Probabilistic Models in Functional Neuroimaging...The aim of the following section is to provide a brief overview about data analytical strategies employed in noninvasive cognitive

Probabilistic Models in Functional Neuroimaging

Dirk Ostwald

2

3

Contents

Introduction

Mathematical Preliminaries

Sets

(1) Definition of sets

(2) Union, intersection, difference

(3) Selected sets of numbers

Functions

(1) Sums, products, and exponentiation

(2) General functions

(3) Basic properties of functions

(4) Elementary functions

Matrix Algebra

(1) Matrix definition

(2) Matrix addition and subtraction

(3) Matrix scalar multiplication

(4) Matrix multiplication

(5) Matrix inversion

(6) Inversion of small matrices by hand

(7) Matrix transposition

(8) Matrix determinants

(9) Rank of a matrix

(10) Matrix symmetry and positive-definiteness

Differential Calculus

(1) Intuition and definition of derivatives of univariate functions

(2) Derivatives of important functions

(3) Rules of differentiation

(4) Analytical optimization

(5) Multivariate, real-valued functions and partial derivatives

4

(6) Higher-order partial derivatives

(7) Gradient, Hessian, and Jacobian

(9) Taylor’s theorem

Integral Calculus

(1) Definite integrals – The integral as the signed area under a function’s graph

(2) Indefinite integrals – Integration as the inverse of differentiation

Sequences and Series

(1) Sequences

(2) Series

Ordinary Differential Equations

(1) Differential equations

(2) Initial value problems for systems of first-order ordinary differential equations

(3) Numerical approaches for initial value problems

An Introduction to Fourier Analysis

(1) Generalized cosine and sine functions

(2) Linear combinations of generalized cosine sine functions

(3) The real Fourier series

(4) Periodic and protracted functions

(5) Complex Numbers

(6) Euler’s identities

(7) The complex form of the Fourier series

(8) The polar form of the Fourier series

(9) The Fourier transform

(10) The discrete Fourier transform

(12) Fast Fourier Transforms in Matlab

An Introduction to Numerical Optimization

(1) Gradient methods

(2) The Newton-Raphson method

5

Foundations of Probabilistic Models

Probability Theory

(1) Random variables

(2) Joint and marginal probability distributions

(3) Conditional probabilities

(4) Bayes Theorem

(5) Independent random variables

(6) Discrete random variables and probability mass functions

(7) Continuous random variables and probability density functions

(8) Expected value and variance of univariate random variables

Multivariate Gaussian distributions

(1) The bivariate Gaussian distribution

(2) The multivariate Gaussian distribution

(3) Independent Gaussian random variables and spherical covariance matrices

(4) The linear transformation theorems for Gaussian distributions

(5) The Gaussian joint and conditional distribution theorem

Information Theory

(1) Entropy

(2) Kullback-Leibler Divergence

Principles of Probabilistic Inference

(1) Maximum Likelihood Estimation

(2) Maximum likelihood estimation of the parameters of a univariate Gaussian distribution

(3) Numerical maximum likelihood estimation and Fisher-Scoring

(4) Bayesian Estimation

(5) Bayesian estimation of the expectation of a univariate Gaussian

(6) Principles of variational Bayes

6

Probability distributions in classical inference

(1) The Standard normal distribution

(2) The chi-squared distribution

(4) The 𝑡-distribution

(5) The 𝑓-distribution

Probability distributions in Bayesian inference

(1) The gamma distribution

(2) The inverse Gamma distribution

(3) The Wishart distribution

(4) The inverse Wishart distribution

(5) The normal-gamma distribution and the normal-inverse gamma distribution

(6) The univariate non-central 𝑡 -distribution

(5) The multivariate non-central 𝑡 -distribution

Basic Theory of the General Linear Model

Structural and probabilistic aspects

(1) Experimental design

(2) A verbose introduction

(3) Simple linear regression

(4) The Gaussian assumption

(5) Equivalent formulations

(6) Sampling a simple linear regression model

Maximum likelihood estimation

(1) Maximum likelihood and least-squares beta parameter estimation

(2) Least-squares beta parameter estimation for simple linear regression

(3) General beta parameter estimation

(4) Maximum likelihood variance parameter estimation

Frequentist Parameter Estimator Distributions

(1) The intuitive background for parameter estimator distributions

(2) The sampling distribution of the beta parameter estimator

7

(3) The sampling distribution of the scaled variance parameter estimator

(4) Overview of the frequentist GLM distribution theory

T- and F-Statistics

(1) Significance and hypothesis testing in frequentist statistics

(2) Definition and intuition of the T-statistic

(3) The T-statistic null distribution

(4) The T-statistic and null hypothesis significance testing

(5) Definition and intuition of the F-Statistic

(6) The F-Statistic Null Distribution

(7) The F-statistic and null hypothesis significance testing

(8) Classical variance partitioning formula and the GLM formulation of the F-Statistic

Bayesian estimation

(1) Model Formulation

(2) Bayesian estimation of the beta parameters

(3) Examples for Bayesian beta parameter estimation

(4) Bayesian joint estimation of beta and variance parameters

Fundamental Designs

(1) A spectrum of designs


(3) Multiple linear regression

(4) One-sample T-test

(5) Independent two-sample T-test

(6) One-way ANOVA

(7) Multifactorial designs and two-way ANOVA

(8) Analysis of Covariance

Advanced theory of the General Linear Model

The generalized least-squares estimator and whitening

(1) Motivation

(2) Derivation of the generalized least-squares estimator

8

(3) Whitening

Restricted Maximum Likelihood

(1) Motivation

(2) The REML objective function and REML variance parameter estimators

(3) Derivatives of the REML objective function

(4) Fisher-Scoring for the REML objective function and covariance basis matrices

FMRI applications of the General Linear Model

The mass-univariate GLM-FMRI approach

(1) FMRI data acquisition and preprocessing

(2) Brain mapping using the GLM-FMRI approach

First-level regressors

(1) Discrete time-signals

(2) Discrete time-systems

(3) Convolution

(4) The canonical hemodynamic response function

(5) Stimulus onset convolution in GLM-FMRI

First-level design matrices

(1) Parameterizing event-related FMRI designs

(2) Measuring event-related FMRI design efficiency

(3) Finite impulse response designs

(4) Psychophysiological interaction designs

First-level covariance matrices and model estimation

(1) Models of FMRI serial correlations

(2) Estimation of mass-univariate GLMs with serial correlations

Second-level models

(1) The “summary-statistics” approach

(2) A hierarchical GLM

9

(3) An equivalent beta parameter estimates model

The multiple testing problem

(1) An introduction to the multiple testing problem in GLM-FMRI

(2) Type I error rates

(3) Exact, weak, and strong control of family-wise error rates

(4) The Bonferroni procedure and its “conservativeness” in GLM-FMRI

Multivariate Approaches

Classification Approaches

(1) Generative learning - Linear Discriminant Analysis

(2) Discriminative Learning - Logistic Regression

(3) Support Vector Classification

Deterministic Dynamical Models

Deterministic dynamic causal models for FMRI

(1) The structural form of DCM for FMRI

(2) The neural state evolution function

(3) Interpretation of the hemodynamic evolution and observer functions

(4) The probabilistic form of DCM for FMRI

Variational Bayesian inversion of deterministic dynamical models

(1) Model formulation

(2) Fixed-form evaluation of the variational free energy

(3) Optimization of the variational free energy

10

Introduction

The aim of this Section is to review some informal basics of quantitative data analysis in general and

some terminology with respect to experimental design with the aim of establishing general terms for PMFN

and the theory of the General Linear Model (GLM) in particular.

When designing any cognitive neuroscience experiment, it is essential to have at least a vague idea

about the data analytical procedures that are going to be used on the collected data, be it behavioral,

functional magnetic resonance imaging (FMRI) or magneto-/electroencephalographic (M/EEG) measures.

The aim of the following section is to provide a brief overview about data analytical strategies employed in

noninvasive cognitive neuroimaging experiments and, more generally, in probabilistic quantitative data

analysis.

Data analysis as data reduction

Any cognitive experiment generates a wealth of data (numbers). For example, when conducting a

psychophysical experiment, one could present each stimulus of each condition multiple times to the

participant and gather reaction times and correctness of the response on each trial. For reaction times only,

with 100 trials per condition, this would amount to 400 real positive numbers per participant. Normally, one

would not only acquire data from a single participant, and thus deal with 400 times the number of

participants data points. If one concomitantly acquires any form of neurophysiological data (e.g. FMRI data

over many voxels) the number of data points grows very large very quickly. Nevertheless, one would like to

know in what way the experimental manipulation has affected the recorded data. Any data analytical

method hence projects large sets of numbers onto a smaller set of numbers (sometimes also referred to as

“statistics”) that allow for the experimental effects to be more readily evaluated. While many data-analytical

techniques appear very different on the surface, the reduction of the “data dimensionality” is a ubiquitous

characteristic of all data analyses (Figure 1).

Figure 1. Raw data usually comes in the form large data matrices, here represented by an 100 × 150 array of different colours on the left. Usually, the raw data are not reported in scientific reports, but rather a smaller set of numbers (like T or p values in classical statistics, or log evidence ratios in Bayesian statistics). This smaller set of numbers is represented as the 2 × 2 array of different colours on the right.

11

The ubiquity of model-based data analysis

Another ubiquitous characteristic of any data analytical strategy is that it embodies some

assumptions about how the data were generated and which data aspects are important. In essentially every

data analytical approach, the key step is to compare how well a given set of quantitative assumptions or

“model” (often also referred to as “computational model”, or more generally, “theory”) can explain a set of

observed data. When studying a data analytical approach, it is always helpful to aim to identify the following

three components of the scientific method, which we will refer to as “model formulation”, “model

estimation”, and “model evaluation” (Figure 2).

Figure 2. On the relationship of reality and the scientific method.

By “model formulation” we understand the formalization of informal ideas about the generation of

empirical measurements. Important question that are answered during probabilistic model formulation are

the following. Which intuitive idea is used to explain observed data? What are the deterministic and what

are the probabilistic parts of its formalized version? What are fixed parameters of the model, and which

parameters are allowed to vary and be determined by the data?

By “model estimation” we understand the adaptation of the model parameters (and, in the Bayesian

scenario, the approximation of the model’s plausibility for explaining the data) in light of observed data.

Interestingly, models are readily conceivable for which the estimation of their parameters is a non-trivial

task. In PMFN, we will be concerned both with models for which explicit and relatively straightforward

methods for model estimation exist (such as the GLM) and others, for which these methods become much

more involved (such as dynamic causal models in FMRI).

Finally, “model evaluation” refers to evaluating the obtained parameter estimates in some

meaningful sense and to draw conclusions about the experimental hypothesis based on the parameters

and/or the overall model plausibility in light of data.

Note that the upon model evaluation, the scientific method proceeds by going back to the model

formulation step. At least two aims may be addressed during model reformulation: either to conceive a

model formulation that may capture observed data in a more “meaningful” or “better” manner or for

example to relax the assumptions of the model to derive a more general theory.

12

The classical statistics approach vs. Bayesian approaches

Modern, probabilistic model-based approaches to data analysis as encountered in cognitive

neuroimaging commonly comprise both “deterministic” (also referred to as “structural”) and “probabilistic”

aspects. The probabilistic aspects usually model the "noise" in the data, i.e. that part of the data variance

that is not explained by the deterministic aspects. Approaches from classical statistics and Bayesian

approaches differ in the way that the probabilistic aspects are dealt with and at which level probabilistic

concepts are invoked to model “uncertainty”.

Classical inference

In most basic terms, in the classical statistics approach variants of the GLM are combined with the notion

of “null-hypothesis testing”. Informally, one assumes that if there is no experimental effect of interest, the

statistic that one is interested in (for example a group mean) has a certain distribution, namely the “null

distribution”. If one now observes data, based on the null distribution of the statistic of interest, one can

compute the conditional probability of observing this (or more extreme) data given that the null-hypothesis

is true, or in other words, that the null distribution is the true data distribution. If this probability, known as

the 𝑝-value, is small, one concludes that the data does not support the null hypothesis, and rejects it.

Bayesian approach

The Bayesian approach to data analysis takes a different viewpoint. In simplified terms, it uses the same

formalism as classical inference (namely probability theory), but does not interpret probabilities as

“objective” large sample limits, but as measures of “subjective uncertainty”. To this end, the absence of any

experimental data, one may quantify one’s uncertainty about the parameter value of interest using a so-

called “prior distribution”. Further, one explicitly considers the data likelihood, i.e. the distribution of the

data given a specific value of the parameter. Using Bayes' theorem one may then compute the “posterior

distribution” of the parameter given the data, resulting in an updated belief about what the real parameter

underlying the data generation might be. At the same time, one aims to quantify the probability of the data

under the model assumptions employed.

It should be noted that the dichotomy between the Classical and Bayesian approach to data analysis is

not a strict one, and that mixed forms exists. In general, up to now most of cognitive neuroscience (and in

fact quantitative science in general) is dominated by classical statistics. Especially with the current interest

in, and the increasing availability of, large data sets (“big data”) combined with high performance computing

solutions, the Bayesian approach seems to become more popular.

It should not be surprising, if the notions of classical statistical inference and Bayesian inference

appear somewhat nebulous at this point. One aim of PMFN is to obtain a better intuition about both

frameworks, their commonalities, and their differences by putting them into action with respect to the same

underlying model, the GLM.

Univariate vs. multivariate approaches

Another way to classify data analytical procedures is according to them being "univariate" or

"multivariate". The key difference between the two approaches is the dimensionality of the “dependent

experimental variable” or “outcome measure”. If this dimensionality is one, i.e. for each trial of each

13

experimental condition a single, scalar number (for example a reaction time, the BOLD signal in a given

voxel, the EEG frequency power in a specific frequency band) is observed and modeled by the approach,

one speaks of a “univariate” data analytical approach. Typical examples of univariate approaches are the

variants of the standard GLM encountered in PMFN.

On the other hand, if two or more numbers comprise the dependent experimental variable (e.g. a

reaction time and a verbal response, or the multi-voxel activation pattern of a given brain region) and are

modeled by the approach, one speaks of a multivariate approach. Typical examples of multivariate

approaches are the multivariate analysis of variance (MANOVA), canonical correlation, multi-dimensional

scaling and multivariate classification approaches.

Encoding vs. decoding approaches

Yet another form of classification of data analytical approaches often encountered in cognitive

neuroscience is along the lines of “encoding” and “decoding”. Encoding approaches like the GLM rest on an

explicit formulation of experimental circumstances that generate observed responses. Decoding approaches

on the other hand decode, according to some algorithm which is trained on a subset of the data, the

experimental circumstances from the observed response. It is important that the distinction between

encoding and decoding based approaches is artificial: after training, any decoding algorithm is based on a

generative model of the response patterns associated with specific experimental circumstances. In this sense

both encoding and decoding approaches achieve the same thing: they test for the statistical non-

independence between independent and dependent experimental variables.

In summary, the GLM, which we will elucidate from both the classical and Bayesian viewpoint in this

Section of PMFN, corresponds to a univariate data dimensionality reduction technique that encodes simple

assumptions about how observed data are generated.

Study Questions

1. Give brief definitions of the terms Model Formulation, Model Estimation, and Model Evaluation.

2. Provide a brief overview of difference and commonalities between the classical inference and Bayesian statistical approaches.

3. Why is the General Linear Model as discussed in the course referred to as a “univariate” data analysis method?

4. Define the terms “independent variable”, “dependent variable”, “categorical variable”, and “continuous variable”.

5. Explain the difference between within- and between-subject experimental designs.

Study Question Answers

1. By “Model Formulation” we understand the formalization of informal ideas about the generation of empirical measurements.

Important question that we need to answer during probabilistic model formulation are the following. Which intuitive idea is

used to explain observed data? What are the deterministic and what are the probabilistic parts of its formalized version? What

are fixed parameters of the model, and which parameters are allowed to vary and be determined by the data? By “Model

estimation” we understand the adaptation of the model parameters and, in the Bayesian scenario, the approximation of the

model’s plausibility for explaining the data in light of observed data. “Model Evaluation” refers to evaluating the obtained

parameter estimates in some meaningful sense and to draw conclusions about the experimental hypothesis based on the

parameters or the overall model plausibility.

2. Both the classical and the Bayesian approach formulate probabilistic models, which may have the same structural and

probabilistic (likelihood) form. In the classical statistics the notion of “null-hypothesis testing” by means of 𝑝-values is prevalent

and probabilities are interpreted as “objective large sample frequencies”. The Bayesian approach to data analysis uses the

same formalism as classical inference, namely (probability theory, but does not interpret probabilities as “objective” large

14

sample limits, but as measures of “subjective uncertainty”. To this end, the absence of any experimental data, one may quantify

one’s uncertainty about the parameter value of interest using a so-called “prior distribution”. Further, one explicitly considers

the data likelihood, i.e. the distribution of the data given a specific value of the parameter. Using Bayes' theorem one may then

compute the “posterior distribution” of the parameter given the data, resulting in an updated belief about what the real

parameter underlying the data generation might be. At the same time, one aims to quantify the probability of the data under

the model assumptions used.

3. General Linear Model as discussed in the course referred to as a “univariate” data analysis, because the dimensionality of the

observations at each trial/time-point is one.

4. An independent experimental variable refers to an aspect of an experimental design that is intentionally manipulated by the

experimenter and that is hypothesized to cause changes in the dependent variables. A dependent experimental variables is a

quantity that are measured by the experimenter in order to evaluate the effect of one or more independent variables. A

categorical variable is a variable that can take on one of several discrete values, A continuous variable is one that can take on

any value within a pre-specified range, usually described mathematically by a real number.

5. In a between- subject manipulation, different subject groups reflect different values of the independent variable, while in a (full)

within-subject design, subjects are exposed to all values of the independent variable.

15

Mathematical Preliminaries

16

Sets

(1) Definition of Sets

Sets may be defined according to Georg Cantor (1845 – 1918) as follows:

“A set is a gathering together into a whole of definite, distinct objects of our perception

[Anschauung] or of our thought – which are called elements of the set.”

In PMFN we primarily use sets as a means to denote which kind of mathematical objects we are

dealing with. Sets are usually denoted using curly brackets. For example, the set 𝐴 comprising the first five

lowercase letters of the Roman alphabet is denoted as

𝐴 ≔ {𝑎, 𝑏, 𝑐, 𝑑, 𝑒} (1)

We use the symbol “ ≔” to denote a definition and the symbol “=” to denote an equality.

There are essentially three ways to define sets: (a) by listing the elements of the set as in (1); (b) by

specifying the properties of the elements in a set, for example as

𝐵 ≔ {𝑥| 𝑥 𝑖𝑠 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑓𝑖𝑣𝑒 𝑙𝑜𝑤𝑒𝑟𝑐𝑎𝑠𝑒 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑜𝑚𝑎𝑛 𝑎𝑙𝑝ℎ𝑎𝑏𝑒𝑡} (2)

where the variable 𝑥 before the vertical bar denotes the elements of the set in a generic fashion, and the

statement after the vertical bar denotes the defining properties of the set; and finally (c) by defining a set to

correspond to another, well known set, for example

𝐶 ≔ ℕ (3)

where ℕ denotes the set of natural numbers (see below). To indicate that e.g. the letter 𝑏 is an element of

𝐴, we write

𝑏 ∈ 𝐴 (4)

which may be read as “𝑏 is in 𝐴” or “𝑏 is an element of 𝐴”. To indicate that, for example, the number 2 is not

an Element of 𝐴, we write

2 ∉ 𝐴 (5)

which may be read as “2 is not in 𝐴” or “2 is not an element of 𝐴”. The number of elements of a set is

referred to as the “cardinality” of a set and is denoted by vertical bars. For example

|𝐴| = |𝐵| = 5 (6)

is the cardinality of the sets 𝐴 and 𝐵, because both contain five elements.

If a set 𝐵 contains all elements of another set 𝐴 and 𝐴 contains some additional elements, i.e., the

set are not equal, then 𝐵 is said to be a “subset” of 𝐴, denoted by 𝐵 ⊂ 𝐴 and 𝐴 is said to be a “superset” of

𝐵, denoted by 𝐴 ⊃ 𝐵. Usually denoting that a set is a subset of another set suffices. For example, if

𝐴 ≔ {1,2, 𝑎, 𝑏} and 𝐵 ≔ {1, 𝑎}, then 𝐵 ⊂ 𝐴, because all elements of 𝐵 are also in 𝐴. If a set 𝐵 may be a

subset or may equal another set 𝐴, the notation 𝐵 ⊆ 𝐴 is also used.

17

(2) Union, Intersection, Difference

Three operations on sets are sometimes helpful: (a) the union of two sets, (b) the intersection of two

sets, and (c) the difference of two sets. These may be defined as follows.

Let 𝑀 and 𝑁 be two arbitrary sets. Then

𝑀 ∪𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑜𝑟 𝑥 ∈ 𝑁 } (1)

denotes the “union” of the two sets 𝑀 and 𝑁. The union of two sets is a set that comprises elements that

are either in 𝑀 (only), in 𝑁 (only), and in both 𝑀 and 𝑁. The “or” in the definition of 𝑀 ∪𝑁 is thus to be

understood in an inclusive “and/or”, rather than exclusive, way. As an example, for 𝑀 ≔ {1,2,3} and

𝑁 ≔ {2,3,5,7} we have

𝑀 ∪𝑁 = {1,2,3,5,7} (2)

The “intersection” of two arbitrary sets 𝑀 and 𝑁 is defined as

𝑀 ∩𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑎𝑛𝑑 𝑥 ∈ 𝑁 } (3)

The intersection 𝑀 ∩𝑁 is thus a set that only comprises elements that are both in 𝑀 and 𝑁. For the

example 𝑀 ≔ {1,2,3} and 𝑁 ≔ {2,3,5,7} we have

𝑀 ∩𝑁 = {2,3} (4)

because 2 and 3 are the only numbers that are both in 𝑀 and 𝑁. If the intersection of two sets does not

contain any elements (a set referred to as the “empty set” and denoted by ∅) the two sets are said to be

“disjoint”.

Finally, the “difference” between two sets, also referred to as the “difference set” of two sets 𝑀 and

𝑁, is defined as

𝑀\𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑎𝑛𝑑 𝑥 ∉ 𝑁} (5)

The set 𝑀\𝑁 thus comprises all elements with the property that they are in 𝑀, but not in 𝑁. For the

example 𝑀 ≔ {1,2,3} and 𝑁 ≔ {2,3,5,7} we have

𝑀\𝑁 = {1} (6)

because the elements 2,3 ∈ 𝑀 are also in 𝑁. Note that the other elements 𝑁 do not play a role in the

difference set. The difference of two sets is not symmetric, as we have for example

𝑁\𝑀 = {5,7} (7)

(3) Selected Sets of Numbers

In this section, we briefly introduce a selection of important sets of numbers that will be required

throughout PMFN.

The set of natural numbers (or positive integers) is denoted by ℕ and defined as

ℕ ≔ {1,2,3,… } (1)

18

where the dots “…” denote “to infinity”. Subsets of the set of natural numbers are the sets natural numbers

of order 𝑛 ∈ ℕ, which are defined as

ℕ𝑛 ≔ {1,2,… , 𝑛} (2)

The union of the set of natural numbers and zero will be denoted by ℕ0, i.e. ℕ0 ≔ ℕ∪ {0}. If 0 and the

“negative natural numbers” are added to the set ℕ, one obtains the set of integers defined by

ℤ ≔ {… ,−3,−2,−1,0,1,2,3,… } (3)

Adding also ratios of integers yields the set of rational numbers, defined as

ℚ ≔ {𝑝

𝑞|𝑝, 𝑞 ∈ ℤ} (4)

The most important basic set of numbers for our purposes is the set of real numbers, denoted by ℝ.

The real numbers ℝ are a superset of the rational numbers, i.e. ℚ ⊂ ℝ. In addition to the rational numbers,

the real numbers include the solutions of some algebraic equations (e.g. √2, the solution of the equation

𝑥2 = 2, which is not an element of ℚ). These numbers are called irrational numbers. Additionally, the real

numbers include the limits of sequences of irrational numbers, such as 𝜋 ≈ 3.14… .Intuitively, the set of real

numbers is the set one thinks of when referring to “continuous” numbers. Interestingly, between any two

real numbers there exists infinitely many more real numbers, while ℝ also extends to negative and positive

infinity. One can show that there are more real numbers than natural numbers, the set of real numbers is

said to be non-countable. Intuitively, the real numbers are the scalar (i.e. “single”) numbers that are used to

model most “continuous” data formats in cognitive neuroimaging. Sometimes, one is interested in only the

non-negative real numbers. These are denoted by

ℝ+ ≔ {𝑥 ∈ ℝ|𝑥 ≥ 0} (5)

In general, contiguous subsets of the real numbers are referred to as “intervals”. We will mainly

deal with closed intervals, i.e., subsets of ℝ, which are defined by two numbers 𝑎, 𝑏 ∈ ℝ and are defined as

[𝑎, 𝑏] ≔ {𝑥 ∈ ℝ|𝑎 ≤ 𝑥 ≤ 𝑏} (6)

where ≤ denotes the property “less than or equal to”. Note that 𝑎 and 𝑏 are elements of the interval [𝑎, 𝑏]

and that [𝑎, 𝑏] is defined as the empty set if 𝑏 ≤ 𝑎. Three different kinds of intervals can be defined,

referred to as “left-“ or “right-semi-open “or “open” intervals by means of the “less than” property <,

respectively

]𝑎, 𝑏] ≔ {𝑥 ∈ ℝ|𝑎 < 𝑥 ≤ 𝑏} (7)

[𝑎, 𝑏[≔ {𝑥 ∈ ℝ|𝑎 ≤ 𝑥 < 𝑏} (8)

]𝑎, 𝑏[≔ {𝑥 ∈ ℝ|𝑎 < 𝑥 < 𝑏} (9)

Note that 𝑎 and 𝑏 are not elements of the interval ]𝑎, 𝑏[.

The set of scalar real numbers can readily be generalized to the set of The real 𝒏-tuples or 𝒏-

dimensional vectors, denoted by ℝ𝑛. For 𝑛 ∈ ℕ, the set ℝ𝑛, with the special case ℝ1 = ℝ denotes the set of

𝑛-tuples with real entries. The 𝑛-tuples are usually denoted as lists, or “column vectors” in the form

19

𝑥 ≔ (

𝑥1𝑥2⋮𝑥𝑛

) (10)

We will usually write 𝑥 ∈ ℝ𝑛 to denote that 𝑥 is a list or column vector of 𝑛 real numbers. An example for

𝑥 ∈ ℝ4 is

𝑥 ≔ (

0.161 1.762−0.203𝜋

) (11)

Study Questions

1. Give brief explanations of the symbols ℕ,ℕ𝑛, ℤ, ℚ, ℝ,ℝ𝑛 and provide a numerical example of a 𝑥 ∈ ℝ5.

2. Consider the sets 𝐴 ≔ {1,2,3} and 𝐵 ≔ {3,4,5}. Write down the sets 𝐶 ≔ 𝐴 ∪ 𝐵 and 𝐷 ≔ 𝐴 ∩ 𝐵

3. Write down the definition of the interval [0,1] ⊂ ℝ. Is 0 an element of this interval?

Study Questions Answers

1. ℕ is the set of natural numbers, i.e. the positive integers from 1 to ∞. ℕ𝑛 is the set of natural numbers from 1 to 𝑛, i.e. the set

of positive integers 1,2, … , 𝑛. ℤ is the set of integers, i.e. the set of positive and negative integers and zero,

… ,−3,−2,−1,0,1,2,3,… ranging from −∞ to ∞. ℚ is the set of rational numbers, i.e. all numbers that can be written as ratios 𝑝

𝑞

where 𝑝 and 𝑞 are integers. ℝ is the set of real numbers, i.e., the set of rational numbers and some numbers which cannot be

written as ratios, such as 𝜋. ℝ𝑛 is the space of 𝑛-tuples with real entries (or, equivalently, the set of 𝑛-dimensional vectors with

real entries). An example of 𝑥 ∈ ℝ5 is 𝑥 = (1,2,3,4,5)𝑇.

2. 𝐶 = {1,2,3} ∪ {3,4,5} = {1,2,3,4,5} and 𝐷 = {1,2,3} ∩ {3,4,5} = {3}

3. The interval [0,1] is defined as [0,1] ≔ {𝑥 ∈ ℝ|0 ≤ 𝑥 ≤ 1}. Yes, 0 is an element of this interval.

20

Functions

Before studying functions, we introduce three helpful concepts often encountered in higher

mathematics, the sum symbol, the product symbol, and the rules of exponentiation.

(1) Sums, products, and exponentiation

In mathematics, one often has to add numbers. A concise way to represent sums is afforded by the

sum symbol,

∑ (1)

The sum symbol is reminiscent of the Greek letter Sigma (Σ), corresponding to the roman capital S

and thus mnemonic for “Sum”. Under the sum symbol one denotes the terms summed over, usually with the

help of indices. For example, for 𝑥1, 𝑥2, 𝑥3 ∈ ℝ we can write the equation

𝑥1 + 𝑥2 + 𝑥3 = 𝑦 (2)

in shorthand notation as

∑ 𝑥𝑖3𝑖=1 = 𝑦 (3)

Note that the subscript at the sum symbol indicates the running index and its initial value (here 𝑖 and

1, respectively) and that the superscript at the sum symbol denotes the final value of the running index

(here 𝑖 = 3). It is an often encountered (and lamentable) behaviour of authors to not include these sub- and

superscripts, which then render the sum symbol somewhat meaningless. We will usually use the index sub-

and superscripts in PMNF.

To obtain some familiarity with the sum symbol, consider the following examples

𝑎 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 (4)

𝑏 = 1 ⋅ 𝑥1 + 2 ⋅ 𝑥2 + 3 ⋅ 𝑥3 + ⋯+ 𝑛 ⋅ 𝑥𝑛 (5)

𝑐 = 2 + 2 + 2 + 2 + 2 (6)

Using the sum symbol, these may be written as follows: For 𝑎 we add all squares of the natural numbers

from 1 to 10. We thus write

𝑎 = ∑ 𝑖210𝑖=1 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 (7)

For 𝑏, we are given the numbers 𝑥1, … , 𝑥𝑛 ∈ ℝ and have to multiply each one with its index and then add

them all up. We thus write

𝑏 = ∑ 𝑖 ⋅ 𝑥𝑖𝑛𝑖=1 = 1 ⋅ 𝑥1 + 2 ⋅ 𝑥2 + 3 ⋅ 𝑥3 + ⋯+ 𝑛 ⋅ 𝑥𝑛 (8)

For 𝑐, we have to add the number 2 five times. For this we can write

𝑐 = ∑ 251 = 2 + 2 + 2 + 2 + 2 (9)

21

The nomenclature of the indices is irrelevant, for example, we have

∑ 𝑖210𝑖=1 = ∑ 𝑗210

𝑗=1 (10)

One property of sums that we will encounter frequently is that constant factors (i.e. factors that do

not depend on the sum index), may either be written under the sum symbol or out of it. Consider the

arithmetic mean of 𝑛 ∈ ℕ real numbers 𝑥1, 𝑥2, … , 𝑥𝑛. The arithmetic mean corresponds to the sum of the 𝑛

numbers divided 𝑛. We may write this as

𝑥1+𝑥2+⋯+𝑥𝑛

𝑛=

∑ 𝑥𝑖𝑛𝑖=1

𝑛=

1

𝑛∑ 𝑥𝑖𝑛𝑖=1 (11)

or, equivalently, as

𝑥1+𝑥2+⋯+𝑥𝑛

𝑛=

𝑥1

𝑛+𝑥2

𝑛+⋯+

𝑥𝑛

𝑛= ∑

𝑥𝑖

𝑛𝑛𝑖=1 = ∑

1

𝑛𝑛𝑖=1 𝑥𝑖 (12)

We thus have shown that

∑1

𝑛𝑛𝑖=1 𝑥𝑖 =

1

𝑛∑ 𝑥𝑖𝑛𝑖=1 (13)

or, informally, that “we may take out the constant factor 1/𝑛 from under the sum and instead multiply the

whole sum by it”.

Another common mathematical operation is the multiplication of numbers. To this end, the product

sign Π (the greek capital Pi for product) allows for writing products of multiple factors in a concise manner.

In complete analogy to the sum symbol, the product symbole has the following semantics

∏ 𝑎𝑖𝑛𝑖=1 = 𝑎1 ⋅ 𝑎2 ⋅ … ⋅ 𝑎𝑛 (14)

Closely related to the product is the exponentiation operation, essentially a product of a number

with itself. Recall that for 𝑎 ∈ ℝ and 𝑛 ∈ ℕ0 “𝑎 to the power of 𝑛” is defined (recursively) as

𝑎0 ≔ 1 and 𝑎𝑛+1 ≔ 𝑎𝑛 ⋅ 𝑎 (15)

Further, “𝑎 to the power of minus 𝑛” is defined for 𝑎 ∈ ℝ\{0} and 𝑛 ∈ ℕ by

𝑎−𝑛 ≔ (𝑎𝑛)−1 ≔1

𝑎𝑛 (16)

In the term 𝑎𝑛 the number 𝑎 is sometimes referred to as “base” and the number 𝑛 as “exponent” or

“power”. Based on the definition in (15) the following familiar “laws of exponentiation” can be derived,

which hold for all 𝑎, 𝑏 ∈ ℝ and 𝑛,𝑚 ∈ ℤ (given that 𝑎 ≠ 0 for negative powers)

𝑎𝑛𝑎𝑚 = 𝑎𝑛+𝑚 (17)

(𝑎𝑛)𝑚 = 𝑎𝑛𝑚 (18)

(𝑎𝑏)𝑛 = 𝑎𝑛𝑏𝑛 (19)

Further note that the 𝑛th root of a number 𝑎 ∈ ℝ is defined as a number 𝑟 ∈ ℝ such that its 𝑛th power

equals 𝑎

22

𝑟𝑛 ≔ 𝑎 (20)

From this definition it follows, that the 𝑛th root may equivalently be written using a rational exponent

𝑟 = 𝑎1

𝑛 (21)

because from (20) and (17) it then follows that

𝑟𝑛 = (𝑎1

𝑛)𝑛

= 𝑎1

𝑛 ⋅ 𝑎1

𝑛 ⋅ ⋯ ⋅ 𝑎1

𝑛 = 𝑎∑

1

𝑛𝑛𝑖=1 = 𝑎1 = 𝑎 (22)

The familiar square root of a number 𝑎 ∈ ℝ, 𝑎 ≥ 0 may thus equivalently be written as

√𝑎 = 𝑎1

2 (23)

which, together with the laws of exponentiation (17) – (19) often significantly simplifies the handling of

square roots.

(2) General functions

A “function” (also referred to as “mapping”) 𝑓 is generally specified in the form

𝑓:𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) (1)

where the set 𝐷 is called the “domain” of the function 𝑓 and the set 𝑅 is called the “range” or “co-domain”

of the function. In concordance with contemporary mathematical language, the terms “function” and

“mapping” are used interchangeably in these notes.

In (1) “𝑓: 𝐷 → 𝑅” may be read as “the function (or rule) 𝑓 maps all elements of the set 𝐷 onto

elements of the set 𝑅”. Why this may appear artificial and cumbersome at first, it is of immense practical use

when dealing with functions and helps to be precise about their “input” set 𝐷 as well as their “output” set

𝑅. In (1) “𝑥 ↦ 𝑓(𝑥)” denotes the mapping of the domain element 𝑥 ∈ 𝐷 onto the range element 𝑓(𝑥) ∈ 𝑅

and may be read as “𝑥, which is an element of 𝐷, is mapped by 𝑓 onto 𝑓(𝑥), which is an element of 𝑅”.

Note that the arrow “→” is used to denote the mapping between the two sets and the arrow “↦” is used to

denote the mapping between an element in the domain of 𝑓 and an element in the range of 𝑓. The most

important thing to note about (1) is that this notation distinguishes between the function 𝑓 proper, which

corresponds to an “abstract rule”, and elements 𝑓(𝑥) of its range. A function can thus be understood as a

rule that relates two sets of quantities, the inputs and the outputs. Importantly, each input 𝑥 to a function is

related to an output of the function 𝑓(𝑥) in a deterministic fashion.

Usually in definitions of functions, the specification of the function abbreviation, its domain and its

range, as in (1) is followed by a definition of the functional form linking the domain elements to the range

elements. Consider the example of a familiar function, the square of a real number. Using the notation

introduced above, this function can be written as

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 (2)

23

(3) Basic Properties of Functions

Functions can be categorized according to a basic set of properties that specify how the a function

relates elements of its domain to elements of its range. To define these properties it is helpful to introduce

the notions of an “image” and a “preimage” first. To this end, let 𝑓: 𝐷 → 𝑅 be a function and let 𝑥 ∈ 𝐷. Then

𝑓(𝑥), i.e., the element of 𝑅 that 𝑥 is mapped onto, is called “the image of 𝑥 under 𝑓”. The entire subset of

the range 𝑅 for which images under 𝑓 exist, i.e., the set

𝑓(𝐷) ≔ {𝑦 ∈ 𝑅|𝑡ℎ𝑒𝑟𝑒 𝑒𝑥𝑖𝑠𝑡𝑠 𝑎𝑛 𝑥 ∈ 𝐷 𝑤𝑖𝑡ℎ 𝑓(𝑥) = 𝑦} ⊆ 𝑅 (1)

is called “the image of 𝐷 under 𝑓” and is notated by “𝑓(𝐷)”. Note that the image 𝑓(𝐷) and the range 𝑅 are

not necessarily identical and that the image can be a subset of the range. If 𝑦 is an element of 𝑓(𝐷), then an

𝑥 ∈ 𝐷 for which 𝑓(𝑥) = 𝑦 holds is called a “preimage” of 𝑦 under 𝑓. The following relationships between

images and preimages are important to note: (1) Every element in the domain of a function is allocated

exactly one image in the range under 𝑓. This is a more precise definition of the notion of a function. (2) Not

every element in the range of a function has to be a member of the image of 𝑓. (3) If 𝑦 is an element of

𝑓(𝐷), then there may exist multiple preimages of 𝑦.

The standard example to understand these properties is the square function

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 (2)

While every real number 𝑥 ∈ ℝ may be multiplied by itself, and thus 𝑦 ≔ 𝑥2 always exists, many 𝑦 ∈ ℝ do

not have preimages under 𝑓, namely all negative numbers. This simply follows as the square of 0 is 0, the

square of a positive number is a positive number, and the square of a negative number is a positive number.

Finally, this function also has the property that for one 𝑦 ∈ 𝑓(ℝ) there exist multiple preimages: for example

4 has the preimages 2 and −2 under 𝑓.

The relations between images and preimages of functions as sketched above are formalized by the

notions of “injective”, “surjective”, and “bijective” functions. These are defined as follows. Let 𝑓: 𝐷 → 𝑅 be a

function. 𝑓 is called surjective, if every element 𝑦 ∈ 𝑅 is a member of the image of 𝑓, or in other words, if

𝑓(𝐷) = 𝑅. If this is not the case, 𝑓 is not surjective. 𝑓 is called injective, if every element in the image of 𝑓

has exactly one preimage under 𝑓. If this is not the case, 𝑓 is not injective. Finally, a function 𝑓 that is

surjective and injective is called “bijective” or “one-to-one” mapping. Figure 1 below illustrates a non-

surjective function, a non-injective function, and a bijective function.

Figure 1. Non-surjective, non-injective, and bijective functions.

24

(4) Elementary Functions

Because the GLM embeds the assumption of normally distributed error terms, and because the

normal or Gaussian distribution belongs to the family of exponential probability density function, we will

require a number of properties of the exponential function, and its inverse, the logarithmic function for

these lecture notes. We will not prove that the properties hold, but rather collect them here for reference.

Proofs of these properties may be found in any undergraduate real analysis textbook.

The identity function

The identity function is defined as

𝑖𝑑 ∶ ℝ → ℝ, 𝑥 ↦ 𝑖𝑑(𝑥) ≔ 𝑥 (1)

The identity function thus maps values 𝑥 ∈ ℝ onto themselves. The derivative of the identity function is 1

𝑑

𝑑𝑥𝑖𝑑(𝑥) =

𝑑

𝑑𝑥𝑥 = 1 (2)

The exponential function

The exponential function is defined as

exp:ℝ → ℝ, 𝑥 ↦ exp(𝑥) ≔ 𝑒𝑥 ≔ ∑𝑥𝑛

𝑛!∞𝑛=0 = 1 + 𝑥 +

𝑥2

2!+𝑥3

3!+𝑥4

4!+⋯ (1)

The mathematical object ∑𝑥𝑛

𝑛!∞𝑛=0 , and infinite sum, is called a “series” and is actually quite intricate. For now

it suffices to recall that 𝑒1 ≈ 2.71… is called “Euler’s number” and that exp(𝑥) thus corresponds to Euler’s

number to the power of 𝑥. The graph of the exponential function is depicted in Figure 2.

A defining property of the exponential function is that it is equal to its own derivative

𝑑

𝑑𝑥exp(𝑥) = exp′(𝑥) = exp (𝑥) (2)

Below we note some properties of the exponential function which are often helpful in algebraic

manipulations:

Special values: exp(0) = 1 and exp(1) = 𝑒 (3)

Value ranges: 𝑥 ∈] − ∞, 0[ ⇒ 0 < exp(𝑥) < 1 and 𝑥 ∈]0,∞[ ⇒ 1 < exp(𝑥) < ∞ (4)

The exponential function thus assumes only strictly positive values, exp(ℝ) =]0,∞[.The brackets]𝑎, 𝑏[

denote the “open interval” 𝑎 < 𝑥 < 𝑏, the brackets [𝑎, 𝑏] denote the closed interval 𝑎 ≤ 𝑥 ≤ 𝑏.

Monotonicity: The exponential function is strictly monotonically increasing. In other words, if 𝑥 < 𝑦,

then it follows that exp(𝑥) < exp(𝑦).

Exponentiation identity (“Product Property”) : exp(𝑎 + 𝑏) = exp(𝑎) ⋅ exp(𝑏) (5)

The exponentiation identity, i.e. the fact that the product of the exponential of two arguments 𝑎 and 𝑏

corresponds to the exponential of the sum of 𝑎 and 𝑏 will be exploited many times in the context of

independent normally distributed random variables in PMNF. From it, it follows that we have

25

exp(𝑎 − 𝑏) =exp(𝑎)

exp(𝑏) and exp(𝑎) ⋅ exp(−𝑎) =

exp(𝑎)

exp(𝑎)= 1 (6)

Note that the term “product property” is a convention used in these notes, but not a general label.

The natural logarithm

The natural logarithm may be defined as the inverse function of the exponential function as follows

ln: ]0,∞[→ ℝ, 𝑥 ↦ ln (𝑥) (7)

Figure 2. Visualization of the graphs of the exponential function and the natural logarithm. Note that 𝑒𝑥𝑝(0) = 1 and

that 𝑙𝑛(1) = 0.

where ln(𝑥) is characterized by the fact that

ln(exp(𝑥)) = 𝑥 for all 𝑥 ∈ ℝ and exp(ln(𝑥)) = 𝑥 for all 𝑥 ∈ ]0,∞[ (8)

Note that the natural logarithm is only defined for positive values 𝑥 ∈]0,∞[. The graph of the natural

logarithm is depicted in Figure 1. An important feature of the natural logarithm, which we will often exploit

in these notes, is its derivative

𝑑

𝑑𝑥ln(𝑥) = ln′(𝑥) =

1

𝑥 (9)

Below we note some properties of the natural logarithm which are often helpful in algebraic manipulations.

Special values: ln(1) = 0 and ln(𝑒) = 1 (10)

Value ranges 𝑥 ∈]0,1[ ⇒ ln(𝑥) < 0 and 𝑥 ∈]1,∞[ ⇒ ln(𝑥) > 0 (11)

The natural logarithm thus assumes values in the entire range of the real numbers, but is only defined on the

set of positive real numbers ln(]0,1[) = ℝ.

26

Monotonicity. The natural logarithm is strictly monotonically increasing. In other words, if 𝑥 < 𝑦,

then it follows that ln(𝑥) < ln(𝑦).

Inverse Property ln1

𝑥= − ln 𝑥 for all 𝑥 ∈]0,∞[ (12)

Product Property ln 𝑥𝑦 = ln 𝑥 + ln 𝑦 for all 𝑥, 𝑦 ∈]0,∞[ (13)

Power Property ln 𝑥𝑘 = 𝑘 ln 𝑥 for all 𝑥 ∈]0,∞[ and 𝑘 ∈ ℚ (14)

Especially the facts that the natural logarithm “turns multiplication into addition” ln 𝑥𝑦 = ln 𝑥 + ln𝑦 and

that it “turns exponentiation into multiplication” ln 𝑥𝑘 = 𝑘 ⋅ ln 𝑥 will exploited many times in PMNF. Note

that the terms “inverse property”, “product property” and “power property” are conventions used in PMNF,

but not general labels.

The cosine, sine, and tangent functions

We eschew a discussion of the geometric intuitions of the trigonometric functions (the interested

reader is referred to [Spivak 1994] for discussion of these basic properties) and introduce the cosine and sine

functions by means of their series representations

cos:ℝ → [−1,1], 𝑥 ↦ cos(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0

𝑥2𝑛

(2𝑛)! (1)

sin:ℝ → [−1,1], 𝑥 ↦ sin(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0

𝑥2𝑛+1

(2𝑛+1)! (2)

The graphs of the sine and cosine functions are shown in Figure 1. Notably, these functions vary in a

prescribed way between +1 and −1 and repeat themselves every 2𝜋.

Figure 1 Sine and cosine on the interval [−2𝜋, 4𝜋].

The following properties of the cosine and sine function, which we state without proofs, are useful

to remember (the interested reader is referred to [Spivak 1994] for a derivation of these properties from the

series definition of sine and cosine)

Derivatives: sin′ = cos and cos′ = −sin (3)

Special values: sin0 = 0 and cos0 = 1 (4)

27

Functional equations: cos(−𝑥) = cos(𝑥) , sin(−𝑥) = −sin(𝑥) and cos2(𝑥) + sin2(𝑥) = 1 (5)

Addition theorems: For all 𝑥, 𝑦 ∈ ℝ

cos(𝑥 + 𝑦) = cos(𝑥) cos(𝑦) − sin(𝑥) sin(𝑦) sin(𝑥 + 𝑦) = sin(𝑥) cos(𝑦) + cos(𝑥) sin(𝑦) (6)

Definition of 𝜋: The zero of cos in [0,2] multiplied by 2 is called 𝜋 (7)

Periodicity of cosine and sine with period 2𝜋 :cos(𝑥 + 2𝜋) = cos(𝑥) , sin(𝑥 + 2𝜋) = sin(𝑥) (8)

Zeros: cos(𝑥) = 0 ⇒ 𝑥 ∈ {𝜋

2+ 𝑘𝜋|𝑘 ∈ ℤ} , sin(𝑥) = 0 ⇒ 𝑥 ∈ {𝑘𝜋|𝑘 ∈ ℤ} (9)

Relation of cosine and sine: cos (𝑥 +𝜋

2) = −sin(𝑥) , sin (𝑥 +

𝜋

2) = cos(𝑥) (10)

The tangent function is defined as

tan:ℝ\ {𝜋

2+ 𝑘𝜋|𝑘 ∈ ℤ} → ℝ, 𝑥 ↦ tan(𝑥) ≔

sin(𝑥)

cos(𝑥) (11)

The following properties of the tangent function, which we state without proofs, are useful (again

the interested reader is referred to [Spivak 1994] for a derivation of these properties)

Derivative tan′(𝑥) =1

cos2(𝑥)= 1 + tan2(𝑥) (𝑥 ≠

𝜋

2+ 𝑘𝜋, 𝑘 ∈ ℤ) (12)

Special values tan(0) = 0, tan (𝜋

4) = 1, tan (−

𝜋

4) = −1 (13)

Functional equations tan(−𝑥) = − tan(𝑥) , tan(𝑥 + 𝜋) = tan(𝑥) (𝑥 ≠𝜋

2+ 𝑘𝜋, 𝑘 ∈ ℤ)(14)

The trigonometric functions are not injective and hence not invertible. However, on the intervals on which

the functions show monotonic behavior, for example on [−𝜋

2,𝜋

2] for the sine, [0, 𝜋] for the cosine, and

[−𝜋

2,𝜋

2] for the tangent function, one may define their inverse functions. These functions are referred to as

the “arc” functions and are defined as follows, where by 𝑓|[𝑎,𝑏] , we denote the restriction of a function 𝑓

on the interval [𝑎, 𝑏].

arccos ≔ (cos|[0,𝜋])−1: [−1,1] → [0, 𝜋] with cos(arccos(𝑥)) = 𝑥 (𝑥 ∈ [−1,1]) (15)

arcsin ≔ (sin|[−𝜋

2,𝜋

2])−1

: [−1,1] → [−𝜋

2,𝜋

2] with arcsin(sin(𝑥)) = 𝑥 (𝑥 ∈ [−1,1]) (16)

and

arctan ≔ (tan|[−𝜋

2,𝜋

2])−1

: ℝ → [−𝜋

2,𝜋

2] with arctan(tan(𝑥)) = 𝑥 (𝑥 ∈ ℝ) (17)

Study Questions

1. Evaluate the sum 𝑦 ≔ ∑ 𝑎𝑖𝑥𝑖4𝑖=1 for 𝑎1 = −1, 𝑎2 = 0, 𝑎3 = 2, 𝑎4 = −2 and 𝑥1 = 3, 𝑥2 = 2, 𝑥3 = 5, 𝑥4 = −2 .

2. Explain the meaning of 𝑓: 𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) and its components 𝑓,𝐷, 𝑅, 𝑥, 𝑓(𝑥) using an example of your choice.

3. Is the function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥 + 2 a linear function? Justify your answer.


1. With the definitions in the question we have

28

𝑦 = 𝑎1𝑥1 + 𝑎2𝑥2 + 𝑎3𝑥3 + 𝑎4𝑥4

= −1 ⋅ 3 + 0 ⋅ 2 + 2 ⋅ 5 + (−2) ⋅ (−2)

= −3 + 0 + 10 + 4

= 11

2. In 𝑓:𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) 𝑓 denotes a function, i.e. a rule that allocates elements of 𝑅 to all elements of 𝐷. 𝐷 is a set and referred to

as the domain of the function (or its input set), 𝑅 is a set and referred to as the range of the function (or its output set). The

statement “𝑓: 𝐷 → 𝑅” defines the label of the function, its domain and its range. 𝑥 is an element of 𝐷 and denotes an input

argument of the function, ↦ is read as “maps to” and defines which function value 𝑓(𝑥) ∈ 𝑅 is allocated to an input argument 𝑥. As

an example, consider 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2. Here, the function value for the input argument 𝑥 is defined as the square of 𝑥.

3. The function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥 + 2 on the vector space ℝ is not a linear function. For example, we have 𝑓(1) ≔ 2 ⋅ 1 +

2 = 4 and 𝑓(2) ≔ 2 ⋅ 2 + 2 = 6, but 𝑓(1 + 2) ≔ 2 ⋅ 3 + 2 = 8 ≠ 10 and thus 𝑓(1 + 2) ≠ 𝑓(1) + 𝑓(2).

29

Matrix Algebra

This Section reviews some aspects of elementary matrix algebra. The interested reader will find a

much more exhaustive and principled treatment in [Strang 2009]

(1) Matrix Definition

A matrix is a rectangular collection of numbers. Matrices are usually denoted by capital letters as

follows:

𝐴 = (

𝑎11 𝑎12𝑎21 𝑎22

… 𝑎1𝑚 … 𝑎2𝑚… …

𝑎𝑛1 𝑎𝑛2

… … … 𝑎𝑛𝑚

) = (𝑎𝑖𝑗)1≤𝑖≤𝑛,1≤𝑗≤𝑚 (1)

An entry 𝑎𝑖𝑗 in a matrix 𝐴 is indexed by its row index 𝑖 and its column index 𝑗. For example, the entry 𝑎32 in

the matrix

𝐴 ≔ (

2 78 2

5 2 5 6

6 49 2

0 9 1 6

) (2)

is 4.

The size (sometimes also called dimensionality) of a matrix is determined by its number of rows

𝑛 ∈ ℕ and its number of columns 𝑚 ∈ ℕ. If a matrix has the same number of rows and columns (that is, if

𝑛 = 𝑚) the matrix is called a “square matrix”. When describing a matrix, it is very helpful to mention its

number of rows and columns and the properties of its entries. In our treatment, the entries of a matrix will

usually be elements of the set of real numbers ℝ. To denote that a given matrix 𝐴 has entries from the set of

real numbers, and that it has 𝑛 rows and 𝑚 columns, we will write

𝐴 ∈ ℝ𝑛×𝑚 (3)

(3) may be expressed in words as “The matrix 𝐴 consists of 𝑛 rows and 𝑚 columns and the entries in 𝐴 are

real numbers”. For example, for the matrix 𝐴 in (B.2) we write

𝐴 ∈ ℝ4 ×4 (4)

because it has four rows and four columns (and is, in fact, a square matrix).

Above, we have introduced the set of real 𝑛-tuples (𝑛-dimensional vectors) ℝ𝑛. In the context of

matrix algebra, we can identify the 𝑛-dimensional vectors with the set of 𝑛 × 1 matrices. In other words, for

most of our purposes, we can treat 𝑛-dimensional vectors and matrices with 𝑛 rows and a single column as

equivalent. This also implies that we can set ℝ𝑛 ≔ ℝ𝑛×1 for 𝑛 ∈ ℕ.

Just as one can do algebra with real numbers, one can calculate with matrices. More specifically, one can

(1) Add and subtract two matrices of the same dimensionality (Matrix Addition and Subtraction)

(2) Multiply a matrix with a scalar (Matrix Scalar Multiplication)

(3) Multiply two matrices, but only if certain conditions hold (Matrix Multiplication)

30

(4) Divide by a matrix, or, more precisely, multiply by a matrix inverse (Matrix Inversion)

We will study each type of operation below. Addition, scalar multiplication, matrix multiplication, and matrix

inversion are not the only operations that can be performed with matrices, but will suffice for most of these

lecture notes. One additional concept we will require, however, is the notion of

(5) Matrix Transposition

(2) Matrix Addition and Subtraction

Two matrices of the same size, i.e. with the same number of rows and columns can be added or subtracted

by element-wise adding or subtracting their entries. Formally, we can write this as

𝐴 + 𝐵 = 𝐶 (1)

with 𝐴, 𝐵, 𝐶 ∈ ℝ𝑛 ×𝑚, and the matrix 𝐶 is given by

𝐴 + 𝐵 = (

𝑎11 𝑎12𝑎21 𝑎22

… 𝑎1𝑛𝑚 … 𝑎2𝑚

… …𝑎𝑛1 𝑎𝑛2

⋱ … … 𝑎𝑛𝑚

)+(

𝑏11 𝑏12𝑏21 𝑏22

… 𝑏1𝑚 … 𝑏2𝑚

… …𝑏𝑛1 𝑏𝑛2

⋱ … … 𝑏𝑛𝑚

) (2)

= (

𝑎11+𝑏11 𝑎12 + 𝑏12𝑎21 + 𝑏21 𝑎22 + 𝑏22

… 𝑎1𝑚 + 𝑏1𝑚 … 𝑎2𝑚 + 𝑏2𝑚

… …𝑎𝑛1 + 𝑏𝑛1 𝑎𝑛2 + 𝑏𝑛2

⋱ … … 𝑎𝑛𝑚 + 𝑏𝑛𝑚

)

= (

𝑐11 𝑐12𝑐21 𝑐22

… 𝑐1𝑚 … 𝑐2𝑚

… …𝑐𝑛1 𝑐𝑛2

⋱ … … 𝑐𝑛𝑚

)

= 𝐶

The analogue element-wise operation is defined for subtraction

𝐴 − 𝐵 = 𝐶 (3)

Example

As an example, consider the 2 × 3 matrices 𝐴, 𝐵 ∈ ℝ2×3

𝐴 ≔ (2 −3 01 6 5

) and 𝐵 ≔ ( 4 1 0−4 2 0

) (4)

Since they both have the same size, we can add them, as below

𝐴 + 𝐵 = (2 −3 01 6 5

) + ( 4 1 0−4 2 0

) = (2 + 4 −3 + 1 0 + 01 − 4 6 + 2 5 + 0

) = ( 6 −2 0−3 8 5

) (5)

and we can subtract them, as below

31

𝐴 − 𝐵 = (2 −3 01 6 5

) − (4 1 0−4 2 0

) = (2 − 4 −3 − 1 0 − 01 + 4 6 − 2 5 − 0

) = (−2 −4 05 4 5

) (6)

(3) Matrix Scalar Multiplication

One can also multiply a matrix 𝐴 ∈ ℝ𝑛×𝑚 by a number 𝑐 ∈ ℝ. Note that the set ℝ1×1, the set of

1 × 1 matrices with real entries is usually identified with the set of real numbers ℝ. The operation of

multiplying a matrix by a number is called “scalar multiplication” and is performed element-wise. Formally,

we have with 𝐴 and 𝑐 ∈ ℝ

𝑐𝐴 = 𝐵 (1)

where 𝐵 ∈ ℝ𝑛 ×𝑚 is evaluated according to

𝑐𝐴 = 𝑐(

𝑎11 𝑎12𝑎21 𝑎22

… 𝑎1𝑚 … 𝑎2𝑚

… …𝑎𝑛1 𝑎𝑛2

⋱ … … 𝑎𝑛𝑚

)

= (

𝑐𝑎11 𝑐𝑎12𝑐𝑎21 𝑐𝑎22

… 𝑐𝑎1𝑚 … 𝑐𝑎2𝑚

… …𝑐𝑎𝑛1 𝑐𝑎𝑛2

⋱ … … 𝑐𝑎𝑛𝑚

)

= (

𝑏11 𝑏12𝑏21 𝑏22

… 𝑏1𝑚 … 𝑏2𝑚

… …𝑏𝑛1 𝑏𝑛2

⋱ … … 𝑏𝑛𝑚

)

= 𝐵

(2)

As an example, consider the matrix 𝑋 ∈ ℝ4×3

𝑋 ≔ (

3 1 15 2 52 7 13 4 2

) (3)

and let 𝑐 ≔ −3. Then

𝑐𝑋 = −3(

3 1 15 2 52 7 13 4 2

) = (

−3 ∙ 3 −3 ∙ 1 −3 ∙ 1−3 ∙ 5 −3 ∙ 2 −3 ∙ 5−3 ∙ 2 −3 ∙ 7 −3 ∙ 1−3 ∙ 3 −3 ∙ 4 −3 ∙ 2

) = (

−9 −3 −3−15 −6 −15−6 −21 −3−9 −12 −6

) (4)

(4) Matrix Multiplication

In addition to adding and subtracting matrices of the same size, and multiplying a matrix by a scalar,

one can also multiply two matrices. However, matrix multiplication is not an element-wise operation, but

has a special definition. Importantly, two matrices 𝐴 and 𝐵 can only be multiplied as

𝐴𝐵 (1)

32

if the first matrix 𝐴 has as many columns as the second matrix 𝐵 has rows, or in other words, if

𝐴 ∈ ℝ𝑛×𝑚 and 𝐵 ∈ ℝ𝑚×𝑝 (2)

with 𝑛,𝑚, 𝑝 ∈ ℕ. This condition is sometimes referred to as the equality of the “inner dimensions” with

respect to the product 𝐴𝐵. If this equality does not hold, the two matrices cannot be multiplied. However, if

the condition holds, the matrix product for 𝐴 ∈ ℝ𝑛×𝑚, 𝐵 ∈ ℝ𝑚×𝑝 can be written as

𝐴𝐵 = 𝐶 (3)

where 𝐶 ∈ ℝ𝑛×𝑝 is evaluated according to

𝐴𝐵 = (

𝑎11 𝑎12𝑎21 𝑎22

… 𝑎1𝑚 … 𝑎2𝑚… …

𝑎𝑛1 𝑎𝑛2

… … … 𝑎𝑛𝑚

)(

𝑏11 𝑏12𝑏21 𝑏22

… 𝑏1𝑝 … 𝑏2𝑝

… …𝑏𝑚1 𝑏𝑚2

… … … 𝑏𝑛𝑝

) (4)

=

(

∑ 𝑎1𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎1𝑖𝑏𝑖2

𝑛𝑖=1

∑ 𝑎2𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎2𝑖𝑏𝑖2

𝑛𝑖=1

… ∑ 𝑎1𝑖𝑏𝑖𝑝𝑛𝑖=1

… ∑ 𝑎2𝑖𝑏𝑖2𝑛𝑖=1

… …∑ 𝑎𝑚𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎𝑚𝑖𝑏𝑖2

𝑛𝑖=1

⋱ …… ∑ 𝑎𝑚𝑖𝑏𝑖𝑝

𝑛𝑖=1 )

= (

𝑐11 𝑐12𝑐21 𝑐22

… 𝑐1𝑝 … 𝑐2𝑝

… …𝑐𝑚1 𝑐𝑚2

… … … 𝑐𝑚𝑝

)

= 𝐶

The expression for the matrix product is a bit unhandy, but very important. What the expression says is that

the entry in the 𝑖-th row and 𝑗-th column of the matrix product of the two matrices 𝐴 ∈ ℝ𝑛 ×𝑚 and

𝐵 ∈ ℝ𝑚 ×𝑝 is given by overlaying the entries in the 𝑖-th row of matrix 𝐴 with the 𝑗-th column entries of

matrix 𝐵, multiplying the overlaid numbers and then adding them all up. It takes some time to get used to

matrix multiplication, so do not worry, if it is not completely clear now. Consider the example below next.

Example

Let 𝐴 ∈ ℝ2×3 and 𝐵 ∈ ℝ3 ×2 be defined as

𝐴 ≔ (2 −3 01 6 5

) and 𝐵 ≔ ( 4 2−1 0 1 3

) (5)

We first consider the size of the matrix 𝐶 ≔ 𝐴𝐵. We know that 𝐴 has two rows and three columns, while 𝐵

has three rows and two columns. Because 𝐵 has the same number of rows as 𝐴 has columns, the matrix

product 𝐴𝐵 = 𝐶 is defined. The expression (B.18) tells us, that the resulting matrix 𝐶 has two rows and two

columns, because the number of rows of the resulting matrix 𝐶 is determined by the number of rows of the

first matrix 𝐴, and the number of columns of the resulting matrix 𝐶 is determined by the number of columns

of the second matrix 𝐵. We thus know that is a 2 × 2 matrix or 𝐶 ∈ ℝ2×2. Overlaying the rows of 𝐴 on the

columns of 𝐵, multiplying the entries and adding the results up yields

33

𝐴𝐵 = (2 −3 01 6 5

)( 4 2−1 0 1 3

) (6)

= ((2 ∙ 4) + (−3 ∙ −1) + (0 ∙ 1) (2 ∙ 2) + (−3 ∙ 0) + (0 ∙ 3)(1 ∙ 4) + (6 ∙ −1) + (5 ∙ 1) (1 ∙ 2) + (6 ∙ 0) + (5 ∙ 3)

)

= (8 + 3 + 0 4 + 0 + 04 − 6 + 5 2 + 0 + 15

)

= (11 43 17

)

= 𝐶

It is essential to always keep track of the sizes of the matrices that are involved in a multiplication.

Specifically, if matrix 𝐴 is of size 𝑛 × 𝑚 and matrix 𝐵 is of size 𝑚 × 𝑝, then the product 𝐴𝐵 will always be of

size 𝑛 × 𝑝. This can be visualized as

(𝑛 × 𝑚) ∙ (𝑚 × 𝑝) = (𝑛 × 𝑝) (7)

i.e. the inner numbers 𝑚 disappear.

If one is calculating with scalars, multiplication is commutative, that is for 𝑎, 𝑏 ∈ ℝ, we have

𝑎𝑏 = 𝑏𝑎. In general, matrix multiplication is not commutative, i.e. the side on which 𝐴 and 𝐵 stand matters.

As an example, consider the matrices from above. We have just seen that

𝐶 ≔ 𝐴𝐵 = (2 −3 01 6 5

)( 4 2−1 0 1 3

) = (11 43 17

) (8)

On the other hand, we have

𝐷 ≔ 𝐵𝐴 = ( 4 2−1 0 1 3

)(2 −3 01 6 5

) = ( 10 0 10−2 3 0 5 15 15

) (9)

as you may convince yourself as an exercise.

(5) Matrix Inversion

In order to motivate the concept of matrix inversion consider the equation

𝐴𝑋 = 𝑌 (1)

with 𝐴 ∈ ℝ𝑛×𝑛 , 𝑋 ∈ ℝ𝑛×𝑛, and thus 𝑌 ∈ ℝ𝑛×𝑛. We now simplify the above by assuming 𝑛 = 1, i.e. the

matrices in (24) are in fact scalar number. To remind of this assumption, we represent (24) using lower-case

letters as 𝑎, 𝑥, 𝑦 ∈ ℝ

𝑎𝑥 = 𝑦 (2)

Let us further assume that we know that 𝑎 = 2 and 𝑦 = 6, that is, we have the equation

2 ⋅ 𝑥 = 6 (3)

34

In school mathematics, we were taught how to solve equations of the form (26) for the “unknown variable

𝑥” for many years. The general strategy was to isolate 𝑥 by the appropriate operations (like addition and

multiplication) on the left-hand side and observe the outcome on the right-hand side. This strategy would

look similar to something like this

2 ⋅ 𝑥 = 6 |divide both sides by 2 (4)

2

2⋅ 𝑥 =

6

2 |evaluate the ratios

𝑥 = 3

and we have learned that the unknown value of the variable 𝑥 is equal to 3. We also remember from high

school algebra, that dividing by a scalar number 𝑎 corresponds to multiplication with its inverse, where the

inverse is understood as that number, that yields 1 if multiplied with 𝑎, that is 1

𝑎.

We may thus also write (27) as

2 ⋅ 𝑥 = 6 |multiply both sides by 1

2 (5)

1

2⋅ 2 ⋅ 𝑥 =

1

2⋅ 6 |evaluate the ratios

𝑥 = 3

Of course, the strategy used does not change the result. You may also remember, that inverse of a scalar𝑎,

i.e. 1

𝑎 can also be denoted as 𝑎−1. In general, 𝑎−1 is called the inverse of 𝑎 if, and only if

𝑎𝑎−1 = 𝑎−1𝑎 = 1 (6)

where 1 denotes the “neutral element”, that is for all 𝑎

𝑎 ∙ 1 = 1 ∙ 𝑎 = 𝑎 (7)

To recapitulate (28) in more abstract terms, we have carried out the following operation

𝑎𝑥 = 𝑦 |multiply both sides by 𝑎−1 (8)

𝑎−1𝑎𝑥 = 𝑎−1𝑦 |evaluate the ratio

𝑥 = 𝑎−1𝑦

We now return to the case that 𝐴 ∈ ℝ𝑛×𝑛 , 𝑋 ∈ ℝ𝑛×𝑝 and 𝑌 ∈ ℝ𝑛×𝑝are in fact matrices. Very often we will

encounter analogous statements to (2) for matrices, which are referred to as “Linear Systems of Equations”

𝐴𝑋 = 𝑌 (9)

where we know 𝐴 and 𝑌 and would like to evaluate 𝑋. In analogy to the strategy described in (8), we can

multiply both sides of (9) with the inverse of 𝐴, denoted as 𝐴−1 and obtain

𝐴−1𝐴𝑋 = 𝐴−1𝑌 (10)

35

In analogy to 𝑎−1𝑎 = 1 for scalars, the matrix product 𝐴−1𝐴 yields, by definition, a very specific matrix,

namely the identity matrix 𝐼𝑛 ∈ ℝ𝑛×𝑛. The identity matrix has ones on its main diagonal and zeros

everywhere else. That is, by definition, we have

𝐴 = 𝐼𝑛 ≔ (

1 00 1

… 0 … 0

… …0 0

⋱ … … 1

) (11)

In analogy to 𝑎 ∙ 1 = 1 ∙ 𝑎 = 𝑎 for scalars, the product of a matrix 𝐴 with the identity always yields the

matrix 𝐴 again

𝐴𝐼 = 𝐼𝐴 = 𝐴 (12)

If we consider (10) again, we thus see that we can evaluate the left-hand side of (10) as follows

𝐴−1𝐴𝑋 = 𝐴−1𝑌 ⇔ 𝐼𝑛𝑋 = 𝐴−1𝑌 ⇔ 𝑋 = 𝐴−1𝑌 (13)

In other words, if we have a way to evaluate the inverse 𝐴−1 of the matrix 𝐴, we can solve for the unknown

matrix 𝑋 just as we can solve for the unknown variable 𝑥 in 𝑎𝑥 = 𝑦.

There is a lot more to learn about matrix inversion. For example, so far, we have no idea how to

actually evaluate 𝐴−1 for, say

𝐴 ≔ (1 0 22 −1 34 1 8

) (14)

or in other words, to find a matrix 𝐴−1 such that the matrix product 𝐴−1𝐴 evaluates as

𝐴−1𝐴 = 𝐴−1 (1 0 22 −1 34 1 8

) = (1 0 00 1 00 0 1

) = 𝐼3 (15)

Also, specific conditions exists, under which a matrix may not even be invertible (like one cannot evaluate

the inverse of the scalar 0, i.e 1

0 is “not allowed”). There is also a deep relation between matrix inversion and

the concept of matrix determinants, which you may recall from high-school mathematics. Further note that

we have been very specific on the dimensions of 𝐴, 𝑋, and 𝑌, as in (9) they are all square matrices. For non-

square matrices, things become more complex, and “proper” matrix inversion is not defined.

One example of an inverse matrix will be helpful with respect to multivariate Gaussian distributions:

the inverse of a diagonal matrix, i.e. a square matrix that has non-zero elements on its main diagonal, and

zeros everywhere else:

𝐷 = (

𝑑1 00 𝑑2

⋯ 0⋯ ⋮

⋮ ⋯0 ⋯

⋱ 00 𝑑𝑛

) ∈ ℝ𝑛×𝑛 (16)

The inverse of 𝐷 is given by the diagonal matrix with the one-dimensional inverses 𝑑𝑖−1 =

1

𝑑𝑖 on its diagonal:

36

𝐷−1 =

(

1

𝑑10

01

𝑑2

⋯ 0⋯ ⋮

⋮ ⋯0 ⋯

⋱ 0

01

𝑑𝑛)

∈ ℝ𝑛×𝑛 (17)

as one may convince oneself by evaluating the matrix product 𝐷−1𝐷 for, say 𝑛 = 4.

In the following section, the interested reader will find a procedure that allows for evaluating the

inverse of small (e.g. 𝑛 = 2,3,4) square matrices 𝐴 ∈ 𝕂𝑛×𝑛 by hand. This is helpful for becoming familiar

with the theory of matrix inversion. However, in real life, matrix inverses are usually evaluated

algorithmically with the help of a computer, so being proficient in manual matrix inversion is not an essential

skill.

(6) Inversion of small matrices by hand

While the introduction of matrix inversion has so far explained the main principles, it has not been

discussed how to actually invert a matrix, i.e. how to compute 𝐴−1 if 𝐴 is known. In fact, there are different

ways to do that, and we will only discuss one approach here for the remainder of the chapter.

We will demonstrate this approach by trying to solve the following matrix equation

𝐴𝑋 = (1 0 22 −1 34 1 8

)(

𝑥11 𝑥12 𝑥13𝑥21 𝑥22 𝑥23𝑥31 𝑥32 𝑥33

) = (2 −2 31 4 22 1 1

) = 𝐵 (1)

with 𝐴 ∈ ℝ3×3 and 𝐵 ∈ ℝ3×𝑛 and unknown 𝑋 ∈ ℝ3×3. Recall, that we can solve the equation for the

matrix 𝑋 ∈ ℝ3×3 if we find and inverse 𝐴−1 ∈ ℝ3×3 for the matrix 𝐴 ∈ ℝ3×3, because then

𝑋 = 𝐴−1𝐵 (2)

Also recall, that for an inverse of 𝐴 ∈ ℝ3×3, the following equation holds

𝐴−1𝐴 = 𝐼 (3)

where

𝐼 = (1 0 00 1 00 0 1

) ∈ ℝ3×3 (4)

A practical approach to finding the inverse of 𝐴 is the following

1. Write down the matrix 𝐴 and next to it the identity matrix 𝐼

(1 0 22 −1 34 1 8

|||

1 0 00 1 00 0 1

) (5)

2. Now use three kinds of operations on the rows of 𝐴 to transform 𝐴 into the identity matrix. Apply the

same operations to the identity matrix 𝐼 in parallel. The operations allowed are

a. Exchanging to rows of 𝐴

b. Multiplying a row of 𝐴 by a number

37

c. Adding or subtracting a multiple of another row of 𝐴 from any row of 𝐴

Adding the -2 times the first row to the second row, and -4 times the first row to the third row yields

(1 0 20 −1 10 1 0

|||

1 0 0−2 1 0−4 0 1

) (6)

Exchanging the second and third row yields

(1 0 20 1 00 −1 −1

|||

1 0 0−4 0 1−2 1 0

) (7)

Adding the second row to the third row yields

(1 0 20 1 00 0 −1

|||

1 0 0−4 0 1−6 1 1

) (8)

Adding 2 times the third row to the first yields

(1 0 00 1 00 0 −1

|||

−11 2 2−4 0 1−6 1 1

) (9)

Multiplying the third row by -1 yields

(1 0 00 1 00 0 1

|||

−11 2 2−4 0 16 −1 −1

) (10)

Having transformed the matrix on the right into the identity matrix, the matrix that is left on the left is now

the inverse 𝐴−1 ∈ ℝ3×3, as can be verified by computing

(1 0 22 −1 34 1 8

)(−11 2 2−4 0 16 −1 −1

) = (1 0 00 1 00 0 1

) (11)

Having obtained

𝐴−1 = (−11 2 2−4 0 16 −1 −1

) (12)

The solution to

𝐴𝑋 = (1 0 22 −1 34 1 8

)(

𝑥11 𝑥12 𝑥13𝑥21 𝑥22 𝑥23𝑥31 𝑥32 𝑥33

) = (2 −2 31 4 22 1 1

) = 𝐵 (13)

is then

38

𝑋 = 𝐴−1𝐵 = (−11 2 2−4 0 16 −1 −1

)(2 −2 31 4 22 1 1

) = (−16 32 −27−2 11 −99 −17 15

) (14)

(7) Matrix Transposition

A very basic, and often useful, matrix operation is the transposition of a matrix. By the transposition

of a matrix we understand the exchange of its column and row elements. Transposition of a matrix 𝐴 is

denoted by 𝐴𝑇 and implicates that, if ∈ ℝ𝑛×𝑚 , then 𝐴𝑇 ∈ ℝ𝑚×𝑛. Formally, we have for 𝐴 ∈ ℝ𝑛×𝑚 and

≔ 𝐴𝑇, where

𝐴 = (

𝑎11 𝑎12𝑎21 𝑎22

… 𝑎1𝑚 … 𝑎2𝑚… …

𝑎𝑛1 𝑎𝑛2

… … … 𝑎𝑛𝑚

) (1)

the following form of 𝐵 ∈ ℝ𝑚×𝑛:

𝐵 = (

𝑏11 𝑏12𝑏21 𝑏22

… 𝑏1𝑛 … 𝑏2𝑛… …

𝑏𝑚1 𝑏𝑚2

… … … 𝑏𝑚𝑛

) ≔ (

𝑎11 𝑎22𝑎12 𝑎22

… 𝑎𝑛1 … 𝑎𝑛2… …

𝑎1𝑚 𝑎2𝑚

… … … 𝑎𝑛𝑚

) (2)

For example, if

𝐴 ≔ (2 −3 01 6 5

) (3)

then

𝐴𝑇 ≔ ( 2 1−3 6 0 5

) (4)

If the transpose of a matrix is identical to the matrix, which can only be the case for square matrices,

because otherwise the size of the matrix changes, then the matrix is called “symmetric”. Covariance matrices

encountered later are typical examples of symmetric matrices, as are identity matrices. Finally note that the

transpose of a 1 × 1 matrix, i.e. a scalar, is just the same scalar again.

(8) Matrix Determinants

The theory of matrix determinants is quite evolved and has deep connections to the theory of linear

systems of equations and vector space theory. Historically, determinants were used to determine whether

systems of linear equations of the form 𝐴𝑥 = 𝑏, where 𝐴 ∈ ℝ𝑛×𝑛, 𝑥 ∈ ℝ𝑛, 𝑏 ∈ ℝ𝑛 have a unique solution,

which may be determined based on the value of the determinant of 𝐴. Today, it is actually debatable,

whether determinants are really of much value in linear algebra. Nevertheless, we will encounter a matrix

determinant in the context of multivariate Gaussian distributions, so we briefly review some special

determinants in this section.

Determinants are defined for square matrices, are scalars, and are usually written as det (𝐴), or, as in

these notes, as |𝐴|. For a 2 × 2 matrix 𝐴 ∈ ℝ2×2, the determinant is defined as follows

|𝐴| = |(𝑎 𝑏𝑐 𝑑

)| ≔ 𝑎𝑑 − 𝑏𝑐 (1)

39

For a 3 × 3 matrix 𝐴 ∈ ℝ3×3, the determinant is defined as

|𝐴| = |(𝑎 𝑏 𝑐𝑑 𝑒 𝑓𝑔 ℎ 𝑖

)| ≔ 𝑎𝑒𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑏𝑑𝑖 − 𝑎𝑓ℎ − 𝑐𝑒𝑔 (2)

which one may either memorize using Sarrus’ rule, or based on the determinant development

|𝐴| = |(𝑎 𝑏 𝑐𝑑 𝑒 𝑓𝑔 ℎ 𝑖

)| = 𝑎 |(𝑒 𝑓ℎ 𝑖

)| − 𝑏 |(𝑑 𝑓𝑔 𝑖

)| + 𝑐 |(𝑑 𝑒𝑔 ℎ

)| (3)

= 𝑎𝑒𝑖 − 𝑎𝑓ℎ − 𝑏𝑑𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑐𝑒𝑔

= 𝑎𝑒𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑏𝑑𝑖 − 𝑎𝑓ℎ − 𝑐𝑒𝑔

For 𝑛 > 3, the determinant of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is more complicated to evaluate. A general

formula for matrix determinants exists (“Leibniz fomula”). However, understanding and using this formula

requires some group theory, and especially the notion of permutation groups, which we do not require (and

hence will not cover) in these notes. We merely state it here for completeness. For 𝐴 ∈ ℝ𝑛×𝑛, the

determinant is given by

|𝐴| = |(𝑎𝑖𝑗)1≤𝑖,𝑗≤𝑛| =∑ 𝑠𝑔𝑛(𝜎)∏ 𝑎𝑖,𝜎𝑖

𝑛𝑖=1𝜎∈𝑆𝑛 (4)

where 𝑆𝑛 denotes the symmetric group in 𝑛 elements, 𝑠𝑔𝑛(𝜎) the signum of the elements 𝜎 of a symmetric

group, and 𝜎𝑖 the value at the 𝑖th position upon application of 𝜎.

More important than Leibniz formula in the context of these notes is the determinant of a special

matrix, namely a diagonal matrix with non-zero elements along its main diagonal and zeros everywhere else.

For such matrices, the determinant is given by the product of its diagonal elements: let 𝐴 ∈ ℝ𝑛×𝑛 be

diagonal, i.e. 𝐴 is of the form

𝐴 = (

𝑎11 00 𝑎22

⋯ 0⋯ ⋮

⋮ ⋯0 ⋯

⋱ 00 𝑎𝑛𝑛

) ∈ ℝ𝑛×𝑛 (5)

Then

|𝐴| = |(

𝑎11 00 𝑎22

⋯ 0⋯ ⋮

⋮ ⋯0 ⋯

⋱ 00 𝑎𝑛𝑛

)| = ∏ 𝑎𝑖𝑖𝑛𝑖=1 = 𝑎11 ⋅ 𝑎22 ⋅ … ⋅ 𝑎𝑛𝑛 (6)

(9) Rank of a Matrix

The rank of a matrix is an important concept and intuitively serves as a measure of “degenerateness”

of the system of linear equations or the linear transformation encoded by a matrix. There are multiple ways

to define the rank of a matrix and all rely on additional theory with respect to the concepts covered thus far

in this Section. Here, we introduce the rank of a matrix as the cardinality of the largest set of” linearly

40

independent columns” of a matrix, which requires the notion of linear independent columns, a concept from

the theory of vector spaces.

Intuitively, the columns of a matrix 𝐴 ∈ ℝ𝑛×𝑛 can be conceived as “𝑛-dimensional vectors”, i.e., lists

of numbers with 𝑛 entries. For example, the matrix

𝐴 = (2 1 31 4 53 4 0

) ∈ ℝ3×3 (1)

can be conceived as the concatenation of the three 3-dimensional vectors (or matrices with a single column)

𝑣1 = (213) , 𝑣2 = (

144) and 𝑣3 = (

350) ∈ ℝ3×1 =:ℝ3 (2)

Now, let 𝑣1, 𝑣2, … , 𝑣𝑛 ∈ ℝ𝑛 denote a set of 𝑛-dimensional vectors with real entries and let 𝑎1, 𝑎2, … , 𝑎𝑛 ∈ ℝ

denote a set of real scalars. The definition of scalar matrix multiplication and matrix addition, we may then

evaluate the sum

𝑣 ≔ 𝑎1𝑣1 + 𝑎2𝑣2 +⋯+ 𝑎𝑛𝑣𝑛 = ∑ 𝑎𝑖𝑣𝑖𝑛𝑖=1 (3)

The resulting vector 𝑣 ∈ ℝ𝑛 is referred to as a “linear combination” of the vectors 𝑣1, 𝑣2, … , 𝑣𝑛. The vectors

𝑣1, 𝑣2, … , 𝑣𝑛 are called linearly dependent, if there exist scalars 𝑎1, 𝑎2, … , 𝑎𝑛, which are not all zero, such

that

𝑎1𝑣1 + 𝑎2𝑣2 +⋯+ 𝑎𝑛𝑣𝑛 = 0 (4)

The linear combination with 𝑎1 = 𝑎2 = ⋯ = 𝑎3 = 0 of a vector set 𝑣1, 𝑣2, … , 𝑣𝑛, i.e.,

0𝑣1 + 0𝑣2 +⋯+ 0𝑣𝑛 = 0 (5)

is called the “trivial representation” of the zero vector 0 ∈ ℝ𝑛. We may thus state that 𝑣1, 𝑣2, … , 𝑣𝑛 are

linearly dependent, if there exists a “non-trivial” representation of 0 ∈ ℝ𝑛, and is linearly independent, if the

only linear combination of 𝑣1, 𝑣2, … , 𝑣𝑛 that results in the zero vector 0 ∈ ℝ𝑛 is the trivial one.

Consider the following examples. Let

𝑣1 = (100) , 𝑣2 = (

010) and 𝑣3 = (

110) ∈ ℝ3×1 (6)

Then 𝑣1, 𝑣2, 𝑣3 are linearly dependent, because if we choose 𝑎1 = 𝑎2 = 1 and 𝑎3 = −1, we can write the

zero vector 0 ∈ ℝ3 using 𝑎𝑖’s (𝑖 = 1,2,3) that are not all zero

1 ⋅ (100) + 1 ⋅ (

010) − 1 ⋅ (

110) = (

110) − (

110) = 0 (7)

On the other hand, let

𝑣1 = (100) , 𝑣2 = (

010) and 𝑣3 = (

001) ∈ ℝ3×1 (8)

41

Then 𝑣1, 𝑣2, 𝑣3 are linearly independent, which can be made intuitive by considering their linear combination

𝑎1 ⋅ (100) + 𝑎2 ⋅ (

010) + 𝑎3 ⋅ (

001) = (

𝑎100) + (

0𝑎20) + (

00𝑎3

) = (

𝑎1𝑎2𝑎3) (9)

As soon as we choose any of the 𝑎1, 𝑎2, 𝑎3 to be non-zero, this linear combination does not result in the zero

vector anymore, and hence the only representation of the zero vector by means of the vectors defined in (8)

is the trivial one. In general, it is not easy to determine whether a given set of vectors is linear dependent or

not and may specialized approaches exist to evaluate this

Returning to the notion of a matrix rank, the column rank of a matrix was introduced above as “the

cardinality of largest set of linearly independent columns (i.e. column vectors)” of a matrix. From the

discussion above, we may now infer that the rank of the matrix

𝐴 = (1 0 00 1 00 0 1

) ∈ ℝ3×3 (10)

is 3. If the rank of a matrix equals its number of columns, the matrix is said to be of “full-rank”.

Importantly, it can be shown that if a square matrix is of full-rank, it can be inverted, while it cannot

be inverted if it is not of full-rank. To show that this is actually case requires additional theory, and the

interested reader is referred to [Strang 2009] in this regard.

(10) Matrix symmetry and positive-definiteness

A square matrix 𝐴 ∈ ℝ𝑛×𝑛 is called symmetric, if it equals its transpose, i.e. if 𝐴 = 𝐴𝑇. For example,

the matrix

𝐴 = (2 1 31 4 53 5 0

) (1)

is symmetric which can be checked by writing down its transpose. Square, symmetric matrices can have an

additional property, referred to as positive-definiteness. The covariance matrix of a multivariate Gaussian

distribution, which is an essential component of PMNF, has be positive-definite, which motivates the

introduction of this term in the current Section. As for the rank of a matrix, positive-definiteness can be

approached from multiple perspectives, and it is usually not trivial to check whether a given matrix is in fact

positive-definite. We here introduce a very basic notion of positive-definiteness that mainly serves to

introduce the term itself.

A symmetric matrix 𝐴 ∈ ℝ𝑛×𝑛 is called positive definite, if the matrix-vector product 𝑥𝑇𝐴𝑥 > 0 for

all non-zero 𝑥 ∈ ℝ𝑛 and is called positive semi-definite if 𝑥𝑇𝐴𝑥 ≥ 0 for all non-zero 𝑥 ∈ ℝ𝑛.

As an example, we consider the real symmetric matrix

𝐴 = (2 11 2

) ∈ ℝ2×2 (2)

42

If we consider a non-zero vector (or “matrix with a single column”)

𝑥 ≔ (𝑥1𝑥2) (3)

we find that

𝑥𝑇𝐴𝑥 = (𝑥1 𝑥2) (2 11 2

) (𝑥1𝑥2) = (2𝑥1 + 𝑥2 𝑥1 + 2𝑥2) (

𝑥1𝑥2) = (2𝑥1 + 𝑥2)𝑥1 + (𝑥1 + 2𝑥2)𝑥2 (4)

If we consider the sum on the right-hand side (4) further, we see

(2𝑥1 + 𝑥2)𝑥1 + (𝑥1 + 2𝑥2)𝑥2 = 2𝑥12 + 𝑥2𝑥1 + 𝑥1𝑥2 + 2𝑥2

2 (5)

= 2𝑥12 + 2𝑥1𝑥2 + 2𝑥2

2

= 𝑥12 + 2𝑥1𝑥2 + 𝑥2

2 + 𝑥12 + 𝑥2

2

= (𝑥1 + 𝑥2)2 + 𝑥1

2 + 𝑥22

For any non-zero 𝑥1, 𝑥2 we thus obtained a sum-of-squares, which cannot be zero. Hence, the matrix 𝐴 is

positive-definite.

Study Questions

1. Let

𝐴 ≔ (1 23 4

) and 𝐵 ≔ (1 10 2

)

Evaluate the following matrices: 𝐶 ≔ 𝐴 + 𝐵𝑇 , 𝐷 ≔ 𝐴 − 𝐵, 𝐸 ≔ 𝐴𝐵 and 𝐷 ≔ 𝐵𝐴.

2. Let 𝑋 ∈ ℝ10×3. What is the size of the matrix ≔ 𝑋𝑇𝑋 ? 3. Explain the concept of a matrix inverse. 4. Evaluate the inverse 𝐴−1 for

𝐴 ≔ (1 0 00 2 00 0 −4

)

5. Evaluate the determinants of the matrices

𝑀 ≔ (1 20 1

) and 𝑁 ≔ (2 12 1

)

6. Write down the definition of the rank of matrix. 7. Write down the definition of a positive-definite matrix. Study question answers

1. With the definitions of the question, we have

𝐶 = 𝐴 + 𝐵𝑇 = (1 23 4

) + (1 10 2

)𝑇

= (1 23 4

) + (1 01 2

) = (2 24 6

)

𝐷 = 𝐴 − 𝐵 = (1 23 4

) − (1 10 2

) = (0 13 2

)

𝐸 = 𝐴𝐵 = (1 23 4

) (1 10 2

) = (1 53 11

)

43

𝐷 = 𝐵𝐴 = (1 10 2

) (1 23 4

) = (4 66 8

)

2. Using the “matrix size formula” (𝑚 × 𝑛) ⋅ (𝑛 × 𝑝) = 𝑚 × 𝑝 for a matrix product 𝐴𝐵 ∈ ℝ𝑚×𝑝 of matrices 𝐴 ∈ ℝ𝑚×𝑛, 𝐵 ∈ ℝ𝑛×𝑝, we have with 𝑋𝑇 ∈ ℝ3×10 (3 × 10) ⋅ (10 × 3) = 3 × 3 and thus 𝑋 ∈ ℝ3×3.

3. In analogy to division of a number 𝑥 ∈ ℝ but itself, which yields 1, the inverse 𝐴−1 ∈ ℝ𝑛×𝑛 of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is that matrix, which if (pre-or post-)multiplied by 𝐴 yields the identity matrix 𝐼𝑛: 𝐴−1𝐴 = 𝐴𝐴−1 = 𝐼𝑛.

4. The inverse of 𝐴 as defined in the question is given by

𝐴−1 = (

1 0 0

01

20

0 0 −1

4

)

because

(

1 0 0

01

20

0 0 −1

4

)(1 0 00 2 00 0 −4

) = (1 0 00 2 00 0 −4

)(

1 0 0

01

20

0 0 −1

4

) = (1 0 00 1 00 0 1

)

5. For 2 × 2 matrices, the determinant is given by |𝐴| = |(𝑎 𝑏𝑐 𝑑

)| ≔ 𝑎𝑑 − 𝑏𝑐. We thus have

|𝑀| = 1 ⋅ 1 − 2 ⋅ 0 = 1 and |𝑁| = 2 ⋅ 1 − 1 ⋅ 2 = 0

6. The rank of a matrix is the cardinality of largest set of linearly independent columns of a matrix. 7. A symmetric matrix 𝐴 ∈ ℝ𝑛×𝑛 is called positive definite, if the matrix-vector product 𝑥𝑇𝐴𝑥 > 0 for all non-zero 𝑥 ∈ ℝ𝑛.

44

Differential Calculus

Here we summarize some basic results from uni- and multivariate calculus. We eschew a discussion

of concepts from real analysis, such as continuity and differentiability of functions, which are assumed to

hold for the functions of interest covered in PMFN. Interested readers may find [Spivak 1994] a helpful

resource for further study.

(1) Intuition and definition of derivatives of univariate functions

We first concern ourselves with derivatives of univariate, real-valued functions, by which we

understand functions that map real, scalar numbers onto real, scalar numbers, in other words, functions 𝑓 of

the type

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)

The derivative 𝑓′(𝑥0) ∈ ℝ of a function 𝑓 at the location 𝑥0 ∈ ℝ of a function 𝑓 conveys two basic and

familiar intuitions

(1) It is a measure of the rate of change of 𝑓 at location 𝑥0

(2) It is the slope of the tangent line of 𝑓 at the point (𝑥0, 𝑓(𝑥0)) ∈ ℝ2

Formally, this may be expressed using the “differential quotient” of 𝑓. The differential quotient (sometimes

also referred to as “Newton’s difference quotient”) expresses the difference between two values of the

function 𝑓(𝑥 + ℎ) and 𝑓(𝑥), where 𝑥, ℎ ∈ ℝ with respect to the difference between the two locations 𝑥 and

𝑥 + ℎ for ℎ approaching zero:

𝑓′(𝑥) = limℎ→0𝑓(𝑥+ℎ)−𝑓(𝑥)

ℎ (2)

While (2) represents the formal definition of the derivative of 𝑓 and forms the basis for proofs of the rules of

differentiation we will discuss in the following, its practical importance for our purposes is rather negligible.

Before discussing some basic rules of differentiation, we note that the derivative can either be

considered at a specific point 𝑥0, which is often denoted as

𝑓′(𝑥)|𝑥=𝑥0 (3)

or, if evaluated for all possible values be considered as a function

𝑓′:ℝ → ℝ, 𝑥0 ↦ 𝑓′(𝑥0) ≔ 𝑓′(𝑥)|𝑥=𝑥0 (4)

Intuitively, (4) just means that the derivative of a differentiable function may be evaluated at any point of

the real line.

The derivative 𝑓′ of a function 𝑓 is also referred to as the “first-order” derivative of a function,

where the “zeroth-order” derivative of a function just corresponds to the function itself. High-order

derivatives (i.e. second-order, third-order, and so on) can be evaluated by recursively forming the derivative

of the respective lower order derivative. For example, the second-order derivative of a function corresponds

to the (first-order) derivative of the (first-order) derivative of a function. To this end, the 𝑑

𝑑𝑥 operator

notation for derivatives is useful. Intuitively 𝑑

𝑑𝑥𝑓 can be understood as the imperative to evaluate the

45

derivative of 𝑓, or simply as an alternative notation for 𝑓′. By itself, 𝑑

𝑑𝑥 carries no meaning. We thus have for

the first-order derivative

𝑑

𝑑𝑥𝑓(𝑥) = 𝑓′(𝑥) (5)

and for the second-order derivative

𝑑2

𝑑𝑥2𝑓(𝑥) =

𝑑

𝑑𝑥(𝑑

𝑑𝑥𝑓(𝑥)) = 𝑓′′(𝑥) (6)

Intuitively, the second-order derivative measures the rate of change of the first-order derivative in the

vicinity of 𝑥. If these first-order derivatives, which may be visualized as tangent lines, change relatively

quickly in the vicinity of 𝑥 , the second-order derivative is large, and the function is said to have a high

“curvature”.

(2) Derivatives of important functions

In this section we collect without proof the derivatives of a number of commonly encountered

functions.

Constant function

The derivative of any constant function

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 (𝑎 ∈ ℝ) (1)

is zero:

𝑓′: ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) = 0 (2)

For example, the derivative of 𝑓(𝑥) ≔ 2 is 𝑓′(𝑥) = 0.

Single-term polynomials

Let 𝑓 be a “single-term polynomial function” of the form

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎𝑥𝑏 (𝑎, 𝑏 ∈ ℝ) (3)

Then the derivative of 𝑓 is given by

𝑓′: ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) = 𝑏𝑎𝑥𝑏−1 (4)

For example, the derivative of 𝑓(𝑥) ≔ 2𝑥3 is 𝑓′(𝑥) = 6𝑥2 and the derivative of 𝑔(𝑥) ≔ √𝑥 = 𝑥1

2 is

𝑔′(𝑥) =1

2𝑥−

1

2 =1

2√𝑥

Sine and cosine

Let 𝑓 be the sine function

𝑓:ℝ → [0,1], 𝑥 ↦ 𝑓(𝑥) ≔ sin(𝑥) (5)

46


𝑓′:ℝ → [0,1], 𝑥 ↦ 𝑓′(𝑥) = cos(𝑥) (6)

Further, let 𝑔 be the cosine function

𝑔:ℝ → [0,1], 𝑥 ↦ 𝑔(𝑥) ≔ cos(𝑥) (7)

Then the derivative of 𝑔 is given by

𝑔′: ℝ → [0,1], 𝑥 ↦ 𝑔′(𝑥) = − sin(𝑥) (8)

Exponential and logarithm

Let 𝑓 be the exponential function

𝑓:ℝ → ℝ+ 𝑥 ↦ 𝑓(𝑥) ≔ exp(𝑥) (9)


𝑓′:ℝ → ℝ+, 𝑥 ↦ 𝑓′(𝑥) = exp(𝑥) (10)

Further, let 𝑔 be the natural logarithm

𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥) ≔ ln(𝑥) (11)

Then the derivative of 𝑔 is given by

𝑔′: ℝ → ℝ, 𝑥 ↦ 𝑔′(𝑥) =1

𝑥 (12)

(3) Rules of differentiation

In the following we state important rules for computing derivatives of univariate function without

proof. For a formal derivation of these rules, the interested reader may consult [Spivak 1994]

Summation rule

Let

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∑ 𝑔𝑖(𝑥)𝑛𝑖=1 (13)

be the sum of 𝑛 arbitrary functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2, … , 𝑛). Then the derivative of 𝑓 is given by the sum

of the derivatives of the 𝑔𝑖:

𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔ ∑ 𝑔𝑖′(𝑥)𝑛

𝑖=1 (14)

For example, the derivative of 𝑓(𝑥) = 𝑥2 + 2𝑥 with 𝑔1(𝑥) ≔ 𝑥2 and 𝑔2(𝑥) ≔ 2𝑥 is

𝑓(𝑥) = 𝑔1′(𝑥) + 𝑔2′(𝑥) = 2𝑥 + 2 (15)

47

Chain Rule

Let ℎ be the concantenation of two functions 𝑓:ℝ → ℝ and 𝑔:ℝ → ℝ , i.e.

ℎ:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑓(𝑥)) (16)

Then the derivative of ℎ is given by

ℎ′: ℝ → ℝ, 𝑥 ↦ ℎ′(𝑥) ≔ 𝑔′(𝑓(𝑥))𝑓′(𝑥) (17)

In words: the derivative of a function that can be written as the concatenation of a first function 𝑓 with a

second function 𝑔 is given by the derivative of the second function 𝑔 “at the location of the function 𝑓”

multiplied with the derivative of the first function. For example, the derivative of ℎ(𝑥) ≔ exp(sin(𝑥)),

which can be written as the concatenation of a function 𝑓(𝑥) ≔ exp(𝑥) with derivative 𝑓′(𝑥) = exp(𝑥)

and a function 𝑔(𝑥) ≔ sin(𝑥) with derivative 𝑔′(𝑥) = cos(𝑥) is given by ℎ′(𝑥) = exp(sin(𝑥)) cos(𝑥).

Product Rule

Let 𝑓 be the product of two functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2), i.e.

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑔1(𝑥)𝑔2(𝑥) (18)


𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔ 𝑔1′ (𝑥)𝑔2(𝑥) + 𝑔1(𝑥)𝑔2

′ (𝑥) (19)

where 𝑔1′and 𝑔2

′ denote the derivatives of 𝑔1 and 𝑔2, respectively. In words: if a function can be written as

the product of a first and a second function, its derivative corresponds to the product of the derivative of the

first function with the second function “plus” the product of the first function with the derivative of the

second function. For example, the derivative of 𝑓(𝑥) ≔ 𝑥2 exp(𝑥) can be found by writing 𝑓 as 𝑔1 ⋅ 𝑔2 with

𝑔1(𝑥) ≔ 𝑥2 and 𝑔2(𝑥) ≔ exp(𝑥) with derivatives 𝑔1′ (𝑥) = 2𝑥 and 𝑔2

′ (𝑥) = exp(𝑥), respectively. This then

yields for the derivative of 𝑓 that 𝑓′(𝑥) = 2𝑥 exp(𝑥) + 𝑥2 exp(𝑥).

Quotient Rule

Let 𝑓 be the quotient of two functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2), i.e.

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔𝑔1(𝑥)

𝑔2(𝑥) (20)


𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔𝑔1′ (𝑥)𝑔2(𝑥)−𝑔1(𝑥)𝑔2

′ (𝑥)

𝑔22(𝑥)

(21)

In words: the derivative of a function that can be written as the quotient of a first function in the numerator

and a second function in the denominator is given by the difference of the product of the derivative of the

first function with the second function and the product of the first function with the derivative of the second

function divided by the square of the second function, i.e., the function in the denominator of the original

function. For example, the derivative of 𝑓(𝑥) ≔sin(𝑥)

𝑥2+1 can be evaluated by considering 𝑔1(𝑥) ≔ sin(𝑥) with

48

derivative 𝑔1′ (𝑥) = cos(𝑥) and 𝑔2(𝑥) ≔ 𝑥2 + 1 with derivative 𝑔2

′ (𝑥) = 2𝑥 to yield

𝑓′(𝑥) =cos(𝑥)⋅(𝑥2+1)−sin(𝑥)⋅2𝑥

(𝑥2+1)2.

(4) Analytical Optimization

First- and second-order derivatives can be used to find (local) maxima and minima of functions.

Finding maxima and minima of functions is a fundamental aspect of applied mathematics and is referred to

in general as “optimization”. It is helpful to distinguish clearly between two aspects of optimization: on the

one hand, when finding a maximum or a minimum, one finds a value of the function, say 𝑓 ∶ 𝐷 → 𝑅, in its

range 𝑅 for which the conditions 𝑓(𝑥) ≥ 0 or 𝑓(𝑥) ≤ 0, respectively, hold at least in the vicinity of 𝑥 ∈ 𝐷.

These values in the range of 𝑓 are called maxima or minima, and are sometimes abbreviated by

“max𝑥∈𝐷 𝑓 (𝑥)” and “min𝑥∈𝐷 𝑓 (𝑥)”. On the other hand, one simultaneously finds those points 𝑥 in the

domain of 𝑓, for which 𝑓(𝑥) assumes a maximum or minimum. These points, which are often more

interesting than the corresponding values 𝑓(𝑥) themselves, are referred to as “extremal points” and

sometimes abbreviated by “argmax𝑥∈𝐷 𝑓 (𝑥)” or “argmin𝑥∈𝐷 𝑓 (𝑥)” for extremal points that correspond

to maxima and minima of 𝑓, respectively. Note the difference between max𝑥∈𝐷 𝑓 (𝑥) and argmax𝑥∈𝐷 𝑓 (𝑥):

the former refers to a point (or a set of points) in the range of 𝑓, the latter to a point (or a set of points) in

the domain of 𝑓.

When using first- and second-order derivatives to find extremal points and their corresponding

maxima or minima, it is helpful to distinguish (a) “necessary” and (b) “sufficient” conditions for extremal

points.

Necessary condition for an extremum

The necessary condition for an extremum of a function 𝑓:ℝ → ℝ (i.e., a maximum or minimum) at a

point 𝑥 ∈ 𝐷 is that the first-derivative “vanishes”, or more precisely, is equal to zero: 𝑓′(𝑥) = 0. Intuitively,

this can be made transparent by considering a maximum of 𝑓 at a point 𝑥𝑚𝑎𝑥: for all values of 𝑥 which are

smaller than 𝑥𝑚𝑎𝑥, the derivative is positive, because the function is increasing, leading to the maximum. For

the values of 𝑥 which are larger than 𝑥𝑚𝑎𝑥, the derivative is negative, because the function is decreasing. At

the location of the maximum, the function is neither increasing nor decreasing, and thus 𝑓′(𝑥) = 0. The

reverse is true for a minimum of 𝑓 at a point 𝑥𝑚𝑖𝑛 for all values of 𝑥 which are smaller than 𝑥𝑚𝑖𝑛, the

derivative is negative, because the function is decreasing towards the minimum. For the values of 𝑥 which

are larger than 𝑥𝑚𝑖𝑛, the derivative is positive, because the function is increasing again and recovering from

the minimum. Again, at the location of the minimum, the function is neither increasing nor decreasing, and

thus 𝑓′(𝑥) = 0.

Because in both cases 𝑓′(𝑥) = 0, one cannot decide based on finding a point 𝑥∗ for which

𝑓′(𝑥∗) = 0 holds, whether 𝑥∗ corresponds to a maximum or minimum. On the other hand, if a minimum or

maximum exist at a point 𝑥∗, it necessarily follows that 𝑓′(𝑥∗) = 0, hence nomenclature “necessary

condition”. In fact, there is a third possibility for points at which 𝑓′(𝑥∗) = 0: the function may be decreasing

for 𝑥 < 𝑥∗ and increasing for 𝑥 > 𝑥∗, or vice versa. In both cases, there is no maximum nor minimum in 𝑥∗,

but a what is referred to as a “saddle point”.

49

Sufficient condition for an extremum

The second-order derivative 𝑓′′(𝑥), which intuitively refers to “the slope of the tangent line of the

slope of the tangent line of 𝑓 ∶ ℝ → ℝ in 𝑥” allows to test whether a critical point 𝑥∗ for an extremum (i.e., a

point for which 𝑓′(𝑥∗) = 0) is a maximum, a minimum, or a saddle point. In brief, if 𝑓′′(𝑥∗) < 0, there is a

maximum at 𝑥∗, if 𝑓′′(𝑥∗) > 0, there is a minimum at 𝑥∗, and if 𝑓′′(𝑥∗) = 0, there is a saddle point at 𝑥∗.

Together with the condition 𝑓′(𝑥∗) = 0, these conditions are referred to as sufficient conditions for an

extremum. This can be made intuitive by considering a maximum at 𝑥∗. For points 𝑥 < 𝑥∗, the slope of the

tangent line at 𝑓(𝑥) must be positive, because 𝑓 is increasing towards 𝑥∗. Likewise for points 𝑥 > 𝑥∗, the

slope of the tangent line at 𝑓(𝑥) must be negative, because 𝑓 is decreasing after assuming its maximum in

𝑥∗. In other words, 𝑓′(𝑥) > 0 (positive) for 𝑥 < 𝑥∗ and 𝑓′(𝑥∗) < 0 (negative) for 𝑥 > 𝑥∗ and 𝑓′(𝑥∗) = 0.

We now consider the change of 𝑓′, i.e. 𝑓′′: In the region around the maximum, 𝑓′ decreases from a positive

value, to zero, to a negative value, as just stated. Because 𝑓′(𝑥) is positive just before (to the left of) 𝑥∗ and

negative just after (i.e. to the right of) 𝑥∗, it obviously decreases from just before to just after 𝑥∗ . But this

means that its own rate of change, 𝑓′′, is negative in 𝑥∗. The reverse obviously holds for a minimum of 𝑓 in

𝑥∗. Figure 1 visualizes the behavior of 𝑓, 𝑓′ and 𝑓′′ for three different functions in the neighborhood of a

maximum or minimum of 𝑓.

To recapitulate we have established to the following conditions that use the derivatives of a function

𝑓:ℝ → ℝ to determine its extrema :

Necessary condition

If there is a maximum of minimum of 𝑓:ℝ → ℝ at 𝑥∗, then 𝑓′(𝑥∗) = 0 (1)

Sufficient conditions for an extremum

If for 𝑓:ℝ → ℝ 𝑓′(𝑥∗) = 0 and 𝑓′′(𝑥∗) > 0, then there is a minimum at 𝑥∗ (2)

If for 𝑓:ℝ → ℝ 𝑓′(𝑥∗) = 0 and 𝑓′′(𝑥∗) < 0, then there is a maximum at 𝑥∗ (3)

In the following, we discuss three examples in which we use the conditions above to determine the location

of extremal points (see Figure 1).

Figure 1. Analytical optimization of simple functions. For a detailed discussion, please refer to the main text.

Example 1

50

Consider the function (Figure 1, left panel, blue curve)

𝑓 ∶ [0, 𝜋] → [0,1], 𝑥 ↦ sin(𝑥) (4)

The first derivative of 𝑓 is given by (Figure 1, left panel, red curve)

𝑓′: [0, 𝜋] → [0,1], 𝑥 ↦𝑑

𝑑𝑥sin(𝑥) = cos(𝑥) (5)

and the cosine function assumes a zero point at 𝜋

2 in the interval [0, 𝜋]. We thus have the critical point

𝑥∗ = 𝜋/2 for an extremum. The second derivative of 𝑓 is given by (Figure 1, left panel, red dashed curve)

𝑓′′: [0, 𝜋] → [0,1], 𝑥 ↦𝑑

𝑑𝑥cos(𝑥) = −sin(𝑥) (6)

and because – sin(𝜋/2) = −1 we can conclude that there is a maximum of 𝑓 at 𝑥 = 𝜋/2. Of course, this is

also obvious from the graph of 𝑓.

Example 2

Consider the function (Figure 1, middle panel, blue curve)

𝑓 ∶ ℝ → ℝ, 𝑥 ↦ (𝑥 − 1)2 (7)

The first derivative of 𝑓 is given by (Figure 1, middle panel, red curve)

𝑓′: ℝ → ℝ, 𝑥 ↦𝑑

𝑑𝑥((𝑥 − 1)2) = 2𝑥 − 2 (8)

Setting this derivative to zero and solving for 𝑥 yields

2𝑥 − 2 = 0 ⇔ 2𝑥 = 2 ⇔ 𝑥 = 1 (9)

We thus have the critical point 𝑥∗ = 1 for an extremum. The second derivative of 𝑓 is given by (Figure 1,

middle panel, red dashed curve)

𝑓′′: ℝ → ℝ, 𝑥 ↦𝑑

𝑑𝑥(2𝑥 − 2) = 2 (10)

The second derivative is thus the constant function, and 𝑓′′(𝑥∗) = 2 > 0. We thus conclude that there is a

minimum of 𝑓 at 𝑥 = 1 . Again, this is also obvious from the graph of 𝑓.

Example 3

Consider the function (Figure 1, right panel, blue curve)

𝑓 ∶ ℝ → ℝ, 𝑥 ↦ −𝑥2 (11)

The first derivative of 𝑓 is given by (Figure 1, rightpanel, red curve)

𝑓′: ℝ → ℝ, 𝑥 ↦𝑑

𝑑𝑥(−𝑥2) = −2𝑥 (12)

Setting this derivative to zero and solving for 𝑥 yields

51

−2𝑥 = 0 ⇔ 𝑥 = 0 (13)

We thus have the critical point 𝑥∗ = 0 for an extremum. The second derivative of 𝑓 is given by (Figure 1,

right panel, red dashed curve)

𝑓′′: ℝ → ℝ, 𝑥 ↦𝑑

𝑑𝑥(−2𝑥) = −2 (14)

The second derivative is thus the constant function, and 𝑓′′(𝑥∗) = −2 < 0. We thus conclude that there is a

maximum of 𝑓 at 𝑥 = 1.

(5) Multivariate, real-valued functions and partial derivatives

So far, we have considered functions of the form 𝑓 ∶ ℝ → ℝ, which map single numbers (scalars)

𝑥 ∈ ℝ onto single numbers (scalars) 𝑓(𝑥) ∈ ℝ. Another function type that is commonly encountered in

PMFN are functions of the form

𝑓 ∶ ℝ𝑛 → ℝ,𝑥 ↦ 𝑓(𝑥), where 𝑥 ≔ (

𝑥1⋮𝑥𝑛) ∈ ℝ𝑛 are “𝑛-tuples” or “𝑛-dimensional vectors” (1)

We will call 𝑥 as defined above a “column vector”, and, if we write the vector as a “row vector” (𝑥1, … , 𝑥𝑛)

call this the “transpose” of 𝑥, denoted by a “𝑇”-superscript, 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇. By default, all vectors we

consider are column vectors. In physics, functions of the type (1) are referred to as “scalar fields”, because

they allocate scalars 𝑓(𝑥) ∈ ℝ to “points in (𝑛-dimensional) space” (𝑥1, … , 𝑥𝑛)𝑇. Because the functional

value 𝑓(𝑥) is a real number, these kind of functions are also called “real-valued” function, in

contradistinction to “vector-valued functions” discussed in a later section. An example for a function of the

type (1) is the following

𝑓 ∶ ℝ2 → ℝ, 𝑥 ↦ 𝑓(𝑥) = 𝑓 ((𝑥1𝑥2)) ≔ 𝑥1

2 + 𝑥22 (2)

which is visualized in two different ways in the upper panels of Figure 1. Another example is the function

𝑔 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑔(𝑥) = 𝑔 ((𝑥1𝑥2)) ≔ exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ) (3)

which is visualized in two different ways in the lower panels of Figure 1. Note that functions defined on

spaces ℝ𝑛 with 𝑛 > 2 are not easily visualized.

Just as for univariate, real-valued functions, one can ask how much a change in the input argument

at a specific point in ℝ𝑛 of a multivariate, real-valued function affects the value of that function. If one asks

this questions for each of the subcomponents 𝑥𝑖 (𝑖 = 1,… , 𝑛) of 𝑥 = (𝑥1, … , 𝑥𝑛) ∈ ℝ𝑛 independently of the

remaining 𝑛 − 1 subcomponents, one is led to the concept of a “partial derivative”: the partial derivative of

a multivariate, real-valued function 𝑓 ∶ ℝ𝑛 → ℝ with respect to a variable 𝑥𝑖 (𝑖 = 1,… , 𝑛) captures how

much the function value changes “in the direction” of 𝑥𝑖, i.e., in the cross-section through the space ℝ𝑛

defined by the variable of interest. Stated differently, the partial derivative of a function 𝑓 ∶ ℝ𝑛 → ℝ in a

point 𝑥 ∈ ℝ𝑛 with respect to a variable 𝑥𝑖 (𝑖 = 1,… , 𝑛) is the derivative of the function 𝑓 with respect to 𝑥𝑖

while all other variables 𝑥𝑗, (𝑗 ∈ {1,2, … , 𝑛}, 𝑗 ≠ 𝑖) are held constant. The partial derivative of a function

52

𝑓 ∶ ℝ𝑛 → ℝ in a point 𝑥 ∈ ℝ𝑛 with respect to a variable 𝑥𝑖 is denoted by 𝜕

𝜕𝑥𝑖𝑓(𝑥), where the “𝜕” symbol is

used to distinguish the notion of a partial derivative from a standard derivative. This notation is somewhat

redundant, because the subscript 𝑖 on the 𝑥 in 𝜕

𝜕𝑥𝑖 already makes it clear that the derivative is with respect

to 𝑥𝑖 only, however, it is commonly used, and if the subcomponents of 𝑥 are not denoted by 𝑥1, … , 𝑥𝑛, but

by, say, 𝑎 ≔ 𝑥1, 𝑏 ≔ 𝑥2, … , 𝑒 ≔ 𝑥𝑛 it is, in fact, helpful.

Because as for the derivative of a univariate, real-valued function, one may evaluate the partial

derivative for all 𝑥 ∈ ℝ𝑛, one can also view the partial derivative of a multivariate, real-valued function as a

function:

𝜕

𝜕𝑥𝑖𝑓 ∶ ℝ𝑛 → ℝ, 𝑥 ↦

𝜕

𝜕𝑥𝑖𝑓 (𝑥) (6)

Figure 2. Visualization of multivariate (here: bivariate), real-valued functions. Real-valued functions of multiple variables are routinely visualized in a three-dimensional way as in the left panels of the Figure. Note that although this is a 3D plot, the function is bivariate, i.e. a function of two variables. The same information can be conveyed by using “isocontour” plots, which in a two-dimensional way visualize the “isocontours” of functions. Isocontours are the lines assuming equal values in the range of the function. Usually isocontour plots suffice to convey all relevant information about a bivariate function.

Examples

We first consider the example

53

𝑓 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑓(𝑥) = 𝑓 ((𝑥1𝑥2)) ≔ 𝑥1

2 + 𝑥22 (7)

Because this function has a 2-dimensional domain, one can evaluate 2 different partial derivatives,

namely

𝜕

𝜕𝑥1𝑓 ∶ ℝ2 → ℝ, 𝑥 ↦

𝜕

𝜕𝑥1𝑓 (𝑥) and

𝜕

𝜕𝑥2𝑓 ∶ ℝ2 → ℝ,𝑥 ↦

𝜕

𝜕𝑥2𝑓 (𝑥) (8)

To evaluate the partial derivative 𝜕

𝜕𝑥1𝑓 ∶ ℝ2 → ℝ one considers the function

𝑓𝑥2 ∶ ℝ → ℝ, 𝑥1 ↦ 𝑓𝑥2(𝑥1) ≔ 𝑥12 + 𝑥2

2 (9)

where 𝑥2 assumes the role of a constant. To indicate that 𝑥2 is no longer an input argument of the function,

but the function still dependent on the constant 𝑥2, we have used the subscript notation “𝑓𝑥2(𝑥1)”. To

evaluate the partial derivative 𝜕

𝜕𝑥1𝑓, one evaluates the standard (univariate) derivative of 𝑓𝑥2:

𝑓𝑥2′ (𝑥) = 2𝑥1 (10)

We thus have

𝜕

𝜕𝑥1𝑓 ∶ ℝ2 → ℝ,𝑥 ↦

𝜕

𝜕𝑥1𝑓(𝑥) =

𝜕

𝜕𝑥1(𝑥1

2 + 𝑥22) = 𝑓𝑥2

′ (𝑥) = 2𝑥1 (11)

and, accordingly, with the corresponding definition of 𝑓𝑥1:

𝜕

𝜕𝑥2𝑓 ∶ ℝ2 → ℝ,𝑥 ↦

𝜕

𝜕𝑥2𝑓(𝑥) =

𝜕

𝜕𝑥2(𝑥1

2 + 𝑥22) = 𝑓𝑥1

′ (𝑥) = 2𝑥2 (12)

We next consider the example

𝑔 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑔(𝑥) = 𝑔 ((𝑥1𝑥2)) ≔ exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ) (13)

Again, there are two partial derivatives 𝜕

𝜕𝑥1𝑔 and

𝜕

𝜕𝑥2𝑔. Using the chain rule of (univariate) differentiation

and the logic of treating the variable(s) with respect to which the derivative is not performed as constant in

standard univariate differentiation, we obtain, without making the function properties of the partial

derivatives explicit

𝜕

𝜕𝑥1𝑔(𝑥) =

𝜕

𝜕𝑥1exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) )

=𝜕

𝜕𝑥1exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) )

𝜕

𝜕𝑥1((−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ))

= −exp (−1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ) (𝑥1 − 1) (14)

and

𝜕

𝜕𝑥2𝑔(𝑥) =

𝜕

𝜕𝑥2exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) )

54

=𝜕

𝜕𝑥2exp (−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) )

𝜕

𝜕𝑥2((−

1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ))

= −exp (−1

2((𝑥1 − 1)

2 + (𝑥2 − 1)2) ) (𝑥2 − 1) (15)

(6) Higher-order partial derivatives0F

1

Like for the standard derivative of a univariate real-valued function 𝑓 ∶ ℝ → ℝ, higher-order partial

derivatives can be formulated and evaluated by taking partial derivatives of partial derivatives. Because

multivariate real-valued functions of the form 𝑓 ∶ ℝ𝑛 → ℝ are functions of multiple input arguments, more

possibilities exist for higher-order derivatives compared to the univariate case. For example, given the partial

derivative 𝜕

𝜕𝑥1𝑓 of a function 𝑓 ∶ ℝ3 → ℝ, one may next form the partial derivative again with respect to 𝑥1,

yielding the second-order partial derivative equivalent to the second-order derivative of a univariate

function and denoted by 𝜕2

𝜕𝑥12 𝑓. However, one may also next form the partial derivative with respect to 𝑥2,

𝜕2

𝜕𝑥1𝜕𝑥2𝑓, or with respect to 𝑥3,

𝜕2

𝜕𝑥1𝜕𝑥3𝑓. Note that the numerator of the partial derivative sign increases its

power with the order of the derivative, and the denominator denotes the variables with respect to which

the derivative is taken. If the derivative is taken multiple times with respect to the same variable, the

variable in the denominator is notated with the corresponding power. Again, note that these are mere

conventions to signal the form of the partial derivative, but the symbols themselves do not have any

meaning besides the implicit encouragement to the reader to evaluate the corresponding partial derivative.

To clarify the notation, we evaluate the first and second-order partial derivatives of the function

𝑓:ℝ3 → ℝ,𝑥 ↦ 𝑓(𝑥) ≔ 𝑥12 + 𝑥1𝑥2 + 𝑥2√𝑥3 (1)

We have for the first-order-derivatives

𝜕

𝜕𝑥1𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦

𝜕

𝜕𝑥1𝑓(𝑥) =

𝜕

𝜕𝑥1(𝑥1

2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 2𝑥1 + 𝑥2 (2)

𝜕

𝜕𝑥2𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦

𝜕

𝜕𝑥2𝑓(𝑥) =

𝜕

𝜕𝑥2(𝑥1

2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 𝑥1 +√𝑥3 (3)

𝜕

𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕

𝜕𝑥3𝑓(𝑥) =

𝜕

𝜕𝑥3(𝑥1

2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 𝑥21

2𝑥3−1

2 =𝑥2

2√𝑥3, (4)

for the second-order derivatives with respect to 𝑥1

𝜕2

𝜕𝑥12 𝑓 ∶ ℝ

3 → ℝ, 𝑥 ↦𝜕2

𝜕𝑥12 𝑓(𝑥) =

𝜕

𝜕𝑥1(𝜕

𝜕𝑥1𝑓(𝑥)) =

𝜕

𝜕𝑥1(2𝑥1 + 𝑥2) = 2 (5)

𝜕2

𝜕𝑥2𝜕𝑥1𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕2

𝜕𝑥2𝜕𝑥1𝑓(𝑥) =

𝜕

𝜕𝑥2(𝜕

𝜕𝑥1𝑓(𝑥)) =

𝜕

𝜕𝑥2(2𝑥1 + 𝑥2) = 1 (6)

𝜕2

𝜕𝑥3𝜕𝑥1𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕2

𝜕𝑥3𝜕𝑥1𝑓(𝑥) =

𝜕

𝜕𝑥3(𝜕

𝜕𝑥1𝑓(𝑥)) =

𝜕

𝜕𝑥3(2𝑥1 + 𝑥2) = 0, (7)

1 This section requires some familiarity with matrix notation as covered in the Section “Matrix Algebra”.

55

for the second-order derivatives with respect to 𝑥2

𝜕2

𝜕𝑥1𝜕𝑥2𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦

𝜕2

𝜕𝑥1𝜕𝑥2𝑓(𝑥) =

𝜕

𝜕𝑥1(𝜕

𝜕𝑥2𝑓(𝑥)) =

𝜕

𝜕𝑥1(𝑥1 +√𝑥3) = 1 (8)

𝜕2

𝜕𝑥22 𝑓 ∶ ℝ

3 → ℝ, 𝑥 ↦𝜕2

𝜕𝑥22 𝑓(𝑥) =

𝜕

𝜕𝑥2(𝜕

𝜕𝑥2𝑓(𝑥)) =

𝜕

𝜕𝑥2(𝑥1 +√𝑥3) = 0 (9)

𝜕2

𝜕𝑥3𝜕𝑥2𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕2

𝜕𝑥3𝜕𝑥2𝑓(𝑥) =

𝜕

𝜕𝑥3(𝜕

𝜕𝑥2𝑓(𝑥)) =

𝜕

𝜕𝑥3(𝑥1 +√𝑥3) =

1

2𝑥3−1

2 =1

2√𝑥3 (10)

and for the second-order derivatives with respect to 𝑥3

𝜕2

𝜕𝑥1𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕2

𝜕𝑥1𝜕𝑥3𝑓(𝑥) =

𝜕

𝜕𝑥1(𝜕

𝜕𝑥3𝑓(𝑥)) =

𝜕

𝜕𝑥1(𝑥2

2√𝑥3) = 0 (11)

𝜕2

𝜕𝑥2𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦

𝜕2

𝜕𝑥2𝜕𝑥3𝑓(𝑥) =

𝜕

𝜕𝑥2(𝜕

𝜕𝑥3𝑓(𝑥)) =

𝜕

𝜕𝑥2(𝑥2

2√𝑥3) =

1

2√𝑥3 (12)

𝜕2

𝜕𝑥32 𝑓 ∶ ℝ

3 → ℝ, 𝑥 ↦𝜕2

𝜕𝑥32 𝑓(𝑥) =

𝜕

𝜕𝑥3(𝜕

𝜕𝑥3𝑓(𝑥)) =

𝜕

𝜕𝑥3(𝑥2

1

2𝑥3−1

2) = −1

2𝑥2

1

2𝑥3−3

2 = −1

4𝑥2𝑥3

−3

2 (13)

Note from the above that it does not matter in which order the second derivatives are taken, as

𝜕2

𝜕𝑥1𝜕𝑥2𝑓(𝑥) =

𝜕2

𝜕𝑥2𝜕𝑥1𝑓(𝑥) = 1,

𝜕2

𝜕𝑥1𝜕𝑥3𝑓(𝑥) =

𝜕2

𝜕𝑥3𝜕𝑥1𝑓 (𝑥) = 0 and

𝜕2

𝜕𝑥2𝜕𝑥3𝑓(𝑥) =

𝜕2

𝜕𝑥3𝜕𝑥2𝑓(𝑥) =

1

2√𝑥3 (14)

This is a general property of partial derivatives known as “Schwarz’ Theorem”, which we state here without

proof: For a multivariate real-valued function 𝑓 ∶ ℝ𝑛 → ℝ,𝑥 ↦ 𝑓(𝑥) the following identity holds

𝜕2

𝜕𝑥𝑖𝜕𝑥𝑗𝑓(𝑥) =

𝜕2

𝜕𝑥𝑗𝜕𝑥𝑖𝑓(𝑥) 𝑖, 𝑗 ∈ {1,2,… , 𝑛} (15)

Schwarz theorem is helpful when evaluating partial derivatives: on the one hand, one can save some work

by relying on it, on the other hand it can help to validate the analytical results, because if one finds that it

does not hold for certain second-order partial derivatives, something must have gone wrong in the

calculation.

(7) Gradient, Hessian, and Jacobian

First- and second-order partial derivatives of multivariate real-valued functions can be summarized

in two entities known as the “gradient” and the “Hessian” or “Hessian matrix” of the function 𝑓 . The

gradient of a function

𝑓:ℝ𝑛 → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)

at a location 𝑥 ∈ ℝ𝑛 is defined as the 𝑛-dimensional vector of the function’s partial derivatives evaluated at

this location and denoted by the ∇ (nabla) sign:

56

∇𝑓 ∶ ℝ𝑛 → ℝ𝑛, 𝑥 ↦ ∇𝑓(𝑥) ≔

(

𝜕

𝜕𝑥1𝑓(𝑥)

𝜕

𝜕𝑥2𝑓(𝑥)

⋮𝜕

𝜕𝑥3𝑓(𝑥))

∈ ℝ𝑛 (2)

Intuitively, the gradient evaluated at 𝑥 ∈ ℝ𝑛 is a vector that points in the direction of the greatest rate

increase (steepest ascent) of the function in its domain space ℝ𝑛. That this is in fact the case, is not easy to

prove, so we omit the proof here and content with the intuition. Note that the gradient is a vector-valued

function: it takes a vector 𝑥 ∈ ℝ𝑛 as input and returns a vector ∇𝑓(𝑥) ∈ ℝ𝑛.

Second-order partial derivatives are summarized in the so-called Hessian matrix of a multivariate-

real valued function

𝑓:ℝ𝑛 → ℝ, 𝑥 ↦ 𝑓(𝑥) (3)

which is denoted by

𝐻𝑓:ℝ𝑛 → ℝ𝑛×𝑛 , 𝑥 ↦ 𝐻𝑓(𝑥) ≔

(

𝜕2

𝜕𝑥12 𝑓(𝑥)

𝜕2

𝜕𝑥1𝜕𝑥2𝑓(𝑥)

𝜕2

𝜕𝑥2𝜕𝑥1𝑓(𝑥)

𝜕2

𝜕𝑥22 𝑓(𝑥)

⋯𝜕2

𝜕𝑥1𝜕𝑥𝑛𝑓(𝑥)

⋯𝜕2

𝜕𝑥2𝜕𝑥𝑛𝑓(𝑥)

⋮ ⋮𝜕2

𝜕𝑥𝑛𝜕𝑥1𝑓(𝑥)

𝜕2

𝜕𝑥𝑛𝜕𝑥2𝑓(𝑥)

⋱ ⋮

⋯𝜕2

𝜕𝑥𝑛𝜕𝑥𝑛2𝑓(𝑥))

=:(𝜕2

𝜕𝑥𝑖𝜕𝑥𝑗𝑓(𝑥))

𝑖,𝑗=1,…,𝑛

(4)

Note that in each row of the Hessian, the second (in the order of differentiation, not in the order of

notation) of the two partial derivatives is constant, while the first varies from 1 to 𝑛 over columns, and the

reverse is true for each column. The Hessian matrix can be viewed as a matrix-valued function: it takes a

vector 𝑥 ∈ ℝ𝑛 as input and returns an 𝑛 × 𝑛 matrix 𝐻𝑓(𝑥) ∈ ℝ𝑛×𝑛. Also note that due to Schwarz’

Theorem, the Hessian matrix is symmetric, i.e.,

(𝐻𝑓(𝑥))𝑇= 𝐻𝑓(𝑥) (5)

Functions of the form 𝑓 ∶ ℝ𝑛 → ℝ map vectors 𝑥 ∈ ℝ𝑛 onto single numbers (scalars) 𝑓(𝑥) ∈ ℝ. A

further function type commonly encountered in applied mathematics are functions that map vectors onto

vectors. These functions are of the general form

𝑓:ℝ𝑛 → ℝ𝑚, 𝑥 ↦ 𝑓(𝑥) ≔ (𝑓1(𝑥),… , 𝑓𝑚(𝑥))𝑇= (

𝑓1(𝑥1, … , 𝑥𝑛)

𝑓2(𝑥1, … , 𝑥𝑛)⋮

𝑓𝑚(𝑥1, … , 𝑥𝑛)

) (6)

and, in physics, are referred to as “vector fields”. The first derivative of these functions evaluated at 𝑥 ∈ ℝ𝑛

in the direction of the canonical basis vector set is given by the so-called Jacobian matrix, which we denote

by

57

𝐽𝑓(𝑥) ≔ (𝐽𝑖𝑗𝑓)(𝑥) ≔

(

𝜕

𝜕𝑥1 𝑓1(𝑥) ⋯

𝜕

𝜕𝑥𝑛 𝑓1(𝑥)

⋮ ⋱ ⋮𝜕

𝜕𝑥1 𝑓𝑚(𝑥) ⋯

𝜕

𝜕𝑥𝑛 𝑓𝑚(𝑥))

∈ ℝ𝑚×𝑛 (7)

(9) Taylor’s Theorem

Loosely speaking, Taylor’s theorem states that any 𝑘 times differentiable function 𝑓 ∶ ℝ → ℝ can be

locally approximated in the region of a given point of its domain by means of a polynomial. In the “Peano

form”, Taylor’s theorem takes the following form: let 𝑘 ∈ ℕ and let a univariate real-valued function

𝑓 ∶ ℝ → ℝ be 𝑘 times differentiable in a point 𝑎 ∈ ℝ. Then there exists a function ℎ𝑘:ℝ → ℝ such that

𝑓(𝑥) = 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1

2!𝑓′′(𝑎)(𝑥 − 𝑎)2 +⋯+

1

𝑘!𝑓(𝑘)(𝑎)(𝑥 − 𝑎)𝑘 + ℎ𝑘(𝑥)(𝑥 − 𝑎)

𝑘 (1)

with lim𝑥→𝑎 ℎ𝑘(𝑥) = 0. In other words, close to the expansion point 𝑎, the approximation of 𝑓(𝑥) by the

𝑘th order polynomial involving the 𝑘 derivatives of 𝑓 becomes exact. In the applied literature, Taylor’s

theorem is often used without the remainder term, and referred to as a “𝑘th order Taylor approximation” of

the function 𝑓 in (or around) the expansion point 𝑎. This is commonly denoted by

𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1

2!𝑓′′(𝑎)(𝑥 − 𝑎)2 +⋯+

1

𝑘!𝑓(𝑘)(𝑎)(𝑥 − 𝑎)𝑘 (2)

First- and second order Taylor approximations, which are perhaps the most often applied approximation, are

thus given for 𝑥, 𝑎 ∈ ℝ by

𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) (3)

and

𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1

2𝑓′′(𝑎)(𝑥 − 𝑎)2 (4)

Note that Taylor approximations are local and not global function approximations, i.e. they are good

approximations close to the expansion point 𝑎 ∈ ℝ, but bad approximations far away from the expansion

point.

In analogy to the univariate case, Taylor’s theorem can also be formulated for functions of the form

𝑓:ℝ𝑛 → ℝ. Because 𝑓 is now a function multiple variables, partial derivatives come into play. We do not

state Taylor’s theorem for multivariate real-valued functions explicitly, but in analogy to (15) and (16)

content with stating first- and second-order Talyor approximations around an expansion point 𝑎 ∈ ℝ𝑛 for

multivariate-real valued functions. Notably, in these approximations the gradient vector takes on the role of

the first derivative and the Hessian matrix takes the role of the second derivative. For the first-order Taylor

approximation of a differentiable function 𝑓 ∈ ℝ𝑛 → ℝ in 𝑎 ∈ ℝ𝑛, we have

𝑓(𝑥) ≈ 𝑓(𝑎) + ∇𝑓(𝑎)𝑇(𝑥 − 𝑎) (5)

and for the second-order Talyor approximation of 𝑓 ∈ ℝ𝑛 → ℝ in 𝑎 ∈ ℝ𝑛, we have

𝑓(𝑥) ≈ 𝑓(𝑎) + ∇𝑓(𝑎)𝑇(𝑥 − 𝑎) +1

2(𝑥 − 𝑎)𝑇𝐻𝑓(𝑎)(𝑥 − 𝑎) (6)

58

Finally, also in the case of multivariate, vector-valued functions of the form 𝑓:ℝ𝑛 → ℝ𝑚 local

approximations around expansion points 𝑎 ∈ ℝ𝑛 can be evaluated using an appropriately generalized

Taylor’s theorem. We content here by stating the first-order approximation of a function 𝑓:ℝ𝑛 → ℝ𝑚 in a

point 𝑎 ∈ ℝ𝑛 in analogy to (15) and (17). Notably, the Jacobian matrix now takes on the role of the first

derivative or gradient:

𝑓(𝑥) ≈ 𝑓(𝑎) + 𝐽𝑓(𝑎)(𝑥 − 𝑎) (7)

Note that here 𝑓(𝑎) ∈ ℝ𝑚, 𝐽𝑓(𝑎) ∈ ℝ𝑚×𝑛 and (𝑥 − 𝑎) ∈ ℝ𝑛. Of course, for nonlinear 𝑓 this approximation

will only be reasonable in the close vicinity of 𝑎.

Figure 3. Taylor approximations. Panel A depicts first and second order approximations of the univariate, real-valued logarithmic function 𝑓: ℝ+ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑙𝑛 𝑥 around the expansion point 𝑎 = 2.5. Note that in the vicinity of 𝑎 the second-order approximation captures the behaviour of the function 𝑓 better than the first order approximation. Panel B depicts a first order approximation of the multivariate (here bivariate) real-valued function 𝑓:ℝ2 → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ −𝑥𝑇𝑥 around the expansion point 𝑎 = (0.75, 0.75)𝑇. Note that the approximation is only reasonable in the vicinity of the expansion point.

Study Questions

1. Give a brief explanation of the notion of a derivative of a univariate function 𝑓 in a point 𝑥.

2. Provide brief explanations of the symbols 𝑑

𝑑𝑥,𝑑2

𝑑𝑥2 ,𝜕

𝜕𝑥 ,and

𝜕2

𝜕𝑥2

3. Compute the first derivatives of the following functions

𝑓: ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2exp(−𝑥3)

𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥) ≔ (𝑥2 + 2𝑥 − 𝑎)3

4. Compute the partial derivatives of the function

𝑓:ℝ2 → ℝ, (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦) ≔ log 𝑥 + ∑ (𝑦 − 3)2𝑛𝑖=1

with respect to 𝑥 and 𝑦.

59

5. Determine the minimum of the function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 + 𝑥 + 2.


1. The derivative 𝑓′(𝑥) ∈ ℝ of a function 𝑓 at the location 𝑥 ∈ ℝ of a function 𝑓 conveys two basic intuitions: It is a measure of

the rate of change of 𝑓 at location 𝑥, and it is the slope of the tangent line of 𝑓 at the point (𝑥, 𝑓(𝑥)) ∈ ℝ2.

2. If written in front of a function with input argument 𝑥, 𝑑

𝑑𝑥 is understood as the imperative to evaluate the (first) derivative of

the function, 𝑑2

𝑑𝑥2 is understood as the imperative to evaluate the second derivative of the function (the derivative of the

derivative of the function). Likewise if written in front of a function with multiple input arguments, for example 𝑥, 𝑦, and 𝑧, 𝜕

𝜕𝑥 is

best understood as the imperative to evaluate the (first) derivative of the function with respect to the input variable 𝑥 (the

partial derivative with respect to 𝑥) , while 𝜕2

𝜕𝑥2 is understood as the imperative to evaluate the second derivative of the

function (the derivative of the derivative of the function) with respect to 𝑥, i.e. the second partial derivative with respect to 𝑥. 3. With the chain rule, we have

𝑓′(𝑥) =𝑑

𝑑𝑥2 exp(−𝑥3) = 2

𝑑

𝑑𝑥(exp(−𝑥3)) = 2

𝑑

𝑑𝑥exp(−𝑥3) ⋅

𝑑

𝑑𝑥(−𝑥3) = −6 exp(−𝑥3) 𝑥2

𝑔′(𝑥) =𝑑

𝑑𝑥(𝑥2 + 2𝑥 − 𝑎)3 = 3(𝑥2 + 2𝑥 − 𝑎)2

𝑑

𝑑𝑥(𝑥2 + 2𝑥 − 𝑎) = 3(𝑥2 + 2𝑥 − 𝑎)2(2𝑥 + 2)

4. We have with the linearity of differentiation

𝜕

𝜕𝑥𝑓(𝑥, 𝑦) =

𝜕

𝜕𝑥(log 𝑥 + ∑ (𝑦 − 3)2𝑛

𝑖=1 ) =𝜕

𝜕𝑥log 𝑥 +

𝜕

𝜕𝑥∑ (𝑦 − 3)2𝑛𝑖=1 =

1

𝑥+ 0 =

1

𝑥

and with the linearity of and the chain rule of differentiation

𝜕

𝜕𝑦𝑓(𝑥, 𝑦) =

𝜕

𝜕𝑦(log 𝑥 + ∑ (𝑦 − 3)2𝑛

𝑖=1 )

=𝜕

𝜕𝑦log 𝑥 +

𝜕

𝜕𝑦∑ (𝑦 − 3)2𝑛𝑖=1

= 0 + ∑𝜕

𝜕𝑦((𝑦 − 3)2)𝑛

𝑖=1

= ∑ 2(𝑦 − 3)𝑛𝑖=1 ⋅

𝜕

𝜕𝑦(𝑦 − 3)

= ∑ 2(𝑦 − 3)𝑛𝑖=1 ⋅ 1

= 2∑ (𝑦 − 3)𝑛𝑖=1

5. We first compute the first derivative of the function and then set it to zero, solving for a critical point. For the first derivative, we

have

𝑓′(𝑥) ≔𝑑

𝑑𝑥(𝑥2 + 𝑥 + 2) = 2𝑥 + 1

Setting to zero and solving for the critical point 𝑥∗then yields

𝑓′(𝑥∗) = 0 ⇔ 2𝑥∗ + 1 = 0 ⇔ 𝑥∗ = −1

2

Because 𝑓′′(−1/2) = 2 > 0, we know that −1

2 is indeed a minimum of 𝑓

60

Integral Calculus

We assume that the reader has a basic familiarity with integrals from high school mathematics and focus

the discussion on two features of integration: (1) the intuition of the definite Riemann integral as the signed

area under a function’s graph, and (2) indefinite integration as the inverse of differentiation.

(1) Definite integrals - The integral as the signed area under a function’s graph

We will denote the “definite integral” of a univariate real-valued function

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)

on an interval [𝑎, 𝑏] ⊂ ℝ by the real number

𝐼 ≔ ∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 ∈ ℝ (2)

Two things are important to note with respect to the notation above: first, the definite integral is simply a

real number, and second, the right hand side is merely notational and to be understood as the task of

“integrating the function 𝑓 on the interval [𝑎, 𝑏]”. In other words, there is no mathematical meaning

associated with the “𝑑𝑥” or the ∫𝑏

𝑎 beyond the definition of the integral boundaries 𝑎 and 𝑏. The term

“definite” is used here to distinguish this integral from the “indefinite” integral discussed below. Put simply,

definite integrals are those integrals for which the integral boundaries appear at the integral sign (although

due to sloppy notation, they are sometimes omitted, for example, if the interval of integration is the entire

real line).

Intuitively, the definite integral ∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 is best understood as the continuous generalization of

the sum

∑ 𝑓(𝑥𝑖)𝑛𝑖=1 Δ𝑥 (3)

where

𝑎 =: 𝑥1, 𝑥2 ≔ 𝑥1 + Δ𝑥, 𝑥3 ≔ 𝑥2 + Δ𝑥,… . , 𝑥𝑛 ≔ 𝑏 (4)

corresponds to an equipartition of the interval [𝑎, 𝑏], i.e. a partition of the interval [𝑎, 𝑏] into 𝑛 − 1 bins of

equal size Δ𝑥. The term 𝑓(𝑥𝑖)Δ𝑥 (𝑖 = 1,… , 𝑛) corresponds to the area of the rectangle formed by the value

of the function 𝑓 at 𝑥𝑖 (i.e. the upper left corner of the rectangle) as height, and the bin width Δ𝑥 as width.

Summing over all rectangles then yields an approximation of the area under the graph of the function 𝑓,

where terms with negative values of 𝑓(𝑥𝑖) enter the sum with a negative sign. Intuitively, by letting the bin

width Δ𝑥 in the sum (3) approach zero (i.e. making it smaller and smaller) then approximates the integral of

𝑓 on the interval [𝑎, 𝑏]:

∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 ≈ ∑ 𝑓(𝑥𝑖)

𝑛𝑖=1 Δ𝑥 for Δ𝑥 → 0 (5)

See Figure 1 below for a visualization.

61

Definite integrals have the “linearity property” which is often useful when evaluating integrals

analytically. We will briefly elucidate this property. Based on the intuition that for a function 𝑓:ℝ → ℝ, 𝑥 ↦

𝑓(𝑥)

∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 ≈ ∑ 𝑓(𝑥𝑖)

𝑛𝑖=1 Δ𝑥 (6)

and the fact that for a second function 𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥)

∑ (𝑓(𝑥𝑖) + 𝑔(𝑥𝑖))Δ𝑥 𝑛𝑖=1 = ∑ (𝑓(𝑥𝑖)Δ𝑥 + 𝑔(𝑥𝑖)Δ𝑥)

𝑛𝑖=1 = ∑ 𝑓(𝑥𝑖)Δ𝑥 + ∑ 𝑔(𝑥𝑖)Δ𝑥

𝑛𝑖=1 𝑛

𝑖=1 (7)

and a constant 𝑐 ∈ ℝ

∑ 𝑐𝑓(𝑥𝑖)Δ𝑥 𝑛𝑖=1 = 𝑐 ∑ 𝑓(𝑥𝑖)Δ𝑥

𝑛𝑖=1 (8)

we can infer the following linearity properties of the Riemann integral

∫ (𝑓(𝑥) + 𝑔(𝑥))𝑏

𝑎𝑑𝑥 = ∫ 𝑓(𝑥)

𝑏

𝑎𝑑𝑥 + ∫ 𝑔(𝑥)

𝑏

𝑎𝑑𝑥 (9)

and

∫ 𝑐𝑓(𝑥)𝑏

𝑎𝑑𝑥 = 𝑐 ∫ 𝑓(𝑥)

𝑏

𝑎𝑑𝑥 (10)

In words: Firstly, the integral of the sum of two functions 𝑓 + 𝑔 over an interval [𝑎, 𝑏]corresponds to the

sum of the integrals of the individual functions 𝑓 and 𝑔 on [𝑎, 𝑏] Secondly, the integral of a function 𝑓

multiplied by a constant 𝑐 on an interval [𝑎, 𝑏] corresponds to the integral of the function 𝑓 on an interval

[𝑎, 𝑏] multiplied by the constant. Both properties are very useful when evaluating integrals analytically: the

first allows for decomposing integrals of composite functions into sums of integrals of less complex

functions, while the second allows for removing constants from integration.

Figure 1. Visualization of the definite integral as area under a function’s graph

(2) Indefinite Integrals - Integration as the inverse of differentiation

Consider again a univariate real-valued function

62

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)

and a second function defined by means of making the upper integration boundary of an integral of 𝑓 its

input variable:

𝐹:ℝ+ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫ 𝑓(𝑠)𝑥

0𝑑𝑠 (2)

From the discussion above, we have that the value 𝐹 at 𝑥 can be considered the signed area under the graph

of the function 𝑓 on the interval from 0 to 𝑥. Notably, the derivative of the function 𝐹 at 𝑥, denoted by

𝐹′(𝑥), corresponds to the value of the function 𝑓, i.e.

𝐹′(𝑥) =𝑑

𝑑𝑥(∫ 𝑓(𝑠)

𝑥

0𝑑𝑠) = 𝑓(𝑥) (3)

Intuitively, the above statement says that integrating is the inverse of differentiation, in the sense

that first integrating 𝑓 from 0 to 𝑥 and then computing the derivative with respect to 𝑥 yields 𝑓. Any

function 𝐹 with the property 𝐹′(𝑥) = 𝑓(𝑥) for a function 𝑓 is called an anti-derivative or “indefinite

integral” of 𝑓. An indefinite integral is denoted by

𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) = ∫ 𝑓(𝑠)𝑑𝑠 (4)

Note that the definite integral defined above corresponds to a real scalar number, while the indefinite integral is a function.

Proof of (3)

While the statement of equation (3) that the derivative of a function’s antiderivative corresponds to the function itself is familiar and intuitive, it is not necessarily formally easy to grasp. Here we thus provide a proof of equation (3) for the interested reader based on [Leithold 1996]. This proof makes use of limiting processes and the mean value theorem for integration, which the interested reader may review prior to studying the proof. Let 𝑓 ∶ ℝ → ℝ, 𝑠 ↦ 𝑓(𝑠) be a univariate, real-valued function, and define another function

𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫ 𝑓(𝑠)𝑥

𝑎𝑑𝑠 (3.1)

For any two numbers 𝑥1 and 𝑥1 + Δ𝑥 in the (closed) interval [𝑎, 𝑏] ⊂ ℝ, we then have

𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1𝑎

𝑑𝑠 and 𝐹(𝑥1 + Δ𝑥) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥

𝑎𝑑𝑠 (3.2)

Subtraction of these two equalities yields

𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥

𝑎𝑑𝑠 − ∫ 𝑓(𝑠)

𝑥1𝑎

𝑑𝑠 (3.3)

From the intuition of the integral as the area between the function 𝑓 and the 𝑥-axis it follows naturally, that the sum of the areas of two adjacent areas is equal to the area of both regions combined, i.e.,

∫ 𝑓(𝑠)𝑥1𝑎

𝑑𝑠 + ∫ 𝑓(𝑠)𝑥1+Δ𝑥

𝑥1𝑑𝑠 = ∫ 𝑓(𝑠)

𝑥1+Δ𝑥

𝑎𝑑𝑠 (3.4)

From this it follows, that the difference above evaluates to

𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥

𝑥1𝑑𝑠 (3.5)

According to the “mean value theorem for integration”, there exists a real number 𝑐Δ𝑥 ∈ [𝑥1, 𝑥1 + Δ𝑥] (the dependence on Δ𝑥 of which we have denoted by the subscript)

∫ 𝑓(𝑠)𝑥1+Δ𝑥

𝑥1𝑑𝑠 = 𝑓(𝑐Δ𝑥)Δ𝑥 (3.6)

and we hence obtain

63

𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = 𝑓(𝑐Δ𝑥)Δ𝑥 (3.7)

Division by Δ𝑥 then yields

𝐹(𝑥1+Δ𝑥)−𝐹(𝑥1)

Δ𝑥= 𝑓(𝑐Δ𝑥) (3.8)

where the left-hand side corresponds to “Newton’s difference quotient”. Taking the limit Δ𝑥 → 0 on both sides then yields

limΔ𝑥→0𝐹(𝑥1+Δ𝑥)−𝐹(𝑥1)

Δ𝑥= lim

Δ𝑥→0𝑓(𝑐Δ𝑥) ⇔ 𝐹′(𝑥1) = lim

Δ𝑥→0𝑓(𝑐Δ𝑥)

(3.9)

by definition of the derivative as the limit of Newton’s difference quotient. The limit on the right hand side of the above remains to be evaluated. To this end, we recall that 𝑐Δ𝑥 ∈ [𝑥1, 𝑥1 + Δ𝑥] or in other words, that 𝑥1 ≤ 𝑐Δ𝑥 ≤ 𝑥1 + Δ𝑥. Notably lim

Δ𝑥→0𝑥1 = 𝑥1 and

limΔ𝑥→0

𝑥1 + Δ𝑥 = 𝑥1. Therefore, we can conclude that limΔ𝑥→0

𝑐Δ𝑥 = 𝑥1, as 𝑐Δ𝑥 is “squeezed” between limΔ𝑥→0

𝑥1 = 𝑥1 and limΔ𝑥→0

𝑥1 + Δ𝑥 =

𝑥1. We thus find

𝐹′(𝑥1) = limΔ𝑥→0

𝑓(𝑐Δ𝑥) = 𝑓(𝑥1) (3.10)

which concludes the proof. □

Indefinite integrals (or antiderivatives) allow for the evaluation of definite integrals ∫ 𝑓(𝑠)𝑏

𝑎𝑑𝑠 by means of

the fundamental theorem of calculus

∫ 𝑓(𝑠)𝑏

𝑎𝑑𝑠 = 𝐹(𝑏) − 𝐹(𝑎) (5)

In words: to evaluate the integral of a univariate real-valued function 𝑓 on the interval [𝑎, 𝑏], one has to first

compute the anti-derivative of 𝑓, and then from the difference between the anti-derivative evaluated at the

upper integral interval boundary 𝑏 and the anti-derivative evaluated at the lower integral interval boundary

𝑎. Equation (5) is very familiar, such that we postpone a formal derivation and first consider properties and

examples.

Without proof we note that the linearity properties of the definite integral also hold for the

indefinite integral, i.e., for functions 𝑓, 𝑔:ℝ → ℝ and constant 𝑐 ∈ ℝ we have

∫(𝑓(𝑥) + 𝑔(𝑥)) 𝑑𝑥 = ∫𝑓(𝑥) 𝑑𝑥 + ∫𝑔(𝑥) 𝑑𝑥 (6)

and

∫ 𝑐𝑓(𝑥) 𝑑𝑥 = 𝑐 ∫𝑓(𝑥) 𝑑𝑥 (7)

Just as for differentiation, it is useful to know the anti-derivatives of a handful of univariate functions

𝑓:ℝ → ℝ which are commonly encountered. We present a selection without proofs below, which can

readily be verified by evaluating the derivatives of the respective antiderivatives to recover the original

functions. Note that the derivative of the constant function 𝑓(𝑥) ≔ 𝑐 with 𝑐 ∈ ℝ is zero. We have

𝑓(𝑥) ≔ 𝑎 ⇒ 𝐹(𝑥) = 𝑎𝑥 + 𝑐

𝑓(𝑥) ≔ 𝑥𝑎 ⇒ 𝐹(𝑥) =1

𝑎+1𝑥𝑎+1 + 𝑐 (𝑎 ≠ −1)

𝑓(𝑥) ≔ 𝑥−1 ⇒ 𝐹(𝑥) = ln 𝑥 + 𝑐

𝑓(𝑥) ≔ exp(𝑥) ⇒ 𝐹(𝑥) = exp(𝑥) + 𝑐

64

𝑓(𝑥) ≔ sin(𝑥) ⇒ 𝐹(𝑥) = −cos(𝑥) + 𝑐

𝑓(𝑥) ≔ cos(𝑥) ⇒ 𝐹(𝑥) = sin(𝑥) + 𝑐 (8)

The interested reader may find proofs of the above in [Spivak 1994].

Example

To illustrate the theoretical discussion above, we evaluate the anti-derivative of the function

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥2 + 𝑥 + 1 (9)

and use this anti-derivative to evaluate the definite integral of this function on the interval [1,2], i.e.

∫ 𝑓(𝑥)2

1𝑑𝑥. To this end, we first use the linearity property of the indefinite integral, which yields for the

antiderivative

𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫𝑓(𝑥)𝑑𝑥 = ∫(2𝑥2 + 𝑥 + 1)𝑑𝑥 = 2∫ 𝑥2 𝑑𝑥 + ∫𝑥 𝑑𝑥 + ∫1𝑑𝑥 (10)

We make then use of (8) to evaluate the remaining integral terms:

𝐹(𝑥) =2

3𝑥3 +

1

2𝑥2 + 𝑥 + 𝐶 (11)

where the constant 𝐶 ∈ ℝ comprises all constant terms. Importantly, this constant term vanishes, once we

evaluate a definite integral:

∫ 𝑓(𝑥)2

1𝑑𝑥 = 𝐹(2) − 𝐹(1)

=2

323 +

1

222 + 2 + 𝐶 − (

2

313 +

1

212 + 1 + 𝐶)

=16

3+4

2+ 2 + 𝐶 −

2

3−1

2− 1 − 𝐶

=32

6+12

6+12

6+ 𝐶 −

4

6−3

6−6

6− 𝐶

=43

6 (12)

In the remainder of this Section, we formally justify equation (5) for the interested reader.

Proof of (5)

Like equation (3), equation (5) is very familiar, but again, a formal derivation is somewhat more involved. The following

proof makes use of limiting processes and the mean value theorem of differentiation. We first consider the quantity 𝐹(𝑏) − 𝐹(𝑎). To

this end, we select numbers 𝑥1, … , 𝑥𝑛 such that

𝑎 ≔ 𝑥0 < 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛−1 < 𝑥𝑛 =: 𝑏 (5.1)

It follows that 𝐹(𝑏) − 𝐹(𝑎) = 𝐹(𝑥𝑛) − 𝐹(𝑥0). Now each 𝐹(𝑥𝑖), 𝑖 = 1,… , 𝑛 − 1 is added to the quantity 𝐹(𝑏) − 𝐹(𝑎) with its

additive inverse

𝐹(𝑏) − 𝐹(𝑎) = 𝐹(𝑥𝑛) + (−𝐹(𝑥𝑛−1) + 𝐹(𝑥𝑛−1)) + ⋯+ (−𝐹(𝑥1) + 𝐹(𝑥1)) − 𝐹(𝑥0) (5.2)

= (𝐹(𝑥𝑛) − 𝐹(𝑥𝑛−1)) + (𝐹(𝑥𝑛−1) − 𝐹(𝑥𝑛−2)) + ⋯+ (𝐹(𝑥1) − 𝐹(𝑥0))

65

= ∑ (𝐹(𝑥𝑖) − 𝐹(𝑥𝑖−1))𝑛𝑖=1

The mean value theorem of differentiation states that for a function 𝐹: [𝑎, 𝑏] → ℝ there exists (under certain constraints) a number

𝑐 ∈]𝑎, 𝑏[ such that

𝐹′(𝑐) =𝐹(𝑏)−𝐹(𝑎)

𝑏−𝑎 (5.3)

From the mean value theorem of differentiation it thus follows that for the terms of the sum above, we have with appropriately

chosen 𝑐𝑖 ∈]𝑎, 𝑏[ (𝑖 = 1,… , 𝑛)

𝐹(𝑥𝑖) − 𝐹(𝑥𝑖−1) = 𝐹′(𝑐𝑖)(𝑥𝑖 − 𝑥𝑖−1) (5.4)

and substitution yields

𝐹(𝑏) − 𝐹(𝑎) = ∑ 𝐹′(𝑐𝑖)(𝑥𝑖 − 𝑥𝑖−1)𝑛𝑖=1 (5.5)

By definition it follows that 𝐹′(𝑐𝑖) = 𝑓(𝑐𝑖) and setting Δ𝑥𝑖−1 ≔ 𝑥𝑖 − 𝑥𝑖−1, which yields

𝐹(𝑏) − 𝐹(𝑎) = ∑ 𝑓(𝑐𝑖) Δ𝑥𝑖−1𝑛𝑖=1 (5.6)

By taking the limit of the above, we obtain the Riemann integral

limΔ𝑥𝑖−1→0

(𝐹(𝑏) − 𝐹(𝑎)) = limΔ𝑥𝑖−1→0∑ 𝑓(𝑐𝑖) Δ𝑥𝑖−1𝑛𝑖=1 (5.7)

𝐹(𝑏) and 𝐹(𝑎) are independent of 𝑥𝑖 and the left hand side of the above thus evaluates to 𝐹(𝑏) − 𝐹(𝑎). For the right hand side, we

note that 𝑥𝑖−1 ≤ 𝑐𝑖 ≤ 𝑥𝑖−1 + Δ𝑥𝑖−1 and thus limΔ𝑥𝑖−1→0

𝑥𝑖−1 = 𝑥𝑖−1 and limΔ𝑥𝑖−1→0

𝑥𝑖−1 + Δ𝑥𝑖−1 = 𝑥𝑖−1, from which it follows that

limΔ𝑥𝑖−1→0

𝑐𝑖 = 𝑥𝑖−1. We thus have

𝐹(𝑏) − 𝐹(𝑎) = limΔ𝑥𝑖−1→0∑ 𝑓(𝑥𝑖−1)Δ𝑥𝑖−1𝑛𝑖=1 = limΔ𝑥𝑖→0

∑ 𝑓(𝑥𝑖)Δ𝑥𝑖𝑛−1𝑖=0 = ∫ 𝑓(𝑠)𝑑𝑠

𝑏

𝑎

(5.8)

With the definition of the Riemann integral under the generalization that the Δ𝑥𝑖 may not be equally spaced. □

Study Questions

1. State the intuitions for (a) the definite integral ∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 of a function 𝑓 on an interval [𝑎, 𝑏] ⊂ ℝ and (b) the indefinite integral

∫ 𝑓(𝑥)𝑑𝑥 of a function.

2. Evaluate the integral ∫ 13

1𝑑𝑠.

Study question answers

1. The definite integral ∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 of a function 𝑓 on an interval [𝑎, 𝑏] ⊂ ℝ is intuitively understood as the signed area between the

function’s graph and the 𝑥-axis. The indefinite integral ∫ 𝑓(𝑥)𝑑𝑥 of a function is that function 𝐹 whose derivative is the function 𝑓,

i.e. 𝐹′(𝑥) =𝑑

𝑑𝑥(∫ 𝑓(𝑥)𝑑𝑥) = 𝑓(𝑥).

2. We first evaluate the anti-derivative of the constant function 1 ∶ ℝ → ℝ, 𝑥 ↦ 1(𝑥) ≔ 1, which is given by the identity function

𝑖𝑑𝑥 ∶ ℝ → ℝ, 𝑥 ↦ 𝑖𝑑(𝑥) ≔ 𝑥, because 𝑑

𝑑𝑥𝑖𝑑𝑥(𝑥) = 1for all 𝑥 ∈ ℝ. Using the fundamental theorem of calculus, we thus have

∫ 13

1𝑑𝑠 = (𝑖𝑑𝑥)1

3 = 3 − 1 = 2

66

Sequences and Series

(1) Sequences

Intuitively, a sequence is an infinite ordered list. More formally, let 𝑀 be a nonempty set, for

example the real line ℝ. A “sequence in 𝑀” is a function 𝑓, which allocates to each natural number 𝑛 ∈ ℕ a

unique element 𝑓(𝑛) ∈ 𝑀. If 𝑀 ≔ ℝ, the sequence is called a “real sequence”. The elements 𝑎𝑛 ≔ 𝑓(𝑛)

are called the “terms” of the sequence 𝑓. Sequences are usually denoted without reference to 𝑓, and the

following notations are commonly encountered

(𝑎𝑛)𝑛∈ℕ or (𝑎𝑛) or (𝑎1, 𝑎2, 𝑎3, … ) (1)

It is important to note that because sequences are defined as functions of the form

𝑓: ℕ → 𝑀, 𝑛 ↦ 𝑓(𝑛) ≔ 𝑎𝑛 (2)

and there are infinitely many natural numbers, sequences have infinitely many terms 𝑎𝑛. Examples of

sequences are

(𝑞𝑛)𝑛∈ℕ = (𝑞1, 𝑞2, 𝑞3, … ) for 𝑞 ∈ ℝ (3)

(𝑛𝑘)𝑛∈ℕ

= (1𝑘 , 2𝑘 , 3𝑘 , … ) for 𝑘 ∈ ℤ (4)

and

(√𝑛)𝑛∈ℕ

= (√1, √2, √3,… ) (5)

An important concept associated with sequences is the question of their convergence. Intuitively, a

sequence is said to converge if its terms never increase or decrease beyond a given finite real number.

(2) Series

Intuitively, series are infinite sums. More formally, series are defined as special kind of sequences:

Let (𝑎𝑛)𝑛∈ℕ be a sequence in ℝ. Then a series

∑ 𝑎𝑛∞𝑛=1 = 𝑎1 + 𝑎2 + 𝑎3 +⋯ (1)

is the sequence (𝑠𝑛)𝑛∈ℕ, where

(𝑠𝑛)𝑛∈ℕ ≔ ∑ 𝑎𝑘𝑛𝑘=1 (2)

The 𝑎𝑛 are referred to as “terms” of the series, whereas the 𝑠𝑛 are referred to as the “partial sums” of the

series. If a sequence converges, i.e., if the sequence of partial sums does not increase or decrease beyond a

given finite real number for 𝑛 going to infinity, then the symbol ∑ 𝑎𝑛∞𝑛=1 is also used to indicate the value of

this limiting value.

Examples for series are the so-called “geometric series”

∑1

𝑧𝑛∞𝑛=0 = 1 +

1

𝑧+

1

𝑧2+⋯ (3)

which converges for |𝑧| < 1 and the so-called “harmonic series”

67

∑1

𝑛∞𝑛=1 = 1 +

1

2+1

3+⋯ (4)

which does not converge.

Series can be used to define real-valued univariate functions of the form

𝑓 ∶ ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (5)

To achieve this, the value of the function 𝑓(𝑥) is defined as a series in 𝑥, i.e. as a series that depends on the

function’s input argument 𝑥. An important example is the series definition of the exponential function

exp:ℝ → ℝ, 𝑥 ↦ exp(𝑥) ≔ ∑𝑥𝑛

𝑛!∞𝑛=0 = 1 + 𝑥 +

𝑥2

2!+𝑥3

3!+𝑥4

4!+⋯ (6)

where 𝑛! ≔ ∏ 𝑘𝑛𝑘=1 denotes the factorial. Importantly, the series definition of the exponential function

allocates the value of the convergent series ∑𝑥𝑛

𝑛!∞𝑛=0 ∈ ℝ to each 𝑥 ∈ ℝ. Further important examples for

series are the sine and cosine functions introduced below. Finally, also the Fourier series is a function with

domain ℝ that is defined in terms of a series.

Study Questions

1. Write down the definition of a sequence. 2. Write down the definition of a series.


1. For an nonempty set 𝑀 a “sequence in 𝑀” is a function 𝑓, which allocates to each natural number 𝑛 ∈ ℕ a unique element

𝑓(𝑛) ∈ 𝑀 and for 𝑎𝑖 ∈ 𝑀 is usually denoted by (𝑎𝑛)𝑛∈ℕ or (𝑎𝑛) or (𝑎1, 𝑎2, 𝑎3, … ).

2. Let (𝑎𝑛)𝑛∈ℕ be a sequence in ℝ. Then a series ∑ 𝑎𝑛∞𝑛=1 = 𝑎1 + 𝑎2 + 𝑎3 +⋯ is the sequence (𝑠𝑛)𝑛∈ℕ, where (𝑠𝑛)𝑛∈ℕ ≔

∑ 𝑎𝑘𝑛𝑘=1 .

68

Ordinary differential equations

(1) Differential equations

Differential equations specify sets of functions that model real-world phenomena. These sets of

functions are defined by differential equations in an implicit manner, by specifying functions of the function

specified by the differential equation, its input arguments, and, importantly, its derivatives to arbitrary

order. In this manner, differential equations can be used to model real-world phenomena by explicitly

describing their dynamics, i.e. the way the phenomena change, and the changes of this change. In contrast

to algebraic equations, the solutions of which are numbers, the solutions of differential equations and

associated initial and boundary value problems are functions. The classic theory of differential equations is

primarily concerned with finding explicit solutions to differential equations and establishing criteria under

which such solutions exists and are unique. The more modern viewpoint of dynamical systems theory is

more interested in the qualitative properties of systems of differential equations.

Three classes of differential equations are of interest for probabilistic models of functional

neuroimaging data: (1) ordinary differential equations, (2) partial differential equations, and (3) stochastic

differential equations. Ordinary differential equations (ODEs) are characterized by the fact that only

derivatives with respect to a single scalar input variable (usually modelling time or space) appear. ODEs are

thus used to model systems that evolve either in time or space. Partial differential equations (PDEs) are

characterized by the fact that they comprise partial derivatives, i.e. derivatives with respect to two or more

scalar input variables. Often, these variables refer to time and space, and PDEs are used to model systems

that evolve in both time and space. Both ODEs and PDEs specify sets of “deterministic” functions (all

functions are deterministic by definition), i.e. they do not account for random fluctuations in the evolution of

the system that they model. Stochastic differential equations, which can be both ordinary or partial,

explicitly take random innovations in the dynamics of systems into account. The solutions of stochastic

differential equations, if they are exist, are stochastic processes. In this section, we will be concerned with

ordinary differential equations.

Ordinary differential equations specify functions of scalar variables. We begin our treatment of ODEs

with some preliminary remarks on notation. In the context of neuroimaging data analyses, the most

prevalent use of ODEs is to describe the evolution of systems over time. We will thus denote the input

argument to the functions specified by ODEs using the letter 𝑡 and assume that 𝑡 ∈ 𝐼 ⊆ ℝ, i.e. that 𝑡 is a real

scalar value in some interval 𝐼 of the real line. The functions specified by ODEs can be real-valued or vector-

valued, i.e. with 𝑛 ≥ 1 they are of the form

𝑥 ∶ 𝐼 ⊆ ℝ → ℝ𝑛, 𝑡 ↦ 𝑥(𝑡) (1)

Note that the 𝑥s that appear in the ODEs below are always refer to functions, and not, as in algebraic

equations, to numbers or vectors. Because we imply that the input variable 𝑡 describes time, we will use the

“dot notation” to indicate the derivatives of the function 𝑥 with respect to its input argument

�̇�(𝑡) =𝑑

𝑑𝑡𝑥(𝑡), �̈�(𝑡) =

𝑑2

𝑑𝑡2𝑥(𝑡), 𝑥(𝑡) =

𝑑3

𝑑𝑡3𝑥(𝑡), … (2)

Note that the dot notation only really works for lower order derivatives and that for higher order derivatives,

(𝑘 > 3), we use the more general notation

69

𝑥(𝑘)(𝑡) =𝑑𝑘

𝑑𝑡𝑘𝑥(𝑡) (3)

In general, we will refrain from using the “prime” notation 𝑥′(𝑡), 𝑥′′(𝑡),… for derivatives in the context of

ODEs. Furthermore, we will usually use the notation �̇�(𝑡), �̈�(𝑡), … to denote the derivatives of the function 𝑥

evaluated at 𝑡, while we use the notation �̇�, �̈�, … to denote the derivative of the function 𝑥 proper, i.e. the

functions

𝑥(𝑘) ∶ 𝐼 ⊆ ℝ → ℝ𝑛, 𝑡 ↦ 𝑥(𝑘)(𝑡) (𝑘 = 1,2,… ) (4)

which allocate the respective derivatives to all input arguments 𝑡 ∈ 𝐼.

Based on these preliminary remarks, we can now give a first formalization of ODEs. An ODE is an

equation that can comprise a function 𝑥, its input argument 𝑡, derivatives of the function 𝑥 up to arbitrary

order, and a function of these entities. In general ODEs can be written as

𝐹(𝑡, 𝑥, �̇�, �̈� … , 𝑥(𝑘)) = 0 (5)

where we leave the range and domain of the function 𝐹 unspecified for the moment. The highest-order

derivative occurring in an ODE 𝑘 ∈ ℕ is referred to as the “order” of an ODE. If the ODE can bewritten in the

form

𝑥(𝑘) = 𝑓(𝑡, 𝑥, �̇�, �̈� … , 𝑥(𝑘−1)) (6)

by appropriately reformulating the function 𝐹 for the function 𝑓, the ODE is referred to as an “explicit” ODE.

Otherwise, it is referred to as an “implicit ODE”. Finally, if the specifying function 𝐹 or 𝑓 is not a function of 𝑡,

the ODE is referred to as “autonomous”, otherwise, it is referred to as “non-autonomous”.

To illustrate the nomenclature we give two examples of ODEs. Let

𝑥 ∶ 𝐼 ⊆ ℝ → ℝ, 𝑡 ↦ 𝑥(𝑡), �̇� ∶ 𝐼 ⊆ ℝ → ℝ, 𝑡 ↦ �̇�(𝑡), … (7)

denote a function and its derivatives. Then

(1) �̇� = 𝑥 is an explicit, autonomous, first-order ODE (8)

(2) �̇� = 2𝑡𝑥 is an explicit, non-autonomous, first order ODE (9)

(3) �̈� = sin(𝑡) 𝑥(𝑡) is an explicit, autonomous, second-order ODE (10)

(2) Initial value problems for systems of first-order ordinary differential equations

Explicit systems of first-order differential equations are of eminent importance in the theory of ODEs

and dynamical systems, because many higher-order problems can be rewritten as such.

Systems of first-order ODEs

To introduce systems of first-order ODEs, let 𝐼 ⊆ ℝ be an interval of the real line and 𝐷 ⊆ ℝ𝑛 a subset of the

𝑛-dimensional space of vectors with real entries. Further, let

𝑥 ∶ 𝐼 → 𝐷, 𝑡 ↦ 𝑥(𝑡) (1)

70

denote a function (or, more specifically, a curve in ℝ𝑛) and let the first derivative of 𝑥 be denoted by

�̇� ≔ (

�̇�1�̇�2⋮�̇�𝑛

) (3)

where �̇�𝑖 (𝑖 = 1,… , 𝑛) denote the first derivatives of the 𝑛 component functions of 𝑥. Further, let

𝑓 ∶ 𝐼 × 𝐷 → ℝ𝑛, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) (4)

denote a vector-valued function of a real and a vector-valued argument, sometimes referred to as the

“evolution” function of the system. Then

�̇� = 𝑓(𝑡, 𝑥) (5)

denotes an 𝑛-dimensional system of first-order ODEs. For a given time-point 𝑡 ∈ 𝐼, the system of first-order

ODEs (5) can be written more explicitly as

(

�̇�1(𝑡)�̇�2(𝑡)⋮

�̇�𝑛(𝑡)

) =

(

𝑓1 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇)

𝑓2 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇)

⋮

𝑓𝑛 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇))

(6)

where the

𝑓𝑖 ∶ 𝐼 × 𝐷 → ℝ, (𝑡, 𝑥) ↦ �̇�𝑖(𝑡) for 𝑖 = 1,… , 𝑛 (7)

are the component functions of the function 𝑓.

□

Before introducing the notion of an initial value problem for systems of first-order ODEs, we consider

an example. Let 𝑛 = 1 and

𝑓 ∶ 𝐼 × 𝐷 → ℝ, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) ≔ 𝑥 (8)

Then (5) specifies the one-dimensional system of first-order ODEs

�̇� = 𝑥 (9)

In words (9) specifies the set of functions for which the rate of change �̇�(𝑡) at each point in time 𝑡 ∈ 𝐼

corresponds to the value of the function 𝑥(𝑡) at the same time point. Without stating how the solution is

derived, we postulate that

𝑦 ∶ 𝐼 → ℝ, 𝑡 ↦ 𝑦(𝑡) ≔ 𝑐 exp(𝑡) for 𝑐 ∈ ℝ (10)

denotes a solution of (9). To prove this postulate, we evaluate the derivative of 𝑦 for each 𝑡 ∈ 𝐼

71

�̇�(𝑡) =𝑑

𝑑𝑡𝑦(𝑡) =

𝑑

𝑑𝑡(𝑐 exp(𝑡)) = 𝑐

𝑑

𝑑𝑡exp(𝑡) = 𝑐 exp(𝑡) = 𝑦(𝑡) (11)

and thus the function 𝑦 as defined in (10) fulfills the ODE (9).

Note that there are an infinite number of functions 𝑦 of the form (10) due to the coefficient 𝑐 ∈ ℝ. The ODE

in (9) thus specifies an infinite set of functions. Often, one is interested in specific members of this set,

which, in addition to the ODE also fulfil additional conditions such as taking on a specific value for a given

input argument. This gives rise to the concept of “initial value problems”. We define the initial value

problem for a system of first-order ODEs next.

Initial value problem for a system of first-order ODEs

In the context of a system of first-order ODEs as discussed above let 𝑡0 ∈ 𝐼 be a fixed and specified

input argument of the function 𝑥. Then an initial value problem for a system of first-order ODEs corresponds

to the task to find a solution 𝑦 ∶ 𝐼 → ℝ𝑛 of the system

�̇� = 𝑓(𝑡, 𝑥) (12)

which satisfies the initial value

𝑥(𝑡0) = 𝑥0 (13)

where 𝑥0 ∈ ℝ𝑛 is a fixed and specified value referred to as “initial value”.

□

An initial value problem for the example one-dimensional system discussed above would be the

following

�̇� = 𝑥, 𝑥(0) = 2 (14)

In words (14) specifies the set of functions for which the rate of change �̇�(𝑡) at each point in time 𝑡 ∈ 𝐼

corresponds to the value of the function 𝑥(𝑡) at the same time point and for which the value at time-point

𝑡0 ≔ 0 is given by 2. Based on the solution above, we find that

𝑦(0) = 2 ⇔ 𝑐 exp(0) = 2 ⇔ 𝑐 ⋅ 1 = 2 (15)

Thus, the function

𝑦 ∶ 𝐼 → ℝ, 𝑡 ↦ 𝑦(𝑡) ≔ 2exp(𝑡) (16)

satisfies the ODE �̇� = 𝑥 and the initial value 𝑥(0) = 2 .

One-dimensional initial value problems can be visualized by plotting the derivative of the function 𝑥

specified by 𝑓 as linear slope at a number of locations in the (𝑡, 𝑥) plane (Figure 1). An intuition about the

solutions to initial value problems can then be gained by conceiving the resulting vector field as the flow

profile of a liquid, and the initial condition as the location of where a floating particle is added to the fluid.

Over time, this particle will move in the (𝑡, 𝑥) plane, and the solution of the initial value problem specifies

the route of this movement.

72

Figure 1. Visualization of one-dimensional initial value problems. The left panel depicts the flow field, corresponding to the ODE of the example �̇� = 𝑥 discussed in the main text, together with the solution path of an initial value problem where 𝑡0 ≔ 0 and 𝑥(𝑡0) ≔ 2. Note the direction of the flow at given location in the (𝑡, 𝑥) plane is a function of 𝑥 only, as evident from the definition 𝑓(𝑡, 𝑥) ≔ 𝑥 of the ODE. The left panel depicts a more intricate flow field, which according to the definition 𝑓(𝑡, 𝑥) ≔ 𝑥2 − 𝑡 is a function of both 𝑡 and 𝑥. Here, a solution for the initial condition 𝑡0 ≔ −1.5 and 𝑥(𝑡0) ≔ −1.8 is depicted.

(3) Numerical approaches for initial value problems

The mathematical theory of ODEs and initial value problems is primarily concerned with (a) finding

and characterizing analytical approaches to solve ODEs and initial value problems, and (b) establishing

analytic criteria under which solutions of ODEs exist and to determine whether these solutions are unique.

Real-world applications of ODEs, however, often result in rather complicated dynamical systems, for which

often no analytical treatment exists. In these cases, numerical procedures can be used to evaluate

trajectories described by initial value problems. In this section, we introduce a number of basic numerical

approaches. We here consider the initial value problem for systems of first-order ODEs in the following

formulation

�̇�(𝑡) = 𝑓(𝑡, 𝑥(𝑡)) for all 𝑡 ∈ [𝑎, 𝑏], 𝑥(𝑎) = 𝑥𝑎 and 𝑓 ∶ [𝑎, 𝑏] × ℝ𝑛 → ℝ𝑛 (1)

Euler methods

To introduce the explicit and modified Euler methods for the numerically solution of initial value

problems of the form (1), we specifically consider the scalar case, i.e. 𝑛 = 1. Numerical approaches are

characterized by the fact that they replace continuous time by discrete time, i.e. instead of generating

solutions for the infinitely and uncountable values 𝑡 ∈ [𝑎, 𝑏] ⊂ ℝ, they typically generate solutions for a

discrete set of “support points” 𝑡𝑘, where 𝑘 = 0,… ,𝑚 with 𝑚 ∈ ℕ and 𝑚 < ∞. We thus consider a

discretization of the interval 𝐼 of the form

𝐼Δ ≔ {𝑎 =: 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑚 ≔ 𝑏} (2)

To simplify proceedings, we define the distance between two adjacent support points by

ℎ𝑘 ≔ 𝑡𝑘+1 − 𝑡𝑘 (𝑘 = 0,1,… ,𝑚 − 1) (3)

73

Note that the ℎ𝑘 are not required to be all the same, i.e. the support point to be “equidistant”. This is,

however, often the case in implementations of numerical methods. The central idea of basically all

numerical approaches for the solution of initial value problems is to find good approximations for the values

𝑥(𝑡) of the sought solution function in the form of values 𝑥𝑘 at the support points 𝑡𝑘. In other words,

numerical approaches find values 𝑥𝑘 such that

𝑥𝑘 ≈ 𝑥(𝑡𝑘) for 𝑘 = 1,… ,𝑚 − 1 (4)

For 𝑘 = 0, the initial condition 𝑥(𝑡0) = 𝑥(𝑎) = 𝑥𝑎 is usually employed. Recursion equations for the

remaining 𝑥𝑘 are then usually motivated from the perspective of Taylor approximations. Recall that for small

ℎ𝑘 = 𝑡𝑘+1 − 𝑡𝑘 a first-order Taylor approximation of 𝑥(𝑡𝑘+1) is given by

𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + �̇�(𝑡𝑘)(𝑡𝑘+1 − 𝑡𝑘) (5)

= 𝑥(𝑡𝑘) + ℎ𝑘�̇�(𝑡𝑘)

In words: the value of the sought function 𝑥 at the support point 𝑡𝑘+1 is approximately equal to the value of

the function at the previous support plus the derivative (“slope”) of the function 𝑥 at support point 𝑡𝑘

multiplied by the step-size ℎ𝑘. Importantly, the derivative of 𝑥 at the support point 𝑡𝑘 is specified by the ODE

of the initial value problem (1), such that we have

𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + ℎ𝑘𝑓(𝑡𝑘 , 𝑥(𝑡𝑘)) (6)

In (6), the values 𝑥(𝑡𝑘+1) and 𝑥(𝑡𝑘) are unknown. If one replaces them by approximations 𝑥𝑘+1 and 𝑥𝑘,

respectively, one obtains the explicit Euler method algorithm

1. Set 𝑥0 ≔ 𝑥𝑎

2. For 𝑘 = 0,… ,𝑚 − 1 set

𝑥𝑘+1 ≔ 𝑥𝑘 + ℎ𝑘𝑓(𝑡𝑘, 𝑥𝑘) (7)

Figure 2 depicts the application of Euler’s method for the one-dimensional initial value problem discussed in

the previous section.

Figure 3. Application of the Euler method algorithm for the numerical solution of the initial value problem �̇� = 𝑥 with 𝑡0 ≔ 0 and 𝑥(𝑡0) = 2. Note that the deviation between the analytical and numerical solutions decreases as the number of support points 𝑚 increases.

74

Modified Euler method

A fundamental property of Euler’s method is that for an interval [𝑡𝑘 , 𝑡𝑘+1] the derivative at the left

interval boundary is used as basis for the approximation. An alternative approach is to consider the

derivative in the center of the interval, i.e. to use the derivative at 𝑡𝑘 +1

2ℎ𝑘. Using the same approach as

above, a first-order Taylor approximation of 𝑥(𝑡𝑘+1) takes the form

𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + �̇� (𝑡𝑘 +1

2ℎ𝑘) (𝑡𝑘+1 − 𝑡𝑘) (8)

= 𝑥(𝑡𝑘) + ℎ𝑘�̇� (𝑡𝑘 +1

2ℎ𝑘)

= 𝑥(𝑡𝑘) + ℎ𝑘𝑓 (𝑡𝑘 +1

2ℎ𝑘 , 𝑥 (𝑡𝑘 +

1

2ℎ𝑘))

Replacing the unkown values of 𝑥(𝑡𝑘) and 𝑥(𝑡𝑘+1) based on the approximations 𝑥𝑘 and 𝑥𝑘+1 and evaluating

the unknown value 𝑥 (𝑡𝑘 +1

2ℎ𝑘) by means of Euler’s method, i.e., by setting

𝑥 (𝑡𝑘 +1

2ℎ𝑘) ≈ 𝑥𝑘 +

1

2ℎ𝑘𝑓(𝑡𝑘 , 𝑥𝑘) (9)

then yields the “modified Euler method” algorithm

1. Set 𝑥0 ≔ 𝑥𝑎

2. For 𝑘 = 0,… ,𝑚 − 1 set

𝑥𝑘+1 ≔ 𝑥𝑘 + ℎ𝑘𝑓 (𝑡𝑘 +1

2ℎ𝑘 , 𝑥𝑘 +

1

2ℎ𝑘𝑓(𝑡𝑘 , 𝑥𝑘) ) (10)

Study Questions 1. What do differential equations specify?

2. Verbally discuss the notions of (1) ordinary differential equations, (2) partial differential equations, and (3) stochastic differential

equations.

3. What is the order of the ordinary differential equation 𝐹 = 𝑚�̈�(𝑡), known as Newton’s law?

4. Formulate the general initial value problem for a system of first-order differential equations.

5. What is the difference between an analytical solution and a numerical solution of an initial value problem?

Study Questions Answers 1. Differential equations specify sets of functions.

2. Ordinary differential equations (ODEs) are characterized by the fact that only derivatives with respect to a single scalar input

variable (usually modelling time or space) appear. ODEs are thus used to model systems that evolve either in time or space. Partial

differential equations (PDEs) are characterized by the fact that they comprise partial derivatives, i.e. derivatives with respect to two

or more scalar input variables. Often, these variables refer to time and space, and PDEs are used to model systems that evolve in

both time and space. Both ODEs and PDEs specify sets of “deterministic” functions (all functions are deterministic by definition), i.e.

they do not account for random fluctuations in the evolution of the system that they model. Stochastic differential equations, which

can be both ordinary or partial, explicitly take random innovations in the dynamics of systems into account. The solutions of

stochastic differential equations, if they are exist, are stochastic processes.

75

3. The highest derivative occurring in the differential equation is of second-order, Newton’s law is thus a second-order ordinary

differential equation.

4. Let 𝐼 ⊆ ℝ be an interval of the real line and 𝐷 ⊆ ℝ𝑛 a subset of the 𝑛-dimensional space of vectors with real entries. Let

𝑥 ∶ 𝐼 → 𝐷, 𝑡 ↦ 𝑥(𝑡) denote a function (or, more specifically, a curve in ℝ𝑛) and let the first derivative of 𝑥 be denoted by �̇�. Further,

let 𝑓 ∶ 𝐼 × 𝐷 → ℝ𝑛, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) denote a vector-valued function of a real and a vector-valued argument specifying the first-order

system of differential equations �̇� = 𝑓(𝑡, 𝑥) and let 𝑡0 ∈ 𝐼 be a fixed and specified input argument of the function 𝑥. Then an initial

value problem for a system of first-order ODEs corresponds to the task to find a solution 𝑦 ∶ 𝐼 → ℝ𝑛 of the system �̇� = 𝑓(𝑡, 𝑥) which

satisfies the initial value 𝑥(𝑡0) = 𝑥0 ,where 𝑥0 ∈ ℝ𝑛 is a fixed and specified value referred to as “initial value”.

5. An analytical solution of an initial value problem is derived by theoretical considerations. A numerical solution is derived by

approximating the solution function of an initial value problem with the help of a computer which implements a recursive algorithm.

76

An Introduction to Fourier Analysis

The ability to represent time- or space-domain signals in the frequency domain is a fundamental

aspect of modern signal processing. In functional neuroimaging frequency content representation is

ubiquitous: methodological approaches in the analysis of neural oscillations in cognitive tasks, the

implementation of filters in EEG data processing, the visualization of electromagnetic MRI signals, the

analysis of FMRI resting-state networks, as well as the analysis of task-related FMRI data in the context of

the GLM all involve the transformation of signals from the time-or space-domain to the frequency domain.

Representing signals as linear combinations of sine and cosine function has its origins in the theory of partial

differential equations, most prominently in Fourier’s work on heat diffusion [Fourier (1822) “Théorie

analytique de la chaleur”]. The modern reliance on frequency decompositions in digital signal processing

owes much to the development of the fast Fourier transform (FFT) algorithm [Cooley Tukey (1965) “An

algorithm for the machine calculation of complex Fourier series”]. In this and the following sections, we

develop the theory of Fourier analysis following [Weaver(1983) “Applications of discrete and continuous

Fourier Analysis”]. We eshew a discussion of the origins of Fourier analysis in the theory of partial

differential equations and instead focus on the notion of frequency and phase content of time-or space

domain signals. To this end, we first develop the representation of a univariate real-valued function by

means of a Fourier series. We then discuss alternative representations of the real Fourier series (the

complex-exponential and polar forms) that are less intuitive, but more closely related to technical

implementations of Fourier transforms. Finally, we consider the notions of a Fourier transform of a function,

and, from the perspective of digital signal processing more importantly, the concept of the discrete Fourier

transform. We close with a discussion of the fast Fourier transform algorithm. Fourier analysis rests on a

number of mathematical preliminaries (e.g. sequences and series, the trigonometric functions, complex

numbers, Riemann integration) on which we base our discussion.

(1) Generalized cosine and sine functions

In this subsection we consider the fundamental building blocks of Fourier analysis, namely the

following two functions, referred to here as the “generalized” sine and cosine functions:

𝑓:ℝ → [−𝑎, 𝑎], 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 cos(2𝜋𝜔𝑥 − 𝜑) (1)

𝑔:ℝ → [−𝑏, 𝑏], 𝑥 ↦ 𝑔(𝑥) ≔ 𝑏 sin(2𝜋𝜔𝑥 − 𝜙) (2)

We discuss the components of (1) and (2) in turn, starting from the behavior of “pure” sine and cosine

functions.

The factors 𝑎 and 𝑏 in the definition of the generalized cosine and sine functions are referred to as

“amplitudes”. They are simply constants that scale the height of the cosine and sine functions and cause

them to vary between 𝑎 and – 𝑎, 𝑏 and – 𝑏 , rather than between −1 and +1, respectively. Note that the

pre-multiplication with 𝑎 or 𝑏 does not alter the fact that sine and cosine repeat themselves every 2𝜋 and

does not alter their zero crossings.

We now consider the terms cos(2𝜋𝜔𝑥) and sin(2𝜋𝜔𝑥) in the definition of the generalized cosine

and sine functions. To simplify the discussion we define 𝜇 ≔ 2𝜋𝜔. 𝜇 is called the “radial frequency” of the

sine and cosine function and is a measure of how often the functions repeat themselves with respect to

units of 𝜋. Specifically, cos(𝜇𝑥) and sin(𝜇𝑥) repeat themselves every 2𝜋/𝜇, or, in other words perform 𝜇

77

revolutions within every 2𝜋 (Figure 1, left panels). Usually, oscillations are not expressed with respect to 𝜋,

but with respect to a temporal unit such as seconds, or a spatial unit, such as meters. To this end 𝜔, the so-

called “circular frequency” of cosine and sine allows for expressing these functions in a meaningful way

without an 𝑥-axis measured in units of 𝜋. Specifically, if measured in circular frequency, the period of the

cosine and sine functions is 1/𝜔, i.e. they perform a full revolution every 1/𝜔 𝑥-units (Figure 1 right panels).

In other words, the period 𝑇 of sin(2𝜋𝜔𝑥) is defined as the number of 𝑥-units required to complete one

cycle of a sine function. It is given by 𝑇 =1

𝜔. If the variable 𝑥 represents a time variable, 𝜔 is measured in

1/𝑠𝑒𝑐, referred to as “Hertz”. When the variable 𝑥 represents a spatial measurement, the period 𝑇 is called

the wavelength and denoted by 𝜆.

Figure 1 Radial and circular frequency for the sine function. If expressed in the form 𝑠𝑖𝑛(𝜇𝑥), 𝜇 is referred to as “radial frequency” and expresses the number of revolutions of the sine function over an interval of length 2𝜋. In other words, the sine function repeats itself every 2𝜋/𝜇. Note that the radial frequency is useful, if the x-axis is measured in units of 𝜋. Usually, the x-axis is not measured in units of 𝜋, but some physical measure, such as seconds or meter. In this case, the sine functions is sensibly expressed as 𝑠𝑖𝑛(2𝜋𝜔𝑥), where 𝜔 is referred to as the “circular frequency” and measures the number of revolutions over an interval of length 1. If 𝑥 is measured in seconds, the number of revolutions per second is referred to as “Hertz”. 𝑇 = 1/𝜔 is referred to as period and measures the number of 𝑥-units required for a full revolution of the sine function.

Finally, we consider the terms 𝜙 and 𝜑. For 𝑥 = 0 we have sin(2𝜋𝜔𝑥) = 0 and cos(2𝜋𝜔𝑥) = 1. It

is often convenient to be able to shift the functions so that at 𝑥 = 0 they may take on other values. This is

accomplished by the addition of a “phase” term to the argument. For example, the function sin(2𝜋𝜔𝑥 − 𝜑)

is the same as sin(2𝜋𝜔𝑥) shifted to the right by an amount 𝜑. Because a sine function can be transformed

into a cosine function by shifting it either to 𝜋/2 to the right or to the left, and vice versa, phase terms are

usually referred to as “phase angles” and are considered only in the interval [−𝜋

2,𝜋

2]. In Figure 2 below, we

depict generalized sine functions with different phase angles and identical amplitude 𝑎 = 1 and circular

frequency 𝜔 =1

2𝜋, i.e. generalized sine functions of the form sin(𝑥 − 𝜙).

78

Figure 2. Influence of the phase parameter ϕ of a generalized sine function. Note that inclusion of the additive term –ϕ for

ϕ ∈ [0, π/2] shifts the original function sin(x) to the right and that the function sin (x −π

2) is equivalent to the cosine function

cos(x).

(2) Linear combinations of generalized cosine sine functions

Using a number 𝑛 ∈ ℕ generalized cosine and sine functions with phase 0, one can create a “linear

combination of generalized cosine and sine functions” of the form

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥)𝑛−1𝑘=0 + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥) (1)

where we introduced 𝑛 amplitude coefficients 𝑎𝑘 (𝑘 = 0,1, … , 𝑛) for the cosine functions, 𝑛 amplitude

coefficients 𝑏𝑘 (𝑘 = 0,1,… , 𝑛 − 1) for the sine functions and 𝑛 circular frequency values 𝜔𝑘 (𝑘 =

0,1,… , 𝑛 − 1) which are associated with the respective amplitude coefficients. Linear combinations of

generalized cosine and sine functions are a very versatile formulation which allow for representing a wide

variety of functions.

We illustrate this concept with two examples. For a first example (Figure 1), let 𝑛 = 2,𝜔0 =

10, 𝑎0 = 0, 𝑏0 = 3,𝜔1 = 15, 𝑎1 = 2, 𝑏1 = 0. Then 𝑓 is given by

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) = 0 cos(2𝜋10𝑥) + 3 sin(2𝜋10𝑥) (2)

+2cos(2𝜋15𝑥) + 0 sin(2𝜋15𝑥)

In Figure 1 we plot the sine and cosine functions comprising this example and the resulting sum. Note that

the resulting sum is a periodic function, but does not have the “simple” form of a sine or cosine.

As a second example (Figure 2) let 𝑛 = 3,𝜔0 = 1, 𝜔1 = 4, 𝜔2 = 10, 𝑎0 = 2, 𝑎1 = 4, 𝑎2 =

0 , 𝑏0 = 0, 𝑏1 = −1, 𝑏2 = 3 . Then 𝑓 is given by

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) = 2 cos(2𝜋1𝑥) + 0 sin(2𝜋1𝑥) (3)

+ 4 cos(2𝜋4𝑥) − 1 sin(2𝜋4𝑥)

+ 0 cos(2𝜋10𝑥) + 3 sin(2𝜋10𝑥)

79

Figure 1 Linear combinations of generalized cosine and sine functions, example 1.

Figure 2 Linear combinations of generalized cosine and sine functions, example 2.

80

(3) The real Fourier series

In the previous section we have introduced the generalized cosine and sine functions and have seen how

these can be combined by choosing their frequencies and amplitudes to yield quite arbitrary functions. In

the current section we consider the reverse approach: we start from an arbitrary function 𝑓 and ask which

frequencies and coefficients we may choose for generalized cosine and sine functions to reconstruct 𝑓 by an

infinite sum of generalized cosine and sine functions. More specifically, in this section, we consider the

following problem: Given a function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥), can we find real numbers 𝑎𝑘 , 𝑏𝑘, 𝜔𝑘 such that 𝑓(𝑥)

can be represented by an infinite sum of generalized cosine and sine functions for all 𝑥 ∈ ℝ. In other words:

can we find 𝑎𝑘 , 𝑏𝑘, 𝜔𝑘 ∈ ℝ such that

𝑓(𝑥) = ∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)∞𝑘=0 for all 𝑥 ∈ ℝ (1)

To simplify proceedings, we will focus on functions for which we assume that they can be expressed in the

Fourier series form above, i.e., those functions for which the Fourier series converges for all 𝑥 ∈ ℝ.

Functions for which this is the case are said to satisfy the so-called “Dirichlet conditions”. The first Dirichlet

condition requires the function 𝑓 to be defined on the domain ℝ and periodic with period 𝑇, i.e.

𝑓(𝑥 + 𝑇) = 𝑓(𝑥) for all 𝑥 ∈ ℝ (2)

Understanding the further Dirichlet conditions requires some familiarity with basic mathematical concepts

such as continuity, which are not covered in PMFN. We hence omit a further discussion of them here (the

interested reader is referred to [Weaver, 1983] for a full discussion of the Dirichlet conditions). Instead we

focus on the derivation of formulas which, given the specification of a periodic function 𝑓 with period 𝑇,

yield the values of 𝑎𝑘 , 𝑏𝑘 and 𝜔𝑘 for 𝑘 = 0,1,2.. in equation (1). We will state these formulas next and then

discuss how they are obtained.

The frequencies 𝜔𝑘 for 𝑘 = 0,1,2,… in equation (1) are given by

𝜔𝑘 =𝑘

𝑇 (3)

where 𝑇 is the period of 𝑓. The cosine coefficients are given by the definite integrals

𝑎0 =1

𝑇∫ 𝑓(𝑥)𝑇/2

−𝑇/2𝑑𝑥 and 𝑎𝑘 =

2

𝑇∫ 𝑓(𝑥) cos (2𝜋

𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3… (4)

The sine coefficients are given by

𝑏0 = 0 and 𝑏𝑘 =2

𝑇∫ 𝑓(𝑥) sin (2𝜋

𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3,… (5)

In words: Given a periodic function 𝑓 with period 𝑇, the circular frequencies 𝜔𝑘 in expressions (1) are merely

the integer indices 𝑘 = 0,1,2,… divided by the period of 𝑓. Note that all these frequencies are integer

multiples of a basic frequency 𝜔 ≔ 1/𝑇. The cosine coefficient associated with the frequency 𝜔0 = 0, i.e. a

flat line, is given by the integral of the function 𝑓 on the interval [−𝑇/2, 𝑇/2], while all other cosine

coefficients 𝑎𝑘 , 𝑘 = 1,2, … are given by the integral of the function 𝑓 multiplied by a cosine term of circular

frequency 𝑘/𝑇. The sine coefficient associated with 𝑘 = 0, which corresponds to a sine wave of circular

frequency zero 𝜔0 = 0, i.e. a flat line, is 𝑏0 = 0, i.e., this term vanishes. All further sine coefficients

𝑏𝑘 , 𝑘 = 1,2, … are given by the integral of the function 𝑓 multiplied by a sine term of circular frequency 𝑘/𝑇

on [−𝑇/2, 𝑇/2]. To make the statements (3) – (7) explicit, we substitute them in equation (1) and obtain

81

𝑓(𝑥) =1

𝑇∫ cos (2𝜋

𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 + ∑ (

2


𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥) cos(2𝜋

𝑘

𝑇𝑥) + (

2


𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥) sin (2𝜋

𝑘

𝑇𝑥)∞

𝑘=1 (6)

Note that the right hand side of the above is fully specified in terms of the function 𝑓, its period 𝑇, the

cosine and sine functions, the evaluation of infinitely many integrals, and an infinite summation.

Proof of equations (4) and (5)

We now verify that the formulas given in (4) – (7) indeed fulfill equation (1) for the circular frequencies 𝜔𝑘 =𝑘

𝑇, 𝑘 =

0,1,2,…. To do so, we require the following “orthogonality” properties of sine and cosine products for 𝑘, 𝑗 ∈ ℕ0:

∫ cos (2𝜋𝑘

𝑇𝑥) cos (2𝜋

𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = {

0 (𝑘 ≠ 𝑗) 𝑇/2 (𝑘 = 𝑗 ≠ 0)

(4.1)

∫ sin (2𝜋𝑘

𝑇𝑥) sin (2𝜋

𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = {

0 (𝑘 ≠ 𝑗)𝑇/2 (𝑘 = 𝑗 ≠ 0)

(4.2)

∫ sin (2𝜋𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 0 (4.3)

While these properties can readily be confirmed in a formal way based on the series definitions of sine and cosine and the notions of even and odd functions, they are maybe somewhat more intuitively appreciated by inspection of Figure 1 and the intuition of the definite integral as the signed are under a function’s graph. A formal derivation of these orthogonality properties is given for example in Coleman MP (2013) “An Introduction to Partial Differential Equations with Matlab” CRC Press (p.84 – 86).

Figure 1 Orthogonality properties of sine and cosine functions. For 𝑘, 𝑗 ∈ {1,2} the panels depict product functions of sine and cosine functions. From the intuition of the integral as the signed area between a function’s graph and the x-axis (dashed line), the orthogonality properties (9) – (11) may be appreciated.

To derive the expression for the cosine coefficients in equation (5), we first substitute the frequencies 𝜔𝑘 = 𝑘/𝑇 on the right hand side of equation (1), which yields

𝑓(𝑥) = ∑ 𝑎𝑘 cos (2𝜋𝑘

𝑇𝑥) + 𝑏𝑘 sin (2𝜋

𝑘

𝑇𝑥)∞

𝑘=0 (4.4)

We next choose an arbitrary 𝑗 ∈ ℕ and multiply both sides of equation (12) by cos (2𝜋𝑗

𝑇𝑥), yielding

82

𝑓(𝑥) cos (2𝜋𝑗

𝑇𝑥) = (∑ 𝑎𝑘 cos (2𝜋

𝑘


𝑘

𝑇𝑥) ∞

𝑘=0 ) cos (2𝜋𝑗

𝑇𝑥)

= ∑ (𝑎𝑘 cos (2𝜋𝑘


𝑗


𝑘


𝑗

𝑇𝑥)) ∞

𝑘=0 (4.5)

We next integrate both sides on the interval [−𝑇/2, 𝑇/2]:

∫ 𝑓(𝑥) cos (2𝜋𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∫ ∑ 𝑎𝑘 cos (2𝜋

𝑘


𝑗


𝑘


𝑗

𝑇𝑥) ∞

𝑘=0𝑇/2

−𝑇/2𝑑𝑥 (4.6)

Assuming that we may exchange the order of integration and summation on the right hand side of the above, we then obtain with the linearity property of the definite integral


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 + 𝑏𝑘 ∫ sin (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 ∞

𝑘=0 (4.7)

From the orthogonality properties of sine and cosine functions, we now know that the latter terms involving sine and cosine of either the same (for 𝑘 = 𝑗) or different (for 𝑘 ≠ 𝑗) frequencies evaluate to zero and we have


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥∞

𝑘=0 (4.8)

We further know that all integrals with 𝑘 ≠ 𝑗 in the infinite sum on the right-hand side above evaluate to zero, and we thus have


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑎𝑗 ∫ cos (2𝜋

𝑗


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑎𝑗

𝑇

2 (4.9)

Dividing by 𝑇/2, we thus obtain

𝑎𝑗 =2


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 (4.10)

which, upon exchanging the arbitrary index 𝑗 for 𝑘 yields equation (5). For the special case 𝑗 = 0, we have in equation (4.8)

∫ 𝑓(𝑥) cos (2𝜋0

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑎0 ∫ cos (2𝜋

0


0

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥

⇔ ∫ 𝑓(𝑥)𝑇/2

−𝑇/2𝑑𝑥 = 𝑎0 ∫ 1

𝑇/2

−𝑇/2𝑑𝑥 = 𝑎0 (

𝑇

2− (−

𝑇

2)) = 𝑎0𝑇

𝑎0 =1


−𝑇/2𝑑𝑥 (4.11)

and thus obtained equation (4).

Likewise, we may obtain the formula for the sine coefficients 𝑏𝑘. We first substitute 𝜔𝑘 = 𝑘/𝑇 for 𝑘 = 1,2, …. and multiply

both sides of equation (1) by sin (2𝜋𝑗

𝑇𝑥) for 𝑗 ∈ ℕ . We then integrate both sides from −𝑇/2 to 𝑇/2. In this case all terms involving

𝑎𝑘 vanish and in the remaining infinite sum only the terms for 𝑗 = 𝑘 remain. As the corresponding right hand side of (4.10) now

involves the integral of 𝑓(𝑥) multiplied by sin (2𝜋𝑗

𝑇𝑥), we find equation (5).

In detail, we first substitute the frequencies 𝜔𝑘 = 𝑘/𝑇 on the right hand side of equation (1)

𝑓(𝑥) = ∑ 𝑎𝑘 cos (2𝜋𝑘


𝑘

𝑇𝑥)∞

𝑘=0 (5.1)

We next multiply both sides of the above by sin (2𝜋𝑗

𝑇𝑥), where we choose an arbitrary 𝑗 ∈ ℕ0 :

𝑓(𝑥) sin (2𝜋𝑗

𝑇𝑥) = (∑ 𝑎𝑘 cos (2𝜋

𝑘

𝑇𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)

∞𝑘=0 ) sin (2𝜋

𝑗

𝑇𝑥)

= (∑ 𝑎𝑘 cos(2𝜋𝑘


𝑗

𝑇𝑥) + 𝑏𝑘 sin(2𝜋

𝑘


𝑗

𝑇𝑥)∞

𝑘=0 ) (5.2)

Next we integrate both sides of the above on the interval [−𝑇/2, 𝑇/2]:

∫ 𝑓(𝑥) sin (2𝜋𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∫ ∑ 𝑎𝑘 cos (2𝜋

𝑘


𝑗

𝑇𝑥) + 𝑏𝑘 sin(2𝜋

𝑘


𝑗

𝑇𝑥)∞

𝑘=0𝑇/2

−𝑇/2𝑑𝑥 (5.3)

83

Again assuming that we may exchange the order of integration and summation on the right hand side of (5.3) above, we then obtain with the linearity property of the definite integral


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 + 𝑏𝑘 ∫ sin (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2∞𝑘=0 𝑑𝑥 (5.4)

As in the evaluation of the cosine coefficients, we know from the orthogonality properties of the sine and cosine functions, that all the integrals involving a multiplication of sine and cosine evaluate to zero, and thus we obtain


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = ∑ 𝑏𝑘 ∫ sin (2𝜋

𝑘


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥∞

𝑘=0 (5.5)

We also know that all integrals involving the multiplication of sin (2𝜋𝑘

𝑇𝑥) with sin (2𝜋

𝑗

𝑇𝑥) for which 𝑘 ≠ 𝑗 evaluate to zero, thus

the infinite sum in (5.5) simplifies to a single term, i.e. the case that 𝑘 = 𝑗 for our chosen 𝑗 ∈ ℕ0. If 𝑗 > 0, we have


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑏𝑗 ∫ sin (2𝜋

𝑗


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 (5.6)

and thus


𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑏𝑗

𝑇

2⇔ 𝑏𝑗 =

2


𝑗

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 (5.7)

By changing the arbitrary index 𝑗 on the right hand side of the above to 𝑘, we have

𝑏𝑘 =2


𝑘

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3,… (5.8)

and thus derived equation (5). If 𝑗 = 0, we have

∫ 𝑓(𝑥) sin (2𝜋0

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑏0 ∫ sin (2𝜋

0


0

𝑇𝑥)

𝑇/2

−𝑇/2𝑑𝑥 ⇔ 0 = 𝑏00 (5.9)

and we may arbitrarily define 𝑏0, which we do by setting 𝑏0 = 0.

□

In summary, the expressions for the Fourier series cosine and sine coefficients (4) and (5) result from

choosing circular frequencies as integer multiples of 𝜔 = 1/𝑇, multiplication of equation (1) by either a

cosine or a sine term, integration over the interval −𝑇/2 to 𝑇/2 and the orthogonality properties of the

cosine and sine functions. In the next section, we clarify why for periodic functions with period 𝑇 the interval

of integration −𝑇/2 to 𝑇/2 is a sensible choice.

(4) Periodic and protracted functions

When introducing the formulas for the sine and cosine frequencies and amplitude coefficients for a

Fourier series of a function 𝑓 in the previous section, we required that this function was periodic with period

𝑇. We also performed the respective integrals from −𝑇/2 to 𝑇/2 which implied that a period of the function

is centered at 𝑥 = 0. In this section, we relax these conditions and introduce the notion of a “protracted

function”. In data analytical problems one is usually interested in functions over a finite domain (for example

the event-related potential for a fixed peristimulus time window in EEG data analysis, or the BOLD time-

course of a voxel over an experimental run). Above the Fourier series was introduced based on functions

over the infinite domain ℝ, but with period 𝑇. If we have a function with finite domain, which is not periodic,

we may create from it a periodic function with infinite domain ℝ by simply putting many identical copies of

the function of finite domain next to each other. This new function, called the “protracted function of 𝑓“ is

then periodic, where the period 𝑇 corresponds to the length of the finite domain of the original function,

and, by construction, has an infinite domain. In other words: from a non-periodic function with finite domain

it is easy to create a periodic function with infinite domain, which may be accessible to Fourier series

representation as in equation (1). There are mathematical subtleties associated with this approach, like the

84

introduction of discontinuities in the protracted function, however, we will not discuss them here. The

interested reader is referred to [Weaver, 1983] for a mathematically sound discussion of the idea of

protracted functions. We visualize the idea of protraction of a function in Figure 1.

Figure 1 Protraction of a function with finite domain yielding a periodic function with infinite domain.

The concept of a protracted function also clarifies the integral boundaries −𝑇/2 and 𝑇/2 used in the

previous section: they refer to the left and right endpoint of the period of the function and not necessarily to

the specific values of −𝑇/2 and 𝑇/2. For the example shown in Figure 2, this means that, as the period of

the function is 𝑇 = 2, the Fourier series integrals may be performed from 0 to 2 rather than from −1 to 1.

Note, however, that by construction also the protracted function defined on [−1,1] is a periodic function of

period 𝑇.

More generally, the periodicity of functions with period 𝑇 allows for performing the integrals over

intervals of length 𝑇 centered anywhere without changing the value of the integral. Formally, for a periodic

function 𝑓 with period 𝑇 and an arbitrary number t 𝛿 < 𝑇, we have

∫ 𝑓(𝑥)𝑇/2+𝛿

−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)

𝑇/2

−𝑇/2𝑑𝑥 (1)

In words the right hand side of equation (1) specifies an integral of the periodic function on a domain of

length 𝑇 shifted from a center at 0 to a center at 𝛿. The equality to the integral of this function on a domain

of length 𝑇 centered at 0 then implies the shift-invariance of integrals over domains of the length of the

period for periodic functions. Figure 2 visualizes this result.

Figure 2 Shift invariance of the integral of periodic functions over intervals of length 𝑇.

85

Proof of equation 1

Equation (1) may be seen as follows. Let 𝛿 < 𝑇 be an arbitrary number. Then we first note that that performing an integral with lower boundary 𝑏 + 𝛿 which is larger than the upper boundary 𝑏 corresponds to the negative of the interval with lower boundary 𝑏 and upper boundary 𝑏 + 𝛿. For 𝑎 < 𝑏 this may be seen with the fundamental theorem of calculus as follows

∫ 𝑓(𝑥)𝑏

𝑎𝑑𝑥 = 𝐹(𝑏) − 𝐹(𝑎) ⇔ −∫ 𝑓(𝑥)

𝑏

𝑎𝑑𝑥 = 𝐹(𝑎) − 𝐹(𝑏) = ∫ 𝑓(𝑥)

𝑎

𝑏𝑑𝑥 (1.1)

Then, by inspection of Figure 2, we note that by the linearity of the definite integral

∫ 𝑓(𝑥)𝑇/2+𝛿

−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)

𝑇/2

−𝑇/2𝑑𝑥 + ∫ 𝑓(𝑥)

−𝑇/2

−𝑇/2+𝛿𝑑𝑥 + ∫ 𝑓(𝑥)

𝑇/2+𝛿

𝑇/2𝑑𝑥 (1.2)

where the second term on the right hand side is negative. Further, with 𝑓(𝑥 + 𝑇) = 𝑓(𝑥), we have

∫ 𝑓(𝑥)−𝑇/2

−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥 + 𝑇)

−𝑇/2

−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝜉)

𝑇/2

𝑇/2+𝛿𝑑𝜉 = −∫ 𝑓(𝜉)

𝑇/2+𝛿

𝑇/2𝑑𝜉 (1.3)

where we have changed the integration boundaries by substitution of 𝜉 ≔ 𝑥 + 𝑇 in the second equality. We thus have

∫ 𝑓(𝑥)𝑇/2+𝛿

−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)

𝑇/2

−𝑇/2𝑑𝑥 − ∫ 𝑓(𝑥)

𝑇/2+𝛿

𝑇/2𝑑𝑥 + ∫ 𝑓(𝑥)

𝑇/2+𝛿

𝑇/2𝑑𝑥 = ∫ 𝑓(𝑥)

𝑇/2

−𝑇/2𝑑𝑥 (1.4)

We thus have shown that the integral of a periodic function 𝑓 with period 𝑇 over an interval of length 𝑇 may be performed from an arbitrary lower boundary.

□

In summary, the formulas derived in the previous Section for the coefficients of the real Fourier series representation can (ignoring mathematical subtleties) be applied to any function with finite domain and all functions with infinite domain with period 𝑇.

(5) Complex Numbers

In the following we introduce two alternative formulations of the real Fourier series: the “complex

exponential form” and the “polar form”. From an intuitive viewpoint, the Fourier series in its real form is

probably the most useful, while from a mathematical viewpoint, the complex form is used most often and

encountered commonly in applications. The complex exponential representation of the Fourier series also

forms the basis for the Fourier transform. Before we can introduce the complex form of the Fourier series,

we will have to briefly discuss the notion of complex numbers, and the relation of cosine, sine, and the

exponential by means of complex numbers and Euler’s identities.

The theory of complex numbers can be motivated from the desire to solve quadratic equations of

the form

𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0 (1)

for the unknown variable 𝑥 and with known constants 𝑎, 𝑏 ∈ ℝ. From high-school mathematics we know

that (1) has the general solutions

𝑥1,2 =−𝑏±√𝑏2−4𝑎𝑐

2𝑎 (2)

If for the constants in equation (1) we have 𝑏2 − 4𝑎𝑐 < 0, then (2) requires us to take the square root of a

negative number. This is not defined when computing with real numbers. An example for such a case is the

quadratic equation

𝑥2 − 2𝑥 + 2 = 0 (3)

86

where

𝑏2 − 4𝑎𝑐 = (−2)2 − 4 ⋅ 1 ⋅ 2 = 4 − 8 = −4 (4)

To nevertheless define solutions of equations such as (3), one can define the number “𝑖”, the so called

“imaginary unit”, as follows

𝑖 ≔ √−1 ⇒ 𝑖2 = (√−1)2= −1 (5)

In words: 𝑖 is defined as the square root of −1, or, stated differently, the number that if taken to the power

of 2, yields −1. This definition allows for expressing square roots of negative numbers such as −4, as

√−4 = √−1 ⋅ 4 = √4√−1 = 2𝑖 (6)

The quadratic equation above then can be said to have the solutions

𝑥1 =2−√−4

2=

2−2𝑖

2= 1 − 𝑖 and 𝑥2 =

2+2𝑖

2= 1 + 𝑖 (7)

A number of the form 𝑎 + 𝑏𝑖, where 𝑎, 𝑏 ∈ ℝ and 𝑖 = √−1 is called a “complex number”. The set of all

complex numbers is denoted by ℂ, i.e.,

ℂ ≔ {𝑎 + 𝑖𝑏|𝑎, 𝑏 ∈ ℝ, 𝑖 = √−1} (8)

Note that for 𝑏 = 0, we have 𝑎 + 𝑖0 = 𝑎 ∈ ℝ, which shows that the real numbers are a subset of the

complex numbers.

In calculations complex numbers may be treated just like real numbers, while attending to the fact

that 𝑖2 = −1. For example, we can show that 𝑥2 is a solution of the quadratic equation (3) by substitution as

follows

(1 + 𝑖)2 − 2(1 + 𝑖) + 2 = 0 (9)

⇔ 12 + 2 ⋅ 1 ⋅ 𝑖 + 𝑖2 − 2 ⋅ 1 − 2 ⋅ 𝑖 + 2 = 0

⇔ 1+ 2𝑖 − 1 − 2 − 2𝑖 + 2 = 0

⇔ 0 = 0

We next define the notions of “real and imaginary parts”, “absolute values and arguments”, and

“conjugates” of complex numbers. A complex number of the form 𝑧 = 𝑎 + 𝑖𝑏 has two components, a “real

part” and an” imaginary part”, which can be written as

𝑅𝑒(𝑧) = 𝑎 and 𝐼𝑚(𝑧) = 𝑏 (10)

The “absolute value” of a complex number is

𝑅 = 𝐴𝑏𝑠(𝑧) = √𝑎2 + 𝑏2 (11)

and the “argument” of a complex number is

87

휃 = 𝐴𝑟𝑔(𝑧) = 𝑎𝑟𝑐𝑡𝑎𝑛 (𝑏

𝑎) (12)

The “(complex) conjugate” of a complex number

𝑧 = 𝑎 + 𝑖𝑏 (13)

is denoted by 𝑧̅ and defined as

𝑧̅ = 𝑎 − 𝑖𝑏 (14)

Notably, the multiplication of complex numbers with the conjugates always yields a real number

𝑧𝑧̅ = (𝑎 + 𝑖𝑏)(𝑎 − 𝑖𝑏) = 𝑎2 − 𝑎𝑖𝑏 + 𝑎𝑖𝑏 − 𝑖2𝑏2 = 𝑎2 − (√−1)2𝑏2 = 𝑎2 + 𝑏2 (15)

(6) Euler’s identities

By means of complex numbers, the exponential, sine and cosine functions can be shown to be

related according to what are known as “Euler’s identities”

exp(𝑖𝑥) = cos(𝑥) + 𝑖 sin(𝑥) and exp(−𝑖𝑥) = cos(𝑥) − 𝑖 sin(𝑥) (1)

which we proof below.

Proof of equation (1)

Recall that the cosine, sine, and exponential functions are defined by means of the series

cos(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0

𝑥2𝑛

(2𝑛)!= 1 −

𝑥2

2!+𝑥4

4!−𝑥6

6!+⋯ (1.1)

sin(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0

𝑥2𝑛+1

(2𝑛+1)!= 𝑥 −

𝑥3

3!+𝑥5

5!−𝑥7

7!+⋯ (1.2)

exp(𝑥) ≔ ∑𝑥𝑛

𝑛!∞𝑛=0 = 1 +

𝑥

1!+𝑥2

2!+𝑥3

3!+𝑥4

4!+⋯ (1.3)

Euler’s identity can then be shown by considering the series expression of exp(𝑖𝑥) given by

exp(𝑖𝑥) = ∑(𝑖𝑥)𝑛

𝑛!∞𝑛=0 = 1 +

𝑖𝑥

1!+𝑖2𝑥2

2!+𝑖3𝑥3

3!+𝑖4𝑥4

4!+𝑖5𝑥5

5!+⋯ (1.4)

Noting that for 𝑛 = 0,1,2,… we have the following alternating sequence between 𝑖, −1, −𝑖, 1

𝑖0 = +1, 𝑖1 = +𝑖, 𝑖2 = −1, 𝑖3 = −𝑖, 𝑖4 = +1, 𝑖5 = +𝑖, 𝑖6 = −1, 𝑖7 = −𝑖, … (1.5)

we obtain

exp(𝑖𝑥) = 1 +𝑖𝑥

1!−𝑥2

2!−𝑖𝑥3

3!+𝑥4

4!+𝑖𝑥5

5!+⋯ (1.6)

= (1 −𝑥2

2!+𝑥4

4!−⋯) + 𝑖 (

𝑥

1!−𝑥3

3!+𝑥5

5!+⋯)

= ∑ (−1)𝑛∞𝑛=0

𝑥2𝑛

(2𝑛)!+ 𝑖 (∑ (−1)𝑛∞

𝑛=0𝑥2𝑛+1

(2𝑛+1)!)

= cos(𝑥) + 𝑖 sin(𝑥)

Similar considerations yield the second identity □

88

From equations (1), we may express the cosine and sine function in terms of the exponential

function with a complex argument as follows

cos(𝑥) =exp(𝑖𝑥)+exp(−𝑖𝑥)

2=:

𝑒𝑖𝑥+𝑒−𝑖𝑥

2 and sin(𝑥) =

exp(𝑖𝑥)−exp(−𝑖𝑥)

2𝑖=:

𝑒𝑖𝑥−𝑒−𝑖𝑥

2𝑖 (2)

Note that we introduced the notation 𝑒𝑥 ≔ exp(𝑥) in (2) to keep the notation concise. Based on the

relations (1) and (2) we are now in the position to derive the complex exponential form of the Fourier series.


The first equation derives from the fact that

exp(𝑖𝑥) + exp(−𝑖𝑥) = (cos(𝑥) + 𝑖 sin(𝑥)) + (cos(𝑥) − 𝑖 sin(𝑥)) = 2 cos(𝑥) (2.1)

and the second equation derives from the fact that

exp(𝑖𝑥) − exp(−𝑖𝑥) = (cos(𝑥) + 𝑖 sin(𝑥)) − (cos(𝑥) − 𝑖 sin(𝑥)) = 2𝑖 sin(𝑥) (2.2)

□

(7) The complex form of the Fourier series

Before introducing the complex form of the Fourier series, we will make a small adjustment to the

notation of the real form of the Fourier series, which shortens the ensuing expressions. In the previous

Section we have seen that if we have a function 𝑓 that is periodic with period of 𝑇, we can decompose it into

a sum of generalized cosine and sine functions

𝑓(𝑥) =1

2𝑎0 + ∑ 𝑎𝑘 cos (2𝜋

𝑘

𝑇𝑥)∞

𝑘=1 + 𝑏𝑘 sin (2𝜋𝑘

𝑇𝑥) (1)

where the coefficients are given by the following integrals

𝑎0 =1


−𝑇/2𝑑𝑥 and 𝑎𝑘 =

2


−𝑇/2cos (2𝜋

𝑘

𝑇𝑥)𝑑𝑥, 𝑏𝑘 =

2


−𝑇/2sin (2𝜋

𝑘

𝑇𝑥)𝑑𝑥 (2)

for 𝑘 = 1,2, …. If we consider 𝑘 = 1, the input argument of the cosine and sine functions takes the form

2𝜋𝑘

𝑇𝑥 = 𝑘

2𝜋

𝑇𝑥 =

2𝜋

𝑇𝑥 (3)

If we define

𝜔0 ≔2𝜋

𝑇 (4)

(note that the subscript “0” does not refer to 𝑘) we may replace every cosine and sine argument “2𝜋𝑘

𝑇𝑥”

with the more concise argument “𝑘𝜔0𝑥”:

2𝜋𝑘

𝑇𝑥 = 𝑘

2𝜋

𝑇𝑥 = 𝑘𝜔0𝑥 (5)

In other words, with 𝜔0 as defined in equation (4), we can write the Fourier series in its real form as

𝑓(𝑥) =1

2𝑎0 + ∑ 𝑎𝑘 cos(𝑘𝜔0𝑥)

∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥) (6)

89

We now rewrite the Fourier series in its complex form. To this end, we first note that from the

previous Section

cos(𝑥) =𝑒𝑖𝑥+𝑒−𝑖𝑥

2 and sin(𝑥) =

𝑒𝑖𝑥−𝑒−𝑖𝑥

2𝑖=

𝑖(𝑒𝑖𝑥−𝑒−𝑖𝑥)

2𝑖2= −

𝑖(𝑒𝑖𝑥−𝑒−𝑖𝑥)

2 (7)

Substitution of the complex exponential expressions for cosine and sine in the real form of the Fourier series

(6) then yields

𝑓(𝑥) = 𝑎0 + ∑ 𝑎𝑘 (𝑒𝑖𝑘𝜔0𝑥+𝑒−𝑒

𝑖𝑘𝜔0𝑥

2) + 𝑏𝑘 (−

𝑖(𝑒𝑖𝑘𝜔0𝑥−𝑒−𝑖𝑘𝜔0𝑥)

2)∞

𝑘=1 (8)

= 𝑎0 + ∑𝑎𝑘

2𝑒𝑖𝑘𝜔0𝑥 +

𝑎𝑘

2𝑒−𝑖𝑘𝜔0𝑥 −

𝑖𝑏𝑘


𝑖𝑏𝑘

2𝑒−𝑖𝑘𝜔0𝑥∞

𝑘=1 (9)

In two terms on the right-hand side of the above, we find the expression “−𝑖𝑘𝜔0𝑥”, which we may interpret

as “(−𝑘) ⋅ 𝑖𝜔0𝑥”. We can thus rewrite the above using two sums, one with indices 𝑘 = 1,2, … and one with

indices −∞,−∞+ 1,… ,−3,−2,−1 as follows

𝑓(𝑥) = 𝑎0 +∑ (𝑎𝑘

2𝑒𝑖𝑘𝜔0𝑥 −

𝑖𝑏𝑘

2𝑒𝑖𝑘𝜔0𝑥) + ∑ (

𝑎−𝑘


𝑖𝑏−𝑘

2𝑒𝑖𝑘𝜔0𝑥)−1

𝑘=−∞∞𝑘=1

= 𝑎0 + ∑1

2(𝑎𝑘 − 𝑖𝑏𝑘)𝑒

𝑖𝑘𝜔0𝑥 +∑1

2(𝑎−𝑘 + 𝑖𝑏−𝑘)

−1𝑘=−∞

∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥 (10)

Note that 𝑎−𝑘 for e.g., 𝑘 = −3 corresponds to 𝑎−(−3) = 𝑎3 and is thus identical to the coefficients in

expression (9).

Expression (10) can be written more compactly by defining a new set of coefficients 𝑐𝑘 , 𝑘 ∈ ℤ as

follows

𝑐𝑘 ≔ {

1

2(𝑎−𝑘 + 𝑖𝑏−𝑘), 𝑘 < 0

𝑎0 , 𝑘 = 01

2(𝑎𝑘 − 𝑖𝑏𝑘) , 𝑘 > 0

(11)

Substitution of these definitions in (9) then yields

𝑓(𝑥) = 𝑎0 + ∑1

2(𝑎𝑘 − 𝑖𝑏𝑘)𝑒

𝑖𝑘𝜔0𝑥 +∑1

2(𝑎−𝑘 + 𝑖𝑏−𝑘)

−1𝑘=−∞

∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥 (12)

= 𝑐0 + ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥 + ∑ 𝑐𝑘

−1𝑘=−∞

∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥

= ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥−1

𝑘=−∞ + 𝑐0𝑒0⋅𝑖𝜔0𝑥 + ∑ 𝑐𝑘𝑒

𝑖𝑘𝜔0𝑥∞𝑘=1

= ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥∞

𝑘=−∞

The expression

𝑓(𝑥) = ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥∞

𝑘=−∞ (13)

where, in general, 𝑐𝑘 ∈ ℂ is referred to as the complex form of the Fourier series. Note that in contrast to

the real form, the indices 𝑘 are signed integers.

90

(8) The polar form of the Fourier series

In the complex exponential form of the Fourier series, the coefficients 𝑐𝑘 contain information about

the “frequency magnitude” and “frequency phase“ for a given frequency 𝑘𝜔0 in the original function 𝑓. As

discussed in the next Section, numerical approaches that find the Fourier series coefficients for a given

function or sequence usually return the complex coefficients 𝑐𝑘. Here, we consider the interpretation of

these coefficients in terms of frequency magnitude and phase. To this end, we first note that the coefficients

𝑐𝑘 with negative 𝑘 and those with positive 𝑘 refer to the identical frequency 𝑘𝜔0 in the original sine and

cosine representation of the Fourier series. We may thus consider only coefficients with 𝑘 ≤ 0, which, in

general are of the form

𝑐𝑘 = 𝑎𝑘 + 𝑖𝑏𝑘 (1)

To see how these complex numbers relate to frequency magnitude and phase, we first consider the so-called

“polar form” of the Fourier series. The polar form of the Fourier series is a reformulation of the real form of

the Fourier series. Specifically, based on the coefficients of the real Fourier series 𝑎0, 𝑎1, 𝑎2, … , 𝑏1, 𝑏2, … the

polar form of the Fourier series is given by

𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1 (2)

where the “polar form coefficients” 𝑑0, 𝑑1, 𝑑2, … and “phase angles” 𝜙1, 𝜙2, … can be obtained from the real

form coefficients by

𝑑0 = 𝑎0 , 𝑑𝑘 = √𝑎𝑘2 + 𝑏𝑘

2 and 𝜙𝑘 = arctan (𝑏𝑘

𝑎𝑘) for 𝑘 = 1,2, … (3)


To show that the above holds, we capitalize on the following identities for trigonometric functions

cos(𝑥 + 𝑦) = cos(𝑥) cos(𝑦) − sin(𝑥) sin(𝑦), cos(−𝑥) = cos(𝑥) and sin(−𝑥) = − sin(𝑥) (3.1)

and the definition of the tangent function as

tan(𝑥) ≔sin(𝑥)

cos(𝑥) (3.2)

We then have

𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1 (3.3)

= 𝑑0 + ∑ 𝑑𝑘(cos(𝑘𝜔0𝑥) cos(−𝜙𝑘) − sin(𝑘𝜔0𝑥) sin(−𝜙𝑘))∞𝑘=1

= 𝑑0 + ∑ 𝑑𝑘(cos(𝑘𝜔0𝑥) cos(𝜙𝑘) + sin(𝑘𝜔0𝑥) sin(𝜙𝑘))∞𝑘=1

= 𝑑0 + ∑ 𝑑𝑘 cos(𝜙𝑘) cos(𝑘𝜔0𝑥) + 𝑑𝑘 sin(𝜙𝑘) sin(𝑘𝜔0𝑥)∞𝑘=1

The latter statement is equivalent to the real form of the Fourier series as defined in the previous section given that

𝑎𝑘 = 𝑑𝑘 cos(𝜙𝑘) and 𝑏𝑘 = 𝑑𝑘 sin(𝜙𝑘) (3.4)

To evaluate 𝜙𝑘 in terms of 𝑎𝑘 and 𝑏𝑘 we may thus set

𝑏𝑘

𝑎𝑘=

𝑑𝑘 sin(𝜙𝑘)

𝑑𝑘 cos(𝜙𝑘)=

sin(𝜙𝑘)

cos(𝜙𝑘)= tan(𝜙𝑘) ⇒ 𝜙𝑘 = arctan (

𝑏𝑘

𝑎𝑘) ,

𝑏𝑘

𝑎𝑘∈ [−

𝜋

2,𝜋

2] (3.5)

Squaring and adding the statements in (2.4) above to find 𝑑𝑘, we obtain

91

𝑎𝑘2 + 𝑏𝑘

2 = (𝑑𝑘 cos(𝜙𝑘))2 + (𝑑𝑘 sin(𝜙𝑘))

2 = 𝑑𝑘2(cos2(𝜙𝑘) + sin

2(𝜙𝑘)) = 𝑑𝑘2 ⇒ 𝑑𝑘 = √𝑎𝑘

2 + 𝑏𝑘2 (3.6)

In summary, the polar form coefficients and phase angles can be derived from the real form coefficients by means of trigonometric identities and the definition of the tangent function.

□

The polar representation of the Fourier series thus shows that the real and imaginary parts of the

coefficients 𝑐𝑘 (i.e., 𝑅𝑒(𝑐𝑘) = 𝑎𝑘 and 𝐼𝑚(𝑐𝑘) = 𝑏𝑘) represent the amplitude coefficient 𝑑𝑘 of a generalized

cosine function of frequency 𝑘𝜔0 in the form of the square root of their sum of squares and the phase

parameter of the same generalized cosine function in terms of the arcus tangens of their ratio. Neglecting

that the underlying function of the polar Fourier series is a cosine function (as one may equally express the

polar Fourier series in terms of a sine function), one may thus refer to “frequency-specific amplitudes and

phases”.

A commonly used approach is to visualize the relation between the real and the imaginary part of

Fourier series coefficients and frequency magnitude and phase by plotting the imaginary number 𝑐𝑘 as a

point in the two-dimensional plane spanned by the values of its real part 𝑅𝑒(𝑐𝑘) = 𝑎𝑘 and its imaginary part

𝐼𝑚(𝑐𝑘) = 𝑏𝑘 (Figure 1). Basic geometry than relates the “frequency magnitude” 𝑑𝑘 to the length of the line

from the origin to the point (𝑎𝑘 , 𝑏𝑘)𝑇 and “frequency phase” to the angle spanned by this line and the 𝑎𝑘-

axis .

Figure 1. Representation of the complex Fourier series coefficients in the “real-imaginary plane”. Visualizing the real and imaginary

part of such a coefficient as a point in this plane, relates the amplitude coefficients of the polar form of the Fourier series to the

length of the line from the origin to the point, and phase parameters in the polar form of the Fourier series to the angle spanned by

this line and the real axis.

(9) The Fourier transform

The Fourier transform is an example of an integral transform which transforms one function defined

on one domain into a second function defined on another domain. In the case of the Fourier transform, the

former domain is usually time or space, i.e., the arguments of the input function to the Fourier transform

represent temporal or spatial units, while the latter domain is the frequency domain, i.e. the input

arguments of the Fourier transformed function are frequencies. More formally, for a function

92

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)

the Fourier transform is defined as the function 𝐹, for which

𝐹:ℝ → ℝ,𝜔 ↦ 𝐹(𝜔) ≔ ∫ 𝑓(𝑥)∞

−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝑥 (2)

The inverse Fourier transform is defined as

𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∫ 𝐹(𝜔)∞

−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝜔 (3)

The definition of the Fourier transform given in equation (2) can be motivated as the “continuous

frequency approximation” of the “discrete frequency form” of the Fourier series. In order to see this, we first

note that we can write the coefficients 𝑐𝑘 (𝑘 ∈ ℤ) of the complex exponential form of the Fourier series the

coefficients directly in terms of the function 𝑓 and a complex exponential. That is, for a function 𝑓 as defined

in (1) and periodic with period 𝑇, the coefficients in its complex exponential Fourier series representation

𝑓(𝑥) = ∑ 𝑐𝑘 exp (2𝜋 (𝑘

𝑇) 𝑖𝑥) ∞

𝑘=−∞ (4)

can be written as

𝑐𝑘 =1


−𝑇/2exp (−2𝜋𝑖 (

𝑘

𝑇) 𝑖𝑥)𝑑𝑥 (𝑘 ∈ ℤ) (5)

which omits the recursion to the real form coefficients 𝑎0 and 𝑎𝑘 , 𝑏𝑘 , 𝑘 ∈ ℕ.


To show that the coefficients 𝑐𝑘 (𝑘 ∈ ℤ) of the complex exponential form of the Fourier series may be written as in

equation (5) we follow a similar line of reasoning as for the derivation of the real form coefficients: we first establish an

orthogonality relation for complex exponentials, multiply the complex form of the Fourier series by a complex exponential and

integrate. More specifically, we require the following complex exponential orthogonality property:

∫ exp (2𝜋𝑖𝑛𝑥

𝑇)

𝑇/2

−𝑇/2exp (−

2𝜋𝑖𝑚𝑥

𝑇) 𝑑𝑡 = {

0, 𝑛 ≠ 𝑚𝑇, 𝑛 = 𝑚

(5.1)

Note the similarity of this property to the cosine and sine orthogonality properties above. Again, we will eschew a formal proof of

(6), but capitalize on it in the following. Multiplying the complex form of the Fourier series

𝑓(𝑥) = ∑ 𝑐𝑘 exp (2𝜋𝑖𝑘𝑥

𝑇) ∞

𝑘=−∞ (5.2)

by exp (−2𝜋𝑖𝑚𝑥

𝑇) for an arbitrary 𝑚 ∈ ℤ and subsequently integrating over the interval [−𝑇/2, 𝑇/2] yields

∫ 𝑓(𝑥) exp (−2𝜋𝑖𝑚𝑥

𝑇)

𝑇/2

−𝑇/2𝑑𝑥 = ∫ ∑ 𝑐𝑘 exp (

2𝜋𝑖𝑘𝑥

𝑇) exp (−

2𝜋𝑖𝑚𝑥

𝑇)∞

𝑘=−∞𝑇/2

−𝑇/2𝑑𝑥 (5.3)

Assuming that we may exchange the order of summation and integration on the right-hand side of (5.3), we obtain


𝑇)

𝑇/2

−𝑇/2𝑑𝑥 = ∑ ∫ 𝑐𝑘 exp (

2𝜋𝑖𝑘𝑥

𝑇) exp (−

2𝜋𝑖𝑚𝑥

𝑇)

𝑇/2

−𝑇/2∞𝑘=−∞ 𝑑𝑥 (5.4)

Using the orthogonality property (5.1), we hence obtain


𝑇)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑐𝑚 ∫ exp (

2𝜋𝑖𝑚𝑥

𝑇) exp (−

2𝜋𝑖𝑚𝑥

𝑇)

𝑇/2

−𝑇/2𝑑𝑥 = 𝑐𝑚𝑇 (5.5)

and exchanging the (arbitrary) index 𝑚 ∈ ℤ for 𝑘 ∈ ℤ, we get

93

𝑐𝑘 =1


−𝑇/2exp (−

2𝜋𝑖𝑘𝑥

𝑇) 𝑑𝑥 (𝑘 ∈ ℤ) (5.6)

□

Having established the direct complex exponential form of 𝑐𝑘, we now return to the view of the

Fourier transform as “continuous frequency generalization” of the Fourier series. We recall that the

constants 𝑐𝑘 are a measure of the “amount” of the discrete frequencies 𝑘𝜔0 = 2𝜋𝑘/𝑇 that are combined to

represent the periodic function 𝑓(𝑥). We also note that although we have an infinite number of these

discrete frequencies, they are all multiples of a basic frequency given by 𝜔0 = 2𝜋/𝑇. As the period

increases, this basic frequency decreases and therefore, therefore, the discrete frequencies get closer

together, until in the limit 𝑇 → ∞ they equal a continuous frequency spectrum. Formally, we may consider

the distance Δ𝜔𝑘 between two discrete frequencies given by

Δ𝜔𝑘 = 𝜔𝑘+1 −𝜔𝑘 =𝑘+1

𝑇−𝑘

𝑇=

1

𝑇 (𝑘 ∈ ℤ) (6)

We may analogously write

lim𝑇→∞𝑘

𝑇= 𝜔 (7)

as 𝑇 → ∞, Δ𝜔𝑘 → 0 and we arrive at a continuous frequency representation 𝜔 ∈ ℝ. Consider now the

Fourier series coefficient 𝑐𝑘, which we multiply with 𝑇:

𝑇𝑐𝑘 = ∫ 𝑓(𝑥)𝑇/2

−𝑇/2exp (−2𝜋 (

𝑘

𝑇) 𝑖𝑥) 𝑑𝑥 (𝑘 ∈ ℤ) (8)

Replacing the discrete frequency (2𝜋𝑘

𝑇) by the continuous frequency 𝜔, we obtain

𝑇𝑐𝑘 = ∫ 𝑓(𝑥)∞

−∞exp(−2𝜋𝜔𝑖𝑥)𝑑𝑥 =: 𝐹(𝜔) (9)

i.e. the Fourier transform formula. Note that 𝐹(𝜔) can be interpreted as the discrete contribution 𝑐𝑘

(tending to zero, as more and more frequencies become available) multiplied by the (constant) period of the

input function (tending to infinity, as the function becomes less and less periodic).

From a data analytical perspective, one usually computes frequency spectra of observed data, not of

analytical functions. We will thus omit a discussion of the properties and implications of the Fourier

transform and instead introduce the discrete Fourier transform next. The interested reader will find much

material on the analytical properties of the Fourier transform and examples for Fourier transforms of a

selection of functions in [Weaver, 1983].

(10) The discrete Fourier transform

We may regard the Fourier series in its complex exponential form as an operation that takes a

function 𝑓 and returns a sequence (i.e., an infinite ordered list) of coefficients 𝑐𝑘 , 𝑘 ∈ ℤ. Likewise, the

Fourier transform may be considered an operation that maps a function 𝑓 onto another function 𝐹. To

determine either the Fourier series or the Fourier transform, integrals have to be evaluated. Thus it is only

possible to consider functions that can be described analytically and even then these function must be

relatively simple. In the real world, we rarely find such functions and therefore, must turn to a computer for

help. A computer, however, does not represent continuous functions, but sequences of numbers that may

94

represent functions or empirical data on finite domains. The discrete Fourier transform can be viewed as

the discretized analogue of the continuous Fourier transform as defined in the previous subsection. To

introduce the discrete Fourier transform, we first consider the notion of an “𝑛th-order sequence”.

A finite number of 𝑛 terms, or a an “𝑛th-order sequence”, is defined as a function whose domain is

the set of integers ℕ𝑛−10 ≔ {0,1,2,… , 𝑛 − 1} for 𝑛 ∈ ℕ and whose range is the set of function values

{𝑓(0), 𝑓(1), … , 𝑓(𝑛 − 1)}. Alternatively, an 𝑛th-order sequence may be conceived as a sequence in the set

of ordered pairs {(0, 𝑓(0)), (1, 𝑓(1)), … , (𝑛 − 1, 𝑓(𝑛 − 1)) }. Here we use a shorter notation and follow the

common practice of denoting the sequence as (𝑓𝑘)𝑘=0,1,…,𝑛−1 and the 𝑘th term of the sequence as 𝑓𝑘. For

example, we may have a sequence (𝑓𝑘)𝑘=0,1,…,𝑛−1 whose terms are defined by

𝑓𝑘 ≔1

𝑘+1, 𝑘 = 0,1, … , 𝑛 − 1 (1)

and thus

(𝑓𝑘)𝑘∈ℕ𝑛−10 = {1,1

2,1

3, … ,

1

𝑛} (2)

Notably, we do not require a formula or equation for 𝑓𝑘 in terms of 𝑘 to define a sequence. In other words,

any empirically observed set of 𝑛 univariate data points 𝑦𝑖 ∈ ℝ, 𝑖 = 0,1,… , 𝑛 − 1 can be regarded as an 𝑛th

order sequence and we may identify data vectors 𝑦 ∈ ℝ𝑛 with 𝑛th order sequences (𝑦𝑘)𝑘∈ℕ𝑛−10 . In order to

emphasize the data analytical aspect of the discrete Fourier transform, we will use

𝑦 = (𝑦𝑘)𝑘∈ℕ𝑛−10 (3)

in the following to denote an 𝑛th order sequence. We call such a sequence “bounded” if all of its terms are

finite valued.

By analogy to the Fourier transform of a function defined on the continuous domain ℝ given by

𝐹:ℝ → ℝ,𝜔 ↦ 𝐹(𝜔) ≔ ∫ 𝑓(𝑥)∞

−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝑥 (4)

the discrete Fourier transform of an a bounded 𝑛th-order sequence is defined as

𝑌𝑗 ≔1

𝑛∑ 𝑦𝑘 exp (−2𝜋 (

𝑗

𝑛) 𝑖𝑘)𝑛−1

𝑘=0 for 𝑗 = 0,1, … , 𝑛 − 1 (5)

Further, in analogy to the inverse Fourier transform, the inverse discrete Fourier transform is defined as

𝑦𝑘 = ∑ 𝑌𝑗 exp (2𝜋 (𝑗

𝑛) 𝑖𝑘)𝑛−1

𝑗=0 for 𝑘 = 0,1, … , 𝑛 − 1 (6)

Like the Fourier transform and the inverse Fourier transform, the discrete Fourier transform and its inverse

are reciprocal: Having obtained a sequence 𝑌 = (𝑌𝑘)𝑘∈ℕ𝑁−10 from a sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 and applying the

inverse discrete Fourier transform equation (5) to it, yields the original sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 .

Note that we use 𝑌𝑗 to denote the (usually complex) values of the discrete Fourier transform of the

𝑛th-order sequence 𝑦. Further notice that with respect to the Fourier transform, the frequency resolution

depends on the order of the sequence, i.e. the number of data points: 𝜔 ∈ ℝ is replaced by 𝑗/𝑛 for

𝑗 = 0,1,… , 𝑛 − 1. Usually, the values of 𝑦 refer to data sampled equidistantly over time or space. Each value

95

thus refers to a given time or space increment and the reciprocal value of this increment is the data sampling

frequency 𝑓𝑠. Dividing the sampling frequency by the number of data points and multiplication with the

𝑗 = 0,1,… , 𝑛 − 1 then yields the discrete support frequencies of the discrete Fourier transform. In other

words: the complex number 𝑌𝑗 contains magnitude and phase information of the frequency 𝑗

𝑛⋅ 𝑓𝑠 in the

signal of interest. Finally, note that in the definition of the complex exponential Fourier series, the index of

the values 𝑐𝑘 covered the negative and positive integers. The first 𝑛

2+ 1 values 𝑌𝑗 correspond to the 𝑛/2 𝑐𝑘

values for 𝑘 = 0,… , 𝑛/2, while the latter 𝑛

2 values 𝑌𝑗 correspond to the 𝑛/2 𝑐𝑘 values for 𝑘 = −

𝑛

2, … , −1.

Because these latter terms are redundant, it suffices to consider the first the 𝑛/2 terms 𝑌𝑗 for characterizing

frequency magnitude and phase.

For notational simplicity, one may define a so-called “weighting kernel” 𝑤𝑛 by

𝑤𝑛 ≔ exp (2𝜋𝑖

𝑛) (7)

Note for a scalar 𝑎 ∈ ℝ taking 𝑤𝑛 to the power of 𝑎 yields

𝑤𝑛𝑎 ≔ (𝑤𝑛)

𝑎 = (exp (2𝜋𝑖

𝑛))𝑎= (𝑒

2𝜋𝑖

𝑛 )𝑎

= 𝑒2𝜋𝑎𝑖

𝑛 = exp (2𝜋𝑎𝑖

𝑛) (8)

Rewriting the definitions of the discrete Fourier transform and the inverse discrete Fourier transform above

using the weighting kernel 𝑤𝑛 we have

𝑌𝑗 =1

𝑛∑ 𝑦𝑘𝑤𝑛

−𝑘𝑗𝑛−1𝑘=0 for 𝑗 = 0,1,… , 𝑛 − 1 (9)

and

𝑦𝑘 = ∑ 𝑌𝑗𝑤𝑛𝑘𝑗𝑛−1

𝑗=0 for 𝑘 = 0,1,… , 𝑛 − 1 (10)

We next turn to the question how to compute the discrete Fourier transform (5) of a given data sequence

(𝑦𝑘) 𝑘∈ℕ𝑛−10 by means of the so-called Fast Fourier Transform algorithm.

(11) The Fast Fourier Transform Algorithm

The Fast Fourier Transform algorithm is an algorithm that computes the values 𝑌𝑗 for 𝑗 = 0,1,… , 𝑛 −

1 based on a given data (𝑦𝑘) 𝑘∈ℕ𝑛−10 in the definition of the discrete Fourier transform

𝑌𝑗 ≔1

𝑛∑ 𝑦𝑘𝑤𝑛

−𝑘𝑗𝑛−1𝑘=0 (1)

In principle, these values can be computed readily using basic programming tools, because each value 𝑌𝑗 is

merely given by the product of 𝑛 terms of the form 𝑦𝑘𝑤𝑛−𝑘𝑗

and their subsequent sum. The problem with a

direct computation of the values 𝑌𝑗 according to the formula (1) is that for long data sequences, i.e. large

values of 𝑛, many numerical operations have to be performed. More specifically, evaluation of the definition

of 𝑌𝑗 for 𝑗 = 0,1,… , 𝑛 − 1 requires 𝑛2 “numerical operations” (= multiplications and additions): For each 𝑌𝑗 𝑛

multiplications and additions have to be performed, and there are 𝑛 terms 𝑌𝑗, thus resulting in a total of 𝑛2

96

operations. Doubling the number of data points from, say, 𝑛1 = 100 to 𝑛2 = 200 thus yields a four time

increase in the number of necessary computations (from 1002 = 10,000 to 2002 = 40,000). The fast

Fourier transform algorithm reduces the number of computations necessary to evaluate the discrete Fourier

transform of a given data 𝑛th-order data sequence to 𝑛 ⋅ log2 𝑛. Given that each numerical operations

performed on a computer takes a non-zero amount of time, the fast Fourier transform algorithm thus

returns values 𝑌𝑗 much quicker than the direct evaluation of the discrete Fourier transform according to

equation (1). Figure 1 depicts the values 𝑛2 and 𝑛 ⋅ log2 𝑛 as function of the data sequence order 𝑛.

Figure 1. Number of numerical operations (multiplications and additions) necessary to evaluate the discrete Fourier transform of an 𝑛th order data sequence. Note that the fast Fourier transform algorithm massively reduces the computational demand.

To see how the fast Fourier transform achieves the reduction of computational demand, we follow

the discussion in [Weaver, 1983]. To keep the notation concise, we will write 𝑦(𝑘) ≔ 𝑦𝑘 for the 𝑘th element

of a sequencer (𝑦𝑘) 𝑘∈ℕ𝑛−10 in the following. For a given 𝑛th order data sequence we assume that 𝑛 is an

even integer, and we can thus split the sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 into two new subsequences (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and

(𝑦2𝑘) 𝑘∈ℕ𝑚−10 by defining

𝑦1𝑘 = 𝑦1(𝑘) ≔ 𝑦(2𝑘) and 𝑦2𝑘 = 𝑦2(𝑘) ≔ 𝑦(2𝑘 + 1) for 𝑘 = 0,1,… ,𝑚 − 1 where 𝑚 = 𝑛/2 (2)

To make the above transparent, consider for example the 6th order sequence

(𝑦𝑘) 𝑘∈ℕ6−10 ≔ {𝑦(0), 𝑦(1), 𝑦(2), 𝑦(3), 𝑦(4), 𝑦(5))} = {0,1,2,14,16,20} (3)

Then (2) defines the two subsequences

(𝑦1𝑘) 𝑘∈ℕ3−10 = {𝑦1(0), 𝑦1(1), 𝑦2(2)} = {𝑦(0), 𝑦(2), 𝑦(4)} = {0,2,16} (4)

and

(𝑦2𝑘) 𝑘∈ℕ3−10 = {𝑦2(0), 𝑦2(1), 𝑦2(2)} = {𝑦(1), 𝑦(3), 𝑦(5)} = {1,14,20} (5)

97

The first subsequence thus comprises the “even” terms of the original sequences (starting from the zeroth

term), while the second subsequence comprises the “odd” terms of the original sequences (starting from the

first term). Note that both new subsequences as defined in (2) are periodic sequences with periodicity 𝑚:

𝑦1(𝑘 + 𝑚) = 𝑦(2(𝑘 + 𝑚)) = 𝑦(2𝑘 + 𝑛) = 𝑦(2𝑘) = 𝑦1(𝑘) (6)

and

𝑦2(𝑘 + 𝑚) = 𝑦(2(𝑘 + 𝑛) + 1) = 𝑦(2𝑘 + 𝑛 + 1) = 𝑦(2𝑘 + 1) = 𝑦2(𝑘) (7)

Since (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 are 𝑚th order sequences, we can determine their discrete Fourier

transforms according to the definition in equation (1):

𝑌1(𝑗) =1

𝑚∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚

−𝑘𝑗 for 𝑗 = 0,1,… ,𝑚 − 1 (8)

𝑌2(𝑗) =1

𝑚∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚

−𝑘𝑗 for 𝑗 = 0,1, … ,𝑚 − 1 (9)

Now let us consider the discrete Fourier transform of the original 𝑛th order sequence {(𝑦𝑘) 𝑘∈ℕ𝑛−10

𝑌(𝑗) =1

𝑛∑ 𝑦𝑘𝑛−1𝑘=0 𝑤𝑛

−𝑘𝑗 for 𝑗 = 0,1, … , 𝑛 − 1 (10)

By splitting the summation, we can write the preceding equation as

𝑌(𝑗) =1

𝑛∑ 𝑦2𝑘𝑚−1𝑘=0 𝑤𝑛

−2𝑘𝑗+1

𝑛∑ 𝑦2𝑘+1𝑚−1𝑘=0 𝑤𝑛

−(2𝑘+1)𝑗 (11)

However, we note that

𝑤𝑛−2𝑘𝑗

= exp (−2⋅2𝜋𝑖𝑘𝑗

𝑛) = exp (−

2𝜋𝑖𝑘𝑗

𝑛/2) = 𝑤𝑚

−𝑘𝑗 (12)

and

𝑤𝑛−(2𝑘+1)𝑗

= exp (−2𝜋𝑖(2𝑘+1)𝑗

𝑛) = exp (−

2⋅2𝜋𝑖𝑗

𝑛−2𝜋𝑗

𝑛) = exp (−

2𝜋𝑖𝑘𝑗

𝑛/2) exp (−

2𝜋𝑗

𝑛) = 𝑤𝑚

−𝑘𝑗𝑤𝑚−𝑗

(13)

Therefore, equation (11) can be written as

𝑌𝑗 =1

𝑛∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚

−𝑘𝑗+𝑤𝑛−𝑗

𝑛∑ 𝑦2𝑘𝑚−1𝑘=0 𝑤𝑚

−𝑘𝑗 for 𝑗 = 0,… , 𝑛 − 1 (14)

Comparison to the discrete Fourier transforms of the subsequences yields

𝑌𝑗 =𝑌1(𝑗)

2+𝑤𝑁𝑛−𝑗𝑌2(𝑗)

2 for 𝑗 = 0,… , 𝑛 − 1 (15)

Because 𝑌1(𝑗) and 𝑌2(𝑗) are periodic with periodicity 𝑚, we have

𝑌(𝑗) =1

2(𝑌1(𝑗) + 𝑌2(𝑗)𝑤𝑛

−𝑗) and 𝑌(𝑗 +𝑚) =

1

2(𝑌1(𝑗) − 𝑌2(𝑗)𝑤𝑛

−𝑗) for 𝑗 = 0,… ,𝑚 − 1 (16)

As we have noted, to calculate the discrete Fourier transform of (𝑦𝑘) 𝑘∈ℕ𝑛−10 requires 𝑛2 numerical

operations (additions and multiplications), whereas to calculate the discrete Fourier transform of

98

(𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 requires only 𝑚2 or (𝑛

2)2=

𝑛2

4 complex operations. When using the equation

(16) to obtain 𝑌(𝑗) based on the values 𝑌1(𝑗) and 𝑌2(𝑗), we first require 2 (𝑛2

4) = 𝑛2/2 operations to

calculate the two Fourier transforms the values 𝑌1(𝑗) and 𝑌2(𝑗) and then we require the 𝑛 additional

operations as prescribed by equation (16). The total number of operations for computing the discrete

Fourier transform coefficients has thus been reduced from 𝑛2 to 𝑛2

2+ 𝑛.

Now suppose 𝑛 is divisible by 4 or 𝑚 = 𝑛/2 divisible by 2. Then the subsequences (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and

(𝑦2𝑘) 𝑘∈ℕ𝑚−10 can be further subdivided into four 𝑚/2 order sequences

𝑔1(𝑘) = 𝑦1(2𝑘), 𝑔2(𝑘) = 𝑦1(2𝑘 + 1), ℎ1(𝑘) = 𝑦2(2𝑘), ℎ2(𝑘) = 𝑦2(2𝑘 + 1) (17)

for 𝑘 = 0,1,… ,𝑚

2− 1. Thus, we can also use the scheme above to obtain the Fourier transforms

(𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 with only 𝑚 +𝑚2/2 complex operations and the use these results to obtain

{𝐹(𝑗)}. A little thought reveals that this requires 2𝑛 + 𝑛2/4 operations. Thus when subdividing a sequence

twice we reduce the number of operations from 𝑛2 to 2𝑛 + 𝑛2/4. The 2𝑛 term is the result of applying the

scheme above twice, where as the 𝑛2/4 term is the resuls of transforming the four reduced sequences. For

the case when 𝑛 = 4, we note that we completely reduce the sequence to four first order sequences that

are their own transformations and the therefore we do not need the additional 𝑛2/4 transform operations.

The formula then becomes 2𝑛. The smallest value of 𝑛 that does not result in complete reduction of the

sequence is 8. For this case have a reduction factor of 1/2 whereas for large 𝑛, the factor approaches 1/4.

Continuing this way we can show that if 𝑛 is divisible by 2𝑝 , then the number of operatiorns required to

compute the discrete Fourier transform of the 𝑛th order sequence {𝑓(𝑘)} by repeated subdivision is

𝑝𝑛 +𝑛2

2𝑝 (18)

Again, for complete reduction, i.e. 𝑛 = 2𝑝, the 𝑛2/2𝑝 term is not required and we obtain 𝑝𝑛 for the number

of operations required. The number of required operations in this case is thus 𝑝𝑛 = 𝑛 log2 𝑛.

(12) Fast Fourier Transforms in Matlab

Below we include Matlab code that evaluates the discrete Fourier transform of a simulated time-

series using Matlab’s fast Fourier transform implementation. Notably, the complex Fourier coefficients are

returned by this implementation as in the definition of the discrete Fourier transform, such that the second

half of the coefficients is redundant. To obtain the coefficients 𝑐𝑘, the function fftshift.m is applied. Figure 2

below visualizes the simulation

% Matlab DFT/FFT example

% -------------------------------------------------------------------------

% simulation parameters

n = 2^9 ; % number of samples (window length)

fs = 200 ; % sampling frequency (Hz)

nf = fs/2 ; % Nyquist frequency

dt = 1/fs ; % time increment per sample

t = (0:n-1)/fs ; % time vector

s = 1.5 ; % noise standard deviation

% sample data with 15 and 40 Hz component and Gaussian noise

y = 2*sin(2*pi*5*t) + 3*sin(2*pi*20*t) + s*randn(1,length(t));

99

% use Matlabs fft to compute the DFT of y and its power

f = (0:n-1)*(fs/n) ; % frequency range of DFT

Y = fft(y) ; % DFT Y_j

p = Y.*conj(Y)/n ; % power of the DFT

% reorder coefficients according to k = -3,-2,-1,0,1,2,3,…

Y0 = fftshift(Y) ; % coefficient reordering

f0 = (-n/2:n/2-1)*(fs/n) ; % frequency range for c_k

p0 = Y0.*conj(Y0)/n ; % power of the DFT

Figure 2. Visualization of Matlab’s fast Fourier transform implementation. The upper panel depicts 512 samples of a noisy simulated time-series at a sampling frequency of 200 Hz comprising two sine components of circular frequency of 5 Hz and 10 Hz, respectively. The lower left panel depicts the power of all returned discrete Fourier transform coefficients centred at zero, while the lower right panel depicts on the first half of the coefficients.

Study questions

1. Write down the generalized cosine function and discuss the intuitive meaning of its components.

2. Sketch the functions 𝑓(𝑥) ≔ 3 sin(2𝜋𝜔𝑥) for 𝜔 ≔ 1 and 𝑔(𝑥) ≔ 0.5 sin(2𝜋𝜔𝑥) for 𝜔 ≔ 3 by hand.

3. Write down the formulas for 𝜔𝑘 , 𝑎𝑘 , 𝑏𝑘 in the Fourier series representation

𝑓(𝑥) = 𝑎0 +∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)∞𝑘=1

for a function 𝑓 with period 𝑇 satisfying the Dirichlet conditions.

4. Verbally sketch the derivation of the Fourier series coefficient formulas.

5. Verbally define the orthogonality properties of sine and cosine functions.

6. Verbally explain the idea of “protracting” a periodic function.

7. Verbally explain why the lower integral boundary for a protracted periodic function with period 𝑇 for an integral over an

interval of length 𝑇 is irrelevant.

8. Determine the real and imaginary part, the complex conjugate, the absolute value, and the argument of the complex number

𝑧 ≔ 2 + 3𝑖

9. For 𝑧1 = 1 + 𝑖 and 𝑧2 = 2 − 3𝑖 evaluate the sum 𝑧3 = 𝑧1 + 𝑧2 and the product 𝑧4 = 𝑧1𝑧2̅

10. Write down Euler’s identity. Why does it hold?

11. Write down the definitions of the coefficients 𝑐𝑘 and the frequency 𝜔0in the complex form of the Fourier series of a periodic

function 𝑓 given by

100

𝑓(𝑥) = ∑ 𝑐𝑘 exp(𝑖𝑘𝜔0𝑥)∞𝑘=−∞

in terms of the coefficients 𝑎𝑘 of the real form of the Fourier series

𝑓(𝑥) =1


∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥)

and the period 𝑇 of 𝑓.

12. Let the real form of the Fourier series of a periodic function 𝑓 be given by

𝑓(𝑥) =1


∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥)

and the polar form of the Fourier series be given by

𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1

Express the polar form coefficients 𝑑𝑘 , 𝑘 = 0,1,2,… and phase angles 𝜙𝑘 , 𝑘 = 1,2,… in terms of the real form coefficients

𝑎𝑘 , 𝑘 = 0,1,2, … and 𝑏𝑘 , 𝑘 = 1,2, …


1. The generalized cosine function is given by 𝑓: ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 cos(2𝜋𝜔𝑥 − 𝜑) Here, 𝑎 ∈ ℝ is an “amplitude coefficient”, which scales the function to vary between −𝑎 and 𝑎, rather than −1 and 1 as the cosine does. 𝜔 is a “circular frequency” term, the higher this term, the higher the frequency of the generalized cosine function. More specifically, the generalized cosine function repeats itself every 1/𝜔 units of the 𝑥-axis, i.e., it is periodic with period 𝑇 = 1/𝜔. The smaller the period, the higher the frequency. 𝜑 is a “phase angle”, which describes how much the standard cosine function is shifted on the 𝑥-axis (to the right).

2. The important points for the sketch are that 1.) the function 𝑓 varies between −3 and 3 and performs a full revolution every 1

𝜔=

1

1= 1 𝑥-units. Equivalently, 𝑔 varies between −0.5 and 0.5 and performs a full revolution ever

1

𝜔=

1

3 𝑥-units, or in other

words, it performs three revolution every 𝑥-unit.

3. The frequencies are given by 𝜔𝑘 =

𝑘

𝑇 and the coefficients are given by

𝑎0 =1

𝑇∫ 𝑓(𝑥)𝑑𝑥𝑇/2

−𝑇/2, 𝑎𝑘 =

2


𝑘

𝑇𝑥)𝑑𝑥

𝑇/2

−𝑇/2 and 𝑏𝑘 =

2


𝑘

𝑇𝑥)𝑑𝑥

𝑇/2

−𝑇/2

4. The Fourier series coefficient formulas can be derived by (1.) substituting the frequencies 𝜔𝑘 = 𝑘/𝑇 in the Fourier series, (2.) multiplication of the Fourier series representation with the cosine or sine function of a chosen frequency for the cosine oo rsine coefficients, respectively, (3.) the subsequent integration from −𝑇/2 to 𝑇/2, and finally (4.) solving for the coefficients.

5. The orthogonality properties of sine and cosine functions correspond to the following three statements: (1) The integral of the product of two sine or two cosine functions with different circular frequencies 𝑘/𝑇 and 𝑗/𝑇 (𝑘 ≠ 𝑗, 𝑘, 𝑗 ∈ ℕ) over an interval of length 𝑇 is zero. (2.) The integral of the product of two sine or two cosine functions with the same circular frequencies 𝑘/𝑇 (𝑘 ∈ ℕ) over an interval of length 𝑇 is 𝑇/2. (3) The integral of the product of a sine and a cosine function with different circular frequencies 𝑘/𝑇 and 𝑗/𝑇 (𝑘 ≠ 𝑗, 𝑘, 𝑗 ∈ ℕ) , or the same circular frequency 𝑘/𝑇 (𝑘 ∈ ℕ) over an interval of length 𝑇 is zero.

6. A non-periodic function 𝑓 on a finite domain 𝐷 ⊂ ℝ can be rendered a periodic function on the infinite domain ℝ by putting many identical copies of 𝑓 next to each other.

101

7. The lower integral boundary for an integral of a protracted function with period 𝑇 for an integral on an interval of length 𝑇 is irrelevant, because the part of the function that is not integrated at the lower end due to the integral boundary shift is integrated at the upper end due to the periodicity of the function.

8. The real part of 𝑧 ≔ 2 + 3𝑖 is 𝑅𝑒{𝑧} = 2 and imaginary part is 𝐼𝑚{𝑧} = 3. The complex conjugate is given by 𝑧̅ = 2 − 3𝑖, the

absolute value is given by 𝑅 = √22 + 32 = √13 and the argument is given by 휃 = tan−1 (3

2).

9. We have 𝑧3 = 𝑧1 + 𝑧2 = 1 + 𝑖 + 2 − 3𝑖 = 3 − 2𝑖

and

𝑧4 = (1 + 𝑖)(2 + 3𝑖) = 1 ⋅ 2 + 1 ⋅ 3𝑖 + 𝑖 ⋅ 2 + 3𝑖2 = 2 + 3𝑖 + 2𝑖 − 3 = −1 + 5𝑖

10. Euler’s identity is given by exp(𝑖𝑥) = cos(𝑥) + 𝑖 sin(𝑥) and it does hold due to the series definitions of the exponential, cosine, and sine fucntions, and the fact that 𝑖2 = −1.

11. The coefficients in the complex form of the Fourier series are given in terms of the real form coefficients 𝑎0, 𝑎1, 𝑎2, … , 𝑏1, 𝑏2 by

𝑐𝑘 ≔ {

1

2(𝑎−𝑘 + 𝑖𝑏−𝑘), 𝑘 < 0

𝑎0 , 𝑘 = 01

2(𝑎𝑘 − 𝑖𝑏𝑘) , 𝑘 > 0

and the fundamental frequency 𝜔0 is given in terms of the period 𝑇 of the function 𝑓 by 𝜔0 = 2𝜋/𝑇.

12. The coefficients of the polar form of the Fourier series in terms of the real form coefficients are given by 𝑑0 = 𝑎0 , 𝑑𝑘 =

√𝑎𝑘2 + 𝑏𝑘

2 and the phase angles are given by 𝜙𝑘 = arctan (𝑏𝑘

𝑎𝑘) for 𝑘 = 1,2,….

102

An Introduction to Numerical Optimization

(1) Gradient methods

The necessary condition for a local extremal point (i.e. the location of a local maximum or minimum

in the domain of a function) is that the first derivative of the function under study vanishes. For multivariate

real-valued functions

𝑓: Θ ⊆ ℝ𝑝 → ℝ, 휃 ↦ 𝑓(휃) (1)

this corresponds to the condition that ∇𝑓(휃∗) = 0 ∈ ℝ𝑝 at the location of an extremal point 휃∗ ∈ ℝ𝑝. The

gradient ∇𝑓 of a multivariate real-valued function evaluated at a point of its domain points into the direction

of steepest ascent of the function. To find a local minimum of the function, one can thus take a step

proportional to the negative gradient, in which case one performs a gradient descent. Alternatively, to find a

local maximum of the function, one can take a step proportional to the (positive) gradient. While a fairly

obvious method, the length of step is crucial for the success of a gradient method. This length is usually

referred to as step-size. In its most simplistic form, the step-size is set to a constant 𝜅 > 0. In Table 1, we

describe a gradient descent scheme that tests on each iteration, whether the necessary condition of a

vanishing gradient is fulfilled for the current iterand.

Initialization

0. Define a starting point 휃(0) ∈ ℝ𝑝 and set 𝑘 ≔ 0. If ∇𝑓(휃(𝑘)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, proceed to

iterations.

Until Convergence

1. Set 휃(𝑘+1) = 휃(𝑘) − 𝜅 ∇𝑓(휃(𝑘))

2. If ∇𝑓(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, go to 3.

3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.

Table 1. The Newton-Raphson method for finding an extremal point of a univariate real-valued function.

(2) The Newton-Raphson method

In brief, the Newton-Raphson method is an algorithm that allows for finding zeros (or “roots”) of

functions, and in the context of numerical optimization is used to find zeros of the derivatives of a target

function. In the following, we first consider the univariate Newton-Raphson method from two perspectives

(1) as a method of finding the root of the derivative of a function of interest, and (2) as the minimization of a

quadratic approximation to the function of interest. Both perspectives are based on the notions of Taylor-

approximations, which the interested reader may find useful to review. Finally, we extend the concept of the

Newton-Raphson method to the multivariate case,

If we assume that a univariate function

𝑓:ℝ → ℝ,휃 ↦ 𝑓(휃) (1)

of interest is differentiable and that at a zero of its first derivative 𝑓′(휃∗) = 0 its second derivative is non-

zero, i.e. 𝑓′′(휃∗) ≠ 0, 𝑓 has a maximum or a minimum at 휃∗. The idea of the Newton-Rapshon method is to

first guess an initial value 휃(0), set an iteration index 𝑘 to zero, such that 𝑘 ≔ 0 and thus 휃(𝑘) = 휃(0) . To

103

find an approximate value for the zero of 𝑓′ the function 𝑓 is then approximated using a first-order Taylor

series around the current value 휃(𝑘):

𝑓′(휃) ≈ 𝑓′(휃) ≔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) (2)

In brief, equation (2) states that the value of the derivative 𝑓′ at the location 휃 ∈ ℝ is approximated by the

sum of (1) the value of the derivative 𝑓′ at the location 휃(𝑘) ∈ ℝ and the product of the rate of (positive)

change of the function 𝑓′ in 휃(𝑘) and the distance (휃 − 휃(𝑘)) between the location of interest 휃 ∈ ℝ and

the current location 휃(𝑘) (see Figure 1 for an illustration).

Figure 1. Visualization of the univariate Newton-Raphson method for two iterations. Aspects of the first iteration are colored red, aspects of the second iteration are colored blue. For details, see the main text.

The Newton-Raphson method then proposes to approximate the zero of 𝑓′ by the zero of 𝑓′, which is readily

analytically evaluated:

𝑓′(휃) = 0 ⇔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = 0 (3)

⇔ 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = −𝑓′(휃(𝑘))

⇔ 휃 − 휃(𝑘) = −𝑓′(𝜃(𝑘))

𝑓′′(𝜃(𝑘))

⇔ 휃 = 휃(𝑘) −𝑓′(𝜃(𝑘))

𝑓′′(𝜃(𝑘))

Note that the right-hand side of the last line in (6) only involves known terms, as 휃(𝑘) has been defined

above, and the first and second derivatives of 𝑓 are known (provided the function 𝑓 is accessible for

analytical derivation, or the derivative can be evaluated numerically). The full Newton-Raphson method then

takes the form of the following iterative algorithm

104

Initialization

0. Define a starting point 휃(0) ∈ ℝ and set 𝑘 ≔ 0. If 𝑓′(휃(0)) = 0, stop! 휃(0) is a zero of 𝑓′. If not, proceed to iterations.

Until Convergence

1. Set 휃(𝑘+1) ≔ 휃(𝑘) −𝑓′(휃(𝑘))

𝑓′′(휃(𝑘))

2. If 𝑓′(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of 𝑓′. If not, go to 3. 3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.

Table 2. The Newton-Raphson method for finding an extremal point of a univariate real-valued function.

Here, “convergence” denotes the case that a zero 휃(𝑘) of 𝑓′ has been found, or, alternatively, a value of 휃(𝑘)

has been found such that the distance between 𝑓′(휃(𝑘)) and zero is smaller than some small value 𝛿 > 0.

From the theory of nonlinear optimization it is known that the Newton-Raphson method converges from any

starting point 휃(0) as long as 𝑓′ is twice differentiable, is a convex function, and has, in fact, a zero.

Rather than a method for finding a zero of the first derivative of a function 𝑓 of interest, the

Newton-Raphson method may also be interpreted as a maximization/minimization of a second-order

approximation to the original function 𝑓 of interest. The second order Taylor approximation of a twice

differentiable, real, univariate function 𝑓 at a location 휃(𝑘) ∈ ℝ is given by

𝑓(휃) ≈ 𝑓(휃) ≔ 𝑓(휃(𝑘)) + 𝑓′(휃(𝑘))(휃 − 휃(𝑘)) +1

2𝑓′′(휃(𝑘))(휃 − 휃(𝑘))

2 (4)

Note the differences to equation (3): firstly, the approximation is formulated here for 𝑓(휃), while in (3) it

was formulated for 𝑓′(휃), and secondly the approximation 𝑓(휃) also includes the second-order term

1/2 ⋅ 𝑓′′(휃(𝑘))(휃 − 휃(𝑘))2

.To find an extremal point of 𝑓(휃) in lieu of 𝑓(휃), the usual approach familiar

from the analytical approach to finding zeros is chosen: the derivative of 𝑓(휃 ) is evaluated and set to zero.

Here, we have for the derivative of 𝑓(휃):

𝑓′(휃) =𝑑

𝑑𝜃𝑓(휃) (5)

=𝑑

𝑑𝜃(𝑓(휃(𝑘)) + 𝑓′(휃(𝑘))(휃 − 휃(𝑘)) +

1

2𝑓′′(휃(𝑘))(휃 − 휃(𝑘))

2)

=𝑑

𝑑𝜃𝑓(휃(𝑘)) +

𝑑

𝑑𝜃(𝑓′(휃(𝑘))휃) −

𝑑

𝑑𝜃(𝑓′(휃(𝑘))휃(𝑘))

+𝑑

𝑑𝜃(1

2𝑓′′(휃(𝑘))휃2) −

𝑑

𝑑𝜃(𝑓′′(휃(𝑘))휃휃(𝑘)) +

𝑑

𝑑𝜃(1

2(휃(𝑘))

2)

= 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))휃 − 𝑓′′(휃(𝑘))휃(𝑘)

= 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘))

Setting 𝑓′(휃) to zero then results in

𝑓′(휃) = 0 ⇔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = 0 (6)

105

which is equivalent to equation (6). The discussion following (3) is thus applicable also from the perspective

of a second-order approximation to 𝑓.

So far, we have only considered the numerical maximization of univariate realv-valued functions of

the form

𝑓:ℝ → ℝ,휃 ↦ 𝑓(휃) (7)

We next generalize the Newton-Raphsonmethod to the case of multivariate, real-valued functions

𝑓:ℝ𝑝 → ℝ, 휃 ↦ 𝑓(휃) (8)

Note that the output value of these functions of interest is still a scalar value.

Above, we have seen that the Newton-Raphson method requires the first- and second derivative of

the function of interest. The multivariate extension of the Newton-Raphson method takes exactly the same

form as the univariate case, if one replaces the notion of the function’s first-order derivative 𝑓′(휃(𝑘)) by its

gradient ∇𝑓(휃(𝑘))and the notion of the function’s second-order derivative 𝑓′′(휃(𝑘)) by its Hessian

𝐻𝑓(휃(𝑘)). Noting that division by scalar corresponds to the multiplication with the inverse of matrix, then

yields the multivariate form of the Newton-Raphson method as shown in Table 2.

Initialization

0. Define a starting point 휃(0) ∈ ℝ𝑝 and set 𝑘 ≔ 0. If ∇𝑓(휃(𝑘)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, proceed to

iterations.

Until Convergence

1. Set 휃(𝑘+1) = 휃(𝑘) − (∇2𝑓(휃(𝑘)))−1

∇𝑓(휃(𝑘))

2. If ∇𝑓(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, go to 3.


Table 3. The Newton-Raphson method for finding an extremal point of a multivariate real-valued function.

106

Foundations of Probabilistic Models

107

Probability Theory

In this Section we will start out with a pragmatic approach to probabilistic concepts and introduce

mathematically more rigorous concepts at later stage. It is important to keep in mind that probability theory

is a mathematical model for real-world phenomena that appear unpredictable (i.e. random) to humans. In

PMFN we described this theory, not the real-world itself. In other words, the statement “A fair coin comes

up heads with a probability of 0.5” is a statement within probability theory and is to be understood as a

definition within this theory. Whether fair coins exist in the real world, and if in the limit of an infinite

number of observations of their tossing behaviour they actually come up heads in half of the cases, is a

different question, and is not addressed here. The interested reader may find fine-grained discussions of the

philosophical underpinnings of probability theory in works such as [Jaynes 2003] and [DeFinetti 1981].

(1) Random variables

Random variables can be introduced in at least two different ways: first, an intuitive, mathematically

highly imprecise way, and second, a non-intuitive, mathematically precise way, which uses modern measure

theory. In the latter sense, random variables are measurable mappings from a probability into a measure

space. As such, random variables are neither “random” nor are they “variables”, but they are deterministic

mappings or functions. We introduce the measure-theoretic foundations of probability theory below,

however, for now it is helpful to use the following intuitive concept of a random variable:

“𝑥 is a random variable, if it takes on random values” (1)

Although this concept may almost be considered a tautology, it is helpful nonetheless. It tells us, that a

random variable 𝑥 can take on different values. For a given random variable 𝑥, it is always helpful, to think

about what kind of values it can take on. For example, if the random variable 𝑥 is used to describe a coin

toss, we may choose the values “heads” and “tails” as the values that 𝑥 can take on. Alternatively, if 𝑥 is

used to describe a die, we may chose the values {1,2,3,4,5,6} as the values it can take on.

Consider for the moment a random variable, that can take on 𝑛 ∈ ℕ different values. For each value

𝑥𝑖 (𝑖 = 1,… , 𝑛) that 𝑥 can take on, we can define a probability that 𝑥 takes on the value 𝑥𝑖. The probability

that 𝑥 takes on value 𝑥𝑖 may be written as

𝑝(𝑥 = 𝑥𝑖) (2)

(1) should be read as “the probability that the random variable 𝑥 takes no the value 𝑥𝑖”. Probabilities of

random variables have two well-known properties: they lie between 0 and 1, and for all possible values that

𝑥 can take on, they sum up to 1. For example, if we believe that all sides of a six-sided die have the same

probability of being the result of a throw of the die, we could write down the following:

𝑝(𝑥 = 𝑖) =1

6 (𝑖 = 1,… ,6) (3)

In (2) we allocate to each value that 𝑥 can take on a probability, namely, 1

6. Because there are six possible

values that 𝑥 can take on (corresponding to the sides of the six-sided die), the sum of these probabilities is 1.

More abstractly, we have specified a mapping from the “outcome space” of 𝑥 to the interval [0,1] in

the form

108

ℕ6 → [0,1], 𝑥𝑖 ↦ 𝑝(𝑥 = 𝑥𝑖) =1

6 (4)

In general, in probability theory, probabilities, i.e. numbers in the interval [0,1] are assigned to the outcomes

of random variables.

(2) Joint and marginal probability distributions

A very importance concept we require is the notion of joint probabilities. To this end, we first

consider two random variables 𝑥 and 𝑦. Each of the random variables may take on a set of different values

𝑥1, … , 𝑥𝑛 and 𝑦1, … , 𝑦𝑚 with 𝑛,𝑚 ∈ ℕ. Joint probabilities then describe the probabilities that 𝑥 and 𝑦

simultaneously take on values 𝑥𝑖 and 𝑦𝑗 (𝑖 = 1,… , 𝑛, 𝑗 = 1,… ,𝑚).

To obtain an intuition for joint probabilities, consider the following example: Assume that 𝑥 can take

on 3 different values 𝑥1 = 1, 𝑥2 = 2 and 𝑥3 = 3. Further assume that 𝑦 can take on 4 different values

𝑦1 = 1, 𝑦2 = 2, 𝑦3 = 3, 𝑦4 = 4. Then there are 3 × 4 = 12 different combinations of the values of 𝑥 and 𝑦:

(𝑥 = 𝑥1, 𝑦 = 𝑦1) (𝑥 = 𝑥1, 𝑦 = 𝑦2) (𝑥 = 𝑥1, 𝑦 = 𝑦3) (𝑥 = 𝑥1, 𝑦 = 𝑦4) (𝑥 = 𝑥2, 𝑦 = 𝑦1) (𝑥 = 𝑥2, 𝑦 = 𝑦2) (𝑥 = 𝑥2, 𝑦 = 𝑦3) (𝑥 = 𝑥2, 𝑦 = 𝑦4) (𝑥 = 𝑥3, 𝑦 = 𝑦1) (𝑥 = 𝑥3, 𝑦 = 𝑦2) (𝑥 = 𝑥3, 𝑦 = 𝑦3) (𝑥 = 𝑥3, 𝑦 = 𝑦4)

(1)

A joint probability distribution now allocates a probability, i.e., a number in the interval [0,1] to each

possible combination of the values for 𝑥 and 𝑦. As in the case for single random variables (also called

marginal random variables, because they can be considered the “marginal projections” of multivariate

random variables), all these probabilities must sum up to one. A joint probability distribution for the

example above may take the following form

𝑝(𝑥 = 1, 𝑦 = 1) =2

12 𝑝(𝑥 = 1, 𝑦 = 2) = 0 𝑝(𝑥 = 1, 𝑦 = 3) = 0 𝑝(𝑥 = 1, 𝑦 = 4) =

1

12

𝑝(𝑥 = 2, 𝑦 = 1) = 0 𝑝(𝑥 = 2, 𝑦 = 2) =3

12 𝑝(𝑥 = 2, 𝑦 = 3) =

1

12 𝑝(𝑥 = 2, 𝑦 = 4) =

2

12

𝑝(𝑥 = 3, 𝑦 = 1) =1

12 𝑝(𝑥 = 3, 𝑦 = 2) = 0 𝑝(𝑥 = 3, 𝑦 = 33) =

2

12 𝑝(𝑥 = 3, 𝑦 = 4) = 0

(2)

If we sum up the probabilities from all cells, we obtain

∑ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 𝑦𝑗)4𝑗=1

3𝑖=1

2

12+ 0 + 0 +

1

12+ 0 +

3

12+

1

12+

2

12+

1

12+ 0 +

2

12+ 0 =

12

12= 1 (3)

and thus, the probabilities sum up to one, as required for probability distributions.

We may also sum up the probabilities row-wise, i.e. over the different values that 𝑦 can take on. We

obtain the following

𝑝(𝑥 = 1) ≔ ∑ 𝑝(𝑥 = 1, 𝑦 = 𝑦𝑗)4𝑗=1 =

2

12+ 0 + 0 +

1

12=

3

12

𝑝(𝑥 = 2) ≔ ∑ 𝑝(𝑥 = 2, 𝑦 = 𝑦𝑗)4𝑗=1 = 0 +

3

12+

1

12+

2

12=

6

12

𝑝(𝑥 = 3) ≔ ∑ 𝑝(𝑥 = 2, 𝑦 = 𝑦𝑗)4𝑗=1 = 0 +

3

12+

1

12+

2

12=

6

12

(4)

The probabilities of 𝑥 taking on the values 1,2 or 3 obtained in this manner sum up to one again

∑ 𝑝(𝑥 = 𝑥𝑖)3𝑖=1 =

3

12+

6

12+

3

12=

12

12= 1 (5)

109

This is no coincidence, but reflects the fact, that if we have specified a joint probability distribution

of two random variables, or in other words a bivariate distribution, we have also specified the marginal

distributions of two univariate random variables. Above, we have seen the marginal distribution of 𝑥. The

marginal distribution of 𝑦 is obtained by summing over columns:

𝑝(𝑦 = 1) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 1)3𝑖=1

=2

12+ 0 +

1

12=

3

12

𝑝(𝑦 = 2) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 2)3𝑖=1

= 0 +3

12+ 0 =

3

12

𝑝(𝑦 = 3) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 3)3𝑖=1

= 0 +1

12+

2

12=

3

12

𝑝(𝑦 = 4) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 4)3𝑖=1

=1

12+

2

12+ 0 =

3

12

(6)

Again, the probabilities of 𝑦 taking on the value 1,2,3, and 4 sum up to one, as required:

∑ 𝑝(𝑦 = 𝑦𝑗)4𝑗=1 =

3

12+

3

12+

3

12+

3

12=

12

12= 1 (7)

(3) Conditional Probabilities

Consider the following joint probability distribution

𝑝(𝑥, 𝑦) (1)

over the random quantities𝑥 and 𝑦. As discussed above a joint probability distribution of two random

variables allocates to each possible combination of values that the two random variables can take on a

probability mass (for discrete random variables) or a probability density (for continuous random variables).

For example, for discrete random variables, the probability that 𝑥 = 1 and 𝑦 = 2 may be specified by (1) as

𝑝(𝑥 = 1, 𝑦 = 2) = 0.1 (2)

If we have specified a joint distribution, we may ask the following questions: (1) What is the probability of

each marginal variable (i.e., either 𝑥 or 𝑦) to take on specific values, irrespective of the values of the other

marginal variable? (2) If we know the value of one marginal variable, say 𝑥 = 1, what is the probability for

the other marginal variable to take on a specific value? The answer to the first question relates to the notion

of marginal probability distributions as discussed above, while the answer to the second question relates to

the notion of conditional probability distributions.

As an example, consider the following example of a joint probability mass function

𝑝(𝑥 = 1, 𝑦 = 1) = 0.1 𝑝(𝑥 = 1, 𝑦 = 2) = 0.3 𝑝(𝑥 = 1, 𝑦 = 3) = 0.2 𝑝(𝑥 = 2, 𝑦 = 1) = 0.2 𝑝(𝑥 = 2, 𝑦 = 2) = 0.1 𝑝(𝑥 = 2, 𝑦 = 3) = 0.1

Table 1 A probability mass function over two random variables 𝑦 and 휃.

Here, 𝑥 can take on the values 1 and 2 and 𝑦 can take on the values 1,2 and 3. Summing probabilities over

all possible combinations of values that the random variables can take on yields 1, as required for probability

distributions. Consider now the marginal variable 𝑥. Summing over columns as discussed above yields the

probability distribution 𝑝(𝑥), which is defined by

𝑝(𝑥 = 1) = 0.6 and 𝑝(𝑥 = 2) = 0.4 (1)

and is referred to as the marginal probability distribution of 𝑥. Using the same summing procedure with

respect to 𝑥 to obtain the marginal distribution 𝑝(𝑦), we get

110

𝑝(𝑦 = 1) = 0.3, 𝑝(𝑦 = 2) = 0.4 and 𝑝(𝑦 = 3) = 0.3 (2)

We now consider the second question posed above. For the probability mass function example of

Table 1, what is the probability that 𝑦 = 2, if we know that the value of 𝑥 = 1? Looking into the table, one

may be tempted to say 0.3, because we have 𝑝(𝑥 = 1, 𝑦 = 2) = 0.3. However, note that 𝑝(𝑥 = 1, 𝑦 =

1) = 0.1 and 𝑝(𝑥 = 1, 𝑦 = 3) = 0.2. Apparently, the values 0.3,0.1 and 0.2 allocated to 𝑦 = 1, 𝑦 = 2 and

𝑦 = 3 do not sum to 1 and hence do not define a probability distribution over 𝑦. Nevertheless, the intuition

that the probability ofy taking on the value 2, if we know that 𝑥 = 1, should be higher than the probability

for 𝑦 taking on the values 1 and 3, respectively, as indicated in the table, is perfectly valid. Based on the fact

that overall combinations of value of 𝑥 and 𝑦, the above probability mass function sums to one, it turns out

that the values specified in the table for 𝑝(𝑥 = 1, 𝑦 = 1), 𝑝(𝑥 = 1, 𝑦 = 2) and 𝑝(𝑥 = 1, 𝑦 = 3) can be

turned into a proper probability distribution over 𝑦, if we divide them by the marginal probability of 𝑥 taking

on the value 1, i.e 𝑝(𝑥 = 1) = 0.6. Intuitively, this step merely corresponds to “normalizing” (i.e. setting to

one) the marginal probability 𝑝(𝑥 = 1) = 0.6:

𝑝(𝑥=1)

0.6=

0.6

0.6= 1 ⇔

∑ 𝑝(𝑥=1,𝑦=𝑖)3𝑖=1

0.6=

0.1+ 0.3+0.2

0.6=

0.1

0.6+ 0.3

0.6+0.2

0.6 (3)

We then have

𝑝(𝑥=1,𝑦=1)

𝑝(𝑥=1)=

0.1

0.6=

1

6,𝑝(𝑥=1,𝑦=2)

𝑝(𝑥=1)=

0.3

0.6=

3

6 and

𝑝(𝑥=1,𝑦=3)

𝑝(𝑥=1)=

0.2

0.6=

2

6 (4)

and these values sum to 1 and thus represent a probability distribution over 𝑦. We can now answer the

question posed above: The probability that 𝑦 takes on the value 2, if we know that 𝑥 takes on the value 1 is

𝑝(𝑦 = 2|𝑥 = 1) = 0.5 (5)

The “|” in the statement above should be read as “given that”, such that the whole statement reads in verbal

terms: “The probability that 𝑦 takes on the value 2 given that 𝑥 takes on the value 1 is 0.5”. The statement

above is a statement of conditional probability, i.e., it describes the probability of 𝑦 taking on a specific value

conditioned on the fact that 𝑥 takes on a specific value. Based on how we computed the probabilities above,

the general specification of a conditional probability for 𝑦 taking on a value 𝑦∗ and 𝑥 taking on a value 𝑥∗ is

given by

𝑝(𝑦 = 𝑦∗|𝑥 = 𝑥∗) =𝑝(𝑥=𝑥∗,𝑦=𝑦∗)

𝑝(𝑥=𝑥) (6)

Because the rule above holds for any values 𝑦∗ and 𝑥∗ that the random variables 𝑦 and 𝑦 may take on, the

specific arguments are usually suppressed, leading to the general definition of the conditional probability

distribution of𝑦 given 𝑥

𝑝(𝑦|𝑥) =𝑝(𝑥,𝑦)

𝑝(𝑥) (7)

The statement above says that to compute the conditional probability of 𝑦 taking on an arbitrary value in its

outcome space 𝒴, given that 𝑥 takes on an arbitrary value in its outcome space 𝒳, we have to look up the

joint probability 𝑝(𝑥, 𝑦) of 𝑥 and 𝑦 taking on these values, and divide this probability by the marginal

probability of 𝑥 taking on the specified value. As above, we may evaluate the marginal probability

distribution of 𝑥 by marginalizing over 𝑦. It is very important to note that the conditional probability

111

distribution 𝑝(𝑦|𝑥) and the joint distribution 𝑝(𝑥, 𝑦), for given value of 𝑥∗ are virtually identical, up to a

multiplicative constant, given by 𝑝(𝑥), which merely renders 𝑝(𝑥 = 𝑥∗, 𝑦) a proper probability distribution

over 𝑦. In other words, the relative differences between the probabilities for different values of 𝑦 specified

in the joint distribution of 𝑥 and 𝑦 are maintained in the conditional distribution of 𝑦 given a specific value of

𝑥.

So far, we have considered the probability of 𝑦 taking on a specific value, given that 𝑥 takes on a

specific value. Of course we may ask the symmetric question with respect to the probability of 𝑥 taking on a

specific value, given that we know that 𝑦 takes on a prespecified value. In this case, we merely have to

exchange the roles of 𝑦 and 𝑥 in the definition of conditional probability distribution above

𝑝(𝑥|𝑦) =𝑝(𝑥,𝑦)

𝑝(𝑦) (8)

Note that there is nothing to exchange with respect to the joint distribution 𝑝(𝑥, 𝑦), because the order of its

arguments is irrelevant.

(4) Bayes Theorem

Bayes theorem is a statement about conditional probabilities. In itself, it is thus a mere corollary of

the definitions of probability and is, by itself silent on whether probabilities are interpreted as observed

frequencies in a large data limit (corresponding to the view of classical statistics) or as degrees of subjective

uncertainty (corresponding to the view of the “Bayesian” school of thought. Bayes theorem is readily

derived as follows:

From equation (7) of the previous section we have


𝑝(𝑥)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦|𝑥) (1)

if we multiply both sides by 𝑝(𝑥). Likewise, from equation (8) of the previous section we have

𝑝(𝑥|𝑦) =𝑝(𝑥,𝑦)

𝑝(𝑦)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑦)𝑝(𝑥|𝑦) (2)

if we multiply both sides by 𝑝(𝑦). We thus have two different ways to write the joint distribution 𝑝(𝑥, 𝑦)

based on the definitions of the conditional probability distributions 𝑝(𝑦|𝑥) and 𝑝(𝑥|𝑦). Of course, the joint

distribution 𝑝(𝑥, 𝑦) is equal to itself. We may thus write

𝑝(𝑥, 𝑦) = 𝑝(𝑥, 𝑦) (3)

Substituting (1) on the left-hand side, and (2) on the right-hand side of the equality above, we obtain

𝑝(𝑥)𝑝(𝑦|𝑥) = 𝑝(𝑦)𝑝(𝑥|𝑦) (4)

If we divide both sides of the above by 𝑝(𝑥), we obtain the conditional probability distribution of 𝑦 given 𝑥

in the following form

𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)

𝑝(𝑥) (5)

112

Equation (5) above is known as “Bayes Theorem”. Resubstituing 𝑝(𝑥, 𝑦) for 𝑝(𝑦)𝑝(𝑥|𝑦), we see that (5) is

identical to equation (7) of the previous section, our initial definition of the conditional probability

distribution of 𝑦 given 𝑥

𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)

𝑝(𝑥)=

𝑝(𝑥,𝑦)

𝑝(𝑥) (6)

In other words, Bayes Theorem is a rule to compute conditional probability distributions based on joint

distributions.

(5) Independent random variables

In probability theory, the notion of “stochastic independence“ refers to the factorization of joint

distributions into the products of their marginal distributions. In other words, two random variables 𝑥 and 𝑦

are said to be stochastically independent, if and only if, their joint probability equals the product of their

marginal probabilities

𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦) (1)

Notably, (1) applies in equivalent form to probability mass and density functions. To make the factorization

definition of stochastic independence intuitive consider the conditional probability 𝑝(𝑥|𝑦) for the case of

two random variables 𝑥 and 𝑦. The conditional probability 𝑝(𝑥|𝑦) implies that, that the distribution over

values that the random variable 𝑥 may take on is dependent on the values that 𝑦 takes on. For example, for

a random variable 𝑥 taking on values in {1,2,3} and a random variable 𝑦 taking on values in {1,2}, we could

have the following

𝑝(𝑥 = 1|𝑦 = 1) = 0.2, 𝑝(𝑥 = 2|𝑦 = 1) = 0.3, 𝑝(𝑥 = 3|𝑦 = 1) = 0.5 (2)

𝑝(𝑥 = 1|𝑦 = 2) = 0.4, 𝑝(𝑥 = 2|𝑦 = 2) = 0.4, 𝑝(𝑥 = 3|𝑦 = 2) = 0.2 (3)

Notably, the probability that 𝑥 takes on, say, the value 2 is dependent on the value of 𝑦: if 𝑦 is 1, this

probability is 0.3 and hence lower than if 𝑦 is 2, in which case this probability is 0.4. In the special case, that,

given two random variables 𝑥 and 𝑦, the distribution of the random variable 𝑥 is not dependent on the value

that the random variable 𝑦 takes on, 𝑥 is said to be “stochastically independent” of 𝑦. In this case, the

conditional statement “|𝑦” is redundant, and we may simply write

𝑝(𝑥|𝑦) = 𝑝(𝑥) (4)

However, from the definition of the conditional probability 𝑝(𝑥|𝑦) this implies that

𝑝(𝑥|𝑦) = 𝑝(𝑥) ⇒𝑝(𝑥,𝑦)

𝑝(𝑦)= 𝑝(𝑥) (5)

The latter equation, however, is only possible, if 𝑝(𝑦) can be cancelled out on the left-hand side, from which

it follows that 𝑝(𝑥, 𝑦) must factorize into 𝑝(𝑥)𝑝(𝑦)

𝑝(𝑥,𝑦)

𝑝(𝑦)= 𝑝(𝑥) ⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦) (6)

because in this case

113

𝑝(𝑥,𝑦)

𝑝(𝑦)=

𝑝(𝑥)𝑝(𝑦)

𝑝(𝑦)= 𝑝(𝑥) (7)

In other words, the assumption that the probability distribution over the random variable 𝑥 is not affected

by the random variable 𝑦 implies, by means of the definition of the conditional probability the factorization

of the joint distribution 𝑝(𝑥, 𝑦) into the product 𝑝(𝑥)𝑝(𝑦). Note that if 𝑝(𝑥|𝑦) = 𝑝(𝑥) and hence

𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦), the conditional probability of 𝑦 is also independent of 𝑥


𝑝(𝑥)=

𝑝(𝑥)𝑝(𝑦)

𝑝(𝑥)= 𝑝(𝑦) (8)

Below, we depict the joint distribution for independent random variables 𝑥 and 𝑦 based on the marginal

distributions of the example in Section (2) “Joint and marginal distributions”. Note that each cell entry

merely is derived by multiplication of the respective row and column marginal entries

𝑝(𝑥 = 1, 𝑦 = 1) =1

16 𝑝(𝑥 = 1, 𝑦 = 2) =

1

16 𝑝(𝑥 = 1, 𝑦 = 3) =

1

16 𝑝(𝑥 = 1, 𝑦 = 4) =

1

16 𝒑(𝒙 = 𝟏) =

𝟑

𝟏𝟐

𝑝(𝑥 = 2, 𝑦 = 1) =2

16 𝑝(𝑥 = 2, 𝑦 = 2) =

2

16 𝑝(𝑥 = 2, 𝑦 = 3) =

2

16 𝑝(𝑥 = 2, 𝑦 = 4) =

2

16 𝒑(𝒙 = 𝟐) =

𝟔

𝟏𝟐

𝑝(𝑥 = 3, 𝑦 = 1) =1

16 𝑝(𝑥 = 3, 𝑦 = 2) =

1

16 𝑝(𝑥 = 3, 𝑦 = 33) =

1

16 𝑝(𝑥 = 3, 𝑦 = 4) =

1

16 𝒑(𝒙 = 𝟑) =

𝟑

𝟏𝟐

𝒑(𝒚 = 𝟏) =𝟑

𝟏𝟐 𝒑(𝒚 = 𝟐) =

𝟑

𝟏𝟐 𝒑(𝒚 = 𝟑) =

𝟑

𝟏𝟐 𝒑(𝒚 = 𝟒) =

𝟑

𝟏𝟐

(9)

(6) Discrete random variables and probability mass functions

The univariate and bivariate probability distributions of random variables that we have considered

so far are examples for so-called “discrete outcome random variables” or “discrete random variables”. The

defining feature of discrete random variables is that they can only take on a (finite) set of discrete values.

Each of this values is assigned a probability and this assignment will be referred to as “probability mass

function”. Intuitively, the complete probability mass of 1 is partitioned and distributed over the values that

the discrete random variable can take on. Note that this does not imply uniform distributions: an example

for a probability mass function is the “Bernoulli distribution” of random variable 𝑥, which can take on the

values 𝑥 = 0 and 𝑥 = 1 with probabilities 𝑝 ∈ [0,1] and (1 − 𝑝) ∈ [0,1], respectively. The Bernoulli

distribution is often used to model the outcome of a coin toss, where, for an unbiased coin, one would set

𝑝 = 0.5.

(7) Continuous random variables and probability density functions

In contrast to neurophysiology, where one commonly encounters discrete probability distribution

when modelling data like neuronal “spikes” (action potentials), neuroimaging data is usually assumed to

continuous. For example, the values of the MR signal at a given brain location may fluctuate around some

baseline value, say 200, and can take on continuous value around that, like, e.g. 159.34, 177.67, 221.89 and

so on. Likewise, the potential measured at specific EEG electrode at specific time-point fluctuates

continuously around, say 0 𝜇𝑉 with respect to a reference electrode, taking on values in the range of

−100 𝜇𝑉 to 100 𝜇𝑉. Because we are interested in describing these kind of signals using probabilistic

concepts, the focus in PMFN will be on “continuous” random variables, or in other words, random variables,

that take on values in the set of real numbers ℝ. From a probability theory viewpoint, this is quite a

complication, because one now needs to distribute the probability mass of 1 over an infinity of values that

the random variable can take on, both in the large value limit ] − ∞,∞[ and in the very small value limit, i.e.

the infinity of real numbers that lie between any two real numbers. We will not strive for too much formal

114

correctness with respect to these mathematical subtleties. For our purposes, it is enough to acknowledge,

that for “continuous” or “real random variables” probability is not assigned to specific values, but to

intervals. Informally, the function that assigns probability to intervals in the case of real random variables is

called “probability density function”. One may conceive this in analogy to the density concept in

Archimedean physics: mass density is defined as mass per volume. This implies, that if one has zero volume,

you has zero mass. With respect to probability density, if one does not have an interval in the real numbers,

but only one of its boundaries, i.e. a scalar in ℝ, this scalar is allocated zero probability.

The most important univariate probability density function required in PMFN is the “Gaussian” or

“normal” probability density function, defined as

𝑝𝜇,𝜎2: ℝ → ℝ+, 𝑥 ↦ 𝑝(𝑥; 𝜇, 𝜎2) ≔1

√2𝜋𝜎2exp (−

1

2𝜎2(𝑥 − 𝜇)2) (1)

The univariate Gaussian probability density function (or, informally, “Gaussian distribution”), has two

parameters: the “expectation parameter” or “mean parameter” 𝜇 ∈ ℝ and the “variance parameter”

𝜎2 ∈ ℝ+. Note that the variance parameter is by definition positive (i.e. strictly larger than zero). The

expectation parameter determines the location of the Gaussian bell curve on the 𝑥-axis, and the width of the

curve increases with increasing 𝜎2. We will sometimes use the following notation to indicate that a random

variable is distributed according to univariate Gaussian distribution (or, in other words, that it is associated

with a Gaussian probability density function)

𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2) (2)

where the tilde ~ should be read as “is distributed according to”. Note that the symbol before the semicolon

indicates the random variable, the symbols behind the semicolon denote the parameters of the associated

probability density function. Likewise, we may write

𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝜎2) =1


1

2𝜎2(𝑥 − 𝜇)2) (3)

which is somewhat sloppy, but is understood in the same way as (1) and (2).

Informally, what expression (1) wants to say is that the probability of a value of the random variable

𝑥 to fall into an interval on the real line ℝ which is close to the value of 𝜇 is higher, than to fall into an

interval on the line ℝ which is quite far from 𝜇. Simultaneously, expression (1) states that the “dispersion” or

“variability” or “probability for a deviation” from 𝜇 increases with higher values of 𝜎2. This is most readily

appreciated by inspecting the probability density functions corresponding to different values of the

parameters 𝜇 and 𝜎2 (Figure 1, uppermost panel). Note that the value of 𝜇 changes the location of the

curve, while the value of 𝜎2 changes the width of the curve. Intuitively, the probability of the random

variable 𝑥 to assume a value in a very small interval around, say 𝑥 = 2.5 varies with the value of 𝑝(𝑥 = 2.5).

In Figure 1 (uppermost panel), values around 𝑥 = 3 have higher probability under the univariate normal

distribution with 𝜇 = 3, 𝜎2 = 0.3 then for the other two probability distributions.

115

Figure 1. The univariate Gaussian distribution. The uppermost panel depict the probability density functions of three different Gaussian distributions represented by different settings of the parameters of their probability density functions. The middle panel depicts a sample of 20 values from the Gaussian represented from the depicted probability density function. Note that these samples cluster where the probability density function assumes high values in 𝑥 space. The lowermost panel depicts a “histogram estimate” based on a large sample from the Gaussian distribution of the underlying probability density function. While we have not introduced the theory of histogram estimates, intuitively, these may be understood as frequency counts for sampled values to fall into intervals in 𝑥 space. The frequency counts are appropriately scaled and depicted as grey bars over the respect intervals on the 𝑥 axis.

A good way to obtain an intuitive understanding of the theory discussed in this section is to explore

sampling from Gaussian distributions using the random number generator implemented in Matlab. Random

number generators allow one to sample numerical values from specified distributions. Matlab and Matlab’s

Statistics Toolbox comprise a large variety of random number generators for many different distributions

(uniform, binomial, Student t, and so on). In addition Matlab provides functions that return probability

densities, which may be seen as the “analytic” counterpart to the “empirical” random number generators.

The middle and lowermost panels in Figure 1 were generated by capitalizing on the random number

generators and probability density functions implemented in Matlab.

(8) Expected value and variance of univariate random variables

Two further concepts from probability theory we require are the “expected value” or “expectation”

of a random variable, and the “variance” of a random variable. The reader will likely have an intuition about

what these terms refer to: the expected value is the value one “expects” to see “most of the time”, or “in

the long-run average” when sampling from the random variable. Likewise, the variance is a measure of the

“variability” around this average value in the long-run.

From a mathematical viewpoint, it is very important to clearly differentiate between the notions of

“expected value” and “variance” on the one hand, and “empirical mean” and “empirical variance”, on the

other. The former two terms are theoretical constructs, that can be evaluated as soon as some probability

mass or density function has been defined for a given random variable. The latter two terms refer to

constructs that can only be evaluated once realizations of a random variable have been observed. Of course,

116

expectation and variance as theoretical constructs carry a clear intuition with respect to sample mean and

sample variance. However, they are not the same and one should always be clear about whether a given

statement refers to the “theoretical” expectation of a random variable, or the “empirical” evaluation of a

sample from the random variable (and likewise for variances).

Because expected value/expectation and variance are theoretical constructs, they can be evaluated

solely on the basis of a probability mass or density function. For discrete random variables, the expected

value is defined as

𝐸(𝑥) = ∑ 𝑝(𝑥𝑖)𝑥𝑖𝑛𝑖=1 (1)

Note that we assumed that the discrete random variable 𝑥 can take on 𝑛 discrete values 𝑥1, … , 𝑥𝑛. For

example, the expected value of a random variable modelling a fair die using the probability mass function

ℕ6 → [0,1], 𝑥𝑖 ↦ 𝑝(𝑥 = 𝑥𝑖) =1

6 (2)

is given by

𝐸(𝑥) =1

6⋅ 1 +

1

6⋅ 2 +

1

6⋅ 3 +

1

6⋅ 4 +

1

6⋅ 5 +

1

6⋅ 6 =

1+2+3+4+5+6

6=

21

6= 3.5 (3)

Note two important features of expected values based on this example: they are single numbers, and they

take on values in the range of the random variable under consideration (even if this value is not really

possible for a die in this case). Also note that the expected value has a clear intuition: it corresponds to the

sum over all values that the random variable can take on, where each value is weighted by a number in the

interval [0,1] Notably, the entire set of weights sums to one.

For continuous real random variables and their associated probability density functions, the

summation in the definition of the expected value is replaced by integration, but the intuition remains the

same: the outcome values that the random variable can take on are multiplicatively weighted by their

associated probabilities and “added up”

𝐸(𝑥) = ∫ 𝑝(𝑥)𝑥ℝ

𝑑𝑥 (4)

As an example, the expected value of a univariate Gaussian random variable is given by its expectation

parameter (which we will not prove here): if 𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2), then

𝐸(𝑥) = ∫ (1

2𝜋𝜎2exp (−

1

2𝜎2(𝑥 − 𝜇)2)) ⋅ 𝑥

ℝ𝑑𝑥 = 𝜇 (5)

The variance of a random variable corresponds to the expected squared deviation from its expected

value. Formally, we write

𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) (6)

The variance can thus be viewed measure for the “variability” of a random variable in the large

sample limit. For probability mass functions of discrete random variables with finite outcome spaces, the

expected value in (6) can again be viewed as weighted sums over the outcome space:

𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = 𝐸((𝑥 − ∑ 𝑝(𝑥𝑖)𝑥𝑖

𝑛𝑖=1 )2) = ∑ (𝑝(𝑥𝑖)(𝑥𝑖 −∑ 𝑝(𝑥𝑖)𝑥𝑖

𝑛𝑖=1 )2) 𝑛

𝑖=1 (7)

117

As an example, consider again the fair die: Above, we evaluated the expectation to

𝐸(𝑥) = ∑ 𝑝(𝑥𝑖)𝑥𝑖6𝑖=1 = 3.5 (8)

To evaluate the variance of this random variable, we have to subtract 𝐸(𝑥) from each value 𝑥𝑖 that 𝑥 can

take on (i.e. 1,2,3,4,5,6), square each result, weight it with the probability 𝑝(𝑥 = 𝑥𝑖) and finally sum over all

values we obtained:

𝑉(𝑥) = ∑ (𝑝(𝑥𝑖)(𝑥𝑖 − ∑ 𝑝(𝑥𝑖)𝑥𝑖6𝑖=1 )

2) 6

𝑖=1

= ∑ (𝑝(𝑥𝑖)(𝑥𝑖 − 3.5)2) 6

𝑖=1

=1

6⋅ (1 − 3.5)2 +

1

6⋅ (2 − 3.5)2 +

1

6⋅ (3 − 3.5)2 +

1

6⋅ (4 − 3.5)2 +

1

6⋅ (5 − 3.5)2 +

1

6⋅ (6 − 3.5)2

=1

6⋅ (−2.5)2 +

1

6⋅ (−1.5)2 +

1

6⋅ (−0.5)2 +

1

6⋅ (0.5)2 +

1

6⋅ (1.5)2 +

1

6⋅ (2.5)2

=1

6⋅ 6.25 +

1

6⋅ 2.25 +

1

6⋅ 0.25 +

1

6⋅ 0.25 +

1

6⋅ 2.25 +

1

6⋅ 6.25

≈ 1.04 + 0.375 + 0.04 + 0.04 + 0.375 + 1.04

= 2.91

(9)

For a continuous random variable, summation is again replaced by integration, and the following

equation for the variance applies

𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = ∫ 𝑝(𝑥)(𝑥 − 𝐸(𝑥))

2𝑑𝑥

ℝ (10)

As an example, the variance of a univariate Gaussian random variable is given by its variance parameter

(which we will not prove here): if 𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2), then

𝑉(𝑥) = ∫ (1

2𝜋𝜎2exp (−

1

2𝜎2(𝑥 − 𝜇)2)) ⋅ (𝑥 − 𝜇)2

ℝ𝑑𝑥 = 𝜎2 (11)

An often encountered variant of the variance of a random variable expresses the variance in terms of the

expectation of the square of the random variable and the square of the expectation. Specifically, we have

from (11)

𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = 𝐸 (𝑥2 − 2𝑥𝐸(𝑥) + (𝐸(𝑥))

2) (12)

Taking the expectation of the terms on the left-hand side of (125) and noting that the expectation is not a

random variable and its expectation thus again the expectation, we have

𝐸(𝑥2 − 2𝑥𝐸(𝑥) + 𝐸(𝑥)2) = 𝐸(𝑥2) − 2𝐸(𝑥)𝐸(𝑥) + 𝐸(𝑥)2 = 𝐸(𝑥2) − 𝐸(𝑥)2 (13)

We thus have

𝑉(𝑥) = 𝐸(𝑥2) − 𝐸(𝑥)2 (14)

118

In summary, expectation and variance are theoretically concepts that can be applied to any random

variable to measure its “mean tendency” and its “dispersion” about this mean tendency. In the context of

PMFN it is essential to differentiate between three mathematical objects that evoke similar connotations,

but refer to different intuitions: (1) expectation/variance, (2) empirical mean/empirical variance, and (3)

expectation parameter/variance parameter of Gaussian distributions. Again, (1) refer to measures of mean

tendency and centrality, which depend on a random variables probability mass or density function, and are

“theoretical” or “analytical “measures. (2) refer to summary statistics of samples or realizations of random

variables. These can only be evaluated, once samples of random variable have been obtained. Finally, (3)

refer to numbers that describe the functional form of Gaussian distributions. By coincidence, the expectation

parameter of Gaussian corresponds to its expectation, and the variance parameter corresponds to its

variance. This, however, need not be the case: consider for example the Bernoulli distribution introduced

above. Here, the probability mass function has a single parameter 𝑝. One can show, that the expectation of a

Bernoulli distributed random variable is given by 𝐸(𝑥) = 𝑝, while its variance is given by 𝑉(𝑥) = 𝑝(1 − 𝑝).

One may thus refer to 𝑝 as the expectation parameter of 𝑥, however, unlike the Gaussian, the Bernoulli

distribution does not have a “variance parameter”. It nevertheless has a variance.

Study Questions

1. Provide a verbal explanation of the following statement: 𝑝(𝑥 = 2, 𝑦 = 4) = 0.4.

2. Explain the notions of probability mass functions and probability density functions using examples.

3. Compute the marginal distributions 𝑝(𝑥) and 𝑝(𝑦) and the conditional distributions 𝑝(𝑥|𝑦 = 2) and 𝑝(𝑦|𝑥 = 1) of the

distribution given by the following probability mass function for random variables 𝑥 and 𝑦

𝑝(𝑥 = 1, 𝑦 = 1) =2

10 𝑝(𝑥 = 1, 𝑦 = 2) =

1

10 𝑝(𝑥 = 1, 𝑦 = 3) =

3

10

𝑝(𝑥 = 2, 𝑦 = 1) =1

10 𝑝(𝑥 = 2, 𝑦 = 2) =

1

10 𝑝(𝑥 = 2, 𝑦 = 3) =

2

10

4. Derive Bayes theorem based on a joint distribution 𝑝(𝑥, 𝑦) and the definition of the conditional distributions 𝑝(𝑥|𝑦) and

𝑝(𝑦|𝑥).

5. Evaluate the expectation of the distribution given by the following probability mass function for the random variable 𝑥 taking

on values in the set {1,4,10}.

𝑝(𝑥 = 1) = 0.3, 𝑝(𝑥 = 4) = 0.2, 𝑝(𝑥 = 10) = 0.5

6. Evaluate the joint distribution of the marginal distributions given by under the assumption of stochastic independence of the

random variables 𝑥 with probability mass function

𝑝(𝑥 = 1) = 0.4, 𝑝(𝑥 = 2) = 0.6

and random variable 𝑦 with probability mass function

𝑝(𝑦 = 1) = 0.2, 𝑝(𝑦 = 2) = 0.8.


1. 𝑝(𝑥 = 2, 𝑦 = 4) = 0.4 can be read as “the probability that the random variable 𝑥 takes on the value 2 and that the random

variable 𝑦 (simultaneously) takes on the value 4 is 0.4 ”.

2. A probability mass function allocates to each possible value that a random variable takes on a probability mass, such that over

all values that the random variable takes on, these masses add to 1. For example, if 𝑥 is a random variable taking on the values

0 and 1, then a probability mass function for 𝑥 is 𝑝(𝑥 = 0) = 0.2, 𝑝(𝑥 = 1) = 0.8. A probability density function allocates to

each possible value that a random variable can take on a probability density, which, if considered over an interval around a

119

value may be multiplied with the length of the interval to yield the probability of the random variable taking on values in that

interval. A typical example for a probability density function is the Gaussian.

3. For the marginal distributions, see the bold entries in the rightmost column for 𝑝(𝑥) and the bold entries in the lowermost row

for 𝑝(𝑦) below. Note that the marginal distribution probability masses sum to 1.

𝑝(𝑥 = 1, 𝑦 = 1) =2

10 𝑝(𝑥 = 1, 𝑦 = 2) =

1

10 𝑝(𝑥 = 1, 𝑦 = 3) =

3

10 𝑝(𝑥 = 1) =

2

10+

1

10+

3

10=

6

10

𝑝(𝑥 = 2, 𝑦 = 1) =1

10 𝑝(𝑥 = 2, 𝑦 = 2) =

1

10 𝑝(𝑥 = 2, 𝑦 = 3) =

2

10 𝑝(𝑥 = 2) =

1

10+

1

10+

2

10=

4

10

𝑝(𝑦 = 1) =2

10+

1

10=

3

10 𝑝(𝑦 = 2) =

1

10+

1

10=

2

10 𝑝(𝑦 = 3) =

3

10+

2

10=

5

10

For the conditional distributions we have by definition

𝑝(𝑥|𝑦 = 2) =𝑝(𝑥,𝑦=2)

𝑝(𝑦=2) and 𝑝(𝑦|𝑥 = 1) =

𝑝(𝑥=1,𝑦)

𝑝(𝑥=1)

As𝑥 takes on the values 1 and 2 in the current example, the conditional distribution of 𝑥 given 𝑦 = 2 is given by

𝑝(𝑥 = 1|𝑦 = 2) =𝑝(𝑥=1,𝑦=2)

𝑝(𝑦=2)=

1

10⋅10

2=

1

2

𝑝(𝑥 = 2|𝑦 = 2) =𝑝(𝑥=1,𝑦=2)

𝑝(𝑦=2)=

1

10⋅10

2=

1

2

And as 𝑦 takes on the values 1,2 and 3 in the current example, the conditional distribution of 𝑦 given 𝑥 = 1 is given by

𝑝(𝑦 = 1|𝑥 = 1) =𝑝(𝑥=1,𝑦=1)

𝑝(𝑥=1)=

2

10⋅10

6=

2

6

𝑝(𝑦 = 2|𝑥 = 1) =𝑝(𝑥=1,𝑦=2)

𝑝(𝑥=1)=

1

10⋅10

6=

1

6

𝑝(𝑦 = 3|𝑥 = 1) =𝑝(𝑥=1,𝑦=3)

𝑝(𝑥=1)=

3

10⋅10

6=

3

6

Note that the probabilities of both 𝑝(𝑥|𝑦 = 2) and 𝑝(𝑦|𝑥 = 1) sum to 1 over the values of 𝑥 and 𝑦, respectively.

4. From the definition of conditional probabilities, we see that we can write the joint distribution 𝑝(𝑥, 𝑦) in two ways


𝑝(𝑥)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦|𝑥) and 𝑝(𝑥|𝑦) =

𝑝(𝑥,𝑦)

𝑝(𝑦)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑦)𝑝(𝑥|𝑦)

Setting the joint distribution 𝑝(𝑥, 𝑦) equal to itself then yields the following equivalences

𝑝(𝑥, 𝑦) = 𝑝(𝑥, 𝑦) ⇔ 𝑝(𝑥)𝑝(𝑦|𝑥) = 𝑝(𝑦)𝑝(𝑥|𝑦) ⇔ 𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)

𝑝(𝑥)

where the last equality is known as “Bayes Theorem”. 5. The expectation is given by

𝐸(𝑥) = 1 ⋅ 𝑝(𝑥 = 1) + 4 ⋅ 𝑝(𝑥 = 4) + 10 ⋅ 𝑝(𝑥 = 10)

= 1 ⋅ 0.3 + 4 ⋅ 0.2 + 10 ⋅ 0.5

= 0.3 + 0.8 + 5

= 6.1.

6. Under the assumption of independent marginal distributions, the entries in the respective cells of the joint distribution result

from the multiplication of the corresponding marginal probabilities. We thus have

𝑝(𝑥 = 1, 𝑦 = 1) =8

100 𝑝(𝑥 = 1, 𝑦 = 2) =

32

100 𝑝(𝑥 = 1) =

4

10

𝑝(𝑥 = 2, 𝑦 = 1) =12

100 𝑝(𝑥 = 2, 𝑦 = 2) =

48

100 𝑝(𝑥 = 2) =

6

10

𝑝(𝑦 = 1) =2

10 𝑝(𝑦 = 2) =

8

10

Note that the joint distribution sums to 1: 8

100+

32

100+

12

100+

48

100=

100

100= 1.

120

Multivariate Gaussian distributions

(1) The bivariate Gaussian distribution

Multivariate Gaussian distributions extend the idea of univariate Gaussians to more than one

random variable. We will start by considering the most commonly discussed multivariate Gaussian

distribution, the 2-dimensional, or bivariate Gaussian distribution. Consider first the univariate Gaussian

probability density function

𝑝(𝑥 = 𝑥∗) = 𝑁(𝑥 = 𝑥∗; 𝜇, 𝜎2) =1


1

2𝜎2(𝑥∗ − 𝜇)2) (1)

As discussed previously, the univariate Gaussian probability density function 𝑁(𝑥; 𝜇, 𝜎2) is a formula that

returns for all possible values 𝑥∗ of the random variable 𝑥 the associated probability density, or, more

intuitively, the probability of the random variable 𝑥 to assume a value in an infinitesimally small interval

around 𝑥∗. As noted previously, this probability is dependent on the parameters of the univariate Gaussian

probability density function, the “expectation parameter” 𝜇 ∈ ℝ indexing the centre of the bell-shaped

Gaussian curve, and the “variance parameter” 𝜎2 > 0 (which has to be positive), indexing the width of the

bell-shaped curve. As a generalization of this idea, the bivariate Gaussian distribution is a formula that

returns for each combination of two values of random variables 𝑥1 and 𝑥2 the probability values 𝑥1∗ and 𝑥2

∗ of

𝑥1 and 𝑥2, respectively, to simultaneously fall into a very small square in the space ℝ2. For example, based

on the multivariate Gaussian formula, we can evaluate the probability of 𝑥1∗ = 0.7 and (simultaneously)

𝑥2∗ = 0.5. If 𝑥1

∗ and 𝑥2∗ are summarized in a two-dimensional vector

𝑥∗ = (𝑥1∗

𝑥2∗) (2)

The formula for the probability of the random vector 𝑥 to assume specific values 𝑥∗, i.e. the bivariate

Gaussian probability density function, assumes a form that is very similar to (1) and is written as

𝑝(𝑥 = 𝑥∗) = 𝑁(𝑥 = 𝑥∗; 𝜇, Σ) = (2𝜋)−1|Σ|−1

2 exp (−1

2(𝑥∗ − 𝜇)𝑇Σ−1(𝑥∗ − 𝜇)) (3)

where |Σ| denotes the determinant and Σ−1 denotes the inverse of Σ . As for the univariate case, the

important part is the expression in the exponential function, whereas the factor (2𝜋)−1|Σ|−1

2 serves the

purpose of normalization, i.e. integration of the probability density function to 1 over the outcome space

ℝ2. A number of bivariate Gaussian probability density functions and their parameters are visualized in

Figure 1. Note the similarity between (1) and (3): instead of the mean or expectation parameter 𝜇 ∈ ℝ in (1),

the bivariate Gaussian distribution introduces the “mean” or “expectation vector” 𝜇 ∈ ℝ2. The intuition for

𝜇 is the same in both cases: it represents the center of mass of the distribution, or in other words, the center

location of the bell-curve. Likewise, instead of a variance parameter 𝜎2 > 0 equation (3) introduces the

notion of a “covariance matrix (parameter)” Σ ∈ ℝ2×2 that governs the width of the corresponding bell-

curve by means of the entries on its main diagonal. The off-diagonal elements have a somewhat different

interpretation, which we will examine in more detail below. Beforehand, however, we will consider the

example

𝑥∗ = (𝑥1∗

𝑥2∗) = (

0.70.5) (4)

121

introduced above in a bit more detail: To evaluate the probability of the random vector 𝑥 = (𝑥1, 𝑥2)𝑇

falling into a very small square in the vincinity of (0.7, 0.5)𝑇 (i.e. the probability density at 𝑥∗ = (0.7, 0.5)𝑇)

based on (3), we need to know the values of 𝜇 and Σ. Let us assume that

𝜇 = (𝜇1𝜇2) ≔ (

11) and Σ = (

𝜎112 𝜎12

2

𝜎212 𝜎22

2 ) ≔ (0.1 0.070.07 0.1

) (5)

Equation (3) then simply means the following: to evaluate the probability density at 𝑥∗ = (0.7, 0.5)𝑇

compute

𝑝(𝑥 = 𝑥∗) = (2𝜋)−1|Σ|−1

2 exp (−1

2(𝑥∗ − 𝜇)𝑇Σ−1(𝑥∗ − 𝜇))

= (2𝜋)−1 |(𝜎112 𝜎12

2

𝜎212 𝜎22

2 )|

−1

2

exp(−1

2((𝑥1∗

𝑥2∗) − (

𝜇1𝜇2))

𝑇

(𝜎112 𝜎12

2

𝜎212 𝜎22

2 )

−1

((𝑥1∗

𝑥2∗) − (

𝜇1𝜇2)))

= (2𝜋)−1 |(0.1 0.070.07 0.1

)|−1

2exp(−

1

2((0.70.5) − (

11))

𝑇

(0.1 0.070.07 0.1

)−1

((0.70.5) − (

11))) (6)

Note that the value 𝑝(𝑥 = 𝑥∗) returned by the somewhat lengthy expression (6) (which is usually evaluated

on a computer and not by hand) is merely a positive scalar number and corresponds to the color at

𝑥∗ = (0.7, 0.5)𝑇 ∈ ℝ2 in Figure 1. Intuitively, as stated above, the warmth of this color corresponds to the

probability of the random variable 𝑥 to fall into a square centered at (0.7, 0.5)𝑇.

Figure 1. Figure 1 depicts bivariate Gaussian distributions with identical expectation parameter 𝜇 = (1,1)𝑇 and varying covariance matrices. The white cross indicates the point (0.7, 0.5)𝑇 ∈ ℝ2. The respective covariance matrices are for

the left panel 𝛴 ≔ (0.1 0.070.07 0.1

) ,the middle panel𝛴 ≔ (0.1 −0.07−0.07 0.1

), and for the right panel 𝛴 ≔ (0.1 00 0.1

).

We now consider the general covariance matrix of a bivariate Gaussian distribution

Σ ≔ (𝜎112 𝜎12

2

𝜎212 𝜎22

2 ) (7)

in more detail. The elements 𝜎112 and 𝜎22

2 correspond to the variances of the (univariate) marginal variables

𝑥1 and 𝑥2. Together with the entries 𝜇1 and 𝜇2 of the mean vector 𝜇 ∈ ℝ2 these values uniquely specify the

shape of the so-called marginal distributions of 𝑥1 and 𝑥2, respectively. The values 𝜎122 and 𝜎21

2 have a

122

different connotation: they specify the covariation (intuitively the reader may think of the “correlation”) of

𝑥1 and 𝑥2. This covariation is always symmetrical (like correlation is always symmetrical). In other words, the

covariation of 𝑥1 and 𝑥2 is the same as the covariation of 𝑥2 and 𝑥1, and thus 𝜎122 = 𝜎21

2 . Because the off-

diagonal elements of Σ are equal, the transpose of Σ, Σ𝑇 equals Σ, i.e. we have

Σ𝑇 = Σ (8)

Covariance matrices are thus always symmetric. Like the variance parameter 𝜎2 of a univariate Gaussian has

to be positive, the covariance matrix of a multivariate Gaussian has to be “positive-definite”. By inspecting

Figure 1 and the associated covariance matrices, one realizes that positive values of the covariation

𝜎122 = 𝜎21

2 result in the fact that high values of 𝑥1 are associated with a high probability of 𝑥2 also being high,

and a low probability of 𝑥2 being low. Conversely, negative values of 𝜎122 = 𝜎21

2 result in the fact that high

values of 𝑥1 are associated with a high probability of 𝑥2 being low, and a low probability of 𝑥2 being high.

Finally, if 𝜎122 = 𝜎21

2 = 0, result in the fact that high values of 𝑥1 are associated with the same probabilities

for low and high values of 𝑥2. In this case, knowledge of the value of 𝑥1 does not give us any information

about the likely value of 𝑥2 which intuitively corresponds to the idea, that 𝑥1 and 𝑥2 are “stochastically

independent”. This last case is very important for the theory of the general linear model, such that it comes

with its own label: If a bivariate Gaussian covariance matrix is of the form

Σ ≔ (𝜎112 0

0 𝜎222 ) (9)

it is called “diagonal”. If, in addition the values of 𝜎112 and 𝜎22

2 are both equal to a unique value 𝜎2 > 0, the

covariance matrix takes the form

Σ ≔ (𝜎2 00 𝜎2

) = 𝜎2 (1 00 1

) = 𝜎2𝐼2 (10)

where 𝐼2 denotes the 2 × 2 identity matrix. The corresponding appearance of the bivariate Gaussian

probability density function is round or “spherical”. A covariance matrix of the form 𝜎2𝐼2 is thus called

“spherical covariance matrix”. Spherical covariance matrices of are of 𝑛-dimensional multivariate Gaussian

distributions are of fundamental importance for the theory of the general linear model. We will thus next

extend the notion of bivariate Gaussian distributions to “𝑛-variate” or “𝑛-dimensional” Gaussian

distributions.

(2) The multivariate Gaussian distribution

The extension of the “𝑛-variate” (or” 𝑛-dimensional”) Gaussian distribution and its associated

probability density function from 𝑛 = 2 to arbitrary 𝑛 ∈ ℕ, and thus “multivariate” Gaussian distributions, is

relatively straightforward: 𝑛-dimensional Gaussian distributions describe the joint distribution of 𝑛

univariate random variables 𝑥1, … , 𝑥𝑛, or, equivalently, the distribution of 𝑛-dimensional “random vectors”

𝑥 ≔ (

𝑥1⋮𝑥𝑛) ∈ ℝ𝑛 (1)

The probability density functions of 𝑛-dimensional Gaussian distributions have the general form

𝑝(𝑥) = (2𝜋)−𝑛

2|Σ|−1

2 exp (−1

2(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇)) (2)

123

where

𝜇 = (

𝜇1⋮𝜇𝑛) ∈ ℝ𝑛 (3)

is called the “mean vector” or “expectation parameter” and

Σ ≔

(

𝜎112 𝜎12

2

𝜎212 𝜎22

2

⋯ 𝜎1𝑛2

⋯ 𝜎2𝑛2

⋮ ⋮𝜎𝑛12 𝜎𝑛2

2⋱ ⋮⋯ 𝜎𝑛𝑛

2 )

∈ ℝ𝑛×𝑛 (4)

is called the covariance matrix. Like in the bivariate case, the covariance matrix of an 𝑛-dimensional Gaussian

probability density function is required to be symmetric and positive-definite. Further, like in the bivariate

case, the 𝑖th diagonal element (𝑖 = 1,… , 𝑛) represents the variance of the 𝑖th marginal variable 𝑥𝑖 and the

(𝑖, 𝑗)th off-diagonal element (𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑛) represents the covariance of the 𝑖th marginal variable

𝑥𝑖 with the 𝑗th marginal variable 𝑥𝑗 (which, of course, is equal to the covariance of the 𝑗th marginal variable

𝑥𝑗 with the 𝑖th marginal variable 𝑥𝑖). For the formulation of the general linear model, the 𝑛-dimensional

generalization of the bivariate spherical covariance matrix is of fundamental importance. In analogy to the

above, for 𝜎2 > 0 it takes the general form

Σ ≔ (

𝜎2 0 0 𝜎2

⋯ 0⋯ 0

⋮ ⋮0 0

⋱ ⋮⋯ 𝜎2

) = 𝜎2 (

1 0 0 1

⋯ 0⋯ 0

⋮ ⋮0 0

⋱ ⋮⋯ 1

) = 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 (5)

where 𝐼𝑛 denotes the 𝑛-dimensional identity matrix.

Figure 2: (Left panel) Probability density function of bivariate Gaussian with 𝜇 ≔ (1,1)𝑇 and covariance matrix

𝛴 ≔ (0.1 0.070.07 0.1

). (Right panel) 200 samples drawn from the distribution characterized by the probability density

function on the left.

To summarize, it is important to be clear about the fact that the only thing we have discussed in this

section is a specific way of specifying a probabilistic model. It is very important not to confuse the “analytic”

or “probability-theoretical” notions of “mean vectors” or “covariance matrices” introduced in the current

section with “empirical” or “statistical concepts” such as the “sample mean” or the “sample correlation”.

124

These concepts, which may be very familiar from undergraduate statistics have not been touched upon in

the current section. In other words, with respect to the undergraduate statistics curriculum the 𝜇’s and Σ’s

introduced in the current section might best be thought of what is often referred to as “population

parameters” in undergraduate statistics. Finally, like for the univariate Gaussian distribution, the probability

density functions discussed in the current section come with a clear intuition: if samples are drawn from, for

example, a bivariate Gaussian distribution described by a given parameter setting of its associated

probability density function, the majority of samples will fall into regions of the outcome space that is

associated with high probability density, and only a minority of samples will fall into regions associated with

low probability density. Figure 1 above visualizes this for the case of a bivariate Gaussian probability density

function.

(3) IID Gaussian random variables and spherical covariance matrices

In this part we introduce a fundamental theorem for the importance of multivariate Gaussian

distributions in the theory of the general linear model. Intuitively, this theorem result may be stated as

follows: 𝑛 univariate random variables, each distributed with (not necessarily identical) expectation and

identical variance parameters, can be described by the joint distribution of 𝑛 univariate random variables in

an 𝑛-dimensional random vector that is distributed according to a multivariate Gaussian distribution with an

expectation vector given by the concatenation of the individual univariate expectation parameters and a

spherical covariance matrix resulting from the multiplication of the 𝑛 × 𝑛 identity matrix with the common

variance parameter of the univariate Gaussian random variables. This result is fundamental, because it

allows to express the assumption of independently and identically distributed error terms for observations in

a general linear model to be concisely expressed by a single distribution, and all classical and Bayesian

inference schemes for the general linear model have tight connections to this basic property.

To express the above more formally, first recall that the notion of stochastic independence

intuitively corresponds to the idea that the value of, say 𝑥𝑖 has no influence on the value of, say 𝑥𝑗

(1 ≤ 𝑖, 𝑗 ≤ 𝑛, 𝑖 ≠ 𝑗). Above we have seen stochastic independence is mathematically expressed (or

modelled) by noting that the joint probability of 𝑥𝑖 = 𝑎 and 𝑥𝑗 = 𝑏 (for 𝑖 ≠ 𝑗) can be evaluated as the

product of probability of 𝑥𝑖 = 𝑎 and the probability of 𝑥𝑗 = 𝑏:

𝑝(𝑥𝑖 = 𝑎, 𝑥𝑗 = 𝑏) = 𝑝(𝑥𝑖 = 𝑎)𝑝(𝑥𝑗 = 𝑏) (1)

The central theorem of this section now claims that the joint distribution of

a) 𝑛 marginal variables 𝑥𝑖 (𝑖 = 1,… , 𝑛) that are independently distributed according to univariate normal

distributions with expectation parameters 𝜇𝑖 (𝑖 = 1,… , 𝑛) (where 𝜇1, 𝜇2, … , 𝜇𝑛 are in general not

identical), and variance parameter 𝜎12 = 𝜎2

2 = ⋯ = 𝜎𝑛2 = 𝜎2 (i.e., the variance parameter is the same

for all variables) and

b) an 𝑛-dimensional random vector 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇 that is distributed according to a multivariate

Gaussian distribution with expectation parameter vector

𝜇 ≔ (

𝜇1⋮𝜇𝑛) (2)

where the entries correspond to the 𝜇𝑖’s of a) and spherical covariance matrix

125

Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 (3)

where 𝜎2 corresponds to the variance parameter of a)

are identical. From a sampling perspective this may rephrased as follows: It does not matter, whether one

samples 𝑛 values from independent, univariate normally distributed variables with individual expectation

parameters 𝜇𝑖 (𝑖 = 1,… , 𝑛) and common variance parameter 𝜎2 “one after the other”, or, simultaneously,

samples an 𝑛-dimensional vector of variables 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇 from a multivariate Gaussian distribution

with expectation parameter 𝜇 ∈ ℝ𝑛 with the entries 𝜇𝑖 (𝑖 = 1,… , 𝑛) and spherical covariance matrix

Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛. The distributions of the sampled values are the same.

Below we derive this insight formally. To this end, we show that the probability density functions

describing (a) 𝑛 independent univariate Gaussian random variables distributed as described above, and

describing (b) an 𝑛-dimensional Gaussian random vector with the parameters described above, are identical,

in short

∏ 𝑁(𝑥𝑖; 𝜇𝑖, 𝜎2)𝑛

𝑖=1 = 𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) (4)

with the notation above.

Proof of (4)

Let 𝑥 ≔ (𝑥1, … , 𝑥𝑛)𝑇 ∈ ℝ𝑛 be an 𝑛-dimensional random vector, let 𝜇 ≔ (𝜇1, … , 𝜇𝑛)

𝑇 ∈ ℝ𝑛 be an expectation parameter,

and let Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛, 𝑝. 𝑑. be a spherical covariance matrix. Then statement (4) claims that

∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛

𝑖=1 = 𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) (4.1)

To see this, we consider the left-hand side of (4.1). The univariate Gaussian probability density functions forming the terms of the

product are given by

𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2) =

1


1

2𝜎2(𝑥𝑖 − 𝜇𝑖)

2) (𝑖 = 1,… , 𝑛) (4.2)

For the product ∏ 𝑁(𝑥𝑖 ; 𝜇𝑖 , 𝜎2)𝑛

𝑖=1 , we thus have


𝑖=1 =1


1

2𝜎2(𝑥1 − 𝜇1)

2) ⋅1


1

2𝜎2(𝑥2 − 𝜇2)

2) ⋅ … ⋅1


1

2𝜎2(𝑥𝑛 − 𝜇𝑛)

2) (4.3)

In (4.3), the factor 1

√2𝜋𝜎2 occurs 𝑛-times, which we may summarize as

(1

√2𝜋𝜎2)𝑛= ((2𝜋𝜎2)−

1

2)𝑛

= (2𝜋𝜎2)−𝑛

2 = (2𝜋)−𝑛

2𝜎−𝑛 (4.4)

Further, we can use the property

exp(𝑎) exp(𝑏) = exp (𝑎 + 𝑏) (4.5)

of the exponential function to write the product of exponentials in (4.3) more compactly as

exp (−1

2𝜎2(𝑥1 − 𝜇1)

2) ⋅ … ⋅ exp (−1

2𝜎2(𝑥𝑛 − 𝜇𝑛)

2) = exp (−∑1

2𝜎2(𝑥𝑖 − 𝜇𝑖)

2𝑛𝑖=1 ) = exp (−

1

2𝜎2∑ (𝑥𝑖 − 𝜇𝑖)

2𝑛𝑖=1 ) (4.6)

These simplifications allow for re-expressing the left-hand side of (4) as


𝑖=1 = (2𝜋)−𝑛

2𝜎−𝑛 exp (−1

2𝜎2∑ (𝑥𝑖 − 𝜇𝑖)

2𝑛𝑖=1 ) (4.7)

Next consider the right-hand side of (4). By definition, we have

126

𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = (2𝜋)−𝑛

2|𝜎2𝐼𝑛|−1

2 exp (−1

2(𝑥 − 𝜇)𝑇(𝜎2𝐼𝑛)

−1(𝑥 − 𝜇)) (4.8)

From linear algebra, we know that the determinant of a diagonal matrix, i.e. a matrix that has non-zero entries only on its main

diagonal, is given by the product of its diagonal elements. We thus have

|𝜎2𝐼𝑛|−1

2 = (∏ 𝜎2𝑛𝑖=1 )−

1

2 = (𝜎2)−𝑛

2 = 𝜎−𝑛 (4.9)

Because the inverse of a diagonal matrix corresponds to the matrix with the multiplicative inverses of the diagonal entries along is

main diagonal, we further have

(𝜎2𝐼𝑛)−1 =

1

𝜎2𝐼𝑛 (4.10)

Finally,

1

𝜎2(𝑥 − 𝜇)𝑇𝐼𝑛(𝑥 − 𝜇) =

1

𝜎2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇) =

1

𝜎2∑ (𝑥𝑖 − 𝜇𝑖)

2𝑛𝑖=1

which is readily seen by considering the matrix product 𝐴𝑇𝐴 of a 𝐴 ∈ ℝ𝑛×1 with itself. Summarizing the above, we thus have

𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = (2𝜋)−𝑛

2𝜎−𝑛 exp (−1

𝜎2∑ (𝑥𝑖 − 𝜇𝑖)

2𝑛𝑖=1 ) (4.11)

Comparing (4.7) and (4.11) now shows that indeed

𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = ∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛

𝑖=1 (4.12)

(4) The linear transformation theorems for Gaussian distributions

The “linear transformation theorem for Gaussian distributions” is a fundamental result in the theory

of multivariate Gaussian distributions. Intuitively, it corresponds to the statement that the matrix product of

a multivariate Gaussian random variable 𝑥 with a given matrix 𝐴 and the addition of a second multivariate

Gaussian random variable 휀 to this product (1) yields a third multivariate Gaussian random variable 𝑦, and

that (2) the parameters of the distribution of 𝑦 can be evaluated based on the parameters of 𝑥 and 휀 and the

matrix 𝐴.

Formally, the transformation theorem states that if

𝑝(𝑥) = 𝑁(𝑥; 𝜇𝑥 , Σx), where 𝑥, 𝜇𝑥 ∈ ℝ𝑑, Σ𝑥 ∈ ℝ

𝑑×𝑑 positive-definite, (1)

𝑝(휀) = 𝑁(휀; 𝜇 , Σ ), where 휀, 𝜇 ∈ ℝ𝑑 , Σε ∈ ℝ𝑑×𝑑 positive-definite, (2)

the covariation of 𝑥 and 휀 is zero ℂ(𝑥, 휀) = (ℂ(휀, 𝑥))𝑇= 0 ∈ ℝ𝑑×𝑑 and (3)

𝐴 ∈ ℝ𝑑×𝑑 is a matrix, (4)

then the random variable

𝑦 ≔ 𝐴𝑥 + 휀 (5)

is distributed according to a multivariate Gaussian distribution

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σy) where 𝑦 ∈ ℝ𝑑, 𝜇𝑦 ∈ ℝ𝑑 , Σ𝑦 ∈ ℝ

𝑑×𝑑 positive definite (6)

and specifically,

127

𝜇𝑦 = 𝐴𝜇𝑥 + 𝜇 and Σ𝑦 = 𝐴Σ𝑥𝐴𝑇 + Σ (7)

(5) The Gaussian joint and conditional distribution theorem

Joint distribution

Given a Gaussian marginal distribution

𝑝(𝑧) = 𝑁(𝑧; 𝜇𝑧, Σ𝑧), where 𝑧, 𝜇𝑧 ∈ ℝ𝑝, Σ𝑧 ∈ ℝ

𝑝×𝑝 𝑝. 𝑑. (1)

and a Gaussian conditional distribution

𝑝(𝑦|𝑧) = 𝑁(𝑦; 𝐴𝑧, Σ𝑦|𝑧), where 𝑦 ∈ ℝ𝑛, 𝐴 ∈ ℝ𝑛×𝑝, Σ𝑦|𝑧 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (2)

the joint distribution of 𝑦 and 𝑧

𝑝(𝑦, 𝑧) = 𝑝(𝑦|𝑧)𝑝(𝑧) (3)

is given by

𝑝(𝑦, 𝑧) = 𝑁 ((𝑦𝑧) ; 𝜇𝑦,𝑧, Σ𝑦,𝑧) with (𝑦, 𝑧)𝑇 , 𝜇𝑦,𝑧 ∈ ℝ

𝑛+𝑝, Σ𝑦,𝑧 ∈ ℝ(𝑛+𝑝)×(𝑛+𝑝) (4)

where

𝜇𝑦,𝑧 = (𝐴𝜇𝑧𝜇𝑧) and Σ𝑦,𝑧 = (

Σ𝑦|𝑧 + 𝐴Σ𝑧𝐴𝑇 𝐴Σ𝑧

Σ𝑧𝐴𝑇 Σ𝑧

). (5)

Conditional and marginal distributions

The conditional distribution

𝑝(𝑧|𝑦) =𝑝(𝑦,𝑧)

𝑝(𝑦) (6)

is given by

𝑝(𝑧|𝑦) = 𝑁(𝑧; 𝜇𝑧|𝑦, Σ𝑧|𝑦) (7)

where

𝜇𝑧|𝑦 = Σ𝑧|𝑦(Σ𝑧−1𝜇𝑧 + 𝑋

𝑇Σ𝑦|𝑧−1𝑦) ∈ ℝ𝑝 and Σ𝑧|𝑦 = (Σ𝑧

−1 + 𝐴𝑇Σ𝑦|𝑧−1𝐴)

−1∈ ℝ𝑝×𝑝 (8)

and the marginal distribution

𝑝(𝑦) = ∫ 𝑝(𝑦, 𝑧)𝑑𝑧 (9)

is given by

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) , where 𝜇𝑦 = 𝐴𝑧 ∈ ℝ𝑛 and Σ𝑦 = Σ𝑦|𝑧 + 𝐴Σ𝑧𝐴

𝑇 ∈ ℝ𝑛×𝑛 (10)

□

128

As an example of the theorem, we consider the special case that 𝑦, 𝑧 ∈ ℝ, i.e., that the joint

distribution 𝑝(𝑦, 𝑧) is a bivariate Gaussian distribution. We thus have the marginal distribution

𝑝(𝑧) = 𝑁(𝑧; 𝜇𝑧, 𝜎𝑧2), where 𝑧, 𝜇𝑧 ∈ ℝ, 𝜎𝑧

2 > 0 (11)

and the conditional distribution

𝑝(𝑦|𝑧) = 𝑁(𝑦; 𝑎𝑧, 𝜎𝑦|𝑧2 ), where 𝑦, 𝑎 ∈ ℝ, 𝜎𝑦|𝑧

2 > 0 (12)

From (5), we thus have the bivariate Gaussian joint distribution

𝑝(𝑦, 𝑧) = 𝑁 ((𝑦𝑧) ; 𝜇𝑦,𝑧, Σ𝑦,𝑧) with 𝜇𝑦,𝑧 = (

𝑎𝜇𝑧𝜇𝑧) , Σ𝑦,𝑧 = (

𝜎𝑦|𝑧2 + 𝑎2𝜎𝑧

2 𝑎𝜎𝑧2

𝑎𝜎𝑧2 𝜎𝑧

2) (13)

the conditional distribution

𝑝(𝑧|𝑦) = 𝑁(𝑧; 𝜇𝑧|𝑦, 𝜎𝑧|𝑦2 ) (14)

where

𝜎𝑧|𝑦2 = ((𝜎𝑧

2)−1 + 𝑎2 (𝜎𝑦|𝑧2 )

−1)−1> 0 and 𝜇𝑧|𝑦 = 𝜎𝑧|𝑦

2 ((𝜎𝑧2)−1𝜇𝑧 + 𝑎

2(𝜎𝑦|𝑧2 )

−1𝑦) ∈ ℝ (15)

and the marginal distribution

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, 𝜎𝑦2) where 𝜇𝑦 = 𝑎𝜇𝑧 ∈ ℝ

𝑛 and 𝜎𝑦2 = 𝜎𝑦|𝑧

2 + 𝑎2𝜎𝑧2 > 0 (16)

Figure 1 below depicts Bayesian inference for the expectation parameter of a univariate Gaussian based on a

single observation with a “tight”, i.e. low variance, prior (upper panels) and a “loose”, i.e. high variance prior

(lower panels) on for the unobserved variable 𝑧.

Figure 1. Bayesian inference for a univariate Gaussian distribution based on a single observation and with known variance 𝜎𝑦|𝑧2 . Note

that for a narrow prior distribution over the latent variable, the posterior distribution over the latent variable is dominated by the prior, while it is dominated by the data for a wide prior.

129

Study Questions

1. Write down the probability density function of a multivariate Gaussian distribution and provide explanations for its components.

2. Assume you know the following covariance matrix of a bivariate Gaussian distribution

Σ = (𝜎112 𝜎12

2

𝜎212 𝜎22

2 ) ≔ (1 −0.5

−0.5 2)

What can you say about the variances of the marginal variables and their correlation?


1. The probability density function of a multivariate Gaussian distribution of an 𝑛-dimensional random vector 𝑥 is given by is given by

𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝛴) = (2𝜋)−𝑛

2|𝛴|−1

2 𝑒𝑥𝑝 (−1

2(𝑥 − 𝜇)𝑇𝛴−1(𝑥 − 𝜇))

Its core components are an expectation parameter vector 𝜇 ∈ ℝ𝑛 which defines where the realizations of the random vector 𝑥 are centered in ℝ𝑛, and a symmetric positive-definite covariance matrix parameter 𝛴 ∈ ℝ𝑛×𝑛 which specifies on its diagonal, how much the realizations of 𝑥 vary about 𝜇 for each dimension, and on its off-diagonal elements, how much the components of 𝑥 covary.

2. From the diagonal elements of 𝛴 we can conclude that the first component 𝑥1 of the bivariate random vector 𝑥 = (𝑥1, 𝑥2) has a smaller variance than the second component. From the off-diagonal elements we can infer that 𝑥1 and 𝑥2 are negatively correlated, i.e., if 𝑥1 takes on a high value, 𝑥2 rather takes on a low value, and vice versa.

130

Information Theory

Information theory can be viewed as a collection of information theoretic quantities. Well known

examples of information theoretic quantities are information entropy and mutual information. The common

and defining feature of information theoretic quantities is that they are functions that map probability

distributions onto scalar numbers. Because probability distributions (or more precisely, probability density

functions and probability mass functions) are themselves functions, information theoretic quantities are also

referred to as functionals (functions of functions). We will highlight this special mathematical nature of

information theoretic quantities by using serif symbols for them, for example ℱ instead of 𝐹. In this Section,

we briefly review two central information theoretic quantities in the context of parametric Bayesian

inference, entropy and the Kullback-Leibler divergence. We conceive these quantities merely as functionals

of probability density functions and make no attempt to relate them to their origins in statistical physics.

Also, in contrast to common introduction to information theory, we consider these quantities only for

probability density functions. In basic information theoretic terms, the quantities we are concerned with

here may thus be referred to as “differential entropy” and “differential Kullback-Leibler divergence”.

(1) Entropy

The differential entropy of random variable 𝑥 governed by a probability density function 𝑝(𝑥) is

defined as

ℋ(𝑝(𝑥)) ≔ −∫𝑝(𝑥) ln 𝑝(𝑥) 𝑑𝑥 (1)

The entropy of a random variable is a measure of its variability and becomes maximal for a uniformly

distributed random variable. Its value is independent of the actual values taken on by 𝑥, and only depends

on the probability density function of 𝑥. This distinguishes entropy from other measures of random variable

variability: for example, the variance a of random variable is given by the expectation of the squared

deviation of the values the random variable can take on from the expectation of the random variable. It is

important to note that if the probability density function of a random variable 𝑥 is of known functional form,

the entropy of the random variable can be evaluated analytically. Without proof, we note that for a 𝑑-

dimensional random vector 𝑥 ∈ ℝ𝑑 governed by the 𝑑-dimensional Gaussian distribution 𝑁(𝑥; 𝜇, Σ), the

entropy integral evaluates to

ℋ(𝑁(𝑥; 𝜇, Σ)) =1

2ln |Σ| +

𝑑

2(1 + ln 2𝜋) =

1

2ln((2𝜋𝑒)𝑑|Σ|) (2)

which, for a univariate Gaussian probability density function simplifies to

ℋ(𝑁(𝑥; 𝜇, 𝜎2)) =1

2ln(2𝜋𝑒𝜎2) (3)

Notably, the entropy of a Gaussian random variable is thus monotonically increasing with its variance

parameter as shown in Figure 1.

(2) Kullback-Leibler Divergence

The Kullback-Leibler divergence of two probability distributions 𝑞(𝑥) and 𝑝(𝑥) is defined as

𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)

𝑝(𝑥))𝑑𝑥 (4)

131

Figure 1. Entropy of Gaussian probability density functions. Panels A and B relate to the entropy of univariate Gaussian distributions, panels C and D relate to the entropy of bivariate Gaussian distributions. Specifically, Panel A depicts Gaussian probability density functions with expectation parameter 𝜇 = 0 and three variance parameter settings. Panel B depicts the differential entropy of Gaussian probability density functions as a function of the variance parameter. The entropies of the three Gaussian densities shown in panel A are indicated using colored markers. Entropy increases nonlinearly as a function of variance. Panel C depicts two bivariate

Gaussian densities with expectation parameter 𝜇 = (0,0)𝑇 and covariance matrix Σ = (1 𝜌𝜌 1

). Larger off-diagonal elements increase

the correlation between the marginal variables 𝑥1 and 𝑥2 and reduce the entropy of bivariate Gaussians nonlineary as shown in Panel D. The markers depict the entropies of the bivariate Gaussians shown on in Panel C.

Intuitively, the Kullback-Leibler is a measure of the “distance” between two probability density functions (it

is, however, not a metric on the space of probability density functions). Without proof, we note that the KL

divergence between two Gaussian probability density functions on a 𝑑-dimensional random vector 𝑥 ∈ ℝ𝑑,

𝑁(𝑥; 𝜇𝑞 , Σ𝑞) and 𝑁(𝑥; 𝜇𝑝, Σ𝑝), is given by

𝒦ℒ (𝑁(𝑥; 𝜇𝑞 , Σ𝑞)||𝑁(𝑥; 𝜇𝑝, Σ𝑝)) =1

2(ln (

|Σ𝑝|

|Σ𝑞|) + 𝑡𝑟(Σ𝑝

−1Σ𝑞) + (𝜇𝑞 − 𝜇𝑝)𝑇Σ𝑝−1(𝜇𝑞 − 𝜇𝑝) − 𝑑) (5)

where 𝑡𝑟(𝐴) = ∑ 𝑎𝑖𝑖𝑑𝑖=1 denotes the trace of a matrix 𝐴 ∈ ℝ𝑑×𝑑. For the special univariate case 𝑥 ∈ ℝ , the

above simplifies to

𝒦ℒ (𝑁(𝑥; 𝜇𝑞 , 𝜎𝑞2)||𝑁(𝑥; 𝜇𝑝, 𝜎𝑝

2)) =1

2(ln (

𝜎𝑝2

𝜎𝑞2) +

𝜎𝑞2

𝜎𝑝2 +

1

𝜎𝑝2 (𝜇𝑞 − 𝜇𝑝)

2− 1) (6)

The KL divergence between two univariate Gaussian probability density functions is thus a function of the

squared distance of their expectation parameters and the ratio of their variance parameters (Figure 2). Note

that for 𝜎𝑞2 = 𝜎𝑝

2 and 𝜇𝑞 = 𝜇𝑝 (6) evaluates to zero.

132

Figure 2. Kullback-Leibler divergences for univariate Gaussians. Panel A depicts the KL divergence between a reference univariate Gaussian density with parameters 𝜇𝑞 = 0 and 𝜎𝑞

2 = 1 and a “test” Gaussian density with parameters 𝜇𝑝 and 𝜎𝑝2 as a function of the

test Gaussian density’s parameters. The colored markers refer to specific cases of the test Gaussian parameters: the red marker refers to the parameter setting in the upper subpanel of panel B, the green marker refers to the center subpanel parameter settings of panel B, and the blue marker refers to the lowermost subpanel parameter settings of panel B. Note that the KL-divergence between two univariate Gaussians is a nonlinear function of the differences in expectation and variance parameters.

Two properties of the KL divergence are central. Firstly, the KL divergence is non-negative for all

choices of probability densities 𝑞(𝑥) and 𝑝(𝑥)

𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≥ 0 for all 𝑞(𝑥), 𝑝(𝑥) (7)

and secondly, it is zero, if and only if the two densities are identical, i.e. 𝑞(𝑥) = 𝑝(𝑥)

𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) = 0 for 𝑞(𝑥) = (𝑥) (8)

Proof of (7) and (8) We follow (Bishop, 2007) (p.55-56) to show why (7) and (8) hold. Two aspects are central: the negative logarithm is a

convex function, and for convex functions Jensen’s inequality applies. Recall that convex functions are defined by the property that

every straight line connecting two points on the function’s graph lies above it, or formally: for 𝑓: [𝑥1, 𝑥2] ⊂ ℝ → ℝ and 𝑞 ∈ [0,1]

𝑓(𝑞𝑥1 + (1 − 𝑞)𝑥2) ≤ 𝑞𝑓(𝑥1) + (1 − 𝑞)𝑓(𝑥2) (8.1)

does hold. Intuitively, (8.1) can be extended to more than two points 𝑥𝑖 , 𝑖 = 1,… , 𝑛 with 𝑞𝑖 ≥ 0 and ∑ 𝑞𝑖𝑛𝑖=1 = 1 in the form

𝑓(∑ 𝑞𝑖𝑛𝑖=1 𝑥𝑖) ≤ ∑ 𝑞𝑖

𝑛𝑖=1 𝑓(𝑥𝑖) (8.2)

(8.2) can in turn, intuitively, be extended to a continuum of points 𝑥 and associated values 𝑞(𝑥), where 𝑞(𝑥) ≥ 0 and ∫ 𝑞(𝑥)𝑑𝑥 = 1

as

𝑓(∫ 𝑞(𝑥)𝑥 𝑑𝑥) ≤ ∫ 𝑞(𝑥)𝑓(𝑥)𝑑𝑥 (8.3)

From a probabilistic viewpoint, ∫ 𝑞(𝑥)𝑥 𝑑𝑥 corresponds to the expectation of 𝑥 under 𝑞(𝑥) and ∫ 𝑞(𝑥)𝑓(𝑥) 𝑑𝑥 to the expectation of

𝑓(𝑥) under 𝑞(𝑥), i.e., for convex 𝑓 we have

𝔼𝑞(𝑥)(𝑓(𝑥)) ≥ 𝑓 (𝔼𝑞(𝑥)(𝑥)) (8.4)

133

The results (8.2) - (8.4) are known as Jensen’s inequality. Noting from real analysis that the logarithm is a concave function, and

𝑓 ≔– ln hence a convex function, we thus have for the KL-divergence as defined in equation (4):

𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)

𝑝(𝑥)) 𝑑𝑥 = −∫𝑞(𝑥) ln (

𝑝(𝑥)

𝑞(𝑥)) 𝑑𝑥 ≥ − ln ∫𝑞(𝑥)

𝑝(𝑥)

𝑞(𝑥)𝑑𝑥 = − ln 1 = 0 (8.5)

Also note that the KL-divergence 𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) vanishes for 𝑝(𝑥) ≔ 𝑞(𝑥), because in this case the logarithmic term in the

integral evaluates to zero for all 𝑥.

134

Principles of Probabilistic Inference

(1) Maximum Likelihood Estimation

The maximum likelihood method is a fairly general principle to derive estimators in probabilistic

models that was popularized by Ronald A Fisher between 1912 and 1922, but found application already in

the works of Pierre-Simon Laplace (1749 - 1827) and Johann C Gauss (1777 – 1855). It is based on the

following intuition: in the context of a probabilistic model the most likely parameter value that underlies an

observed data set is the parameter value for which the probability of the data under the model is maximal.

To render this intuition more precise, we first introduce the so-called “likelihood function”. Consider a

probabilistic model, which specifies the probability of data 𝑦 based on a family of parameterized probability

density (or mass) functions 𝑝(𝑦; 휃), where 휃 ⊂ 𝛩 denotes the models parameter, and 𝛩 denotes the

model’s parameter space. Then the function

𝐿: 𝛩 × ℝ𝑛 → ℝ+, (휃, 𝑦) ↦ 𝐿(휃, 𝑦) ≔ 𝑝(𝑦; 휃) (1)

where 휃 ∈ ℝ𝑘 and 𝑦 ∈ ℝ𝑛 is called the likelihood function of the parameter 휃 for the observation 𝑦. The

important thing to note about the likelihood function is that it is (primarily) viewed as a function of the

parameter value 휃, while, in a concrete case, the value of 𝑦 is fixed. This contrasts with the notion of a

probability density function 𝑝(𝑦; 휃), which is a function of the random variable 𝑦. Less formally, this may be

expressed that the input argument for a probability density (or mass) function is the value of a random

variable and the output argument of a probability density (or mass) function is the probability density (or

mass) for this value given a fixed parameter value, while on the other hand the input argument for a

likelihood function is the value of a parameter value, and the output of a likelihood function is the is the

probability density (or mass) for a fixed value of the random variable given this parameter value. If the

random variable value and parameter values submitted to a probability density (or mass) function and its

corresponding likelihood function are identical, so is the output of both functions.

The maximum likelihood estimator for a given probabilistic model 𝑝(𝑦; 휃) is that value 휃̂𝑀𝐿 of 휃

which maximizes the likelihood function. Formally, this can be expressed as

휃̂𝑀𝐿 ≔ 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈𝛩 𝐿(휃, 𝑦) (2)

(2) should be read as “휃̂𝑀𝐿 is that argument of the likelihood function 𝐿 for which 𝐿(휃, 𝑦) assumes its

maximal value over all possible parameter values 휃 in the parameter space 𝛩”. A standard approach to find

“closed-form” or analytical expressions for ML estimators is to maximize the likelihood function with respect

to 휃 by means of the analytical determination of critical values at which its first derivative (or gradient for

multivariate 휃) vanishes. We will see examples for this approach below. Another approach, often

encountered in practical numerical computing is to automatically shift the values of 휃 around while

monitoring the value of the likelihood function and to stop, once this value is considered to be maximal

based on some sensible stopping criterion. For now, we concern ourselves with the analytical approach.

Candidate values for the ML estimator 휃̂𝑀𝐿 fulfill the following requirement:

𝜕

𝜕𝜃𝑖𝐿(휃, 𝑦)|𝜃=�̂�𝑀𝐿 = 0 (𝑖 = 1,… , 𝑘) (3)

(5) should be read as “a the location of the ML estimator 휃̂𝑀𝐿 ("|𝜃=�̂�𝑀𝐿") the partial derivatives of the

likelihood function 𝜕

𝜕𝜃𝑖𝐿 with respect to the entries of the parameter vector 휃 ≔ (휃1, … , 휃𝑘)

𝑇 vanish, i.e.

they are equal to zero” or alternatively, “at the location of the ML estimator 휃̂𝑀𝐿, the gradient of 𝐿 with

135

respect to 휃 is equal to the zero vector”. This condition merely corresponds to the necessary condition for

function minima or maxima, which states that at the location of a minimum or maximum, the slope of the

function is zero. By evaluating the derivatives and setting them to zero, one obtains a set of equations which

can be solved for the ML estimator.

To simplify this approach, one usually considers the logarithm of the likelihood function, the so-

called “log likelihood function”. The log likelihood function is defined as

𝑙: 𝛩 × ℝ𝑛 → ℝ, (휃, 𝑦) ↦ 𝑙(휃, 𝑦) ≔ 𝑙𝑛 𝐿(𝑦, 휃) = 𝑙𝑛 𝑝(𝑦; 휃) (4)

The use of the logarithm in the context of ML estimation is pragmatic: first, one often considers probability

density functions which have an exponential term, which is simplified by a log transform. Second, one often

assumes independent observations, under which the factorization of the probability density function is

rendered a summation under the log transform. Finally, the logarithm is a monotonically increasing function,

which implies that the location in parameter space at which the likelihood function assumes its maximal

value corresponds to the location in parameter space at which the log likelihood assumes its maximal value.

In analogy to (5), the “log likelihood equation” for the ML estimator is given as

𝜕

𝜕𝜃𝑖𝑙(휃, 𝑦)|𝜃=�̂�𝑀𝐿 = 0 (𝑖 = 1,… , 𝑘) (5)

which, like (5) can be solved for 휃̂𝑀𝐿.

Before we demonstrate the idea of ML estimation in a first example we note two results that

simplify the application of the ML method considerably: Firstly, the assumption of a concave likelihood

function and secondly, the assumption of independent observables.

If the log likelihood function is “concave”, then the necessary condition for a maximum of the

likelihood function is also sufficient. A multivariate real-valued function 𝑓:ℝ𝑛 → ℝ is referred to as

“concave”, if for all input arguments 𝑥, 𝑦 ∈ ℝ the straight line connecting 𝑓(𝑥) and 𝑓(𝑦) lies below the

functions graph (see Figure 1). Formally, this may be expressed by the inequality

𝑓(𝑡𝑥 + (1 − 𝑡)𝑦) ≥ 𝑡𝑓(𝑥) + (1 − 𝑡)𝑓(𝑦) (𝑥, 𝑦 ∈ ℝ𝑛, 𝑡 ∈ [0,1]) (6)

Note that 𝑡𝑥 + (1 − 𝑡)𝑦 (𝑡 ∈ [0,1]) describes a straight line in the domain of the function, while 𝑡𝑓(𝑥) +

(1 − 𝑡)𝑓(𝑦) (𝑡 ∈ [0,1]) describes a straight line in the range of the function. Leaving mathematical

subtleties aside, it is roughly correct that concave functions have a single maximum, or in other words, that a

critical point at which the gradient vanishes is guaranteed to be a maximum. In other words, if the likelihood

function is concave, finding a parameter value for which the log likelihood equation of vanishing partial

derivatives holds is sufficient to know that indeed there is a maximum at this location. In principle, for every

log likelihood function that we discuss below, we would have to show that it indeed fulfills condition (6) and

thus that a maximum can be found by merely setting its gradient to zero and solving for the critical point.

However, because this goes beyond the formalism strived for in PMFN, we content by noting without proofs

that the log likelihood functions we will encounter are all concave.

We now consider the assumption of independent observed variables. If the observed variables

𝑦 ≔ (𝑦1, … , 𝑦𝑛) are stochastically independent and each variable is governed by a probability density

function parameterized by the same parameter vector 휃, i.e. it has the probability density function 𝑝(𝑦𝑖; 휃)

for 𝑖 = 1,… , 𝑛, then the joint probability density function is given as the product of the individual probability

density functions

136

𝑝(𝑦; 휃) = 𝑝(𝑦1, … , 𝑦𝑛; 휃) = 𝑝(𝑦1; 휃) ⋅ 𝑝(𝑦2; 휃) ⋅ … ⋅ 𝑝(𝑦𝑛; 휃) = ∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 (7)

This may be viewed from two angles: one may either conceive the 𝑦𝑖 (𝑖 = 1,… , 𝑛) to be governed by one

and the same underlying probability distribution, from which one can “sample with replacement”, or, one

may conceive each 𝑦𝑖 (𝑖 = 1,… , 𝑛) to be governed by its individual probability density function, which,

however, is the same for all 𝑖 = 1,… , 𝑛. For the purposes of PMFN, these two stances are equivalent, while

the latter feels somewhat closer to the formal developments below. In the case of independent observed

variables 𝑦1, … , 𝑦𝑛, the log likelihood function is given by

𝑙(휃, 𝑦) = 𝑙𝑛 𝑝(𝑦; 휃) = 𝑙𝑛(∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 ) (8)

Repeated application of the “product property” of the logarithm, i.e. the fact that for 𝑎, 𝑏 ∈ ℝ+ we have

𝑙𝑛(𝑎𝑏) = 𝑙𝑛 𝑎 + 𝑙𝑛 𝑏 then allows for writing the right-hand side of (8) as

𝑙𝑛(∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 ) = ∑ 𝑙𝑛 𝑝(𝑦𝑖; 휃)

𝑛𝑖=1 (9)

In other words, the evaluation of the logarithm of a product over probability density functions 𝑝(𝑦𝑖; 휃) is

simplified to the summation over the logarithms of individual probability density functions 𝑝(𝑦𝑖; 휃) (𝑖 =

1,… , 𝑛).

The developments above, together with the assumptions of independent observed variables and the

concavity of the log likelihood function suggest the following three step procedure for the analytical

derivation of maximum likelihood estimators in a given probabilistic model scenario.

1. Formulation of the log likelihood function. This step corresponds to (a) writing down the probability of a

data sample under the probabilistic model, i.e., formulation of the likelihood function, where special

attention has to be paid to the number of observed variables considered and their independence properties,

and (b) taking the logarithm.

2. Evaluation of the log likelihood function gradient. Usually, probabilistic models of interest have more than

one parameter and maximum likelihood estimators for each parameter are required. To this end, the partial

derivatives of the log likelihood function with respect to the parameters have to be evaluated, which is

usually eased by the use of the log likelihood function in the case of probability density functions belonging

to the class of “exponential distributions”, and the assumption of observed variable independence resulting

in sums, rather than products under the logarithm.

3. Equating the log likelihood function gradient with zero and solving for a critical value of the parameters,

which, under the assumption of a concave log likelihood function, corresponds to the location of a maximum

of the log likelihood function in parameter space. The parameter value obtained, which is usually a function

of the observed variables, then corresponds to a maximum likelihood estimator for the given parameter of

the probabilistic model under consideration.

(2) Maximum likelihood estimation of the parameters of a univariate Gaussian distribution

To obtain an intuition of how the maximum likelihood estimation procedure outlined above works in

practice, we consider the case of obtaining 𝑛 ∈ ℕ independent and identically distributed observations

𝑦1, … , 𝑦𝑛 from a univariate Gaussian distribution with parameters 𝜇 ∈ ℝ and 𝜎2 ∈ ℝ. Our aim is to derive

maximum likelihood estimators �̂�𝑀𝐿 and �̂�𝑀𝐿2 for 𝜇 and 𝜎2.

Equivalently, we may view this example as a “one-sample” GLM of the form

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)

137

where 𝑦 ≔ (𝑦1, … , 𝑦𝑛) ∈ ℝ𝑛, 𝜎2 > 0, 𝛽 ≔ 𝜇 ∈ ℝ and 𝑋 ∈ ℝ𝑛×1 is given by

𝑋 ≔ (

11⋮1

) (2)

i.e., the design matrix corresponds to a vector of 𝑛 ones, and the single parameter 𝛽 corresponds to the

expectation parameter of 𝑝(𝑦𝑖). The 𝑖th variable 𝑦𝑖 (𝑖 = 1,… , 𝑛) is distributed according to

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) (3)

and the covariances of variables 𝑦𝑖 and 𝑦𝑗 for 𝑖 ≠ 𝑗, 1 ≤ 𝑖, 𝑗 ≤ 𝑛 are assumed to be zero, corresponding to

the assumption of independent and identically distributed observations.

The first step in the application of the maximum likelihood principle is the determination of the log

likelihood function. In the current example, the 𝑖th random variable 𝑦𝑖 is distributed according to

𝑁(𝑦𝑖; 𝜇, 𝜎2) for all 𝑖 = 1,… , 𝑛, and the random variables 𝑦1, … , 𝑦𝑛 are assumed to be independent. The

probability density for the joint observation of 𝑦1, … , 𝑦𝑛 thus corresponds to the product of the probability

density for each individual observation, which we may write as

𝑝(𝑦1, … , 𝑦𝑛) = 𝑝(𝑦1) ⋅ 𝑝(𝑦2) ⋅ … ⋅ 𝑝(𝑦𝑛) (4)

The individual probability density functions are given by the univariate Gaussian 𝑁(𝑦𝑖; 𝜇, 𝜎2)(𝑖 = 1,… , 𝑛)

and we thus have

𝑝𝜇,𝜎2(𝑦) = 𝑝𝜇,𝜎2(𝑦1, … , 𝑦𝑛) = 𝑁(𝑦1; 𝜇, 𝜎2) ⋅ 𝑁(𝑦2; 𝜇, 𝜎

2) ⋅ … ⋅ 𝑁(𝑦𝑛; 𝜇, 𝜎2) (5)

In (5) we have made the dependence of the probability density function on the parameters 𝜇 and 𝜎2 explicit

by means of subscripts. Because

𝑁(𝑦𝑖; 𝜇, 𝜎2) ≔

1

√2𝜋𝜎2𝑒𝑥𝑝 (−

1

2𝜎2(𝑦𝑖 − 𝜇)

2) (𝑖 = 1,… , 𝑛) (6)

we may re-express (5) as

𝑝(𝑦; 𝜇, 𝜎2) = (2𝜋𝜎2)−𝑛

2 𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 ) (7)

Proof of (7)

We consider the product

1


1

2𝜎2(𝑦1 − 𝜇)

2) ⋅1


1

2𝜎2(𝑦2 − 𝜇)

2)⋯1


1

2𝜎2(𝑦𝑛 − 𝜇)

2) (7.1)

that comprises 𝑛 terms of the form (2𝜋𝜎2)−1

2 and 𝑛 terms of the form 𝑒𝑥𝑝 (−1

2𝜎2(𝑦𝑖 − 𝜇)

2). From the rules of exponiantiation, we

know that

((2𝜋𝜎2)−1

2)𝑛

= (2𝜋𝜎2)−1

2⋅𝑛 = (2𝜋𝜎2)−

𝑛

2⋅ (7.2)

From the fundamental properties of the exponential function, we know that

𝑒𝑥𝑝(𝑎) 𝑒𝑥𝑝(𝑏) = 𝑒𝑥𝑝(𝑎 + 𝑏) (7.3)

And thus a product of 𝑛 exponential factors is equivalent to the exponential of the sum over the exponential inputs

∏ 𝑒𝑥𝑝 (−1

2𝜎2(𝑦𝑖 − 𝜇)

2)𝑛𝑖=1 = 𝑒𝑥𝑝 (−∑

1

2𝜎2(𝑦𝑖 − 𝜇)

2𝑛𝑖=1 ) = 𝑒𝑥𝑝 (−

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 ) (7.4)

We thus found that

∏ 𝑁(𝑦𝑖; 𝜇, 𝜎2)𝑛

𝑖=1 = (2𝜋𝜎2)−𝑛

2 𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 ) (7.5)

□

138

Based on the above, we may now formally write down the likelihood function for the case of 𝑛 independent and identically distributed observations 𝑦 = (𝑦1, … , 𝑦𝑛)

𝐿: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ+, ((𝜇, 𝜎2), 𝑦) ↦ 𝐿((𝜇, 𝜎2), 𝑦) ≔ 𝑝𝜇,𝜎2(𝑦) = (2𝜋𝜎2)−

𝑛

2𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 ) (8)

Note that in the current scenario the parameter space 𝛩 is given by 𝛩 ≔ ℝ×ℝ+\{0} and the

dimensionality of the parameter vector 휃 ≔ (𝜇, 𝜎2) is given by 𝑘 = 2. The corresponding log likelihood

function then evaluates to

ℓ: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ+, ((𝜇, 𝜎2), 𝑦) ↦ ℓ((𝜇, 𝜎2), 𝑦) = −𝑛

2𝑙𝑛(2𝜋) −

𝑛

2𝑙𝑛𝜎2 −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 (9)

Derivation of (9)

Equation (9) follows from basic properties of the logarithm function, specifically

𝑙𝑛(𝑎𝑏) = 𝑙𝑛 𝑎 + 𝑙𝑛 𝑏 , 𝑙𝑛 𝑎𝑏 = 𝑏 𝑙𝑛 𝑎 and 𝑙𝑛(𝑒𝑥𝑝(𝑎)) = 𝑎 (9.1)

We have

𝑙𝑛 ((2𝜋𝜎2)−𝑛

2 𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )) = 𝑙𝑛(2𝜋𝜎2)−

𝑛

2 + 𝑙𝑛 (𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )) (9.2)

with the first of these properties

𝑙𝑛(2𝜋𝜎2)−𝑛

2 + 𝑙𝑛 (𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )) = −

𝑛

2𝑙𝑛(2𝜋𝜎2) −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 (9.3)

With the second and third of these properties, and finally

−𝑛

2𝑙𝑛(2𝜋𝜎2) −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 = −

𝑛

2𝑙𝑛(2𝜋) −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 (9.4)

again with the first of these properties. □

The second step in the analytical derivation of ML estimators is the evaluation of the log likelihood

equations, i.e. the gradient of the log likelihood function In the current case, we have 휃 ≔ (𝜇, 𝜎2). To

identify critical points of the log likelihood function, we thus have to evaluate two derivatives constituting

the gradient of the log likelihood function

𝛻𝑙((𝜇, 𝜎2), 𝑦) = (

𝜕

𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦)

𝜕

𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦)

) (10)

and subsequently set the gradient (i.e., each partial derivative) to zero. For the partial derivative with

respect to the expectation parameter 𝜇, we obtain

𝜕

𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦) =

1

𝜎2∑ (𝑦𝑖 − 𝜇) 𝑛𝑖=1 (11)

and for the partial with respect to the variance parameter 𝜎2, we obtain

𝜕

𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦) = −

𝑛

2𝜎2+

1

2𝜎4∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 (12)

Derivation of (11) and (12)

We consider the derivative of 𝑙((𝜇, 𝜎2), 𝑦) with respect to 𝜇 first. Using the summation and chain rules of differential

calculus, we obtain

𝜕

𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦) =

𝜕

𝜕𝜇(−

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )

=𝜕

𝜕𝜇(−

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )

= −1

2𝜎2∑

𝜕

𝜕𝜇(𝑦𝑖 − 𝜇)

2𝑛𝑖=1

139

= −1

2𝜎2∑ 2(𝑦𝑖 − 𝜇)

𝜕

𝜕𝜇(−𝜇) 𝑛

𝑖=1

= −(−2

2𝜎2∑ (𝑦𝑖 − 𝜇)𝑛𝑖=1 )

=1

𝜎2∑ (𝑦𝑖 − 𝜇)𝑛𝑖=1 (11.1)

We next consider the partial derivative with respect to the variance parameter 𝜎2. Using the summation rule of differential calculus the fact that (𝑙𝑛 𝑥)′ = 𝑥−1, and the fact that (𝑥−1)′ = −𝑥−2, we obtain

𝜕

𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦) =

𝜕

𝜕𝜎2(−

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )

= −𝑛

2

𝜕

𝜕𝜎2(𝑙𝑛 𝜎2) −

𝜕

𝜕𝜎2(1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 )

= −𝑛

2

1

𝜎2−1

2(𝜕

𝜕𝜎2(𝜎2)−1)∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1

= −𝑛

2𝜎2+

1

2𝜎4∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 (12.1)

□

The log likelihood equations for the case of independent and identical sampling from the univariate Gaussian

are thus given by

1

𝜎2∑ (𝑦𝑖 − 𝜇) 𝑛𝑖=1 = 0 (13)

−𝑛

2𝜎2+

1

2𝜎4∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1 = 0 (14)

Notably, these log likelihood equations display a dependence between the maximum likelihood

estimator for 𝜇 and and that for 𝜎2, as both parameters appear in both equations. To solve the log

likelihood equations for �̂�𝑀𝐿 and �̂�𝑀𝐿2 , a standard approach is to first solve equation (13) for �̂�𝑀𝐿 and then

use this solution to solve equation (14) for �̂�𝑀𝐿2 . We obtain the following results

�̂�𝑀𝐿 =1

𝑛∑ 𝑦𝑖 𝑛𝑖=1 (15)

�̂�𝑀𝐿2 =

1

𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 (16)


Equation (13) states that

1

𝜎2∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 = 0 (15.1)

This implies that 1

𝜎2 equals zero or ∑ (𝑦𝑖 − �̂�𝑀𝐿)

𝑛𝑖=1 equals zero. Because, by definition 𝜎2 > 0, and thus also

1

𝜎2> 0, the equation

can only hold, if ∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 equals zero. We hence obtain for �̂�𝑀𝐿

∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 = 0 ⇔ ∑ 𝑦𝑖

𝑛𝑖=1 − ∑ �̂�𝑀𝐿

𝑛𝑖=1 = 0 ⇔ ∑ 𝑦𝑖

𝑛𝑖=1 − 𝑛�̂�𝑀𝐿 = 0 ⇔ 𝑛�̂�𝑀𝐿 = ∑ 𝑦𝑖 ⇔ �̂�𝑀𝐿 =

1

𝑛∑ 𝑦𝑖 𝑛𝑖=1 𝑛

𝑖=1 (15.2)

To find the maximum likelihood estimator for 𝜎2, we substitute the result above in equation (14) and solve for �̂�𝑀𝐿4

−𝑛

2�̂�𝑀𝐿2 +

1

2�̂�𝑀𝐿4 ∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 = 0 ⇔

1

2�̂�𝑀𝐿4 ∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 =

𝑛

2�̂�𝑀𝐿2 ⇔ ∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 =

𝑛2�̂�𝑀𝐿4

2�̂�𝑀𝐿2 ⇔ �̂�𝑀𝐿

2 =1

𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 (16.1)

□

With equation (15), we thus derived the following result: the ML estimator for the expectation

parameter 𝜇 of a univariate Gaussian based on 𝑛 independent and identically distributed observations

𝑦1, … , 𝑦𝑛 is given by

�̂�𝑀𝐿 =1

𝑛∑ 𝑦𝑖 𝑛𝑖=1 (17)

140

Notably, the formula for this ML estimator corresponds to the well-known sample mean, usually denoted as

�̅� ≔1

𝑛∑ 𝑦𝑖 𝑛𝑖=1 (18)

for observations 𝑦1, … , 𝑦𝑛. Note that a sample mean can be computed irrespective of whether one assumes

the data to corresponds to independent and identically distributed samples from a univariate Gaussian. If

one makes this assumption, however, in classical point estimation schemes, the sample mean corresponds

to the best guess for the expectation parameter of the assumed underlying Gaussian.

With equation (16), on the other hand, we derived the following result: the maximum likelihood

estimator for the variance parameter 𝜎2 of a univariate Gaussian based on 𝑛 independent and identically

distributed observations and the ML estimator �̂�𝑀𝐿 is given by

�̂�𝑀𝐿2 =

1

𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 (19)

Note that the ML estimator for the variance does not correspond to the familiar sample variance, which is

given by

𝑠2 =1

𝑛−1∑ (𝑦𝑖 − �̅�)

2𝑛𝑖=1 (20)

In other words, if one computes the sample variance of any sample, one does not need to assume that the

sample was generated by independent and identical sampling from a univariate Gaussian distribution, and

even if one does, the result is not the maximum likelihood estimator for the variance parameter of the

assumed underlying Gaussian. Using classical frequentist estimator quality theory, it can be shown that the

maximum likelihood estimator for the variance parameter of a univariate Gaussian is not “ideal”, or, more

formally, it is not “bias-free”. However, the notion of “biased estimators” and a principled method for

deriving bias-free estimators using a modified maximum likelihood method, referred to as “restriced

maximum likelihood” is beyond the scope of the current section and is covered in more depth in the Section

“Restricted Maximum Likelihood” of PMFN.

(3) Numerical maximum likelihood estimation and Fisher-Scoring

Depending on the probabilistic model of interest, the likelihood and log likelihood functions may

take arbitrarily complex forms that often render a direct analytical optimization as in the previous example

impossible. In this case, numerical optimization offers an alternative. As discussed above, numerical

optimization procedures initialize the input arguments of an objective function to some suitably chosen

initial value and then iteratively modify this value such that the objective function approaches an extremal

point. In the context of maximum likelihood estimation, the objective function corresponds to the log

likelihood function, which is numerically optimized with respect to the model parameters. A popular

approach for numerical maximum likelihood estimation is based on an adaptation of the multivariate

Newton-Raphson approach, in which the Hessian matrix of the log likelihood function is approximated by its

expected value under repeated sampling from the model. This expected Hessian is known as Fisher’s

information matrix and the resulting algorithm is known as “Fisher-Scoring”.

Formally, for a parametric probabilistic model 𝑝(𝑦; 휃) with a 𝑝-dimensional parameter vector

휃 ≔ (휃1, … , 휃𝑝)𝑇∈ Θ ⊂ ℝp we consider the problem of finding a maximizing value for its log likelihood

function which we denote by

ℓ(⋅; 𝑦): Θ ⊂ ℝp → ℝ+, 휃 ↦ ℓ(휃; 𝑦) ≔ ln𝑝(𝑦; 휃) (1)

141

In the context of parametric statistics, the gradient of the log likelihood function at a location 휃 ∈ Θ is

referred to as a “score vector” or a “score function”

𝑆 ∶ Θ → ℝ𝑝, 휃 ↦ 𝑆(휃) ≔ ∇ℓ(휃; 𝑦) =

(

𝜕

𝜕𝜃1ℓ(휃; 𝑦)

𝜕

𝜕𝜃2ℓ(휃; 𝑦)

⋮𝜕

𝜕𝜃𝑝ℓ(휃; 𝑦)

)

(2)

Further, the negative Hessian of the log-likelihood function at a location 휃 ∈ Θ is known as “Fisher’s

Information Matrix” or simply as “Fisher Information”

𝐼 ∶ Θ → ℝ𝑝×𝑝, 휃 ↦ 𝐼(휃) ≔ −∇2ℓ(휃; 𝑦) = −

(

𝜕2

𝜕𝜃1𝜕𝜃1ℓ(휃; 𝑦)

𝜕2

𝜕𝜃1𝜕𝜃2ℓ(휃; 𝑦)

𝜕2

𝜕𝜃2𝜕𝜃1ℓ(휃; 𝑦)

𝜕2

𝜕𝜃2𝜕𝜃2ℓ(휃; 𝑦)

⋯𝜕2

𝜕𝜃1𝜕𝜃𝑝ℓ(휃; 𝑦)

⋯𝜕2

𝜕𝜃2𝜕𝜃𝑝ℓ(휃; 𝑦)

⋮ ⋮𝜕2

𝜕𝜃𝑝𝜕𝜃1ℓ(휃; 𝑦)

𝜕2

𝜕𝜃𝑝𝜕𝜃2ℓ(휃; 𝑦)

⋱ ⋮

⋯𝜕2

𝜕𝜃𝑝𝜕𝜃𝑝ℓ(휃; 𝑦)

)

(3)

Note that for a fixed observation of the data 𝑦, both 𝑆 and 𝐼 are functions of the parameter 휃 ∈ Θ only.

However, because the data 𝑦 is a random variable the values of 𝑆(휃) and 𝐼(휃) can also be conceived as

realizations of random variables. One may thus form an expected value of the score vector and Fisher’s

information matrix under the distribution of the data, denoted by 𝐸(𝑆(휃)) and 𝐸(𝐼(휃)), respectively.

Initialization

0. Define a starting point 휃(0) ∈ Θ and set 𝑘 ≔ 0. If 𝑆(휃) ≔ ∇ℓ(휃; 𝑦) = 0, stop! 휃(0) is a zero of ∇ℓ(휃; 𝑦). If not, proceed to iterations.

Until Convergence

1. Set

휃(𝑘+1) = 휃(𝑘) + 𝐸(𝐼(휃))−1 𝑆(휃)

2. If 𝑆(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇ℓ(휃; 𝑦). If not, go to 3.


Table 1. The Fisher-Scoring method for numerical maximum likelihood estimation.

Based on the definitions above, we can now introduce the notion of Fisher-Scoring: If the function of

interest of the Newton-Raphson procedure corresponds to a log likelihood function (i.e. the aim of the

optimization is to find maximum likelihood parameter estimates), and the Hessian of the log-likelihood

function is replaced by the negative expected Fisher-Information (which is often easier to evaluate for the

current setting of the parameter 휃(𝑘)), then this procedure is referred to as “Fisher Scoring” or “Fisher

Scoring Algorithm”. Replacing the objective function by the log likelihood function ℓ(⋅; 𝑦), the gradient of

the objective function by the score vector 𝑆, and the Hessian of the function objective function by the

expected Fisher information in the multivariate Newton-Raphson method, thus yields the Fisher-Scoring

algorithm (Table 1)

142

(4) Bayesian Estimation

There are at least, two reasons to introduce the Bayesian approach to the estimation and evaluation

of the GLM. The first reason is data-analytic. Bayesian inference provides a principled, mathematically

grounded, and coherent framework to deal with uncertainty in scientific theories (“models”) in light of data.

Compared to classical statistics, it does not comprise a collection of tools (which can be summarized under

the classical framework of the GLM, as discussed so far), but provides a general framework for probabilistic

inference on any model class. While historically wrong, Bayesian data analysis still carries the connotation of

“modern data analysis” and hence is becoming increasingly popular in the psychological and neuroscientific

literature. For example, the Dynamic Causal Modelling approach to EEG and FMRI is an explicitly Bayesian

approach to neuroimaging data analysis that is receiving more and more attention. There is an important

second reason. It is fairly safe to say, that at the beginning of the 21st century, the dominant view of the

human brain is that of a dynamic system implementing Bayesian statistical inference. The core idea here is

that the brain encodes a model of the world and matches its sensory input and motor output to adhere to its

own predictions. This “Bayesian brain hypothesis” has a long history, dating back centuries, and currently

enjoys popularity under the name “free energy principle”. To understand this leading brain theory (and/or

come up with sensible alternative hypotheses), it is necessary to understand the formal framework that the

Bayesian brain hypothesis rests on. A fundamental building block of Bayesian inference is the notion of

conditional probability, which the reader is encouraged to review at this point.

Bayesian parameter estimation – Prior, posterior, and likelihood

For a joint distribution 𝑝(𝑦, 휃) of random entities 𝑦 and 휃 (i.e., random variables or vectors),

“Bayesian inference” corresponds to a specific interpretation of 𝑦 and 휃 their respective marginal and

conditional distributions. Specifically, the random entity 𝑦 is interpreted as “the data” and the random entity

휃 is interpreted as “the parameter” and the joint distribution 𝑝(𝑦, 휃) is interpreted as “the (“generative” or

“probabilistic”) model”. Based on these interpretations, Bayes theorem in the form

𝑝(휃|𝑦) =𝑝(𝜃)𝑝(𝑦|𝜃)

𝑝(𝑦) (1)

hence allows to determine the conditional probability distribution of the parameter 휃 “given” (or

“conditioned on”) the data 𝑦. Note that 𝑝(휃|𝑦) reflects the conditional probability distributions over the

parameter values 휃 for all possible values that the data 𝑦 may take on. In given experimental context, a

specific data point (or data set) 𝑦∗ is usually observed. In this case, the conditional probability distribution

over parameter values may be determined from

𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)

𝑝(𝑦=𝑦∗) (2)

To determine the conditional distribution 𝑝(휃|𝑦) of the parameter given the data and use it for

probabilistic statements about the conditional probabilities of 휃 is the aim of “Bayesian parameter

estimation”. Note that in contrast to the parameter estimation schemes we have discussed so far, Bayesian

parameter estimation does not aim to estimate a single value for the true, but unknown, parameter value,

but a probability distribution over the possible parameter values. The data conditional parameter

distribution is also called the “posterior distribution”, because it can be conceived as the probability

distribution over parameters “once the data have been observed”. Note however, that the conditional

143

parameter distribution is inherent in the specification of the model 𝑝(𝑦, 휃) and is thus not created de novo

once the data is observed.

Figure 1. Fundamentals of Bayesian Inference. The left panel depicts a joint probability distribution (or, more precisely, joint probability density function (PDF)) over two scalar random variables, 𝑦 and 휃. The right panels show marginal and conditional distributions inherent in the joint distribution of the left panel: The upper right panels depict the marginal distributions over 휃 and 𝑦, respectively, while the lower right panels depict examples for conditional distributions of 휃 given 𝑦, here for 𝑦 = 0 and 𝑦 = −2. The parameters (expectation and covariance) of the marginal and conditional probability density functions can be evaluated as functions of the parameters (expectation and covariance) of the joint distribution. The details of this procedure will be discussed in the following Sections.

In principle, the posterior parameter distribution 𝑝(휃|𝑦) for a given observation 𝑦 = 𝑦∗ may be

directly determined from the joint probability distribution 𝑝(𝑦 = 𝑦∗, 휃) and the marginal probability

𝑝(𝑦 = 𝑦∗). However, the Bayesian paradigm usually proceeds by explicitly stating the marginal distribution

𝑝(휃) and the conditional probability distribution 𝑝(𝑦|휃), whose product results in 𝑝(𝑦, 휃). The marginal

distribution 𝑝(휃) over the parameter is called the “prior” distribution in the Bayesian paradigm. It

corresponds to the probability distribution one specifies over the parameter values independent of any data.

To form the posterior distribution 𝑝(휃|𝑦), the distribution 𝑝(𝑦, 휃) and thus also 𝑝(휃) has to be already

specified, which explains the notion of a “prior” distribution over 휃.

The term 𝑝(𝑦|휃) in the numerator on the right-hand side of (1) is referred to as “the (data)

likelihood”. For each specific parameter value 휃 = 휃∗, this term defines a conditional probability distribution

over 𝑦. We have encountered a “likelihood” at many occasions previously. Specifically, we have repeatedly

stated that the GLM specifies a probability distribution over data 𝑦 of the form

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)

where we referred to 𝛽 and 𝜎2 as the fixed true, but unknown values. Let these specific fixed values be

denoted by 𝛽 = 𝛽∗ and 𝜎2 = 𝜎2∗ and think of 𝛽 and 𝜎2 as random variables From a Bayesian viewpoint, we

may define a generative model, in which 휃 ≔ (𝛽, 𝜎2)𝑇 is governed by a probability distribution

corresponding to

𝑝(휃) ≔ 𝑝((𝛽, 𝜎2)𝑇) (4)

In this case we may write the data likelihood as

144

𝑝(𝑦|휃) ≔ 𝑝(𝑦|(𝛽, 𝜎2)𝑇 ) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)

In more general terms, the likelihood specifies a probability distribution over the data given specific

parameter values. How these parameter values are linked to specific data distributions is dependent on the

“functional” (or “structural”) form of the generative model, and we will study different scenarios in the

sections to come.

Consider again Bayes theorem for a specific data observation

𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)

𝑝(𝑦=𝑦∗) (6)

The denominator on the right-hand side 𝑝(𝑦 = 𝑦∗), which we will discuss in more depth below, corresponds

to a constant multiplicative factor for 𝑝(휃)𝑝(𝑦 = 𝑦∗|휃). In other words, the posterior distribution

𝑝(휃|𝑦 = 𝑦∗) is proportional to the product of the prior parameter distribution 𝑝(휃) and the likelihood

𝑝(휃|𝑦 = 𝑦∗). A common mnemonic for Bayes theorem from the perspective of the Bayesian paradigm is

thus

𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝑃𝑟𝑖𝑜𝑟 × 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 (7)

The posterior distribution over the parameter given the data is one of the (two) main outcomes of a

Bayesian data analysis. Often, it is helpful to summarize this distribution using some meaningful numbers.

One example would be the parameter value with the highest posterior probability, which intuitively

corresponds to the “most likely” posterior parameter value. This value, i.e. the mode of the posterior

probability distribution, is known as the maximum-a-posteriori (MAP) parameter estimate. Another quantity

of interest may be the probability for the parameter to fall into a specific interval or to be larger than a given

pre-specified value. For example, for a posterior distribution corresponding to univariate Gaussian (Figure 1)

the probability for the parameter falling in the interval [0,1] may be of interest. Intervals and their

associated posterior probabilities of this sort are known as “credible intervals”.

Bayesian model comparison – Evidence

To determine the posterior parameter distribution and make inferences on it is the first principle aim

of Bayesian inference. The second aim is “Bayesian model comparison”. Bayesian model comparison is

usually performed by posing the following question: given two models 𝑝1(𝑦, 휃) and 𝑝2(𝑦, 휃), under which

model is the observed data 𝑦∗ more likely? As such, Bayesian model comparison is based on the ratio of the

marginal probabilities of the observed data 𝑦∗ under each model

𝑝1(𝑦=𝑦

∗)

𝑝2(𝑦=𝑦∗)

(1)

Note, that for any model 𝑝(𝑦, 휃), the probability 𝑝(𝑦 = 𝑦∗) corresponds to the normalization factor on the

right-hand side of Bayes theorem as specified in equation (1) of the previous section. In the Bayesian

paradigm, this marginal probability is also known as the “model evidence”. Based on their respective

generative models, the marginal probabilities 𝑝1(𝑦 = 𝑦∗) and 𝑝2(𝑦 = 𝑦

∗) may be conceived as joint

probabilities 𝑝1(𝑦 = 𝑦∗, 휃) and 𝑝2(𝑦 = 𝑦

∗, 휃) upon “integration out” or “summing over” all possible

parameter values. We may thus rewrite the above as

𝑝1(𝑦=𝑦

∗)

𝑝2(𝑦=𝑦∗)=

∑ 𝑝1(𝑦=𝑦∗,𝜃=𝜃∗)𝜃∗∈Θ1

∑ 𝑝2(𝑦=𝑦∗,𝜃=𝜃∗)𝜃∗∈Θ2

(2)

145

where we used Θ1 and Θ2 to denote the parameter spaces of the generative models 𝑝1(𝑦, 휃) and 𝑝2(𝑦, 휃),

respectively. Before we return to a general discussion of Bayesian parameter inference and Bayesian model

comparison with respect to the statistical inference machinery introduced for the GLM so far, it is helpful to

consider a basic concrete example of Bayesian inference.

(5) Bayesian estimation of the expectation of a univariate Gaussian

Consider the following likelihood

𝑝(𝑦|휃) ≔ 𝑁(𝑦; 휃, 𝜎𝑦2) =

1

√2𝜋𝜎𝑦2exp (−

1

2𝜎𝑦2 (𝑦 − 휃)

2) (1)

where 𝑦 ∈ ℝ is a scalar random variable and 𝜎𝑦2 > 0 is a fixed an known constant. According to the

likelihood statement above, the random variable 𝑦 is normally distributed with expectation parameter

휃 ∈ ℝ and variance parameter 𝜎𝑦2 > 0. In contrast to our previous discussion of the univariate Gaussian, in

the current example, the expectation parameter 휃 is considered a random variable. A standard choice for

the marginal distribution 𝑝(휃) is a univariate Gaussian with expectation parameter 𝜇𝜃 ∈ ℝ and variance

parameter 𝜎𝜃2 > 0.

𝑝(휃) ≔ 𝑁(휃; 𝜇𝜃, 𝜎𝜃2) =

1

√2𝜋𝜎𝜃2exp (−

1

2𝜎𝜃2 (휃 − 𝜇𝜃)

2) (2)

Here, we used to the subscript 휃 to emphasize that the parameters 𝜇𝜃 and 𝜎𝜃2 govern the distribution of 휃.

From a Bayesian perspective, the distribution 𝑝(휃) corresponds to the prior distribution over 휃 before we

observe a value of 𝑦, and the likelihood 𝑝(𝑦|휃) provides the distribution over 𝑦 for each value that 휃 may

take on. Together, 𝑝(휃) and 𝑝(𝑦|휃) specify a generative model, i.e. a joint distribution over both 𝑦 ∈ ℝ and

휃 ∈ ℝ of the form

𝑝(𝑦, 휃) = 𝑝(휃)𝑝(𝑦|휃) = 𝑁(휃; 𝜇𝜃, 𝜎𝜃2)𝑁(𝑦; 휃, 𝜎𝑦

2) (3)

In the Bayesian formulation of Gaussian probability density function models, the inverse of the variance (or

covariance) parameter, referred to as the “precision” parameter is often preferred. By defining the precision

parameters

𝜆𝑦 ≔1

𝜎𝑦2 and 𝜆𝜃 ≔

1

𝜎𝜃2 (4)

we can rewrite the generative model as follows

𝑝(𝑦, 휃) = 𝑁(휃; 𝜇𝜃 , 𝜆𝜃−1)𝑁(𝑦; 휃, 𝜆𝑦

−1) =√𝜆𝑦

√2𝜋exp (−

𝜆𝑦

2(𝑦 − 휃)2)

√𝜆𝜃

√2𝜋exp (−

𝜆𝜃

2(휃 − 𝜇𝜃)

2) (5)

Note that Gaussians with a high variance have a low precision and vice versa.

Below we will be concerned with general properties of and Bayesian inference in such generative

Gaussian models. For the moment, we merely state that based on the definition of the prior 𝑝(휃) in (2) and

the likelihood 𝑝(𝑦|휃) in (1), we may evaluate the posterior distribution 𝑝(휃|𝑦) by applying Bayes theorem in

the form

146

𝑝(휃|𝑦) =𝑝(𝜃)𝑝(𝑦|𝜃)

𝑝(𝑦) (6)

Substituting (1) and (2) into (6, it can be shown (and we will do so in the following Section), that the

posterior distribution over 휃 conditioned on the data 𝑦 is given by a Gaussian distribution of the form

𝑝(휃|𝑦) = 𝑁(휃; 𝜇𝜃|𝑦, 𝜆𝜃|𝑦 ) (7)

where

𝜆𝜃|𝑦: = 𝜆𝑦 + 𝜆𝜃 and 𝜇𝜃|𝑦: =𝜆𝑦

𝜆𝑦+𝜆𝜃𝑦 +

𝜆𝜃

𝜆𝑦+𝜆𝜃𝜇𝜃 (8)

We use the subscript ⋅𝜃|𝑦 to denote parameters of conditional distributions, here the distribution of 휃 given

𝑦. A couple of things are worth noting about equations (7) and (8).

Figure 1. Bayesian inference for the expectation parameter of the univariate Gaussian with prior parameters 𝜇𝜃 ≔ 0 and 𝜎𝜃

2 ≔ 2, likelihood parameter 𝜎𝑦2 ≔ 1 and observed data value 𝑦 ≔ 2. The left panel depicts the generative

model 𝑝(𝑦, 휃) as specified by equation (29). The white triangle depicts an observation 𝑦 = 2 and the corresponding generative model “section” in 휃 space (dashed white line). The right panel depicts the prior distribution over 휃 and the posterior distribution over 휃 given the observation 𝑦 = 2 as specified by the parameter update equations (8).

First, the posterior distribution over 휃 is a Gaussian distribution, as is the prior distribution over 휃.

Second, the parameters of the posterior distribution over 휃 result from relatively simple formulas that

involve the parameters of the prior distribution and the likelihood (𝜆𝜃, 𝜆𝑦, 𝜇𝜃) and the data 𝑦. Distributions

with the property that the posterior distribution belongs to the same class as the prior distribution, but with

(potentially) different parameters, are called “conjugate distributions”. For the expectation parameter of a

univariate Gaussian likelihood, the univariate Gaussian is thus the conjugate prior (and hence posterior)

distribution. Second, the parameter update equations (8) for the parameters of the posterior distribution in

terms of the prior and likelihood parameters and the data can be memorized very well: to get the precision

parameter 𝜆𝜃|𝑦 of the posterior distribution, the precision parameters of the prior 𝜆𝜃 and likelihood 𝜆𝑦 are

added; to get the expectation parameter 𝜇𝜃|𝑦, a weighted average is formed between the observed data

𝑦 ∈ ℝ and the prior parameter 𝜇𝜃, where the weighting factors are given by their relative precisions with

respect to the posterior precision, 𝜆𝑦

𝜆𝜃|𝑦 and

𝜆𝜃

𝜆𝜃|𝑦, respectively. From a Bayesian perspective, the combined

precisions of likelihood and prior lead to an increased precision in our knowledge about the value of 휃.

Further, the expectation parameter of the posterior distribution thus offers a sensible compromise between

our prior knowledge about the value of 휃 and the data observed. This is visualized in Figure 2.

147

We can reformulate the equations for the posterior distribution parameters using variances instead

of precisions to gain further insight into the analytic properties of the posterior distribution. Based on (8) we

have for the posterior distribution variance

𝜇𝜃|𝑦 = 𝑐𝜇𝜃 + (1 − 𝑐)𝑦 with 𝑐 ≔𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 and 𝜎𝜃|𝑦2 =

𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 ⋅𝜎𝜃2

𝜎𝑦2+𝜎𝜃

2 (9)

Proof of (9)

For the posterior variance parameter we have

𝜎𝜃|𝑦2 =

1

𝜆𝜃|𝑦=

1

𝜆𝑦+𝜆𝜃=

11

𝜎𝑦2+

1

𝜎𝜃2

=1

𝜎𝜃2

𝜎𝑦2𝜎𝜃

2+𝜎𝑦2

𝜎𝜃2𝜎𝑦

2

= (𝜎𝜃2+𝜎𝑦

2

𝜎𝑦2𝜎𝜃

2 )−1

=𝜎𝑦2𝜎𝜃

2

𝜎𝜃2+𝜎𝑦

2 =𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 ⋅𝜎𝜃2

𝜎𝑦2+𝜎𝜃

2. (9.1)

For the posterior distribution expectation parameter we have

𝜇𝜃|𝑦 =𝜆𝑦


𝜆𝜃

𝜆𝑦+𝜆𝜃𝜇𝜃 =

1

𝜎𝑦2

𝜎𝑦2𝜎𝜃

2

𝜎𝜃2+𝜎𝑦

2 𝑦 +1

𝜎𝜃2

𝜎𝑦2𝜎𝜃

2

𝜎𝜃2+𝜎𝑦

2 𝜇𝜃 =𝜎𝜃2

𝜎𝜃2+𝜎𝑦

2 𝑦 +𝜎𝑦2

𝜎𝜃2+𝜎𝑦

2 𝜇𝜃 =𝜎𝑦2𝜇𝜃+𝜎𝜃

2𝑦

𝜎𝑦2+𝜎𝜃

2 . (9.2)

To bring the above into the form (9), we rewrite the above as follows

𝜇𝜃|𝑦 =𝜎𝑦2𝜇𝜃+𝜎𝜃

2𝑦

𝜎𝑦2+𝜎𝜃

2 =𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 𝜇𝜃 +𝜎𝜃2

𝜎𝑦2+𝜎𝜃

2 𝑦 =𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 𝜇𝜃 +𝜎𝑦2+𝜎𝜃

2−𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 𝑦 =𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2 𝜇𝜃 + (𝜎𝑦2+𝜎𝜃

2

𝜎𝑦2+𝜎𝜃

2 −𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2) 𝑦 = 𝑐𝜇𝜃 + (1 − 𝑐)𝑦 (9.3)

where we defined

𝑐 ≔𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2. (9.4)

□

We first consider the expression for the posterior expectation in (9). Because 𝜎𝑦2 > 0 and 𝜎𝜃

2 > 0,

the constant 𝑐 takes on values in the (open) interval ]0,1[. Note that if 𝜎𝜃2 is very small in comparison to 𝜎𝑦

2,

𝑐 tends towards 1, while if 𝜎𝜃2 is relatively large with respect to 𝜎𝑦

2, 𝑐 tends towards 0. Further, because

0 < 𝑐 < 1, we have 𝑐 + (1 − 𝑐) = 1. From these properties of 𝑐, we see that the posterior expectation

parameter 𝜇𝜃|𝑦 corresponds to a weighted average of the prior expectation parameter 𝜇𝜃 and the observed

data point 𝑦 (assume for example, that 𝑐 = 0.4, then 𝜇𝜃|𝑦 = 0.4 ⋅ 𝜇𝜃 + 0.6 ⋅ 𝑦. For 𝑐 = 0.5, 𝜇𝜃|𝑦

corresponds to the standard (unweighted) average between 𝜇𝜃 and 𝑦). If 𝜎𝜃2 is large relative to 𝜎𝑦

2,

corresponding to vague prior information, 𝑐 is small, and more weight is given to the data 𝑦 compared to

the prior expectation parameter 𝜇𝜃. Inversely, if 𝜎𝜃2 is small compared to 𝜎𝑦

2, in other words, the prior is

highly informative, 𝑐 is large and the posterior expectation 𝜇𝜃|𝑦 parameter is close to the prior expectation

parameter 𝜇𝜃. 𝑐 is sometimes referred to as a shrinkage factor, because it provides the proportion of the

distance that the posterior expectation parameter is “shrunk back” from the ordinary estimate 𝑦 towards

the prior mean 𝜇𝜃.

We next consider the expression for the posterior variance in (9). We see that because both 𝜎𝑦2

𝜎𝑦2+𝜎𝜃

2

and 𝜎𝜃2

𝜎𝑦2+𝜎𝜃

2 are larger than 0, but smaller than 1, and the product of two numbers larger than 0 but smaller

than 1 is always smaller than each individual number, the posterior variance parameter is always smaller

than both the variance parameter of the prior and of the likelihood.

148

In summary, we have seen that Bayesian statistical inference is based on joint distributions over

parameters 휃 and data 𝑦. For most of the models discussed in PMFN these joint distributions and their

marginal and conditional distributions are represented by parameterized probability density functions. In

order to not confuse the model parameter 휃 of interest with these parameters, they are sometimes referred

to as “hyperparameters”. However, in our opinion, it is more helpful to think of 휃 and 𝑦 as “unobserved and

observed random variables”, and of parameters of their joint distributions as parameters. In the following

Section, we explore what happens, if we consider the beta parameter 𝛽 ∈ ℝ𝑛 of the GLM as an unobserved

random variable, i.e. if we, based on a prior distribution 𝑝(𝛽) aim to infer the posterior distribution 𝑝(𝛽|𝑦)

and the model evidence 𝑝(𝑦). Note that discussion of the univariate Gaussian case above corresponds to a

GLM with a single observation (𝑛 ≔ 1) and a design matrix 𝑋 ≔ 1.

(6) Principles of variational Bayes

Variational Bayes (VB) is a statistical framework for probabilistic models comprising unobserved

variables and has received increasing attention in the machine learning, theoretical neuroscience, and

neuroimaging literature since the late 1990s. The general starting point of a Bayesian approach is a joint

distribution over observed random variables 𝑦 and unobserved random variables 𝜗

𝑝(𝑦, 𝜗) = 𝑝(𝜗)𝑝(𝑦|𝜗) (1)

where 𝑝(𝜗) is usually referred to as the prior distribution and 𝑝(𝑦|𝜗) as the likelihood. Joint distributions

over observed and unobserved variables are sometimes referred to as “generative models”, a convention we

will follow here. Given an observed value 𝑦∗ of 𝑦, the first aim of a Bayesian approach is to determine the

conditional distribution of 𝜗 given 𝑦∗, referred to as the posterior distribution. The second aim of a Bayesian

approach is to evaluate the logarithm of the marginal probability of the observed data 𝑦 denoted by

ln 𝑝(𝑦) = ln ∫ 𝑝(𝑦, 𝜗) 𝑑𝜗 (2)

If a model only comprises non-random quantities classically referred to as “parameters”, the left hand side

of (2) is referred to as “log likelihood” and no integration as on the right-hand side of (2) is required.

However, if a model comprises unobserved random variables, which are integrated out as on the right hand

side of (2), (2) is referred to as the “log marginal likelihood” or “log model evidence”. The log model

evidence allows for comparing different models in their plausibility to explain observed data. It thus forms

the necessary prerequisite for Bayesian model comparison. In the VB framework it is not the log model

evidence itself which is evaluated, but rather a lower bound approximation to it. This is due to the fact that if

a model comprises many unobserved variables 𝜗 the integration of the right-hand side of equation (2) can

become analytically burdensome or even intractable. To nevertheless achieve the two aims of a Bayesian

approach (posterior parameter estimation and model evidence evaluation), VB in effect replaces an

integration problem with an optimization problem. To this end, VB exploits a set of information theoretic

quantities as introduced in previously and below.

The following log model evidence composition forms the core of the VB approach (Figure 1):

ln 𝑝(𝑦) = ℱ(𝑞(𝜗)) + 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) (3)

where 𝑞(𝜗) denotes an arbitrary probability distribution over the unobserved variables, which is used as an

approximation of the posterior distribution 𝑝(𝜗|𝑦). In the following, 𝑞(𝜗) is referred to as “variational

149

distribution”. In words, equation (3) states that for an arbitrary variational distribution 𝑞(𝜗) over the

unobserved variables, the log model evidence comprises the sum of two information theoretic quantities:

the so-called “variational free energy”, which is define here as

ℱ(𝑞(𝜗)) ≔ ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)

𝑞(𝜗)) 𝑑𝜗 (4)

and the Kullback-Leibler (KL) divergence between the true posterior distribution 𝑝(𝜗|𝑦) and the variational

distribution 𝑞(𝜗)

𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)

𝑝(𝑥))𝑑𝑥 (5)

Proof of (3) Based on the definitions of ℱ(𝑞(𝜗))and 𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) it is easy to show that the decomposition of the log model evidence

formally holds: By definition of the variational free energy in (4), we have

ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln (𝑝(𝑦)𝑝(𝜗|𝑦)

𝑞(𝜗)) 𝑑𝜗 (3.1)

Using the properties of the logarithm and the linearity of integrals, it follows that

ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln 𝑝(𝑦) 𝑑𝜗 + ∫ 𝑞(𝜗) ln (𝑝(𝜗|𝑦)

𝑞(𝜗)) 𝑑𝜗 (3.2)

With the linearity of integrals, we then also have

ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) ∫ 𝑞(𝜗) 𝑑𝜗 + ∫𝑞(𝜗) ln (𝑝(𝜗|𝑦)

𝑞(𝜗)) 𝑑𝜗 (3.3)

and because 𝑞(𝜗) is a probability distribution (and thus integrates to 1) and again with the properties of the logarithm, we obtain

ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) − ∫𝑞(𝜗) ln (𝑞(𝜗)

𝑝(𝜗|𝑦)) 𝑑𝜗 (3.4)

The definition of the KL-divergence then allows to write (3.4) as

ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) − 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) (3.5)

from which (3) follows immediately.

□

The non-negativity property of the KL-divergence has the consequence, that the variational free

energy ℱ(𝑞(𝜗)) is always smaller or equal to the log model evidence, that is

ℱ(𝑞(𝜗)) ≤ ln 𝑝(𝑦) (6)

This fact is exploited in the numerical application of the VB approach to probabilistic models: Because the log

model evidence is a fixed quantity, which only depends on the choice of 𝑝(𝑦, 𝜗) and a specific data

realization 𝑦∗, manipulating the variational distribution 𝑞(𝜗) for a given data set in such a manner that the

variational free energy increases has two consequences: first, the lower bound to the log model evidence

becomes tighter, and the variational free energy a better approximation to the log model evidence. Second,

because the left hand side of (3) remains constant, the KL-divergence between the true posterior and its

variational approximation decreases, which renders the variational distribution 𝑞(𝜗) a better approximation

to the true posterior distribution 𝑝(𝜗|𝑦) (Figure 2).

150

Figure 1. Visualization of the log model evidence decomposition that lies at the heart of the VB approach. The upper vertical bar is meant to represent the log model evidence, which is a function of the generative model 𝑝(𝑦, 𝜗) and is constant for any observation 𝑦∗ of 𝑦. As shown in the main text, the log model evidence can readily be rewritten into the sum of the variational free energy term

ℱ(𝑞(𝜗)) and a KL-divergence term 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)), if one introduces an arbitrary variational distribution over the unobserved

variables 𝜗. Maximizing the variational free energy hence minimizes the KL divergence between the variational distribution 𝑞(𝜗) and the true posterior distribution 𝑝(𝜗|𝑦) and renders the variational free energy a better approximation of the log model evidence. Equivalently, minimizing the KL divergence between the variational distribution 𝑞(𝜗) and the true posterior distribution 𝑝(𝜗|𝑦) maximizes the free energy and also renders it a tighter approximation to the log model evidence ln 𝑝(𝑦).

The log model evidence decomposition in terms of a variational free energy and a Kullback-Leibler

divergence induced by a variational distribution discussed above is a fairly general approach for Bayesian

parameter estimation and model evidence approximation. For concrete generative models, it serves rather

as a guiding principle rather, than a concrete numerical algorithm: algorithms that make use of the log

model evidence decomposition are jointly referred to as variational Bayesian algorithms, but many variants

exists. One may roughly classify these algorithms along two dimensions: (1) whether or not they employ

parametric assumptions about the variational distributions 𝑞(𝜗), referred to as “fixed-form” and “free-form”

variational Bayes, respectively, and (2) whether or not they assume that the variational distribution 𝑞(𝜗)

factorizes over groups of unobserved variables. The latter is usually referred to as mean-field assumption

and for 𝑠 sets of the unobserved random variables denoted by 𝑞(𝜗) = ∏ 𝑞(𝜗𝑖)𝑠𝑖=1 . The variational Bayes

variant often employed for the estimation of differential equation models of neuroimaging data corresponds

to a fixed-form variational Bayesian approach with mean-field assumption.

Figure 2. The log model evidence decomposition of Figure 10 is exploited in numerical algorithms for free-form VB inference as depicted in Figure: based on a mean-field approximation 𝑞(𝜗) = 𝑞(𝜗𝑠)𝑞(𝜗\𝑠), the variational free energy can be maximized in a

coordinate-wise fashion. Maximizing the variational free energy in turn has two implications: it decreases the KL-divergence between 𝑞(𝜗) and the true posterior 𝑝(𝜗|𝑦) and renders the variational free energy a closer approximation to the log model evidence. This holds true, because the log model evidence for a given observation 𝑦∗ is constant (represented by the constant length of the vertical bar) and the KL-divergence is non-negative.

151

Study Questions

1. Write down the general form of the likelihood function 𝐿 and name and explain its components.

2. Write down the general form of a maximum likelihood estimator 휃̂𝑀𝐿 and explain the definition.

3. Name two approaches for obtaining maximum likelihood estimators given a likelihood function.

4. Write down the log likelihood function for 𝑛 independent and identically distributed univariate Gaussian random variables

5. Write down the maximum likelihood estimator for the expectation and variance parameter of a univariate Gaussian based on 𝑛

independent and identical observations and verbally explain how these are derived from the corresponding likelihood function.

6. Based on a joint probability distribution 𝑝(𝑦, 휃) over data 𝑦 and parameters 휃, write down the formal equivalent to the

mnemonic 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝑃𝑟𝑖𝑜𝑟 × 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑.

7. Write down the definition of the precision parameter for a univariate Gaussian distribution with variance parameter 𝜎2.

8. Write down the conjugate posterior distribution 𝑝(휃|𝑦) = 𝑁(휃; 𝜇𝜃|𝑦 , 𝜆𝜃|𝑦 ) parameter update equations for the expectation

parameter estimation for the univariate Gaussian 𝑝(𝑦|휃) ≔ 𝑁(𝑦; 휃, 𝜎𝑦2) using precision parameters and provide an verbal

explanation of their intuition.

9. Write down the definition of the variational free energy and name its constituents.

10. Why is the variational free energy useful for Bayesian inference?


1. For a probabilistic model, which specifies the probability of data 𝑦 based on parameterized probability density functions 𝑝(𝑦; 휃),

where 휃 ⊂ 𝛩denotes the models parameter, and 𝛩 ⊂ ℝ𝑝 denotes the model’s parameter space, the function

𝐿: 𝛩 × ℝ𝑛 → ℝ+, (휃, 𝑦) ↦ 𝐿(휃, 𝑦) ≔ 𝑝(𝑦; 휃)

where 휃 ∈ ℝ𝑝 and 𝑦 ∈ ℝ𝑛 is called the likelihood function of the parameter 휃 for the observation 𝑦. Notably, the likelihood function

is a function of the parameter value 휃, while, in the case of an available realization of the random variable 𝑦, this of 𝑦 is fixed. This

contrasts with the notion of a probability density function 𝑝(𝑦; 휃), which is a function of the random variable 𝑦.

2. The maximum likelihood estimator for a given probabilistic model 𝑝(𝑦; 휃) is that value 휃̂𝑀𝐿 of 휃 which maximizes the likelihood

function. Formally, this can be expressed as 휃̂𝑀𝐿 ≔ 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈𝛩 𝐿(휃, 𝑦)

The above should be read as “휃̂𝑀𝐿 is that argument of the likelihood function 𝐿 for which 𝐿(휃, 𝑦) assumes its maximal value over all

possible parameter values 휃 in the parameter space 𝛩”.

3. One approach to find “closed-form” or analytical expressions for ML estimators is to maximize the likelihood function with

respect to 휃 by means of the analytical determination of critical values at which its derivative vanishes and checking, whether its

second derivative is negative. Another approach, often encountered in practical numerical computing is to automatically shift the

values of 휃 around by means of an algorithm, while monitoring the value of the likelihood function, and to stop, once this value is

considered to be maximal based on a sensible stopping criterion.

4. The log likelihood function for 𝑛 independent and identically distributed univariate Gaussian random variables is given by

ℓ: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ

((𝜇, 𝜎2), 𝑦) ↦ ℓ((𝜇, 𝜎2), 𝑦) ≔ 𝑙𝑛 𝑝(𝑦; 𝜇, 𝜎2) = −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2∑ (𝑦𝑖 − 𝜇)

2𝑛𝑖=1

5. The maximum likelihood estimators for the expectation and variance parameter of a univariate Gaussian distribution based on 𝑛

independent and identically distributed observations are given by �̂�𝑀𝐿 =1

𝑛∑ 𝑦𝑖 𝑛𝑖=1 and �̂�𝑀𝐿

2 =1

𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)

2𝑛𝑖=1 and are derived

from the corresponding likelihood function by computing the respective partial derivatives, setting to zero, and solving for the critical

points (which due to the concavity of the likelihood function correspond to maxima).

6. The formal equivalent is given by Bayes’ theorem applied for the current joint distribution

𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)

𝑝(𝑦=𝑦∗)

7. A precision parameter is the multiplicative inverse of a variance parameter. We thus have 𝜆 ≔1

𝜎2.

8. The parameters of 𝑝(휃|𝑦) are given

152

𝜆𝜃|𝑦: = 𝜆𝑦 + 𝜆𝜃 and 𝜇𝜃|𝑦: =𝜆𝑦


𝜆𝜃

𝜆𝑦+𝜆𝜃𝜇𝜃

Intuitively, the posterior certainty 𝜆𝜃|𝑦 about 휃 is larger than the prior certainty 𝜆𝜃 due to the contribution of 𝜆𝑦 > 0 and the

posterior expectation of 휃 is given by the sum of the prior expectation 𝜇𝜃 and the data 𝑦, weighted by their relative precisions, i.e.

their respective precisions normalized by the sum of precisions.

9. The variational free energy is defined as the functional (function of a function)

ℱ(𝑞(𝜗)) ≔ ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)

𝑞(𝜗)) 𝑑𝜗

It allocates a real number ℱ(𝑞(𝜗)) to a probability density function 𝑞(𝜗), which is an arbitrary distribution over the unobserved variables and serves as an approximation to the posterior distribution 𝑝(𝜗|𝑦) in a generative model 𝑝(𝑦, 𝜗), i.e. the joint distribution of observed random variables 𝑦 and unobserved random variables 𝜗.

10. Due to the log marginal likelihood decomposition ln 𝑝(𝑦) = ℱ(𝑞(𝜗)) + 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) into a variational free energy term

and the Kullback-Leibler divergence between the variational distribution 𝑞(𝜗) and the posterior distribution 𝑝(𝜗|𝑦), maximization of the variational free energy implies (1) minimization of the KL divergence between 𝑞(𝜗) and 𝑝(𝜗|𝑦), rendering 𝑞(𝜗) an approximation to the posterior distribution, and (2) minimization of the difference between the variational free energy and the log marginal likelihood, rendering ℱ(𝑞(𝜗)) an approximation to the log marginal likelihood ln 𝑝(𝑦), also known as the “log model evidence” and an important constituent for Bayesian model comparison.

153

Probability distributions in classical inference

(1) The Standard normal distribution

The standard normal distribution is the univariate normal distribution 𝑁(𝑥; 𝜇, 𝜎2) with parameters

𝜇 = 0 and 𝜎2 = 1.

(2) The chi-squared distribution

The distribution of the scalar random variable

𝜉 ≔ ∑ 𝑋𝑖2𝑛

𝑖=1 , where 𝑝(𝑋𝑖) ≔ 𝑁(𝑋𝑖; 0,1) (𝑖 = 1,… , 𝑛) (1)

i.e. the sum of 𝑛 squared univariate random variables 𝑋𝑖 (𝑖 = 1,… , 𝑛), each distributed according to a

“standard normal distribution” (i.e. a normal distribution with expectation parameter 𝜇 = 0 and variance

parameter 𝜎2 = 1) is called a chi-squared distribution with 𝑛 degrees of freedom and is denoted by

𝜒2(𝜉; 𝑛). A probability density function of the chi-squared distribution 𝜒2(𝜉; 𝑛) is given by

𝑓𝑛:ℝ+ → ℝ+, 𝜉 ↦ 𝑓𝑛(𝜉) ≔ (2𝑛

2𝛤 (𝑛

2))

−1

𝜉𝑛

2−1 𝑒𝑥𝑝 (−

𝜉

2) (2)

where 𝛤 denotes the Gamma function

𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞

0 (3)

Probability density functions of the chi-squared distribution for 𝑛 = 1,2,3 degrees of freedom are shown in

Figure 1. Note that the expectation no chi-squared distributed variable with 𝑛 ∈ ℕ degrees of freedom is 𝑛

and its variance is 2𝑛.

Figure 1 Chi-squared distribution probability density functions for 𝑛 = 1,2,3

(4) The 𝒕-distribution

Let 𝑋 and 𝑌 be two independent scalar random variables. Let X be distributed according to a

“standard” normal distribution 𝑁(𝑋; 0,1) with expectation parameter 𝜇 = 0 and variance parameter 𝜎2 =

154

1, and let 𝑌 be distributed according to a chi-square distribution with 𝑛 degrees of freedom, χ2(𝑦; 𝑛). Then

the distribution of the scalar random variable

𝑇 ≔ 𝑋/√𝑌

𝑛 (1)

is called 𝑡-distribution with 𝑛 degrees of freedom, denoted as 𝑡(𝑇; 𝑛). A probability density function of 𝑡 is

given by

𝑓𝑛:ℝ → ℝ+, 𝑥 ↦ 𝑓𝑛(𝑥) ≔Γ(

𝑛+1

2)

√𝑛𝜋Γ(𝑛

2)∙

1

(1+𝑥2

𝑛)

𝑛+12

(2)


𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞

0 (3)

For large 𝑛, 𝑡(𝑇; 𝑛) asymptotically approaches the standard normal distribution 𝑁(𝑇; 0,1). For 𝑛 = 1, the

distribution is not defined, for 𝑛 = 2,3, … is expectation is given by 0. For 𝑛 = 2, its variance is not defined

and for 𝑛 = 3,4,… its variance is given by 𝑛

𝑛−2. Figure 1 depicts the probability density function of 𝑡𝑛 for

𝑛 = 2,… ,5.

Figure 1 The 𝑡-distribution with varying degrees of freedom

(5) The 𝒇-distribution

The distribution of a random variable 𝑋, where 𝑋 is given by

𝑋 ≔𝑌1/𝑚

𝑌1/𝑛 with 𝑝(𝑌1) = 𝜒

2(𝑌1;𝑚) and 𝑝(𝑌2) = 𝜒2(𝑌2; 𝑛) (1)

is called 𝑓-distribution with (𝑚, 𝑛) degrees of freedom and is denoted as 𝑓(𝑋;𝑚, 𝑛). A probability density

function for the 𝑓 -distribution is given by

𝑔𝑚,𝑛: ℝ+\{0} → ℝ+, 𝑔𝑚,𝑛(𝑥) =Γ(𝑚+𝑛

2)

Γ(𝑚

2)Γ(

𝑛

2)𝑚

𝑚

2 𝑛𝑛

2𝑥𝑚−22

(𝑚𝑥+𝑛)𝑚+𝑛

2

(2)

155


𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞

0 (3)

Some probability density function for 𝑓-distributions with varying degrees of freedom are visualized in Figure

1.

Figure 1 𝑓-distribution probability density functions for varying degrees of freedom

156

Probability distributions in Bayesian inference

(1) The gamma distribution

The gamma distribution is a useful distribution to describe uncertainty about univariate random

variables with strictly positive domain, such as the precision parameter of a Gaussian distribution. A

probability density function for the gamma distribution is given by in its “shape and scale” parameterization

by

𝐺 ∶ ℝ+ → ℝ+, 𝜆 ↦ 𝐺(𝜆; 𝑎, 𝑏) ≔1

Γ(𝑎)

1

𝑏 𝑎𝜆𝑎−1 exp (−

𝜆

𝑏) for 𝑎, 𝑏 > 0 (1)

where 𝑎 is referred to as the “shape parameter” and 𝑏 is referred to as the “scale parameter”. Γ(𝑥) denotes

the gamma function which is defined for 𝑥 > 0 as:

Γ:ℝ+\{0} → ℝ, 𝑥 ↦ Γ(𝑥) = ∫ 𝑡𝑥−1𝑒−𝑡 𝑑𝑡∞

0 (2)

The expectation and variance of 𝜆 under 𝐺(𝜆; 𝑎, 𝑏) are expressed in terms of the parameters as

𝐸(𝜆) = 𝑎𝑏 and 𝑉(𝜆) = 𝑎𝑏2 (3)

Figure 1 below depicts the gamma distribution for a range of shape and scale parameters.

Figure 1. Gamma distribution probability density functions for varying shape and scale parameters. If the shape

parameter 𝛼 is smaller or equal to 1, the mode of the density functions is at 0, otherwise it is larger than 0. The scale

parameter changes the scale of the distribution, i.e. an increase in 𝑏 pushes the mass of the distribution towards zero

and upwards.

In addition to the shape and scale parameterization, the gamma distribution is also often

characterized by probability density functions using the “shape and rate” parameterization. For a positive

random variable 𝜆, these take the form

𝐺 ∶ ℝ+ → ℝ+, 𝜆 ↦ 𝐺(𝜆; 𝛼, 𝛽) ≔𝛽𝛼

𝛤(𝛼)𝜆𝛼−1 𝑒𝑥𝑝(−𝛽𝜆) for 𝛼, 𝛽 > 0 (4)

where 𝛼 is referred to as the “shape” parameter and 𝛽 is referred to as the “rate” parameter. In terms of

these parameters, the expectation and variance 𝜆 under 𝐺(𝜆; 𝛼, 𝛽) are given by

𝐸(𝜆) =𝛼

𝛽 and 𝑉(𝜆) =

𝛼

𝛽2 (5)

157

Figure 2 below depicts the gamma distribution for a range of shape and rate parameters.

Figure 2. Gamma distribution probability density functions for varying shape and rate parameters. If the shape parameter 𝛼 is smaller or equal to 1, the mode of the density functions is at 0, otherwise it is larger than 0. The rate parameter (like the scale parameter) changes the scale of the distribution, i.e. an increase in 𝛽 pushes the mass of the distribution towards zero and upwards. Note that the case 𝛼 = 1 corresponds to the exponential distribution family and the case 𝛼 = 2 corresponds to the Erlang distribution family probability density functions.

The shape and rate parameterization of the gamma distribution is closely related to the exponential

distribution, which is defined as

𝐸𝑥𝑝(𝜆; 𝛽) ≔ 𝐺(𝜆; 1, 𝛽) (6)

i.e., the distributions with scale parameter 𝛼 ≔ 1 whose mode is zero. The exponential distribution is the

governing distribution of the inter-event times of a Poisson process, which is itself in turn defined by the rate

parameter 𝛽.

The gamma distribution has close ties with at least three other distributions. Firstly, if the shape

parameter is set to 𝛼 ≔ 2 it corresponds to the one-parameter Erlang distribution

𝐸𝑟𝑙𝑎𝑛𝑔(𝜆; 𝛽) = 𝐺(𝜆; 2, 𝛽) (7)

Notably, the mode of the Erlang distribution is larger than zero. The Erlang distribution is used in

probabilistic models of queuing processes, where it takes the role of modelling inter-event times like the

exponential distribution for the Poisson process. Secondly, the chi-squared distribution, introduced above as

the distribution of the sum of squared univariate Gaussian random variables, is a special case of the gamma

distribution. Specifically, the probability density function of a chi-squared distribution with 𝑛 degrees of

freedom corresponds to a shape and rate parameterized Gamma distribution with 𝛼 ≔𝑛

2 and 𝛽 ≔

1

2:

𝜒2(𝜉; 𝑛) = 𝐺 (𝜉;𝑛

2,1

2) (8)

Finally, it is closely related to the inverse gamma distribution and is generalized to a multivariate distribution

in the form of the Wishart distribution.

(2) The inverse Gamma distribution

The Gamma distribution is traditionally used to model uncertainty over the precision parameter

𝜆 ≔ (𝜎2)−1 of a Gaussian distribution. A more direct way is to formulate uncertainty directly over the

158

variance parameter 𝜎2. To this end, one can show that if a random variable 𝜆 > 0 is distributed according to

a Gamma distribution, then its inverse 𝜎2 = 𝜆−1 is distributed according to an “inverse gamma distribution”

for which a probability density function is given by

𝐼𝐺 ∶ ℝ+ → ℝ+, 𝜎2 ↦ 𝐼𝐺(𝜎2; 𝑎, 𝑏) =

𝑏𝑎

𝛤(𝑎)(𝜎2)−(𝑎+1) 𝑒𝑥𝑝 (−

𝑏

𝜎2) (1)

The expectation and variance of the inverse Gamma distribution in terms of its “shape parameter” 𝑎 and its

“scale parameter” 𝑏 are given by

𝐸(𝜎2) =𝑏

𝑎−1 and 𝑉(𝜎2) =

𝑏2

(𝑎−1)2(𝑎−2) (2)

Note that the expectation only exists, if 𝑎 > 1, and the variance only exists, if 𝑎 > 2. Figure 1 depicts inverse

gamma distribution probability densities for a range of parameter settings (left panel) and their

corresponding gamma distribution probability densities (right panel).

Figure 1. Gamma (left panel) and inverse gamma (right panel) distribution probability density functions for varying parameters. Note that if 𝛼 = 𝑎 = 1, the mode of the distribution of lamba 𝜆 is zero, while the mode of the distribution of 𝜎2 = 1/𝜆 is larger than zero.

(3) The Wishart distribution

The Wishart distribution is the multivariate generalization of the Gamma distribution. As the Gamma

distribution is defined for random variables taking on only positive values, the Wishart distribution is defined

for positive-definite matrices. Like the Gamma distribution is classically used to describe uncertainty over the

precision parameter 𝜆 ≔ (𝜎2)−1 of (univariate or multivariate) Gaussian probability density functions, the

Wishart distribution is used to model uncertainty about precision matrices Λ = Σ−1 of multivariate Gaussian

probability density functions with covariance matrix parameter Σ. A probability density function of the

Wishart distribution is given by

𝑊 ∶ ℝ𝑝.𝑑.𝑛×𝑛 → ℝ+, Λ ↦ 𝑊(Λ; 𝑆, 𝜈) ≔ (2

𝜈𝑛

2 Γ𝑛 (𝜈

2) |𝑆|

𝜈

2)−1

|Λ|𝜈−𝑛−1

2 exp (−1

2𝑡𝑟(Λ𝑆−1)) (1)

159

In (1) we denote by ℝ𝑝.𝑑.𝑛×𝑛 the set of positive-definite matrices of size 𝑛 × 𝑛, by 𝑆 ∈ ℝ𝑛×𝑛 the “scale matrix”

parameter of the Wishart distribution, and by 𝜈 ∈ ℝ the “degrees of freedom parameter” of the Wishart

distribution. Finally, Γ𝑛 denotes the “multivariate gamma function”, defined in terms of the Gamma function

Γ as

Γ𝑛 ∶ ℝ+\{0} → ℝ, 𝑥 ↦ Γ𝑛(𝑥) ≔ 𝜋

𝑛(𝑛−1)

4 ∏ (𝑥 +1−𝑖

2)𝑛

𝑖=1 (2)

Note that the multivariate Gamma function is referred to as “multivariate”, because it is useful in

multivariate statistics, not because it is defined on a multivariate domain.

The Wishart distribution is only defined for 𝜈 > 𝑛 − 1, otherwise, the normalization constant

(2𝜈𝑛

2 Γ𝑛 (𝜈

2) |𝑆|

𝜈

2)−1

of its probability density function does not exist. The expectation of a Wishart

distribution with scale matrix 𝑆 and degrees of freedom 𝜈 is given by

𝐸(Λ) = 𝜈𝑆 (3)

For 𝑛 = 1, 𝜆 ≔ Λ ∈ ℝ𝑝.𝑑.1×1, scale parameter 𝑠−1 and degrees of freedom parameter 𝜈, the Wishart

distribution corresponds to the Gamma distribution with shape parameter 𝛼 =𝜈

2 and rate parameter

𝑠

2, thus

the following equivalence of probability density functions holds

𝑊(𝜆; 𝑠−1, 𝜈) = 𝐺 (𝜆;𝜈

2,𝑠

2) (4)

(4) The inverse Wishart distribution

The inverse Wishart distribution is useful to describe uncertainty directly about covariance matrix

parameters instead of precision matrix parameters in Gaussian models. Based on the definition of the

Wishart distribution 𝑊(Λ; 𝑆, 𝜈) over a positive definite matrix Λ ∈ ℝ𝑝.𝑑.𝑛×𝑛, it can be shown that the inverse

Σ ≔ Λ−1 of Λ is distributed according to the following probability density function

𝐼𝑊: ℝ𝑝.𝑑.𝑛×𝑛 → ℝ+, Σ ↦ 𝐼𝑊(Σ; 𝑆, 𝜈) ≔ (2

𝜈𝑛

2 Γ𝑛 (𝜈

2) |𝑆|−

𝜈

2)−1

|Σ|−(𝜈+𝑛+1)

2 exp (−1

2𝑡𝑟(𝑆−1Σ−1)) (1)

The expectation of the inverse Wishart distribution over Σ is given in terms of its parameters by

𝐸(Σ) = (𝜈 − 𝑛 − 1)−1𝑆−1 (2)

Like the Wishart distribution reduces to the Gamma distribution in the case 𝑛 = 1, the inverse Wishart

distribution reduces to the inverse Gamma distribution and 𝜎2 ∈ ℝ𝑝.𝑑.1×1, i.e.

𝐼𝑊(𝜎2; 𝑠−1, 𝜈) = 𝐼𝐺 (𝜎2;𝜈

2,𝑠

2) (3)

(5) The normal-gamma distribution and the normal-inverse gamma distribution

The normal-gamma distribution is the conjugate prior distribution for inference about Gaussian

models with unknown (univariate) precision parameter. It is a joint distribution over an 𝑛-dimensional

random vector 𝑥 and a univariate random variable 𝜆 that factorizes as follows

𝑝(𝑥, 𝜆) = 𝑝(𝑥|𝜆)𝑝(𝜆) (1)

160

for which the marginal distribution over 𝜆 is given by a gamma distribution in its shape and rate

parameterization

𝑝(𝜆) ≔ 𝐺(𝜆; 𝛼, 𝛽) ≔𝛽𝛼

Γ(𝑎)𝜆𝛼−1 exp(−𝛽𝜆) (2)

and the conditional distribution is given by

𝑝(𝑥|𝜆) ≔ 𝑁(𝑥; 𝜇, (𝜏𝜆)−1𝐼𝑛) ≔ (2𝜋)−1

2 (𝜏𝜆)𝑛

2 exp (−𝜏𝜆

2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (3)

With (1), a probability density function for the normal-gamma distribution is thus given by

𝑁𝐺 ((𝑥𝜆) ; 𝜇, 𝜏 𝛼, 𝛽) ≔

𝛽𝛼

Γ(𝛼)𝜆𝛼−1 exp(−𝛽𝜆) (2𝜋)−

1

2 (𝜏𝜆)𝑛

2 exp (−𝜏𝜆

2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (4)

Notably, the marginal distribution of 𝑥 under a normal-gamma distribution 𝑝(𝑥, 𝜆) is given by a

multivariate non-central t-distribution, and the marginal distribution of 𝜆 under a normal-gamma

distribution 𝑝(𝑥, 𝜆) is given by a gamma distribution. Figure 1 depicts the normal-gamma distribution for a

univariate variable 𝑥 ∈ ℝ and a number of parameter settings.

Figure 1. Normal-gamma distribution probability density functions for 𝑥 ∈ ℝ as a function of the expectation parameter 𝜇 ∈ ℝ (panel rows) and the scale parameter 𝛼 > 0 (panel columns). The white dots depict the expectations of the respective distributions.

In analogy to the normal-gamma distribution, the normal-inverse gamma distribution is given by assuming

an inverse Gamma distribution over the variance parameter of the Gaussian conditional distribution. IN

other words, the normal-inverse gamma distribution is a joint distribution over an 𝑛-dimensional random

vector 𝑥 and a univariate positive random variable 𝜎2 that factorizes according to

𝑝(𝑥, 𝜎2) = 𝑝(𝑥|𝜎2)𝑝(𝜎2) (5)

161

where

𝐼𝐺(𝜎2; 𝑎, 𝑏) =𝑏𝑎

𝛤(𝑎)(𝜎2)−(𝑎+1) 𝑒𝑥𝑝 (−

𝑏

𝜎2) (6)

and

𝑝(𝑥|σ2) ≔ 𝑁(𝑥; 𝜇, σ2𝐼𝑛) ≔ (2𝜋)−1

2 (σ2)−𝑛

2 exp (−1

2σ2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (7)

The normal-inverse gamma distribution probability density function is thus given as

𝑁𝐼𝐺(𝑥, σ2; 𝜇, 𝑎, 𝑏) = (2𝜋)−1

2 (σ2)−𝑛

2𝑏𝑎

Γ(𝑎)(σ2)−(a+1) exp (−

1

2σ2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇) −

𝑏

𝜎2) (8)

Figure 2 depicts normal-inverse gamma distribution a for a univariate variable 𝑥 ∈ ℝ and a number of

parameter settings.

Figure 2. Normal-inverse gamma distribution probability density functions for 𝑥 ∈ ℝ as a function of the expectation parameter 𝜇 ∈ ℝ (panel rows) and the shape parameter 𝑏 (panel columns). The white dots depict the expectations of the respective distributions.

(6) The univariate non-central 𝒕 -distribution

The univariate non-central t-distribution is a generalization of the t-distribution. A probability density

function for the non-central t-distribution is given by

𝑡 ∶ ℝ → ℝ+, 𝑥 ↦ 𝑡(𝑥; 𝜇, 𝜎2, 𝜈) =Γ(𝜈

2+1

2)

Γ(𝜈

2)

1

√𝜈𝜋𝜎2(1 +

1

𝜈(𝑥−𝜇

𝜎)2)−(

𝜈+1

2)

(1)

162

where Γ denotes the Gamma function. The non-central t-distribution has three parameters: the expectation

parameter 𝜇 ∈ ℝ, the scale parameter 𝜎2 > 0, and the degrees of freedom parameter 𝜈 > 0.

The expectation of the univariate non-central t-distribution for 𝜈 > 0 is given by

𝐸(𝑥) = 𝜇 (2)

For 𝜈 = 1, the univariate non-central t-distribution is referred to as the “Cauchy-distribution” and has the

peculiar property that the expectation does not exists, because the respective integral diverges. The

variance of the univariate non-central t-distribution for 𝜈 > 2 is given by

𝑉(𝑥) =𝜈

𝜈−2𝜎2 (3)

Figure 1 depicts the univariate non-central t-distribution for two choices of 𝜇 and a number of choices of 𝜎2

and 𝜈.

Figure 1. Examples of univariate non-central t-distribution probability densities

(5) The multivariate non-central 𝒕 -distribution

The multivariate non-central t-distribution is the multivariate generalization of the univariate non-

central t-distribution. Let 𝑥 ∈ ℝ𝑛 denote a random vector. A probability density function of the multivariate

noncentral t-distribution is given by

𝑡 ∶ ℝ𝑛 → ℝ+, 𝑥 ↦ 𝑡(𝑥; 𝜇, Σ, 𝜈) ≔Γ(𝜈

2+𝑛

2)

Γ(𝜈

2)

1

(𝜋𝜈)𝑛2|Σ|

12

(1 +1

𝜈(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇))

−(𝜈+1

2)

(1)

where Γ denotes the Gamma function. The non-central t-distribution has three parameters, the expectation

parameter 𝜇 ∈ ℝ𝑛, the scale matrix parameter Σ ∈ ℝ𝑛×𝑛 𝑝. 𝑑., and the degrees of freedom parameter 𝜈 >

0. In terms of its parameters, the expectation of the multivariate non-central t-distribution for 𝜈 > 1 is given

by

𝐸(𝑥) = 𝜇 (2)

and its covariance for 𝜈 > 2 is given by

163

𝐶(𝑥) =𝜈

𝑛−2 Σ (3)

Figure 1 below depicts examples of the multivariate non-central t-distribution the special case that 𝑛 = 2.

Figure 1. Examples of multivariate non-central t-distribution probability densities. The expectation and degrees of freedom parameters of both probability densities shown are 𝜇 = (1,1)𝑇 and 𝜈 = 2, respectively. The shape matrix of

the left probability density is Σ1 ≔ (1 0.50.5 1

) while the shape matrix of the right probability density is Σ2 ≔ (2 00 2

).

164

Basic Theory of the General Linear Model

165

Structural and probabilistic aspects

(1) Experimental design

The aim of the following section is to briefly review some of the key terms in experimental design.

Experiment

A scientific experiment can be defined as the controlled test of a “hypothesis” or “theory”.

Experiments manipulate some aspect of the world and then measure the outcome of that manipulation. In

cognitive neuroimaging experiments, scientists often manipulate some aspect of a stimulus, e.g. showing a

face or an object visually, or manipulating whether a word is easy or difficult to remember, and then

measure the observer's behavior and/or brain activity using FMRI or M/EEG.

Experimental Design

Experimental design refers to the organization of an experiment to allow for the effective

investigation of the research hypothesis. All well-designed experiments share several characteristics: they

test specific hypotheses, rule out alternative explanations for the data, and minimize the cost of running the

experiment.

Experimental variables

An experimental variable can be defined as a measured or manipulated quantity that varies within

an experiment. Two classes of experimental variables are central: independent and dependent variables.

Independent experimental variables are aspects of the experimental design that are intentionally

manipulated by the experimenter and that are hypothesized to cause changes in the dependent variables.

Independent variables in cognitive neuroscience experiments include for example different forms of sensory

stimulation, different cognitive contexts, or different motor tasks. The different values of an independent

variable are often referred to as “conditions” or “levels”. Usually, independent variables are explicitly

controlled. Mathematically, they are thus not represented by random variables, but rather by known

constants.

Dependent experimental variables are quantities that are measured by the experimenter in order to

evaluate the effect of the independent variables. Examples for dependent variables in cognitive

neuroscientific experiments are the response accuracy and reaction time on a given psychophysical task, the

BOLD signal at a given voxel in an FMRI experiment, or the frequency composition at a specific channel in an

M/EEG experiment. Mathematically, dependent experimental variables are usually modeled by random

variables

Categorical and continuous variables

In principle, both independent and dependent variables can either categorical or continuous. A

categorical variable is one that can take one of several discrete values, for example sensory stimulation vs.

no sensory stimulation, or different stimulus categories, e.g. faces and houses. Such categorical variables are

also often referred to as “factors”, which take on different “levels”. Mathematically, categorical variables are

usually represented as elements of the natural numbers or signed integers. A continuous variable is one that

can take on any value within a pre-specified range. Examples for continuous variables are for example

166

different levels of contrast of a visual stimulus, as well as most observed signals in noninvasive neuroimaging

such as the BOLD signal or the electrical potential in EEG. Mathematically, continuous variables are usually

elements of the real numbers. One defining feature of the GLM is that it accommodates both scenarios of

categorical and continuous independent variables, while the dependent variable is usually continuous.

Between-subjects and within-subject (repeated measures) designs

Experimental designs can further be classified according to whether the independent variable

treatments are applied to the same group of participants or to different group of participants. In a between-

subject manipulation, different subject groups reflect different values of the independent variable. More

common in cognitive neuroscience are within-subject designs where each subject participates in all

experimental conditions. These designs are commonly referred to as “repeated-measures designs”.

After the introductory remarks of the previous section we now begin with the introduction of the

general linear model (GLM). In brief, the GLM is a unifying perspective on many parametric statistical

methods such as simple and multiple linear regression, one- and two-sample T-Tests, and the many variants

of the analyses of variance and covariance. Typically, in undergraduate statistical courses, these methods are

introduced one after the other. Here we take a different route: in the Section “Theory of the GLM”, we will

discuss the generalization of these methods in form of the GLM and only after this has been achieved to

some depth re-introduce the aforementioned special cases of the GLM in the subsequent Section

“Applications of the GLM”.

(2) A verbose introduction

The GLM is often written as

𝑋𝛽 + 휀 = 𝑦 (1)

where 𝑋 denotes a “design matrix”, 𝛽 denotes a set of “parameters”, 휀 denotes a probabilistic Gaussian

distributed “error” term, and 𝑦 denotes “the data”. The aim of the current section is to introduce basic

aspects of equation (1). To this end, we will ignore the error term 휀 and focus on the product 𝑋𝛽 and the

data 𝑦. The importance of the error term 휀 will be studied in the subsequent Section.

From elementary statistics we are familiar with the concept of an independent variable, which we

will denote as “𝑥” for the moment, and the concept of a dependent variable, denoted as “𝑦” for the

moment. The idea is that variable 𝑥 is under control of the experimenter, while the variable 𝑦 is the

phenomenon that is being observed. 𝑦 is not under direct control of the experimenter, but in some way

related to 𝑥. In real world experiments, both 𝑥 and 𝑦 come about in a number of ways. For example 𝑥 could

represent a set of qualitatively different groupings, or treatments, as for example pharmacological or

behavioural treatments of depression. In other contexts, the independent variable may be quantitative, for

example representing the dosage of a drug given, or the duration that a visual stimulus is presented to a

human observer. Likewise, the variable 𝑦 can take different forms. For example, 𝑦 could be the number of

rats that die during a poison experiment, the time it takes a neuropsychological patient to identify an object,

or an MR contrast value observed at specific brain location at a specific time.

In any case, there will be two sets of quantities for which we postulate some kind of relationship. For

reasons of simplicity and flexibility (and because any functional relationship is locally linear), humans chose

to represent a lot of these relationships in terms of linear models. In verbose terms, a (noise-free) linear

167

model states that “an observed value of the dependent variable 𝑦 is equal to a weighted sum of values

associated with one or more independent variables 𝑥.” In order to render the last statement more precisely,

we will have to introduce some formal notation:

Let 𝑦𝑖 denote one observation of the dependent variable 𝑦, where 𝑖 = 1,… , 𝑛. Likewise let

𝑥𝑖𝑗 , 𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑝 denote the values of a number of independent variables that are supposed to be

associated with the observation 𝑦𝑖. Here, 𝑝 represents the number of “predictors” or independent variables.

The statement that the value 𝑦𝑖 equals the weighted sum of the values of the independent variables

𝑥𝑖𝑗 associated with this observation can then be written as

𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 = 𝑦𝑖 (1)

The 𝛽𝑗 (“beta 𝑗”) values are multiplicative coefficients that quantify the contribution of the

independent variable 𝑥𝑖𝑗 to the observed effect 𝑦𝑖. All variables in the expression above can be thought of as

real scalar numbers. A numerical example of the expression for the seventh observation 𝑦7 in a set of

observations according the expression above for 𝑝 = 4 independent variables (and, correspondingly 𝑝 = 4

beta parameters) would be

𝑥71𝛽1 + 𝑥72𝛽2 + 𝑥73𝛽3 + 𝑥74𝛽3 = 𝑦7 (2)

A numerical example of (3) could for example be

16 ∙ 0.25 + 1 ∙ 2 + 3 ∙ 0.5 + 2.5 ∙ 1 = 10 (3)

Here, the values of the independent variables are

𝑥71 = 16, 𝑥72 = 1, 𝑥73 = 3, 𝑥74 = 2.5 (4)

the beta parameter values are

𝛽1 = 0.25, 𝛽2 = 2, 𝛽3 = 0.5, 𝛽4 = 1 (5)

and the observed value of the dependent variable is

𝑦7 = 10 (6)

It is very important to be always clear about which variables are known at what point in an

experiment. The independent variables 𝑥𝑖𝑗 are specified by the experimenter, thus they are known as soon

as the experimenter has decided how to set up the experiment. The observation values 𝑦𝑖 are known as

soon as the experimenter has collected data points in response to the fixed set of independent variables

𝑥𝑖1, … , 𝑥𝑖𝑝. What the experimenter does not know in advance (but might have a hypothesis about, or some

interest in), is how much each of the variables 𝑥𝑖1, … , 𝑥𝑖𝑝 contributes to the sum. That is, the weighting

coefficients 𝛽1, … , 𝛽𝑝 are not known in advance and can only be determined by repeatedly observing the

dependent variable in the presence of the fixed set of independent variables.

In terms of mathematical modelling, model values that are determined from the data are called

“parameters” or “free parameters”. Usually, the free parameters are adjusted in a way that the

mathematical model is able to predict an observed response in the best possible way. The process of finding

these parameter values is called “model estimation”, “parameter fitting”, or “model fitting”. We will discuss

168

the details of model estimation and the assumptions behind it for the GLM in more detail in subsequent

sections.


Expressions like

𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 = 𝑦𝑖 (1)

are usually introduced in undergraduate statistics courses under the label of “multiple linear

regression”. Multiple linear regression is usually introduced as a generalization of “simple linear regression”

to more than two (or one, depending on how simple linear regression is treated) independent variables. The

defining feature of multiple regression as introduced in undergraduate statistics is that the independent

variables represent “continuous variables”. In this section, we will consider the special case simple linear

regression of the multiple linear regression problem to link the intuitions about simple linear regression with

the matrix notation of the GLM.

Most likely, the reader will have encountered simple linear regression in the form of the following

equation

𝑦 = 𝑎 + 𝑏𝑥 (1)

where 𝑎 was referred to as the “offset” and 𝑏 as the “slope”, 𝑥 was called the “independent variable” and 𝑦

was called the “dependent variable”.

Let us ponder about the meaning of (1) for a bit. What this equation wants to say is that, if we know

the values of 𝑥, 𝑏 and 𝑎 (which are all real scalar numbers), we can compute the value of 𝑦. Let us assume

that we would like to compute the value of 𝑦 for five different values of 𝑥, namely,

𝑥12 = 0.2, 𝑥22 = 1.4, 𝑥32 = 2.3, 𝑥42 = 0.7 and 𝑥52 = 0.5. (2)

The reader may remember from undergraduate statistics that the values of 𝑥 and 𝑦 were allowed to vary,

whereas 𝑎 and 𝑏 were “fixed” or “always the same”. Let us assume that 𝑎 = 0.8 and that 𝑏 = 1.3. We may

thus write the five values of 𝑦 corresponding to the five values of 𝑥 as:

𝑦1 = 𝑎 + 𝑏𝑥12 = 1 ⋅ 0.8 + 1.3 ⋅ 0.2 (3)

𝑦2 = 𝑎 + 𝑏𝑥22 = 1 ⋅ 0.8 + 1.3 ⋅ 1.4

𝑦3 = 𝑎 + 𝑏𝑥32 = 1 ⋅ 0.8 + 1.3 ⋅ 2.3

𝑦4 = 𝑎 + 𝑏𝑥42 = 1 ⋅ 0.8 + 1.3 ⋅ 0.7

𝑦5 = 𝑎 + 𝑏𝑥52 = 1 ⋅ 0.8 + 1.3 ⋅ 0.5

If the reader is familiar with matrix notation and especially matrix multiplication it will be readily apparent

that expression (3) can also be written as

169

(

𝑦1𝑦2𝑦3𝑦4𝑦5)

=

(

𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32𝑥41 𝑥42𝑥51 𝑥52)

(𝑎𝑏) =

(

1 0.21 1.41 2.31 0.71 0.5)

(0.81.3

) =

(

1 ⋅ 0.8 + 1.3 ⋅ 0.2 1 ⋅ 0.8 + 1.3 ⋅ 1.41 ⋅ 0.8 + 1.3 ⋅ 2.31 ⋅ 0.8 + 1.3 ⋅ 0.71 ⋅ 0.8 + 1.3 ⋅ 0.5 )

(4)

Note that in (4) we have introduced another “independent variable” 𝑥𝑖1, 𝑖 = 1,… , 𝑛 which takes on the

value 1 for all values 𝑦𝑖 , 𝑖 = 1,… , 𝑛.

What did we gain from rewriting (3) as (4)? Not much, but we can express the relatively large

expression (3) more compactly in matrix notation, if we additionally define

𝑦 ≔

(

𝑦1𝑦2𝑦3𝑦4𝑦5)

, 𝑋 ≔

(

𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32𝑥41 𝑥42𝑥51 𝑥52)

and 𝛽 ≔ (𝑎𝑏) (5)

The last definition in (5) can be simplified and put into the context of the standard GLM by setting

𝛽1 ≔ 𝑎 and 𝛽2 ≔ 𝑏 (6)

i.e., by setting

𝛽 ≔ (𝛽1𝛽2) (7)

The reader may take note of the dimensions of 𝑦, 𝑋, and 𝛽: 𝑦 is a 5 × 1 real vector, which we write as

𝑦 ∈ ℝ5, 𝑋 is a 5 × 2 matrix, written as 𝑋 ∈ ℝ5×2 and 𝛽 is 2 × 1 real vector, written as 𝛽 ∈ ℝ2×1. In matrix

form, with 𝑦 ∈ ℝ5, 𝑋 ∈ ℝ5×2 and 𝛽 ∈ ℝ2, we can thus write (3) very compactly as

𝑦 = 𝑋𝛽 (8)

To conclude this section, consider equation (1) again. In comparison with equation (8), it is apparent,

that so far we did not consider the error term 휀. In fact, equation (8) merely describes the “deterministic”

or “structural” aspect of the simple linear regression model. This can be reformulated in two ways: the

simple linear regression corresponds the straight line equation “under the addition of observation noise”.

However, more sensibly, it is appreciated that the “observation noise” is itself integral part of the model of

interest here, namely the GLM, and thus the GLM is a “probabilistic model” with some “deterministic

aspects”. So far, we considered the deterministic aspects, next, we will consider the probabilistic aspects.

(4) The Gaussian assumption

In the previous Section and the treatment of matrix notation and matrix multiplication in the

mathematical preliminaries we have considered the terms 𝑋𝛽 and 𝑦 in the GLM equation

𝑋𝛽 + 휀 = 𝑦 (1)

in some detail. For the special case of simple linear regression we have seen that 𝑦 corresponds to a vector

of real values with 𝑛 ∈ ℕ entries, where 𝑛 corresponds to the number of data points. We also have seen that

the design matrix 𝑋 ∈ ℝ𝑛×2 has two columns, as there are two parameters 𝛽0 and 𝛽1 in a simple linear

170

regression model (the “offset” and the “slope”) and thus that 𝛽 ≔ (𝛽0, 𝛽1)𝑇 ∈ ℝ2. One fundamental aspect

of the GLM equation (1) is that it generalizes many special cases such as simple linear regression by allowing

for different numbers of parameters 𝑝 ∈ ℕ and different forms of design matrices.

Using matrix notation we can express (1) more precisely by stating the GLM equation as

𝑋𝛽 + 휀 = 𝑦 (2)

where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, and 𝑦 ∈ ℝ𝑛. It is important to note that the design matrix 𝑋 ∈ ℝ𝑛×𝑝 always has as

many rows as there are data points (𝑛) and as many columns as there are parameters (𝑝). 𝛽 ∈ ℝ𝑝 is usually

referred to as the “parameter vector” and 𝑦 ∈ ℝ𝑛 the “data vector”. In general, we can thus unpack

equation (1) in the following form

(

𝑥11 𝑥12𝑥21 𝑥22

⋯ 𝑥1𝑝⋯ 𝑥2𝑝

𝑥31 𝑥32⋮ ⋮

⋯ 𝑥3𝑝⋱ ⋮

⋮ ⋮𝑥𝑛1 𝑥𝑛2

⋱ ⋮⋯ 𝑥𝑛𝑝)

(

𝛽1𝛽2⋮𝛽𝑝

)+ 휀 =

(

𝑦1𝑦2𝑦3⋮⋮𝑦𝑛)

(3)

We now consider 휀 in (3). Because 𝑋 ∈ ℝ𝑛×𝑝 and 𝑦 ∈ ℝ𝑛, 휀 must also be an 𝑛-dimensional real vector, i.e.

휀 ∈ ℝ𝑛, and we thus have

(

𝑥11 𝑥12𝑥21 𝑥22

⋯ 𝑥1𝑝⋯ 𝑥2𝑝

𝑥31 𝑥32⋮ ⋮

⋯ 𝑥3𝑝⋱ ⋮

⋮ ⋮𝑥𝑛1 𝑥𝑛2

⋱ ⋮⋯ 𝑥𝑛𝑝)

(


)+

(

휀1휀2휀3⋮⋮휀𝑛)

=

(

𝑦1𝑦2𝑦3⋮⋮𝑦𝑛)

(4)

To introduce the meaning of 휀 in the context of equation (1), we consider the 𝑖th row of (3), which reads

𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 + 휀𝑖 = 𝑦𝑖 (5)

The left-hand side of (5) corresponds to the modelling assumption about the 𝑖th “data point” 𝑦𝑖 and

comprises two categorically different objects. The first part

𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (6)

is “deterministic”, by which we understand that if we know the values of the 𝑥𝑖𝑗 (𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑝)

and the values of the 𝛽𝑗(𝑗 = 1,…𝑝), we can uniquely compute the value of (6). The second part, 휀𝑖 is

different. 휀𝑖 is conceived a random variable. This means that 휀𝑖 assumes values according to probability

distribution. We might know some parameters such as the mean and the variance of this probability

distribution, but the exact value that a 휀𝑖 takes on does not uniquely follow from this knowledge. Equation

(5) thus also implies that the value of 𝑦𝑖 is given by the sum of a “deterministic” term and a “probabilistic” or

“random” term. Informally, consider obtaining samples from the distribution of 휀𝑖 and adding them to a

constant value

𝜇𝑖 ≔ 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (7)

in the form

171

𝑦𝑖 = 𝜇𝑖 + 휀𝑖 (8)

In (8) 𝜇𝑖 is constant and 휀𝑖 is a random variable, for which we now imagine to draw samples from

univariate Gaussian distribution with mean (expectation) parameter 0 and variance parameter 𝜎2 = 1. Most

of the time, these values will be close to zero, but on occassion somewhat positive or somewhat negative.

Consider drawing the samples 휀1 = 0.2, 휀2 = −0.001, 휀3 = 0.05 and consider 𝜇1 = 𝜇2 = 𝜇3 = 1. If we

evaluate (8) for these values, we obtain

𝑦1 = 𝜇1 + 휀1 = 1 + 0.2 = 1.2 (9)

𝑦2 = 𝜇2 + 휀2 = 1 − 0.001 = 0.099

𝑦3 = 𝜇3 + 휀3 = 1 + 0.05 = 1.05

The most important thing to realize about (9) is that despite the fact that each 𝑦𝑖 (𝑖 = 1,2,3) has the same

deterministic aspect 𝜇𝑖 = 1 (𝑖 = 1,2,3), the values that the 𝑦𝑖 take on vary, because something random is

added to the deterministic aspect. Most importantly, this renders the 𝑦𝑖 themselves “random”, or, more

formally, random variables. We can also infer how they are distributed: because the 휀𝑖 have a mean of zero,

the expectation of the 𝑦𝑖 will correspond to the deterministic aspects 𝜇𝑖. The variance of the 𝑦𝑖, on the other

hand, correspond to the variance of the 휀𝑖.

There are two ways to express the above more formally. We can either state that

𝑦𝑖 = 𝜇𝑖 + 휀𝑖 (10)

where the distribution of 휀𝑖 is given by a univariate Gaussian distribution with expectation 0 and variance 𝜎2

𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎2) (11)

Likewise (and more intuitively) we can simply state that the distribution of 𝑦𝑖 is given by a univariate

Gaussian distribution with expectation 𝜇𝑖 and variance 𝜎2

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) (12)

Recall that

𝜇𝑖 = 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (13)

We may denote (13) using matrix multiplication as

𝜇𝑖 = 𝑥𝑖𝛽 (14)

where we defined 𝑥𝑖 ∈ ℝ1×𝑝 as the row vector

𝑥𝑖 ≔ (𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖1) ∈ ℝ1×𝑝 (15)

which, importantly, corresponds to the 𝑖th row of the design matrix 𝑋 ∈ ℝ𝑛×𝑝. 𝛽 ∈ ℝ𝑝 corresponds to the

parameter vector

172

𝛽 ≔ (


) (16)

We thus can rewrite (12) as

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) = 𝑁(𝑦𝑖; 𝑥𝑖𝛽, 𝜎

2) (17)

Let us summarize what we have achieved so far: starting from the GLM equation

𝑋𝛽 + 휀 = 𝑦 (18)

we have seen that 𝑋𝛽 corresponds to a matrix product, each row 𝑖 (where 𝑖 = 1,… , 𝑛) of which specifies a

deterministic contribution 𝜇𝑖 ∈ ℝ to the corresponding data value 𝑦𝑖. Each entry in the “noise vector” 휀, on

the other hand, contributes a random component 휀𝑖 to the value of 𝑦𝑖. It is important to note that these

ideas are assumptions, or in other words, they conform to the formulation of a probabilistic mathematical

model for some data. They do not correspond to any form of reality, which remains unknown. We have also

seen that we can re-express these ideas as a univariate Gaussian distribution 𝑁(𝑦𝑖; 𝑥𝑖𝛽, 𝜎2) that describes

the probability distribution of a single dependent variable 𝑦𝑖. In analogy to equation (16), one may

formulate a probability distribution for the “data vector” 𝑦 ∈ ℝ𝑛 (or in other words, for all dependent

variables 𝑦1, … , 𝑦𝑛) in the form of a multivariate Gaussian distribution

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (19)

However, to understand the implications of (19) one requires some familiarity with multivariate Gaussian

distributions, which the reader is encouraged to review next.

(5) Equivalent formulations of the GLM

In this section, we consider the equivalence

𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)

from the perspective of the linear transformation theorem for Gaussian distributions.

Recall that the linear transformation theorem for Gaussian distributions theorem states that that if

𝑝(𝑥) = 𝑁(𝑥; 𝜇𝑥 , Σx) (𝑥, 𝜇𝑥 ∈ ℝ𝑑 , Σ𝑥 ∈ ℝ

𝑑×𝑑 𝑝. 𝑑), 𝑝(휀) = 𝑁(휀; 𝜇 , Σ ) (휀, 𝜇 ∈ ℝ𝑑 , Σ ∈ ℝ𝑑×𝑑 𝑝. 𝑑. ) (2)

and 𝐴 ∈ ℝ𝑑×𝑑 is a matrix, then the random variable 𝑦 ≔ 𝐴𝑥 + 휀 is distributed according to

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦 , Σy) where 𝑦 ∈ ℝ𝑑 , 𝜇𝑦 = 𝐴𝜇𝑥 + 𝜇 ∈ ℝ𝑑 and Σ𝑦 = 𝐴Σ𝑥𝐴𝑇 + Σ ∈ ℝ𝑑×𝑑 (3)

Applying this theorem in the current context with 𝑑 ≔ 𝑛, we first note that with 𝑥 ≔ 𝑋𝛽 we have 𝜇𝑥 = 𝑋𝛽

and Σ𝑥 = 0, or in other words, 𝑥 is not a random variable. Further, setting 𝜇 = 0 ∈ ℝ𝑛 and Σ = 𝜎2𝐼𝑛, we

have

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦), where 𝑦 ∈ ℝ𝑛, 𝜇𝑦 = 𝑋𝛽 + 0 ∈ ℝ𝑛 and Σ𝑦 = 𝐴 ⋅ 0 ⋅ 𝐴

𝑇 + 𝜎2𝐼𝑛 = 𝜎2𝐼𝑛 ∈ ℝ

𝑛×𝑛 (4)

and thus

173

𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇒ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)

as postulated previously. In words: the Gaussian distribution of the error term 휀 with expectation parameter

0 renders 𝑦 = 𝑋𝛽 + 휀 a random variable which is distribution with expectation parameter 𝑋𝛽 and

covariance matrix parameter corresponding to the error covariance matrix.

(6) Sampling a simple linear regression model

Based on the ability to sample 𝑛-variate Gaussian distributions by means of random number

generators as implemented for example in Matlab, we may now draw a sample from a simple linear

regression model with 𝑛 data points. To this end, we define the expectation parameter 𝜇 ∈ ℝ𝑛 of a 𝑛-variate

Gaussian using the matrix product of the simple linear regression design matrix 𝑋 ∈ ℝ𝑛×2 (recall that this

comprises a column of ones and a column of the independent variable values 𝑥1, … , 𝑥𝑛) and the parameter

vector 𝛽 ∈ ℝ2. To embed the notion of independent sample, we define the 𝑛-variate Gaussian covariance

matrix 𝛴 ∈ ℝ𝑛×𝑛 as the 𝑛 × 𝑛 spherical covariance matrix 𝜎2𝐼𝑛, where 𝜎2 > 0 is the variance parameter of

the ensuing GLM, as discussed in the Section on the multivariate Gaussian. Figure 1 visualizes the result.

Figure 1 A sample of simple linear regression model, obtained by sampling an 𝑛-variate Gaussian.

Study Questions

1. Consider the GLM equation 𝑋𝛽 + 휀 = 𝑦. Which of the symbols 𝑋, 𝛽, 휀, 𝑦 represents independent experimental variables, which

of the symbols represents dependent experimental variables?

2. Consider the GLM equation 𝑋𝛽 + 휀 = 𝑦. In an experimental context, which of the components are known to the experimenter

before performing the experiment, and which of the components are known to the experimenter after the experimenter,

before estimating the model?

3. Write the following matrix statement as a set of equations

𝑋𝛽 = 𝑦, where 𝑋 ∈ ℝ4×3, 𝛽 ∈ ℝ3 and 𝑦 ∈ ℝ3

4. Write the following set of equations in matrix product notation

2 ⋅ 3 + 3 ⋅ 4 − 1 ⋅ 2 = 16

1 ⋅ 3 + 3 ⋅ 4 + 2 ⋅ 2 = 19

0 ⋅ 3 + 0 ⋅ 4 + 7 ⋅ 2 = 14

5. The design matrix 𝑋 of a GLM is of dimensionality 𝑛 × 𝑝. What do 𝑛 ∈ ℕ and 𝑝 ∈ ℕ represent, respectively?

174

6. Name and explain the components and their properties in the GLM equation 𝑋𝛽 + 휀 = 𝑦. 7. In words, explain the equivalence of the statements

𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) and 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛)


1. 𝑋, the design matrix, represents values of independent experimental variables, 𝑦, the data vector represents values of the

dependent variable

2. Before the experiment, the design matrix 𝑋 is known to the experimenter, after performing the experiment, the design matrix 𝑋

and the data vector 𝑦 are known to the experimenter.

3. The matrix notation corresponds to the following set of (“one-dimensional”) equations

𝑥11𝛽1 + 𝑥12𝛽2 + 𝑥13𝛽3 = 𝑦1

𝑥21𝛽1 + 𝑥22𝛽2 + 𝑥23𝛽3 = 𝑦2

𝑥31𝛽1 + 𝑥32𝛽2 + 𝑥33𝛽3 = 𝑦3

𝑥41𝛽1 + 𝑥42𝛽2 + 𝑥43𝛽3 = 𝑦4

4. In matrix notation, the equations may be expressed as (2 3 −11 3 20 0 7

)(342) = (

161914)

5. 𝑛 represents the number of data points, 𝑝 represents the number of parameters.

6. In the GLM equation 𝑋𝛽 + 휀 = 𝑦, 𝑋 ∈ ℝ𝑛×𝑝 is referred to as “design matrix” and encodes the values of 𝑝 independent variables for each of 𝑛 observations, which are concatenated in the “data vector” 𝑦 ∈ ℝ𝑛. 𝛽 ∈ ℝ𝑝 is a “parameter vector”, which encodes how much each independent variable (corresponding to the columns of 𝑋) contributes to the observed values. 휀 ∈ ℝ𝑛 is a 𝑛-dimensional random vector, which is distributed according to a 𝑛-dimensional Gaussian distribution with expectation parameter 0 ∈ ℝ𝑛, and, under standard assumptions, covariance matrix parameter 𝛴 = 𝜎2𝐼𝑛, where 𝜎2 encodes the variance of each component of 𝑦.

7. The first statement formulates an observed data vector 𝑦 as the sum of a deterministic component 𝑋𝛽 and a random error vector 휀 ∈ ℝ𝑛, which conforms to a Gaussian distribution, whose expectation is zero, and whose components have variance 𝜎2. Because 𝑦 results from the addition of a “deterministic” and a “random” entity, one may equally consider 𝑦 a random vector distributed according to multivariate Gaussian with expectation parameter 𝑋𝛽 ∈ ℝ𝑛 and e, whose components have variance 𝜎2.

175

Maximum Likelihood Estimation

In the previous Sections, we have introduced the GLM in the following form: Let 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈

ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛), where 0 ∈ ℝ𝑛, 𝜎2 > 0 and 𝐼𝑛 ∈ ℝ𝑛×𝑛 is the 𝑛 × 𝑛 identity matrix. Then

the GLM is defined by the equation

𝑋𝛽 + 휀 = 𝑦 (1)

We have seen that the probabilistic nature of the error term 휀 renders this model a multivariate Gaussian

distribution for data 𝑦 ∈ ℝ𝑛, which we may write as

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)

Further, we saw that the spherical covariance matrix 𝜎2𝐼𝑛 embeds the assumptions of independent and

homoscedastic (equal variance) noise contributions. In the current we now turn to the problem of obtaining

parameter point estimates of 𝛽 ∈ ℝ𝑝 and 𝜎2 > 0 based on the observation of a data set 𝑦 = 𝑦∗. We use the

asterisk to denote that 𝑦∗ ≔ (𝑦1∗, … , 𝑦𝑛

∗)𝑇 is a sample of the 𝑛-dimensional random vector 𝑦. However,

because the theory is developed for all possible values 𝑦∗ that 𝑦 may take on, we will rarely emphasize that

it also applies to a special data set 𝑦∗ and use the notation more general notation “𝑦” most of the time. The

point estimates for 𝛽 and 𝜎2 will be denoted by �̂� and �̂�2 (beta hat and sigma square hat).

Classical parameter point estimation unfolds upon the following intuition: we assume that “in

reality” there are fixed parameter values 𝛽 = �̅� and 𝜎2 = �̅�2 that govern (in interaction with the design

matrix) the probability distribution of the data 𝑦. We cannot observe �̅� and �̅�2 directly, but only infer their

values based on the observed data 𝑦. �̅� and �̅�2 denote the “true, but unknown,” parameter values of 𝛽 and

𝜎2 that we assume to give rise to observed data. Note that the true, but unknown, parameter values �̅� and

�̅�2 remain unknown after the evaluation of the point estimates �̂� and �̂�2. �̂� and �̂�2 merely represent our

“best guess” of what the value of the true, but unknown, values �̅� and �̅�2 are based on the observation of

𝑦. The notion of “true, but unknown” parameter values roughly correspond to the notion of “population

parameters” as familiar from undergraduate statistics. Note however, that the idea of a “population”

conveys some notion of finiteness of the underlying process, which is not required in our case. Because, as

for the data, we develop the theory for all possible values that 𝛽 and 𝜎2 may assume as “true, but unknown”

values, we will mostly implicitly refer to the “the true, but unknown, values of 𝛽 and 𝜎2” and only rarely

explicitly denote them by �̅� and �̅�2.

To develop the classical theory of parameter point estimation we proceed as follows. We first

introduce a very general procedure to obtain parameter estimates in the context of probabilistic models, the

concept of maximum likelihood (ML) estimation (1). Based on this general development, we will then discuss

a first application of ML estimation to a familiar model, the univariate Gaussian in Section (2). In Section (3),

we study the equivalence between ML and least-squares estimation in the context of the GLM and in Section

(4) explicitly derive the ordinary least-squares beta parameter estimator for the simple linear regression

GLM. In Section (5), we then generalize this estimator for all kinds of specific GLMs. Finally, in Section (6) we

discuss the maximum likelihood estimation of the variance parameter estimator of the GLM.

(1) Maximum likelihood and least-squares beta parameter estimation

In the context of simple or multiple linear regression the reader may have encountered the concept

of “least-squares” estimation. The idea of least-squares estimation is to minimize the squared distance

between the observed data points and the fitted regression line/plane. In the context of the GLM, least-

176

squares estimation and maximum-likelihood estimation of the effect size parameters 𝛽 are equivalent. In

the current section, we will firstly spell out this equivalence for the “general” GLM case, and secondly derive

the least-squares estimator for the parameters 𝛽 ≔ (𝛽1, 𝛽2)𝑇 of a simple linear regression model. This

derivation will form the basis for the ordinary least-squares estimators of the GLM introduced in the next

Section. In the current section, we assume that the value 𝜎2 > 0 is known, and we only would like to derive

an estimator for the 𝛽 parameter.

Consider the GLM in the form

𝑋𝛽 + 휀 = 𝑦, 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑦 ~ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)

where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix. As discussed in the

previous Section, the assumption of a spherical covariance matrix 𝜎2𝐼𝑛 renders the error terms 휀1, … , 휀𝑛

independently distributed. Note that in contrast to the univariate Gaussian example discussed above, the

data variables are not identically distributed: the expectation parameters given by the 𝑖th row of the matrix

product 𝑋𝛽 may well differ. Here, we will use the notation (𝑋𝛽)𝑖 to denote the 𝑖th row of 𝑋𝛽. From (1), the

distribution of the 𝑖th random variable 𝑦𝑖 is given by

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; (𝑋𝛽)𝑖, 𝜎2) (2)

We may thus write down the likelihood function of the parameter 𝛽 for the observation 𝑦 in as

𝐿:ℝ𝑝 × ℝ𝑛 → ℝ+, (𝛽, 𝑦) ↦ 𝐿(𝛽, 𝑦) ≔ 𝑝(𝑦; 𝛽) (3)

Note that in the current scenario the parameter space 𝛩 is given by 𝛩 ≔ ℝ𝑝 and the dimensionality of the

parameter vector 휃 ≔ 𝛽 ∈ ℝ𝑝 is given by 𝑝 ∈ ℕ. Due to the independence assumption the probability

density function of the observation 𝑦 ∈ ℝ𝑛 may be written as the product of its invidual components

𝑝(𝑦1, … 𝑦𝑛) = 𝑝(𝑦1) ⋅ 𝑝(𝑦2)⋯𝑝(𝑦𝑛) = 𝑁(𝑦1; (𝑋𝛽)1, 𝜎2) ⋅ 𝑁(𝑦𝑖; (𝑋𝛽)2, 𝜎

2) ⋅ ⋯ ⋅ 𝑁(𝑦𝑛; (𝑋𝛽)𝑛, 𝜎2) (4)

Substitution of the univariate Gaussian probability density functions

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; (𝑋𝛽)𝑖, 𝜎2) =

1


1

2𝜎2(𝑦𝑖 − (𝑋𝛽)𝑖)

2) (5)

then yields the likelihood function

𝐿(𝛽, 𝑦) = 𝑝(𝑦; 𝛽) = (2𝜋𝜎2)−𝑛

2 𝑒𝑥𝑝 (−1

2𝜎2∑ (𝑦𝑖 − (𝑋𝛽)𝑖)

2𝑛𝑖=1 ) (6)

The necessary algebraic manipulations to get from (5) to (6) are equivalent to those considered in the

previous section and thus omitted here for brevity. Notably, the likelihood function 𝐿(𝛽, 𝑦) in (6) contains

the sum over squared deviations between the data points 𝑦𝑖 (𝑖 = 1,… , 𝑛) and the model prediction (𝑋𝛽)𝑖 in

the exponential term. Because this “sum-of- squares” ∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛

𝑖=1 is non-negative (i.e. zero or

positive), and it is annotated with a minus sign, the exponential term in (6) becomes maximal, if the squared

deviations between model prediction and data becomes minimal. In other words, the likelihood function

𝐿(𝛽, 𝑦) is maximized, if the sum of squared deviations between data and model prediction is minimized,

which is just the least-squares estimation principle. Note that we have assumed a fixed and known variance

parameter 𝜎2 in these considerations. To summarize: whether one uses the technique of maximum

likelihood estimation or the minimization of the squared deviations between data and model prediction, the

resulting estimators for the 𝛽 parameters of a GLM are the same.

177

(2) Least-squares beta parameter estimation for simple linear regression

In this section, we will explicitly derive the “least-squares” (or, as shown in the previous section,

maximum likelihood) 𝛽 parameter estimator for the case of the simple linear regression GLM. This serves as

a preparation for the introduction of the general so-called “ordinary least-squares estimator” �̂� in the

subsequent section. The central result of the current section is that the least-squares estimator for the 𝛽

parameter in a simple linear regression model can be written as

�̂� = (𝑋𝑇𝑋)−1 𝑋𝑇𝑦 (1)

To see this, first consider the simple linear regression GLM

𝑋𝛽 + 휀 = 𝑦, 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) (2)

where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix. Recall that the design

matrix, parameter vector, and data vector in the special GLM case of simple linear regression take the forms

𝑋 ≔ (

1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛

) , 𝛽 ≔ (𝛽1𝛽2) and 𝑦 ≔ (

𝑦1𝑦2⋮𝑦𝑛

) (3)

where the 𝑥𝑖 (𝑖 = 1,… , 𝑛) carry the notion of “independent variables”, 𝛽1 corresponds to the simple linear

regression “offset” and 𝛽2 to the simple linear regression “slope”. The 𝑖th row of the matrix product 𝑋𝛽 is

given in this case as

(𝑋𝛽)𝑖 = 𝛽1 + 𝑥𝑖𝛽2 (4)

To minimize the sum of squared deviations between model prediction and data

∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛

𝑖=1 (5)

we view this sum as a function of 𝛽1 and 𝛽2 and write

𝑓:ℝ2 → ℝ, (𝛽1, 𝛽2) ↦ 𝑓(𝛽1, 𝛽2) = ∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)2𝑛

𝑖=1 (6)

As for a maximum, the necessary condition for a minimum is the vanishing of the partial derivatives of this

function, i.e. at a location of a minimum of 𝑓 (corresponding to a maximum of the log likelihood function),

the gradient of 𝑓

𝛻𝑓(𝛽1, 𝛽2) = (

𝜕

𝜕𝛽1𝑓(𝛽1, 𝛽2)

𝜕

𝜕𝛽2𝑓(𝛽1, 𝛽2)

) (7)

equals the zero vector (0,0)𝑇. Evaluation of the respective partial derivatives yields

𝛻𝑓(𝛽1, 𝛽2) = (2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)

𝑛𝑖=1

2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖

) (8)

Derivation of (8)

We substitute the functional form of 𝑓(𝛽1, 𝛽2) and evaluate the partial derivatives with respect to 𝛽1 and 𝛽2. Using the summation and chain rules of differential calculus, we obtain

𝜕

𝜕𝛽1𝑓(𝛽1, 𝛽2) =

𝜕

𝜕𝛽1(∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

2𝑛𝑖=1 ) = ∑

𝜕

𝜕𝛽1(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

2𝑛𝑖=1 = ∑ 2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

𝑛𝑖=1 (−1) = 2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)

𝑛𝑖=1 (8.1)

and

178

𝜕

𝜕𝛽2𝑓(𝛽1, 𝛽2) =

𝜕

𝜕𝛽2(∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

2𝑛𝑖=1 ) = ∑

𝜕

𝜕𝛽2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

2𝑛𝑖=1 = ∑ 2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)

𝑛𝑖=1 (−𝑥𝑖) = 2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)

𝑛𝑖=1 𝑥𝑖 (8.2)

respectively. □

We have thus derived the following necessary condition for a minimum of the sum of squared deviations

between model prediction and observed data:

∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 = 0 (9)

∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖 = 0 (10)

To derive the standard form of the least-square estimator, we reformulate equation (9) and (10) as

𝑛𝛽1 + (∑ 𝑥𝑖𝑛𝑖=1 )𝛽2 = ∑ 𝑦𝑖

𝑛𝑖=1 (11)

(∑ 𝑥𝑖𝑛𝑖=1 )𝛽1 + (∑ 𝑥𝑖

2𝑛𝑖=1 )𝛽2 = ∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖 (12)


We rewrite equation (9) as follows

∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)

𝑛𝑖=1 − ∑ 𝑦𝑖

𝑛𝑖=1 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)

𝑛𝑖=1 = ∑ 𝑦𝑖

𝑛𝑖=1 ⇔ 𝑛𝛽1 +∑ 𝑥𝑖

𝑛𝑖=1 𝛽2 = ∑ 𝑦𝑖

𝑛𝑖=1 (11.1)

and we rewrite equation (10) as

∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖

𝑛𝑖=1 −∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖

𝑛𝑖=1 = ∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖

⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖𝑛𝑖=1 = ∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖 ⇔ ∑ (𝛽1𝑥𝑖 + 𝑥𝑖

2𝛽2)𝑛𝑖=1 = ∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖 ⇔ ∑ 𝑥𝑖𝛽1

𝑛𝑖=1 + ∑ 𝑥𝑖

2𝛽2𝑛𝑖=1 = ∑ 𝑦𝑖

𝑛𝑖=1 𝑥𝑖 (12.1)

□

We may now express the linear equations (11) and (12) in matrix notation as

(𝑛 ∑ 𝑥𝑖

𝑛𝑖=1

∑ 𝑥𝑖𝑛𝑖=1 ∑ 𝑥𝑖

2𝑛𝑖=1

)(𝛽1𝛽2) = (

∑ 𝑦𝑖𝑛𝑖=1

∑ 𝑦𝑖𝑛𝑖=1 𝑥𝑖

) (13)

Finally, using the definitions of design matrix and data vector for simple linear regression as in equation (3)

we see that

(𝑛 ∑ 𝑥𝑖

𝑛𝑖=1

∑ 𝑥𝑖𝑛𝑖=1 ∑ 𝑥𝑖

2𝑛𝑖=1

) = (1 1𝑥1 𝑥2

⋯ 1⋯ 𝑥𝑛

)(

1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛

) = 𝑋𝑇𝑋 (14)

and that

(∑ 𝑦𝑖𝑛𝑖=1

∑ 𝑦𝑖𝑛𝑖=1 𝑥𝑖

) = (1 1𝑥1 𝑥2

⋯ 1⋯ 𝑥𝑛

)(


) = 𝑋𝑇𝑦 (15)

In other words, the necessary condition for a minimum of the sum of squared deviations as given by the

system of linear equations (51) can be written very compactly in matrix notation as

𝑋𝑇𝑋𝛽 = 𝑋𝑇𝑦 (16)

The system of equations represented by (16) is called the “system of normal equations” and represents a

necessary condition for the function 𝑓 to a assume a minimum in 𝛽 (or, equivalently, for the log likelihood

function of the simple linear regression with respect to 𝛽 o assume a maximum). Note that 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is a

square matrix and thus its inverse can be computed, if it is in fact invertible, which depends on the entries in

the columns of 𝑋 ∈ ℝ𝑛×𝑝.

179

For the moment, we assume that the matrix 𝑋𝑇𝑋 is invertible, and may thus multiply both sides of

(16) by (𝑋𝑇𝑋)−1 and solve for the “least-squares” (or maximum likelihood) parameter estimator �̂� as

follows:

𝑋𝑇𝑋�̂� = 𝑋𝑇𝑦 ⇔ (𝑋𝑇𝑋)−1(𝑋𝑇𝑋)�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ⇔ �̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (17)

In summary, we have shown that for the case of simple linear regression, the maximum likelihood or least-

squares estimator for the beta parameter vector is given by

�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (18)

Note again that in the current section we assumed that 𝜎2 > 0 is known and that the design matrix

𝑋 ∈ ℝ𝑛×2 and the parameter vector 𝛽 ∈ ℝ2 have a specific form – they correspond to the simple linear

regression case. However, in the next section, we will see that the form of the beta parameter estimator (18)

can be transferred to the case of general GLMs, i.e. arbitrary design matrices 𝑋 ∈ ℝ𝑛×𝑝 and parameter

vectors ∈ ℝ𝑝

(3) General beta parameter estimation

To derive the so-called “ordinary least squares (OLS)” estimator for 𝛽 parameters of a “general”

GLM we generalize the result we obtained for the simple linear regression case, i.e.

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ2 (1)

where 𝑋 ∈ ℝ𝑛×2 consists of a column of ones and a column of the values of the independent variable

𝑥1, … , 𝑥𝑛 to the general case of (almost) arbitrary design matrix 𝑋 ∈ ℝ𝑛×𝑝 and 𝛽 ∈ ℝ𝑛. Note that from the

considerations above, we may equivalently refer to the OLS estimator for the 𝛽 parameters of GLM as the

“least-squares” or “maximum-likelihood” estimator, however, the term OLS estimator seems to be most

commonly used in the literature.

To generalize (1), we make the assumption that the that the design matrix 𝑋 ∈ ℝ𝑛×𝑝 is of rank

𝑝 ∈ ℕ, sometimes referred to as 𝑋 being of “full column rank”. Importantly, this assumption guarantees that

the matrix product 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is of rank 𝑝 as well, and thus invertible. In addition, we require the

following two results from linear algebra

(a) (𝐴𝐵)𝑇 = 𝐵𝑇𝐴𝑇 for all matrices 𝐴 ∈ ℝ𝑚×𝑛 and 𝐵 ∈ ℝ𝑛×𝑝. (3)

(b) If a matrix 𝐵 ∈ ℝ𝑚×𝑚 is of the form 𝐵 = 𝐴𝐴𝑇 for 𝐴 ∈ ℝ𝑚×𝑛, the matrix 𝐵 is positive-definite (4)

Based on these assumptions, we can now show that OLS estimator for a general GLM is of the same form as

for the simple linear regression scenario. Because this is a fundamental result, we present it in the form of a

theorem, i.e. a factual statement followed by its mathematical proof.

Theorem. The Ordinary Least Squares �̂� estimator

Let 𝑋𝛽 + 휀 = 𝑦, where 𝑋 ∈ ℝ𝑛×𝑝 is of full-column rank, 𝛽 ∈ ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)

for 𝑛, 𝑝 ∈ ℕ and 𝜎2 > 0 denote the GLM. Based on an observation of the random vector 𝑦 ∈ ℝ𝑛 the

ordinary least-squares estimator (= maximum likelihood estimator) is given by

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (5)

□

Proof of (5)

180

The theorem implies that if �̂� is defined by

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (5.1)

the so called “sum-of-error-squares” (SES)

휀𝑇휀 ≔ (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) ∈ ℝ+ (5.2)

becomes minimal. To show that this is indeed the case we will first show that

𝑋𝑇(𝑦 − 𝑋�̂�) = 0 ∈ ℝ𝑝 and (𝑦 − 𝑋�̂�)𝑇𝑋 = 0 ∈ ℝ1×𝑝 (5.3)

This may be seen as follows: by definition, we have

�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ⇔ 𝑋𝑇𝑋�̂� = 𝑋𝑇𝑦 ⇔ 0 = 𝑋𝑇𝑦 − 𝑋𝑇𝑋𝛽 ⇔ 0 = 𝑋𝑇(𝑦 − 𝑋𝛽) = 0 (5.4)

Note that above 0 ∈ ℝ𝑝. Forming the transpose of 𝑋𝑇(𝑦 − 𝑋�̂�) = 0, we see that (𝑦 − 𝑋�̂�)𝑇𝑋 = 0 ∈ ℝ1×𝑝

0𝑇 = (𝑋𝑇(𝑦 − 𝑋�̂�))𝑇= (𝑦 − 𝑋�̂�)

𝑇𝑋 (5.5)

and thus also (𝑦 − 𝑋�̂�)𝑇𝑋 corresponds to 0 ∈ ℝ1×𝑝.

We now use (5.3) to show that �̂� minimizes the SES. To this end, we first reformulate the SES as follows:

(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂� + 𝑋�̂� − 𝑋𝛽)𝑇(𝑦 − 𝑋�̂� + 𝑋�̂� − 𝑋𝛽) = ((𝑦 − 𝑋�̂�) + 𝑋(�̂� − 𝛽))

𝑇((𝑦 − 𝑋�̂�) + 𝑋(�̂� − 𝛽)) (5.6)

Resolving the brackets then yields

(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) + (𝑦 − 𝑋�̂�)

𝑇𝑋 (�̂� − 𝛽) + (�̂� − 𝛽)

𝑇𝑋𝑇(𝑦 − 𝑋�̂�) + (�̂� − 𝛽)

𝑇𝑋𝑇𝑋(�̂� − 𝛽) (5.7)

with (5.3) the second and the third terms in expression (5.7) are zero and we have for the SES

(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) + (�̂� − 𝛽)

𝑇𝑋𝑇𝑋(�̂� − 𝛽) (5.8)

Based on this expression, we can conclude that the OLS estimator �̂� minimizes the sum-of-error squares: the second term is always larger or equal to zero, because 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is positive-definite and thus can become zero (and thus minimize the right-hand side

of (15)) only for choosing 𝛽 = �̂� ∈ ℝ𝑝.

(4) Maximum likelihood variance parameter estimation

So far we have only considered estimation of the beta parameter vector 𝛽 ∈ ℝ𝑝 and have assumed

that the 𝜎2 is known. Using the equivalence of least-square and maximum likelihood estimation in the case

of the beta parameter, we have found the so-called “ordinary least-squares” �̂� ∈ ℝ𝑝 estimator for 𝛽 ∈ ℝ𝑝.

In this section, we consider the maximum likelihood estimation of the GLM parameter 𝜎2 > 0. As in the

maximum likelihood derivation of the variance estimator for a univariate Gaussian, we will proceed as

follows: We will first write down the likelihood and log likelihood function of the GLM as function of both

𝛽 ∈ ℝ𝑝 and 𝜎2 > 0, and then substitute �̂� ∈ ℝ𝑝 to treat the log likelihood function as function of 𝜎2 > 0

only. We then use the standard maximum likelihood approach: we evaluate the derivative of the log

likelihood function with respect to 𝜎2, set this derivative to zero, and solve for the maximum likelihood

�̂�2 > 0 for𝜎2 > 0.

Recapitulating the above, we consider the GLM


where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix, and its associated

“two parameter” likelihood function

𝐿:ℝ𝑝 × ℝ+\{0} × ℝ𝑛 → ℝ+, ((𝛽, 𝜎2), 𝑦) ↦ 𝐿((𝛽, 𝜎2), 𝑦) ≔ (2𝜋𝜎2)−

𝑛

2 𝑒𝑥𝑝 (−1

2𝜎2(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽)) (2)

181

where (𝑋𝛽)𝑖 ∈ ℝ denotes the 𝑖th entry of the 𝑛 × 1 vector 𝑋𝛽. Logarithmic transformation then yields the

log likelihood function

ℓ:ℝ𝑝 × ℝ+\{0} × ℝ𝑛 → ℝ+, ((𝛽, 𝜎2), 𝑦) ↦ ℓ((𝛽, 𝜎2), 𝑦) ≔ −

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) (3)

Substitution of the ordinary least-squares estimator �̂� then render this function a function of the parameter

𝜎2 only

ℓ�̂�: ℝ+\{0} × ℝ𝑛 → ℝ+, (𝜎2, 𝑦) ↦ ℓ�̂�(𝜎

2, 𝑦) = ℓ(�̂�, 𝜎2, 𝑦) ≔ −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (4)

The derivative of ℓ𝛽 ̂w ith respect to 𝜎2evaluates to

𝑑

𝑑𝜎2ℓ𝛽 ̂ (𝜎

2) = −1

2

𝑛

𝜎2+1

2

1

(𝜎2)2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (5)

Derivation of equation (5)

We have

𝑑

𝑑𝜎2ℓ𝛽 ̂ (𝜎

2) =𝑑

𝑑𝜎2(−

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�))

= −𝑛

2

𝑑

𝑑𝜎2𝑙𝑛 𝜎2 −

1

2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�)

𝑑

𝑑𝜎2(𝜎2)−1

= −𝑛

2

1

𝜎2−1

2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�)(−1)(𝜎2)−2

= −1

2

𝑛

𝜎2+1

2

1

(𝜎2)2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (5.1)

□

Setting the derivative of ℓ𝛽 ̂ to zero and solving for the value of 𝜎2 then yields

�̂�2 =(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛 (6)

Derivation of equation (6)

We have

𝑑

𝑑𝜎2ℓ𝛽 ̂ (�̂�

2, 𝑦) = 0 ⇔ −1

2

𝑛

�̂�2+1

2

1

(�̂�2)2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) = 0 ⇔

1

2

1

(�̂�2)2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) =

1

2

𝑛

�̂�2

⇔ (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) =

�̂�2�̂�2𝑛

�̂�2⇔ 𝑛�̂�2 = (𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) ⇔ �̂�2 =

(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)

𝑛 (6.1)

□

The numerator of the expression for �̂�2, given by (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) is referred to as the

“residual-sum-of-squares” (the analogue quantity for general 𝛽, (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) is referred to as the

“error-sum-of-squares”). The quantity (𝑦 − 𝑋�̂�) corresponds to the difference between the data observed

and the data prediction of the obtained for using the OLS beta estimator �̂� ∈ ℝ𝑝 . In other words, the

estimator for the error variance corresponds to a scaled version of the remaining mismatch between

observed data and OLS-based data prediction, which is somewhat surprising, but inherent in the theory of

the GLM.

Unfortunately, however, like for the univariate Gaussian the maximum likelihood estimator for �̂�2

in the form of equation (6) is not “bias-free”. This can readily be rectified, however, by dividing the RSS not

182

by 𝑛, but by 𝑛 − 𝑝. This yields the “corrected” or “restricted maximum likelihood” estimator for the error

variance

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝 (7)

For details on the motivation and derivation of (7) the reader is referred to the Section “Restricted maximum

likelihood estimation of the GLM”.

In summary, we have established the following classical point estimators for the parameters 𝛽 ∈ ℝ𝑝

and 𝜎2 for a GLM of the form

𝑋𝛽 + 휀 = 𝑦, 𝑝(𝑦) 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (8)

1. The ordinary least-squares beta estimator

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (9)

2. The variance parameter estimator

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝> 0 (10)

Study Questions

1. Write down the formula for the least-squares or beta-parameter estimator �̂� of the GLM, explain its components, and, verbally,

sketch its derivation.

2. Write down the formula for the variance parameter estimator �̂�2 of the GLM and explain its components.


1. The least-squares or beta-parameter estimator for the GLM is given by

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝

where 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix, 𝑦 ∈ ℝ𝑛 denotes the data vector (and the −1 and 𝑇 superscripts denote matrix inversion and

transposition, respectively). The formula for the least-squares estimator follows from evaluating a critical point of the GLM log

likelihood function or minimizing the sum of squared deviations between the GLM prediction 𝑋𝛽 and the observed data 𝑦.

2. The variance parameter estimator of the GLM is given by

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝

where 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix, 𝑦 ∈ ℝ𝑛 denotes the data vector, and �̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 denotes the least-squares

parameter estimator of the GLM. 𝑛 corresponds to the number of data points, and 𝑝 to the number of parameters.

183

Frequentist Parameter Estimator Distributions

(1) The intuitive background for parameter estimator distributions

In the previous sections, we have seen that parameter point estimators for the GLM

𝑋𝛽 + 휀 = 𝑦, 𝑝(𝑦) 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)

with 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 휀 ∈ ℝ𝑛, 𝑦 ∈ ℝ𝑛, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 can be derived by means of (restricted)

maximum likelihood estimation in the forms

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (2)

and

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝∈ ℝ (3)

In this section, we discuss the “sampling distributions” of the estimators �̂� and �̂�2. To this end, it is helpful to

initially review the fundamental “sampling” intuition associated with the GLM. As discussed previously, if the

GLM is applied in a given data analytical context, the following implicit assumptions are made in a

frequentist scenario: In the real world, there exist values �̅� ∈ ℝ𝑝 and �̅�2 ∈ ℝ𝑝, which are fixed and true, but

unknown. The data vector 𝑦(1) ∈ ℝ𝑛 that forms the basis for data analysis is assumed to be a single sample

from the distribution 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛), and based on this sample, point estimates for the values �̅� and �̅�2

can be obtained by substituting 𝑦(1) in the equations for �̂� and �̂�2, i.e., setting

�̂�(1) = (𝑋𝑇𝑋)−1𝑋𝑇𝑦(1) and �̂�2(1)≔

(𝑦(1)−𝑋�̂�)𝑇(𝑦(1)−𝑋�̂�)

𝑛−𝑝 (4)

where the “(1)“- superscripts are meant to indicate that these estimator values are obtained based on the

data realization 𝑦(1). If the experiment were to be repeated under identical circumstances, a second sample

from 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛) could be obtained, which we denote by 𝑦(2). The values of 𝑦(2) will not be identical to

those of 𝑦(1), but largely similar, as both are derived from the same underlying distribution. Again, point

estimates for the values of �̅� and �̅�2 could be obtained by means substitution of 𝑦(2) in (2) and (3). Because

the design matrix 𝑋 and its properties 𝑛 and 𝑝 will remain identical, the values for �̂�(2) and �̂�2(2)

will show

some variation with respect to �̂�(1) and �̂�2(1)

, but, as 𝑦(2) is largely similar to 𝑦(1), not too much. One can

readily imagine taking more samples from 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛), resulting in more values 𝑦(3), 𝑦(4), 𝑦(5)… ∈ ℝ𝑛,

which in turn give rise to more values �̂�(3), �̂�(4), �̂�(5), … and �̂�2(2), �̂�2

(3), �̂�2

(4), ….. In this section, we

investigate the following question: given that we know the distribution of the data values 𝑦(𝑖), 𝑖 = 1,2,3,…,

what can we say about the distribution of the corresponding �̂�(𝑖) and �̂�2(𝑖)

values? These distributional

properties form the theoretical basis for the statistical testing theory based on the 𝑇 and 𝐹 statistics

discussed in subsequent sections.

Before introducing the sampling distributions of �̂� and �̂�2, we remark that we can conceive these

estimators, and in fact, any “statistic” as a mapping from the data space into the parameter space. This is a

useful perspective for studying properties of estimators. For example, we may write the �̂� and �̂�2 estimators

as the following functions

�̂�𝑋 ∶ ℝ𝑛 → ℝ𝑝, 𝑦 ↦ �̂�𝑋(𝑦) ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (5)

184

�̂�𝑋,�̂�2 : ℝ𝑛 → ℝ𝑝, 𝑦 ↦ �̂�

𝑋,�̂�2 ≔

(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)

𝑛−𝑝 (6)

(2) The sampling distribution of the beta parameter estimator

The fundamental assumption of the GLM is that the distribution of the data random vector 𝑦 ∈ ℝ𝑛 is

given by a multivariate Gaussian distribution, 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛). Intuitively, when forming the OLS estimator

�̂� ∈ ℝ𝑝, one multiplies these values with the factor (𝑋𝑇𝑋)−1𝑋𝑇, i.e. one stretches or squeezes the

respective values by a fixed amount, but leaves them unchanged otherwise. One may thus expect the OLS

estimator �̂� to be distributed according to a multivariate Gaussian as well, which is indeed the case.

Formally, the parameters of the normal distribution �̂� can be inferred based on the linear transformation

theorem for multivariate Gaussian distributions.

Specifically, as introduced previously this theorem states that if (1) a random vector 𝑥 ∈ ℝ𝑛 is

distributed according to a Gaussian distribution with expectation parameter 𝜇𝑥 ∈ ℝ𝑛 and covariance matrix

parameter 𝛴𝑥 ∈ ℝ𝑛×𝑛, and if (2) 𝐴 ∈ ℝ𝑚×𝑛 is a matrix of rank 𝑚, and if (3) 휀 ∈ ℝ𝑚 is a random vector with

expectation parameter 0 ∈ ℝ𝑚 and covariance matrix parameter 𝛴 ∈ ℝ𝑚×𝑚, and if (4) the covariance

between 𝑥 and 𝑦 is zero, then

𝑧 ≔ 𝐴𝑥 + 휀 (1)

is distributed according to a Gaussian distribution with expectation parameter 𝜇𝑧 = 𝐴𝜇𝑥 + 𝜇 ∈ ℝ𝑚 and

covariance matrix parameter 𝛴𝑧 = 𝐴𝛴𝑥𝐴𝑇 + 𝛴 . In the current scenario, we are concerned with a special

case of this theorem, namely the case that 휀 ≡ 0, i.e. that 휀 does not exist as a random variable. In this case,

we may omit the contributions of 휀 to the parameters of 𝑧. Formulated in general terms, we thus the

following simplified linear transformation theorem for multivariate Gaussian distributions

Theorem. Simplified linear transformation theorem for multivariate Gaussian distributions.

Let 𝑥 ∈ ℝ𝑛 be distributed according to an 𝑛-dimensional Gaussian distribution with expectation

parameter 𝜇 ∈ ℝ𝑛 and covariance parameter 𝛴 ∈ ℝ𝑛×𝑛. Let further 𝐴 ∈ ℝ𝑚×𝑛 be a matrix of rank 𝑚 ∈ ℕ.

Then the product 𝐴𝑥 ∈ ℝ𝑚 is distributed according to an 𝑚-dimensional Gaussian distribution with

expectation parameter 𝐴𝜇 ∈ ℝ𝑚 and covariance parameter 𝐴𝛴𝐴𝑇 ∈ ℝ𝑚×𝑚. In other words for 𝑝(𝑥) =

𝑁(𝑥; 𝜇, 𝛴) and 𝑧 ≔ 𝐴𝑥 we have 𝑝(𝑧) = 𝑁(𝑧; 𝐴𝜇, 𝐴𝛴𝐴𝑇)

□

To apply this theorem in the current case, we note without proof that (𝑋𝑇𝑋)−1𝑋𝑇 is of rank 𝑛, and

obtain the following result:

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) ⇒ 𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (2)

Derivation of (2)

Upon setting

𝑥 ≔ 𝑦 ∈ ℝ𝑛, 𝜇 ≔ 𝑋𝛽 ∈ ℝ𝑛, 𝛴 ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛, 𝐴 ≔ (𝑋𝑇𝑋)−1𝑋𝑇 ∈ ℝ𝑝×𝑛 and 𝑧 ≔ �̂� =(𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (2.1)

application of the simplified linear transformation theorem for multivariate Gaussian distributions yields the following 𝑝-dimensional

Gaussian distribution for the estimator �̂�

𝑝(�̂�) = 𝑁(�̂�; (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽, (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)((𝑋𝑇𝑋)−1𝑋𝑇)𝑇) (2.2)

The parameters of this distribution can be simplified as follows

𝜇�̂� : = (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽 = 𝛽 ∈ ℝ𝑝 (2.3)

185

and

𝛴�̂� : = (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)((𝑋

𝑇𝑋)−1𝑋𝑇)𝑇 = (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)𝑋(𝑋𝑇𝑋)−1 = 𝜎2(𝑋𝑇𝑋)−1𝑋𝑇𝑋(𝑋𝑇𝑋)−1 = 𝜎2(𝑋𝑇𝑋)−1 ∈ ℝ𝑝×𝑝 (2.4)

In (2.4) we used in the second equation that both 𝑋𝑇𝑋 and its inverse (𝑋𝑇𝑋)−1 are symmetric matrices, and thus ((𝑋𝑇𝑋)−1)𝑇 =

(𝑋𝑇𝑋)−1. In the third equation, the scalar 𝜎2 was moved to the beginning of the product and the identity multiplication 𝐼𝑛 was

suppressed. Finally, in the fourth equation, the matrix product 𝑋𝑇𝑋(𝑋𝑇𝑋)−1 was evaluated to the identity 𝐼𝑝.

□

In verbose form, we obtained the following result: From the assumption of error terms 휀 ∈ ℝ𝑛

distributed according to a multivariate Gaussian distribution with expectation 0 ∈ ℝ𝑛 and spherical

covariance matrix 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 and the GLM equation 𝑦 = 𝑋𝛽 + 휀 it first follows that the data 𝑦 ∈ ℝ𝑛

distributed according to a Gaussian distribution with expectation parameter 𝑋𝛽 ∈ ℝ𝑛 and covariance matrix

𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛. From the Gaussian distribution of the data, in turn, it follows with the (simplified) linear

transformation theorem for Gaussian distributions that the OLS estimator �̂� ∈ ℝ𝑝 is distributed according to

a 𝑝-dimensional Gaussian distribution with expectation parameter 𝛽 ∈ ℝ𝑝, i.e. the true, but unknown, value

of the parameters, and covariance matrix 𝜎2(𝑋𝑇𝑋)−1 ∈ ℝ𝑝×𝑝. Notably, the covariance matrix of the beta

parameter estimator �̂� thus depends on the error variance parameter 𝜎2 and the design matrix 𝑋 ∈ ℝ𝑛×𝑝.

Figure 1 below visualizes the Gaussian sampling distribution of the OLS two-dimensional estimator �̂� ∈ ℝ2 in

the case of a simple linear regression GLM.

Figure 1. Visualization of the OLS 𝛽 estimator distributional properties for the case of a simple linear regression GLM. The panel in

the first row, first column shows 20 samples from simple linear regression GLMs with true, but unknown, parameter values �̅� ≔(0,1)𝑇 (middle) as red dots around their respective expectation (black line)., where the true, but unknown, variance parameter was set to �̅�2 ≔ 0.5 The second column depicts the corresponding evaluations of the OLS 𝛽 estimator in parameter space ℝ2. The third columndepicts the analytical probability density function of the OLS 𝛽 estimator derived by means of the linear transformation

theorem for Gaussian distributions. The panels in the second row depict the equivalent entities for �̅� ≔ (2,−1)𝑇

186

(3) The sampling distribution of the scaled variance parameter estimator

The derivation of the sampling distribution of the variance parameter estimator �̂�2 is - compared to

the derivation of the OLS beta estimator distribution - somewhat more involved. We thus content with

stating the result here and provide some intuition and a sampling-based demonstration of it. Interested

readers may refer to the Section “Probability Distributions” for a formal derivation of the following result:

Let 𝑋𝛽 + 휀 = 𝑦, where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) for 𝑛, 𝑝 ∈ ℕ and 𝜎2 > 0

denote the GLM and let

�̂�2 =(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝 (1)

denote the unbiased estimator for the variance parameter 𝜎2 > 0. Then the distribution of the “scaled

variance parameter estimator”

𝑛−𝑝

𝜎2�̂�2 =

(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)

𝜎2 (2)

is given by a chi-squared distribution with 𝑛 − 𝑝 degrees of freedom, denoted by

𝑝 (𝑛−𝑝

𝜎2�̂�2) = 𝜒2 (

𝑛−𝑝

𝜎2�̂�2; 𝑛 − 𝑝) (3)

What is the intuition behind this result? First, from the previous discussion of the distribution of the

OLS estimator for the beta parameters �̂� it should be clear that based on data samples 𝑦(1), 𝑦(2), 𝑦(3), …

(and values of the corresponding �̂�(1), �̂�(2), �̂�(3), …), we may evaluate values �̂�2(1), �̂�2

(2), �̂�2

(3), … based on

the formula in equation (1) and consider their distribution. In contrast to the distributions encountered so

far, the distribution of the OLS estimator �̂�2 is not a normal distribution. This can intuitively be made

plausible by considering the term (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�), i.e. the so-called “residual sum-of-squares” in the

numerator of the left-hand side of equation (1). From the definition of the GLM and the discussion so far, we

know that 𝑦 is normally distributed with expectation 𝑋𝛽 and that �̂� is distributed normally with expectation

𝛽. Intuitively, the difference, 𝑧 ≔ 𝑦 − 𝑋�̂� ∈ ℝ𝑛 is thus distributed normally with expectation 𝑋𝛽 −

𝑋𝐸(�̂�) = 0 ∈ ℝ𝑛. The numerator of (2) thus implies that one computes the square of normally distributed

random variable 𝑧 with expectation 0.

We next consider sampling values of the random variable 𝑧 from its distribution. One may obtain

values equally likely to be positive or negative and mainly close to zero. Now consider squaring these values.

Most notably, all the negative values will become positive, and thus the distribution of 𝑧𝑇𝑧, i.e. the square of

𝑧, cannot be normally distributed around zero. In fact, as noted in the definition of the chi-squared

distribution in the Section “Probability Distribution”, it is a standard result that the distribution of the sum of

squared univariate random variables 𝑥𝑗 (𝑖 = 1,… , 𝑛), each distributed according to a “standard” normal

distribution of expectation 𝜇 = 0 and variance 𝜎2 = 1, i.e. 휁 ≔ ∑ 𝑥𝑖2𝑛

𝑗=1 is given by a chi-squared

distribution with 𝑛 degrees of freedom, i.e. 𝜒2(𝜉; 𝑛). This fact, the covariance matrix of the random variable

𝑧 = 𝑦 − 𝑋�̂�, and the denominator of the estimator �̂�2 all contribute to the fact that 𝑛−𝑝

𝜎2�̂�2 is distributed

according to a chi-squared distribution with 𝑛 − 𝑝 degrees of freedom. The reason, that the product 𝑛−𝑝

𝜎2�̂�2

is considered here instead of the “pure” variance parameter estimator �̂�2 lies merely in the mathematical

fact that it is this product, which is chi-squared distributed. To recover the distribution of the “pure” �̂�2 one

would have to use the “transformation theorem for probability density functions”, which we will eschew

187

here for simplicity. We visualize the sampling distribution of �̂�2 and 𝑛−𝑝

𝜎2�̂�2 in Figure 2 below. Note that

these entities are one-dimensional and their distribution is hence always readily visualized, irrespective of

the dimensionality of the beta parameter vector.

Figure 2. Visualization of the �̂�2estimator sampling distributional properties for the case of a simple linear regression GLM. The panel

in the first row, first column shows 100 samples from simple linear regression GLMs with true, but unknown, parameter values �̅� ≔(1,1)𝑇 and �̅�2 ≔ 0.1. The panel in the second column depicts the frequency counts of observations of �̂�2 falling into equally spaced

bins of width ≈ 0.08. Finally, the third column depicts the histogram of the product 𝑛−𝑝

𝜎2�̂�2 and the probability density function of

the chi-squared distribution with the appropriate degrees of freedom. The panels in the second row depict the same entities for the

case �̅�2 ≔ 1. Note that while the distribution of �̂�2 changes, the distribution of 𝑛−𝑝

𝜎2�̂�2 does only differ up to sampling error from the

case �̅�2 ≔ 0.1, due to the scaling by 𝜎2.

(4) Overview of the frequentist GLM distribution theory

In brief, we may summarize the sampling distribution of the GLM and its point parameter estimates

with fixed quantities 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑛, 𝑝 ∈ ℕ, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 and random vectors 𝑦, 휀 ∈ ℝ𝑛, �̂� ∈ ℝ𝑝

and �̂�2 ∈ ℝ as follows:

General Linear Model 𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)

Data distribution 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)

Beta parameter distribution 𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (3)

Scaled variance parameter distribution 𝑝 (𝑛−𝑝

𝜎2�̂�2) = 𝜒2 (

𝑛−𝑝

𝜎2�̂�2; 𝑛 − 𝑝) (4)

Study Questions

1. Discuss the frequentist sampling perspective of the GLM and address the role of the data that a GLM analysis is applied to. 2. Write down the “simplified linear transformation theorem for Gaussian distributions”. 3. Verbally, explain the fact that the beta parameter and variance parameter estimators have a probability distribution.

4. Write down the probability distribution of the beta parameter estimator �̂� and explain its components.

5. Write down the probability distribution of the standardized variance parameter estimator 𝑛−𝑝

𝜎2�̂�2 and explain its components.

188


1. From a frequentist sampling perspective, there exist fixed true, but unknown, values of the GLM beta and variance parameters.

Based on these parameters (and a given design matrix), data vectors 𝑦(1), 𝑦((2)) , … can be sampled from the GLM. The distribution of these data vector is described by 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛). If a given data set is analyzed by means of the GLM, it assumed to that this data set is one of the many potential samples taken from the underlying GLM.

2. Let 𝑥 ∈ ℝ𝑛 be distributed according to an 𝑛-dimensional Gaussian distribution with expectation parameter 𝜇 ∈ ℝ𝑛 and covariance

parameter 𝛴 ∈ ℝ𝑛×𝑛. Let further 𝐴 ∈ ℝ𝑚×𝑛 be a matrix of rank 𝑚 ∈ ℕ. Then the product 𝐴𝑥 ∈ ℝ𝑚 is distributed according to an 𝑚-

dimensional Gaussian distribution with expectation parameter 𝐴𝜇 ∈ ℝ𝑚 and covariance parameter 𝐴𝛴𝐴𝑇 ∈ ℝ𝑚×𝑚. In other words

for 𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝛴) and 𝑧 ≔ 𝐴𝑥 we have 𝑝(𝑧) = 𝑁(𝑧; 𝐴𝜇, 𝐴𝛴𝐴𝑇)

3. From a frequentist viewpoint, data vectors can be sampled from a GLM many times. By means of the formulas for the beta and variance parameter estimators, these data sample vectors can be converted into samples of the beta and variance parameter estimators. From the distributional assumptions about the error terms in the GLM, and the ensuing distribution of the data samples then follow the probability distributions of the beta and variance parameters estimators. In an experimental context, there is only one data realization and thus only one realization of the beta and variance parameter estimators. The distributions of the beta and variance parameter estimators are thus “hypothetical”.

4. The probability distribution of the beta parameter estimator is given by the normal distribution 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1), with

expectation parameter 𝛽 ∈ ℝ𝑝, i.e., the true, but unknown, parameter value, and covariance matrix parameter𝜎2(𝑋𝑇𝑋)−1 ∈ℝ𝑝×𝑝, 𝑝. 𝑑. where 𝜎2 > 0 is the true, but unknonwn, GLM noise term variance parameter (i.e. 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)) and 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix. The (co)variance of the beta parameter estimator distribution is thus given by the combination of the noise term variance parameter and the experimental design.

5. The probability distribution of the standardized variance estimator is given by the chi-squared distribution 𝜒2 (𝑛−𝑝

𝜎2�̂�2; 𝑛 − 𝑝),

where the “degrees of freedom” 𝑛 − 𝑝 refer to the difference between the number of data points and the number of parmeters of the underlying GLM. 𝜎2 > 0 is the true, but unknown, GLM noise term variance parameter (i.e., 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)).

189

T- and F-Statistics

(1) Significance and hypothesis testing in frequentist statistics

In classical (= frequentist) statistics, model evaluation corresponds to statistical testing. Informally,

statistical testing works as follows: based on some initial assumptions about the distribution of the observed

data, the distribution of estimated model parameters under a null hypothesis of assumed parameter values

is determined and summarized in the null distribution of a “test statistic”. Based on some observed data and

the corresponding parameter estimates, the point estimate of this statistic is computed next. The value of

the test statistic is then compared to its distribution under the null hypothesis. If the value of the test

statistic falls into a region that under the null hypothesis is associated with a very small probability to occur,

it is inferred that it is unlikely that the observed data was generated from the null distribution and the null

hypothesis is rejected. The probability of observing the data-based test statistic (or a more extreme value),

given that the null hypothesis is true, is known as the p-value. If the p-value falls below some conventional

threshold, say p < 0.05, the result is labelled statistically significant, otherwise it is not.

Commonly, in addition to this notion of “significance testing”, the following set-up is used to

describe the test situation: If the null hypothesis, denoted as 𝐻0, is in fact true, it may be rejected

erroneously at given level 𝛼 ∈ [0,1] of probability, known as the Type I error. If, to the contrary, an

alternative hypothesis 𝐻1 is true, this may erroneously be rejected (and the null hypothesis “accepted”),

known as Type II error, with a probability 𝛽 ∈ [0,1]. Table 1 below illustrates this situation.

𝐻0 is true 𝐻1 is true

𝐻0 is not rejected Correct Decision Type II error of probability 𝛽

𝐻0 is rejected Type I error of probability 𝛼 Correct Decision

Table 1: The problem of statistical testing

Pondering about this set-up of statistical testing results in many questions: Why is a null hypothesis

only ever rejected, and the alternative hypothesis never accepted? Why is a value of 𝑝 < 0.05 so important

to have a “statistically significant” result, and thus, in terms of academia, the opportunity to write a paper,

but not, if 𝑝 = 0.08? How should one evaluate the “statistical power” 1 − 𝛽 in a set-up, where one has no

idea about the effect size difference between 𝐻0 and 𝐻1? Finally, why does one always want to reject the

null hypothesis of “no effect”, when really one is interested in the probability of a experimental

manipulation having resulted in an experiment effect? Why does the 𝑝-value have such a counterintuitive

definition that takes a long time to get used to?

To obtain an intuition about the reasons why statistical testing (and thus, in fact, much of

contemporary empirical science) is nebulized by 𝑝-values and uninformed use of statistical software instead

of straight-forward probabilistic reasoning, it is helpful to consider the history of “significance testing” as put

forward by Ronald Fisher and “hypothesis testing” as developed by Jerzy Neyman and Egon Pearson and

documented in the following excerpt from Wikipedia.

“Significance testing is largely the product of Karl Pearson (p-value, Pearson’s chi-squared test), William

Sealy Gosset (Student’s t-distribution) and Ronald Fisher (“null hypothesis”, analysis of variance,

“significance test”), while hypothesis testing was developed by Jerzy Neyman and Egon Pearson, the son of

Karl Pearson. Ronald Fisher, mathematician and biologist, began his life in statistics as a Bayesian, but soon

grew disenchanted with the subjectivity involved (namely use of the principle of indifference, i.e. uniform

190

probabilities, when determining prior probabilities), and sought to provide a more “objective” approach to

inductive inference.

Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to

extract results from few samples assuming Gaussian distributions. Neyman, who teamed with the younger

Pearson, emphasized mathematical rigor and methods to obtain more results from many samples and a

wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fishers vs. Neyman-

Pearson formulations, methods, and terminology developed in the early 20th century. While hypothesis

testing was popularized early in the 20th century, evidence of its use can be found much earlier. In the 1770s

Laplace considered the statistics of half a million births. The statistics showed an excess of boys compared to

girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.

Fisher popularized the “significance test”. He required a null hypothesis (corresponding to a population

frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-

hypothesis or not. Significance testing did no utilize an alternative hypothesis so there was no concept of a

Type II error [i.e. the error of “not rejecting the null hypothesis, if the alternative hypothesis is actually

true”]. The p-value [i.e. “the probability of observing a data set as extreme or more extreme than the

actually observed one, given that the null hypothesis is true”] was devised as an informal, but objective,

index meant to help a researcher determine whether to modify further experiments or strengthen one’s

faith in the null hypothesis.

Neyman and Pearson considered a different problem, which they called “hypothesis testing”. They

initially considered two simple hypotheses [A and B], both with [specified] frequency distributions. They

calculated [the] two probabilities [of an observed data sample under each hypothesis] and typically selected

the hypothesis with the higher probability, i.e. the hypothesis more likely to have generated the sample.

Hypothesis testing and the associated Type I and Type II errors [i.e. deciding, based on the observed data,

that hypothesis B is true, when in fact hypothesis A is true, and vice versa, deciding that hypothesis A is true,

when in fact hypothesis B is true, respectively] were devised as a more “objective” alternative to Fisher’s p-

value. Their method thus always selected a hypothesis and allowed for the calculation of both types of error

probabilities.

Fisher and Neyman-Pearson clashed bitterly. Neyman-Pearson considered their formulation to be an

improved generalization of significance testing. Fisher thought that it was not applicable to scientific

research because often, during the course of the experiment, it is discovered that the initial assumptions

about the null hypothesis are questionable due to unexpected sources of errors. He believed that the rigid

reject/accept decisions based on models formulated before data is collected was incompatible with this

common scenario faced by scientists and attempts to apply this method to scientific research would lead to

mass confusion.

The dispute between Fisher and Neyman-Pearson was waged on philosophical grounds, characterized by

a philosopher as a dispute over the proper role of models in statistical inference.

Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership

with Pearson and separating disputants, who had occupied the same building, by much of the planetary

diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman

terminated unresolved after 27 years with Fisher’s death in 1962. Neyman wrote a well-regarded eulogy.

Some of Neyman’s later publications reported p-values and significance levels.

191

The modern version of hypothesis testing is a hybrid of the two approaches that resulted from

confusion by writers of statistical textbooks (as predicted by Fisher) in the 1940s. Great conceptual

differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson

provided the stronger terminology, the more rigorous mathematics, and the more consistent philosophy,

but the subject taught to today in introductory statistics has more similarities with Fisher’s method than

theirs. This history explains the inconsistent terminology (example: the null hypothesis is never accepted,

but there is a region of acceptance).

Sometime around 1940, in an apparent effort to provide researchers with a “non-controversial” way

to have their cake and eat it too, the authors of statistical text books began anonymously combining the two

strategies by using the p-value in place of the test statistic (or data) to test against the Neyman-Pearson

“significance level”. Thus, researchers were encouraged to infer the strength of their data against some null

hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided

by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic

research hypothesis, to be used almost solely as a strawman “nil” hypothesis (one where a treatment has no

effect, regardless of the context).

[The two different approaches developed by Fisher and Neyman-Pearson may be summarized as follows]

Fisher’s significance testing

1. Set up a statistical null hypothesis. The null hypothesis need not be a nil hypothesis [in the sense of a

difference from zero, but may be with respect to any value].

2. Report the exact level of significance (e.g. 𝑝 = 0.051 or 𝑝 = 0.049). Do not use a conventional 5%

level and do not talk about accepting or rejecting hypotheses. If the result is “not significant”, draw no

conclusions and make no decisions, but suspend judgment until further data is available.

3. Use this procedure only if little is known about the problem at hand, and only to draw provisional

conclusions in the context of an attempt to understand the experimental situation

Neyman-Pearson’s hypothesis testing

1. Set up two [formalized] statistical hypothesis [“probabilistic models”], 𝐻1 and 𝐻2 and decide about

desirable probabilities 𝛼 and 1 − 𝛽 for Type I and Type II errors, respectively, and sample size before

the experiment, based on subjective cost-benefit considerations. These define a rejection region for

each hypothesis

2. If the data falls into the rejection region of 𝐻1, accept 𝐻2, otherwise accept 𝐻1. Note that accepting a

hypothesis does not mean that you believe in it, but only that you act as if it were true.

3. The usefulness of the procedure is limited among others to situations were one has a disjunction of

hypotheses [e.g. two alternative univariate Gaussians with expectation parameters 𝜇1 ≠ 𝜇2 and only

one of them is true] and where one can make meaningful cost-benefit trade-offs for choosing 𝛼 and

1 − 𝛽.”

(recovered from http://en.wikipedia.org/w/index.php?oldid=581635341)

Some of the differences between Fisher’s significance testing and Neyman-Pearson’s hypothesis testing

frameworks are summarized in Figure 1.

192

In the following, we will consider the T- and F-Statistic and the associated T- and F-Test with the aim

of providing some background on the common use of these statistics in neuroimaging and psychological

research. Based on the discussion above, it will become evident that the T-Statistic and T-Test conform more

closely to the idea of null hypothesis significance testing and is best motivated from a null hypothesis in a

scenario where no alternative hypothesis exists. The F-Statistic and F-Test on the other hand form a stronger

hybrid between both approaches and are best motivated from the concept of a likelihood ratio test in which

two hypotheses exist explicitly, but are finally evaluated using a null hypothesis significance testing

approach.

For both the T- and F-Statistic, we will first introduce their formulas first and discuss their intuitive

meaning irrespective of the framework of statistical testing. In a second step, we will then discuss their

distribution under assumed null hypotheses and briefly introduce how are used for statistical testing.

Figure 1 Significance testing vs. hypothesis testing.

(2) Definition and intuition of the T-statistic

For a “contrast vector” 𝑐 ∈ ℝ𝑝 the T-statistic is defined as

𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (1)

The first thing to note about this definition is that 𝑇 is a function of the data 𝑦 by means of the beta and

variance estimators �̂� and �̂�2. To get at the intuition behind this definition, we first consider the case of the

GLM representing independent and identical sampling of 𝑛 data points 𝑦1, … , 𝑦𝑛 from a univariate Gaussian

in the form

193

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)

where 𝑦 ∈ ℝ𝑛, 𝛽 ∈ ℝ, 𝜎2 > 0 and the design matrix is given by a vector of ones 𝑋 = (1,1… ,1)𝑇 ∈ ℝ𝑛×1. As

discussed previously, in this case the beta parameter can be interpreted as the expectation parameter of

univariate Gaussian distribution from which 𝑛 samples are obtained and which is estimated by the OLS beta

estimator according to the sample mean �̅�:

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 =1

𝑛∑ 𝑦𝑖𝑛𝑖=1 =: �̅� (3)

We further assume the value of 𝑐 ∈ ℝ to be 𝑐 ≔ 1. In this case the 𝑇 statistic defined in (1) evaluates to

𝑇 ≔𝑐𝑇�̂�

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐=

1𝑇⋅�̂�

√�̂�21𝑇(𝑛)−11=

�̅��̂�

√𝑛

⇔ 𝑇 =𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 (4)

which may be recognized as the T-statistic for “one-sample t-tests” as familiar from undergraduate courses

in statistics.

A more general intuition behind the T-statistic expressed in this form is

𝑇 =𝐸𝑓𝑓𝑒𝑐𝑡 𝑆𝑖𝑧𝑒

𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 𝑆𝑐𝑎𝑙𝑒𝑑 𝐷𝑎𝑡𝑎 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (5)

because �̅� corresponds to the sample average and its distance from the (implicit) “null assumption” that this

expectation is zero and �̂� corresponds the sample standard deviation, which is a measure of the data

variability. High absolute 𝑇 values thus have the following interpretation: in comparison to the variability of

the data, the effect size is large, either in the positive or negative direction. Conversely, low absolute 𝑇

values, i.e. 𝑇 values close to zero, speak to a small effect size in comparison to the data variance. Another

way to frame this is to say “the T-statistic measures the effect size with respect to the yard-stick of data

variability”.

This interpretation carries over to the more general case of more than one beta parameter, i.e.

𝛽 ∈ ℝ𝑝, 𝑝 > 1. We first consider the case, for which 𝑐 ∈ ℝ𝑝 contains all zeros, except a one at the 𝑗th entry.

In this case , the T-statistic is given as

𝑇 =�̂�𝑗

√�̂�2(𝑋𝑇𝑋)𝑗𝑗−1

(6)

where (𝑋𝑇𝑋)𝑗𝑗−1 denotes the (𝑗, 𝑗)th entry in the matrix (𝑋𝑇𝑋). Notably, as we see below, the denominator

in (6) is a measure of the variance of �̂�𝑗, which scales with general variance parameter estimate 𝜎2 (Note

that also in the familiar 𝑇 statistic as given in equation (4), the denominator is in fact the “standard error of

the mean” �̂�

√𝑛, which is, intuitively, a measure of the variance of the sample average �̅�. However, for fixed 𝑛

it scales with the general variance parameter 𝜎2). In other words, as for the univariate Gaussian case, with a

contrast vector singling out a specific beta parameter estimate �̂�𝑗 (𝑗 ∈ ℕ𝑝), the 𝑇 statistic is a measure of

the effect size associated with 𝑗th independent variable relative to the data variability.

Finally, the general form of the T-statistic as in (1) allows for the evaluation of any kind of linear

combination (i.e. weighted sum) of beta parameter estimates. Commonly encountered contrasts in FMRI

data analysis are contrasts that test, whether one parameter estimate �̂�𝑖 is larger than another parameter

estimate �̂�𝑗. If �̂�𝑖 is larger than �̂�𝑗, than �̂�𝑖 − �̂�𝑗 > 0 and thus the corresponding contrast vector 𝑐 ∈ ℝ𝑝

contains all zeros, except at the 𝑖th position, where it contains a 1 and at the 𝑗th position, where it contains

a −1.

194

(3) The T-statistic null distribution

How can the T-statistic as defined in equation (1) of the previous section be used to implement a null

hypothesis test as outlined in the beginning of this Section? To this end, the T-statistic must first be

expanded with an assumed parameter value 𝛽0 ∈ ℝ𝑝 representing a null hypothesis 𝐻0 (note that 𝛽0 ∈ ℝ

𝑝

is not required to be zero, and hence that a “null hypothesis” is not required to be a “nil hypothesis” of”),

and second, the sampling distribution of this “extended”, i.e., more general, T-statistics must be assessed.

Formally, a “null hypothesis” corresponds to the assumption that the data 𝑦 is sampled from the

following “null hypothesis distribution”

𝐻0 ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽0, 𝜎2𝐼𝑛) (1)

The “extended” T-statistic takes the form

𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�−𝑐𝑇𝛽0

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (2)

Note that this definition corresponds to the T-statistic introduced in equation (1) of the previous section for

the case 𝛽0 ≔ 0 ∈ ℝ𝑝. The introduction of the term “−𝑐𝑇𝛽0” above merely ensures, that the expectation of

𝑇, if the null hypothesis is true, and 𝑐𝑇�̂� has indeed an expected value of 𝑐𝑇𝐸(�̂�) = 𝑐𝑇𝛽0, is zero.

Having established a more general form of the T-statistic, we now ask, how this statistic is

distributed under the null hypothesis (1) . To this end, we briefly rehearse what we have learned about the

data and parameter estimator distributions of the GLM so far. First, the probability distribution of the data

is given by

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)

Based on (3), the distribution of OLS beta estimator is given by the linear transformation theorem for

Gaussian random variables as

𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1 ) (4)

If the linear transformation theorem for Gaussian random variables is applied to �̂� to arrive at the

distribution of 𝑐𝑇�̂� (and one note that (𝑐𝑇)𝑇 = 𝑐) one sees that the distribution of the OLS beta parameter

contrast 𝑐𝑇�̂� is given by the univariate Gaussian distribution

𝑝(𝑐𝑇�̂� ) = 𝑁(𝑐𝑇�̂� ; 𝑐𝑇𝛽, 𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐) (5)

Finally, we know that the distribution of the scaled variance parameter estimator �̂�2, 𝑛−𝑝

𝜎2�̂�2 is given by a

chi-squared distribution

𝑝 (𝑛−𝑝

𝜎2�̂�2) = 𝜒2 (

𝑛−𝑝

𝜎2�̂�2; 𝑛 − 𝑝) (6)

Based on these results, one can now ask, how the ratio between linear combinations of the

parameter estimates and the variance, i.e. the extended T-statistic, is distributed. To this end, the reader is

first encouraged to review the definition of the 𝑡-distribution. Most importantly, the 𝑡-distribution is a

statement about the probability distribution of a random variable 𝑇 that is the ratio of a standard normally

distributed random variable and a the square root of a chi-squared distributed random variable with 𝑛

degrees of freedom divided by 𝑛. To establish that the T-statistic as defined in (2) is in fact distributed

according to a 𝑡-distribution, we continue from the distributions of the contrasted OLS beta estimator (5)

195

and the distribution of the variance estimator (6) and ask how they have to be modified to adhere to the

definitions of the random variables in the definition of the 𝑡-distribution.

Figure 1 Analytical and sampling distributions for a GLM representing independent and identical sampling from a univariate Gaussian distribution. For the upper panels, the null hypothesis corresponds to 𝛽0 = 0 and for the lower panels to 𝛽0 = 2. The first panels in each row depict the univariate Gaussian probability density function 𝑁(𝑦; 𝛽0, 1) from which a realization of 𝑛 = 10 data points is

drawn. The second panels depict for 100 draws of 𝑛 = 10 data points the ensuing sampling distribution (histogram estimate) of 𝑐𝑇�̂�, where 𝑐𝑇 = 1 and the corresponding analytical distribution. Note that the beta parameter estimates (sample means) are centered around the corresponding null hypotheses 𝛽0 pararameters. The third panels depict the empirical and analytical distributions of the

ensuing �̂�2 estimates multipled by 𝑛−𝑝

𝜎2= 9, which are both distributed according to a chi-squared distribution with 9 degrees of

freedom. Finally, the last panels depict the resulting empirical and analytical distributions of the “extended” T-statistic as defined in equation (2). Note that irrespective of the underlying null hypothesis parameters 𝛽0, both are centered around 0.

Firstly, to render 𝑐𝑇�̂� a random variable that is distributed according to a standard normal

distribution, we have to z-transform it. Recall that the 𝑧-transform transforms a normally distributed

random variable 𝑥 a standard normally distributed random variable 𝑧. In other words, if

𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝜎2) (7)

then

𝑝 (𝑥−𝜇

𝜎) =: 𝑝(𝑧) = 𝑁(𝑧; 0,1) (8)

𝑧-transformation of 𝑐𝑇�̂� yields the standard normally distributed random variable

𝑋 ≔𝑐𝑇�̂�−𝑐𝑇𝛽0

√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (9)

196

Note that this is not the T-statistic as defined in (2), because the denominator contains the true, but

unknown, GLM variance parameter 𝜎2 and not the variance estimator �̂�2. However, division of 𝑋 by the

root of the chi-squared random variable 𝑌 ≔𝑛−𝑝

𝜎2�̂�2 divided by its degrees of freedom 𝑛 − 𝑝 yields

𝑋

𝑌/(𝑛−𝑝)=

𝑐𝑇�̂�−𝑐𝑇𝛽0

√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐⋅ √𝑛−𝑝

√ 𝑛−𝑝

𝜎2�̂�2=

(𝑐𝑇�̂�−𝑐𝑇𝛽0) √𝑛−𝑝

√𝜎2/𝜎2 √𝑐𝑇(𝑋𝑇𝑋)−1𝑐√𝑛−𝑝√�̂�2=

𝑐𝑇�̂�−𝑐𝑇𝛽0

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐=:𝑇 (10)

In summary, we have seen that the T-statistic as defined in (2), under the null hypothesis as defined in (1), is

distributed according to a t-distribution with 𝑛 − 𝑝 degrees of freedom. Figure (1) above visualizes this result

for the case of a GLM representing independent and identical sampling from a univariate Gaussian

distribution and Figure 2 visualizes this result for the case of GLM representing a simple linear regression

model.

Figure 2 Analytical and sampling distributions for a GLM representing a simple linear regression model. The first panel depicts a simple linear regression model based on the null hypothesis parameter setting 𝛽0 = (−1,1)

𝑇 and variance parameter 𝜎2 = 1 a single realization from this model. The second panel depicts the beta parameter estimators obtained by sampling the model in the

first panel 100 times. Note that the first component of these �̂�’s is distributed around −1 and the second component is distributed around 1, in accordance with the null hypothesis parameter setting. The third panel depicts the analytical and empirical distribution

of the product 𝑐𝑇�̂� for 𝑐 = (0,1)𝑇. Note that while the distribution of the �̂� can be high-dimensional, the distribution of a product

𝑐𝑇�̂� is always one-dimensional. The fourth panel depicts the empirical and analytical distributions of the ensuing �̂�2 estimates

multiplied by 𝑛−𝑝

𝜎2= 8. Finally, the last panel depicts the ensuing empirical and analytical distribution of the T-statistic as defined in

equation (2) for the current scenario.

(4) The T-statistic and null hypothesis significance testing

In summary, we have shown that the distribution of the T-statistic as defined in equation (2) of the

previous section under a suitably chosen “null” assumption about the true, but unknown, parameter values

𝛽 ∈ ℝ𝑝 is analytical available. In a given experimental context, one usually observes a single data set

𝑦∗ ∈ ℝ𝑛 from which one can compute a single T-statistic value 𝑇∗ ∈ ℝ for a given “null hypothesis” 𝛽0 ∈ ℝ𝑝.

The logic of null hypothesis significance is then as follows: if the observed T-statistic value and more extreme

values, under the assumed null hypothesis/distribution have a very small probability to occur (for example a

probability for it or more extreme values of less than 0.05), one may infer that it is not very likely that the

data that gave rise to the computed T-statistic value was actually generated from a GLM for which the null

hypothesis 𝐻0: 𝛽 = 𝛽0 ∈ ℝ𝑝 holds true. Informally, we may state this as follows: if the observed data (i.e.

the data observations proper and all summaries of it, such as beta and variance estimators, and the T-

statistic) is allocated a low probability density value under the assumption that the null hypothesis is true,

one may reject the null hypothesis. In general, it is sensible to not only report whether one rejected or did

not reject a null hypothesis based on some (arbitrarily chosen) threshold value, but to report the value of the

T-statistic itself and the associated probability mass for more extreme values itself (the so-called “p-value”).

Put simply, more information is conveyed by stating e.g. that a t-test with 𝑛 degrees of freedom resulted in a

197

T-value of 𝑇 = ⋯ and associated p-value of 𝑝 = ⋯, then to merely state that a null hypothesis was rejected

because the p-value was below e.g. 0.05.

(5) Definition and intuition of the F-Statistic

In this section we first introduce the F-statistic and its associated distribution under a null

hypothesis. From a GLM perspective F-statistics and F-tests correspond to a model comparison procedure

for “nested” GLMs, i.e. the comparison of a GLM with another GLM that is a subpart of the first. In a second

part, we relate this formulation of the F-statistic to the perhaps more familiar notion of F-statistics in

classical variance partitioning schemes of one-factorial ANOVA designs. The benefit of the first perspective is

that it is based on a clearly defined set of modelling assumptions, and corresponds to a fairly general view

that is easily applied in “general” GLMs, while the second view is formulated ad-hoc and readily applicable

only in the case of a one-factorial ANOVA design.

In both approaches, the notion of a variety of “sum-of-squares” is important. We thus first define

the so-called” error sum-of-squares” and the “residual sum-of-squares” for a GLM


where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑛, 𝑝 ∈ ℕ, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 and random vectors 𝑦, 휀 ∈ ℝ𝑛.

The “error sum-of-squares” is a theoretical quantity that relates to the deviations of the random

variables 𝑦𝑖 from their expectation (𝑋𝛽)𝑖 (i.e. the 𝑖th entry in the 𝑛-dimensional vector 𝑋𝛽 ∈ ℝ𝑛). Of course,

these quantities correspond by definition to the error terms 휀𝑖, 𝑖 = 1,… , 𝑛. Squaring the deviations

휀𝑖 = 𝑦𝑖 − (𝑋𝛽)𝑖 and summing over all 𝑛 deviations yields the “error sum-of-squares”, which can be written

in matrix notation as

휀𝑇휀 = (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = ∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛

𝑖=1 = ∑ 휀𝑖2𝑛

𝑖=1 (2)

The “residual sum-of-squares” is a different entity. It refers to the sum of squared deviations

between the realized data points 𝑦𝑖 and their GLM prediction based on the OLS estimation of the 𝛽

parameters , i.e., the deviations 𝑒𝑖 ≔ 𝑦𝑖 − (𝑋�̂�)𝑖 for𝑖 = 1,… , 𝑛. Squaring these deviation and summing

over all 𝑛 yields the “residual sum-of-squares”, which can be written in matrix notation as

𝑒𝑇𝑒 = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) = ∑ (𝑦𝑖 − (𝑋�̂�)𝑖)

2=𝑛

𝑖=1 ∑ 𝑒𝑖2𝑛

𝑖=1 (3)

Below, we will be mainly concerned with various forms of “residual sum-of-squares”.

The F-Statistic in a GLM context can be motivated from the perspective of likelihood ratio-based

model comparison. To make this transparent, consider again the simple linear regression GLM

𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)

where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix, the design matrix,

parameter vector, and data vector take the form

𝑋 ≔ (

1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛

) , 𝛽 ≔ (𝛽1𝛽2) and 𝑦 ≔ (


) (2)

and where the 𝑥𝑖 (𝑖 = 1,… , 𝑛) carry the notion of “independent variables”, 𝛽1 corresponds to the simple

linear regression “offset” and 𝛽2 to the simple linear regression “slope”. The central question we would like

198

to address in this section is to evaluate whether the inclusion of the second column in the design matrix (the

𝑥𝑖’s) is beneficial in explaining an observed data set 𝑦 ∈ ℝ𝑛, for which it is assumed that its generation

process is unclear.

To treat the general case from the outset, we consider a GLM with 𝑝 ∈ ℕ design matrix

columns/beta parameters, which we will refer to as the “full model”. We aim for a comparison of models

including the set of all 𝑝 regressors and their associated effect sizes with a model comprising 𝑝1 < 𝑝

regressors, which we will refer to as the “reduced model”. Note that we can partition any design matrix

𝑋 ∈ ℝ𝑛×𝑝 with 𝑝 > 1 into two components

𝑋 = (𝑋1 𝑋2), where 𝑋1 ∈ ℝ𝑛×𝑝1 and 𝑋2 ∈ ℝ

𝑛×𝑝2 and 𝑝1 + 𝑝2 = 𝑝 (3)

and obtain two separate models, if we also partition 𝛽 ∈ ℝ𝑝 accordingly into 𝛽 ≔ (𝛽1, 𝛽2)𝑇. In the simple

linear regression example, the corresponding design matrix partition would be

𝑋1 ≔ (

11⋮1

) ∈ ℝ𝑛×𝑝1 and 𝑋2 ≔ (

𝑥1𝑥2⋮𝑥𝑛

) ∈ ℝ𝑛×𝑝2 , where 𝑝1 = 𝑝2 = 1 and 𝑝1 + 𝑝2 = 2 (4)

In words, the central question in this case is whether the variability in the independent variable (the 𝑥𝑖’s) is

actually important to describe the observed data variability, or whether the assumption of independent and

identical sampling from a univariate Gaussian is sufficient to explain it.

The Neyman-Pearson notion of “likelihood-ratio testing” offers a principle means to address this

question. In general terms, the idea of likelihood ratio testing is to fit two alternative models 𝑚1 and 𝑚2 to a

given data set using the maximum-likelihood method and then to compute the ratio of the two maximized

likelihoods. In other words, one compares the probabilities 𝑝𝑚1(𝑦∗) and 𝑝𝑚2

(𝑦∗) of both models to account

for the same data 𝑦∗ under the optimal parameter settings of each model. If the probability of observing the

data 𝑦∗ under, say, the optimized model 𝑚1 is much higher than under the optimized model 𝑚2, then one

concludes that 𝑚1 is a better model for the data. Note that because logarithmic transforms render ratios

differences and we actually work with the log likelihood function below, we will formulated “likelihood-ratio

testing” as “log-likelihood difference testing” below.

To apply the idea of likelihood ratios (= log likelihood differences) in the context of the GLM,

suppose that one estimates the beta parameters of a GLM comprising only the design matrix subset 𝑋1 using

the maximum likelihood principle, where we assume for the moment that 𝜎2 > 0 is known. As we saw in

above, the maximized log likelihood is given by

𝑚𝑎𝑥𝛽1∈ℝ𝑝1 ℓ(𝛽1) = −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − 𝑋1�̂�1)

𝑇(𝑦 − 𝑋1�̂�1) = −

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2𝑒1𝑇𝑒1 (5)

where we defined

𝑒1 ≔ 𝑦 − 𝑋1�̂�1 (6)

and 𝑒1𝑇𝑒1 is the “residual sum-of-squares term of the reduced model”. Consider next maximizing the log

likelihood function of the full model 𝑋 = (𝑋1 𝑋2) with all predictors. In this case, the maximized log

likelihood is given by

𝑚𝑎𝑥𝛽∈ℝ𝑝 ℓ(𝛽) = −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (7)

199

= −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2(𝑦 − (𝑋1 𝑋2) (

�̂�1�̂�2))

𝑇

(𝑦 − (𝑋1 𝑋2) (�̂�1�̂�2))

= −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2𝑒12𝑇 𝑒12

where we defined

𝑒12 ≔ 𝑦 − (𝑋1 𝑋2) (�̂�1�̂�2) (8)

and refer to 𝑒11𝑇 𝑒11 as the “residual sum-of-squares term of the full model”. Taking the difference of the two

maximized log-likelihoods then yields a basic statistic for model comparison according to the likelihood ratio

principle.

𝛥ℓ ≔ 𝑚𝑎𝑥 ℓ(𝛽) − 𝑚𝑎𝑥 ℓ(𝛽1)

= −𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2𝑒12𝑇 𝑒12 − (−

𝑛

2𝑙𝑛 2𝜋 −

𝑛

2𝑙𝑛 𝜎2 −

1

2𝜎2𝑒1𝑇𝑒1)

=1

2𝜎2(𝑒1𝑇𝑒1 − 𝑒12

𝑇 𝑒12) (9)

The important thing to note about (9) is that 𝑒1𝑇𝑒1 corresponds to the residual sum-of-squares

resulting from fitting the reduced (smaller) model comprising less regressors to the data, while 𝑒12𝑇 𝑒12

correspond to the case that in addition to the 𝑝1 first regressors also the additional 𝑝2 regressors, i.e., the

full model is used to model the data.

The log-likelihood difference 𝛥ℓ is the major building block of the F-statistic, which is defined as

𝐹:ℝ+ × ℝ+ → ℝ+, (𝑒1, 𝑒12) ↦ 𝐹(𝑒1, 𝑒12) ≔(𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12)/𝑝2

𝑒12𝑇 𝑒12/(𝑛−𝑝)

(10)

To obtain an intuition about the definition of the F-statistic, we first consider its numerator. To this end, it is

helpful to consider two scenarios: firstly, if the data on which it is based is in fact generated from the

reduced model, and secondly, if the data is in fact generator by the full model. If the data is generated from

the reduced model, the parameter estimates for the beta regressors in the full model will tend to zero, and

the residual errors 𝑒1 and 𝑒12 will both be normally distributed around zero. If the data is generated from

the full model, the residual errors resulting from fitting the reduced model 𝑒1 will be larger than the residual

errors resulting from fitting the full model. The numerator of the F-statistic thus represents the reduction in

the residual errors resulting from including the additional 𝑝2 regressors per additional regressor included. If

this reduction is small, i.e., the additional regressor do not explain much data variance, the F-value will be

small. If this reduction is large, and the full model is a much better model of the data, the F-value will be

large. Regardless of the model of the generating model, the denominator of the F-statistic is an estimator of

the variance parameter 𝜎2: If the data is in fact generated from the reduced model, the beta parameter

estimates for the regressors of the full model that are not part of the reduced model will tend to zero, and

the residual sum of squares is evaluated as for the reduced model. The F-statistic thus measures the

reduction of the residual sum-of-squares attributable to the inclusion of the 𝑝2 additional regressors with

respect to the reduced model per regressor, normalized by the estimated variance of the data.

200

(6) The F-Statistic Null Distribution

As for the T-statistic, one may ask how the F-statistic as defined above is distributed, if a “null

hypothesis” about the full model holds. To establish this distribution, we first consider the distributions of

the F-statistic numerator and denominator under a to be defined null hypothesis separately.

Assume that as in the set-up above, with 휀, 𝑦 ∈ ℝ𝑛 , we partition a design matrix 𝑋 ∈ ℝ𝑛×𝑝 in the

form 𝑋 = (𝑋1 𝑋2), where 𝑋1 ∈ ℝ𝑛×𝑝1 and 𝑋2 ∈ ℝ

𝑛×𝑝2 and the correspondingly partitioned beta

parameter vector is given by 𝛽 ≔ (𝛽1, 𝛽2)𝑇 ∈ ℝ𝑝1+𝑝2 . Now, if the null hypothesis

𝐻0: 𝛽2 = 0 ∈ ℝ𝑝2 (1)

is true, the full model

𝑋𝛽 + 휀 = 𝑦 (2)

reduces to the model

𝑋1𝛽1 + 휀 = 𝑦 (3)

As above, let 𝑒12𝑇 𝑒12 denote the residual sum-of-squares of the full model, and let 𝑒1

𝑇𝑒1denote the residual

sum-of-squares of the reduced model . Under the null assumption 𝛽2 = 0 sampling from the full model

yields a distribution of the variance parameter scaled “extra-sum-of squares” 𝑒1𝑇𝑒1 − 𝑒12

𝑇 𝑒12 that

corresponds to a 𝜒2-distribution with 𝑝2 degrees of freedom

𝑝 (𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12

𝜎2) = 𝜒2 (

𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12

𝜎2; 𝑝2) (4)

Note that we have a 𝜒2 distribution of 𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12

𝜎2, while the actual numerator of the F-statistic is (𝑒1

𝑇𝑒1 −

𝑒12𝑇 𝑒12)/𝑝2. We will return to this below, but first consider the denominator of the F-statistic.

The denominator of the F-Statistic 𝑒12𝑇 𝑒12/(𝑛 − 𝑝) corresponds to the standard variance parameter

�̂�2 estimator as discussed previously, where we have seen that 𝑛−𝑝

𝜎2�̂�2 is distributed according to a chi-

squared distribution with 𝑛 − 𝑝 degrees of freedom

𝑝 (𝑛−𝑝

𝜎2𝑒12𝑇 𝑒12

𝑛−𝑝) = 𝑝 (

𝑒12𝑇 𝑒12

𝜎2) = 𝜒2 (

𝑒12𝑇 𝑒12

𝜎2; 𝑛 − 𝑝) (5)

Taken together, we see that, under the null hypothesis 𝐻0: 𝛽2 ≔ 0 the F-statistic corresponds roughly to a

ratio of chi-squared distributed random variables with 𝑝2 and 𝑛 − 𝑝 degrees of freedom.

The distribution of ratios of chi-squared random variables is called an 𝑓-distribution, as the reader

may now review in the Mathematical Preliminaries section. To use the 𝑓-distribution as defined therein, we

have to reformulate the distributional results (4) and (5) in accordance with it, which will eventually yield

the F-statistic as defined in in the previous section. Firstly, forming the ratio of the chi-squared distributed

random variables above yields

𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12𝜎2

𝑒12𝑇 𝑒12𝜎2

=𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12

𝑒12𝑇 𝑒12

(6)

The right-hand side of (6) is a ratio of two chi-squared distributed random variables. In the definition of the

𝑓-distribution, these are divided by their respective degrees of freedom. Dividing both the numerator and

the denominator by the respective degrees of freedom, i.e. the numerator by 𝑝2 and the denominator by

𝑛 − 𝑝 then yields the F- statistic in the form (10):

201

𝐹 =(𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12)/𝑝2

𝑒12𝑇 𝑒12/(𝑛−𝑝)

(7)

In other words, under the null hypothesis of 𝐻0: 𝛽2 = 0 ∈ ℝ𝑝 the F-statistic as defined in (7) is distributed

according to an 𝑓-distribution with (𝑝2, 𝑛 − 𝑝) degrees of freedom.

(7) The F-statistic and null hypothesis significance testing

In summary, if we observe an F-statistic value which under this null distribution has a very small

probability to occur (for example a probability for it or more extreme values of less than 0.05), we can infer,

that it is not very likely that the data that gave rise to the computed F-statistic value was actually generated

from a GLM for which the null hypothesis 𝐻0: 𝛽2 = 0 ∈ ℝ𝑝2 holds true. The idea of the F-statistic and the its

distribution under a null hypothesis is visualized for the case of a simple linear regression model in Figure 1.

Figure 1 This figure visualizes the idea of the null distribution of the F-statistic as defined in (6.28) for the case of the special simple linear regression GLM case. Specifically, the full model for simple linear regression corresponds to a design matrix ≔ (𝑋1 𝑋2) ∈ℝ𝑛×2 , where 𝑋1 ∈ ℝ

𝑛×1 is a column of ones, modelling the regression line offset and 𝑋2 ∈ ℝ𝑛×1 comprises the values of the

univariate independent variable modelling the regression line slope. For the upper left panel 200 samples were obtained from a simple linear regression model for 𝑛 = 10 data points for which the true, but unknown, 𝛽1 parameter was set to 𝛽1 = 1 and the true, but unknown, 𝛽2 parameter was set to 𝛽2 = 0 conforming to the null hypothesis 𝐻0: 𝛽2 = 0. The lower left panel depicts the empirical (gray histogram bars) distribution over the 200 samples of the F-statistic evaluated from the samples. Additionally it depicts the 𝑓-distribution probability density function (red line) for 𝑝2 = 1 and 𝑛 − 𝑝 = 10 − 2 degrees of freedom. In correspondence with the theory discussed in section 6.5, the empirical distribution conforms to the analytical 𝑓-distribution. The right panels depict the case, in which the null hypothesis 𝐻0: 𝛽2 = 0 does not hold. Specifically, here the true, but unknown, beta parameter vector was set to 𝛽 = (1,1.1)𝑇 and 200 samples obtained as on the left-side. The lower right panel shows the empirical and analytical distribution of the F statistic. Because the null hypothesis 𝐻0: 𝛽2 = 0 does not hold in this scenario, the empirical distribution does not conform to the 𝑓-distribution.

(8) Classical variance partitioning formula and the GLM formulation of the F-statistic

In conventional undergraduate statistic courses the F-statistic is usually introduced in the context of

single-factor ANOVA designs. In this context, the F-statistic refers to the ratio of “between group variance”

202

(also referred to as “treatment variance”) and “within group variance” (also referred to as “error variance”).

The aim of the current section is to link this “traditional ANOVA” view of the F-statistic to the “full and

reduced GLM model comparison view” of the previous sections. To this end, we will first review the single

factor ANOVA design with independent measures and the associated variance partitioning approach of

classical ANOVA. We will next relate the variance partitioning scheme to the structural form of the

corresponding full and reduced GLM and finally demonstrate the equivalence of both approaches.

A conventional treatment treats single-factor ANOVA as the extension of a two-sample T-test to

more than two groups. This is usually motivated by stating that to assess “significant differences” between,

say, three group means, a series of two-sample T-tests would be required (group 1 vs. group 2, group 2 vs.

group 3, and group 1 vs. group 3), resulting in a multiple testing procedure and the problem of an increased

probability of Type I errors. From this view, the F-Test offers an alternative by providing a single test

procedure that allows for the assessment of the null hypothesis that the population means of all three

groups are in fact identical, or, equivalently, that the three treatment groups were sampled from the same

population. The categorical independent variable in the context of ANOVA designs is referred to as “factor”

and the different values that it may assume (for example treatment 1, treatment 2, and treatment 3) are

referred to as “levels”. Data acquired in a one-factorial design in a balanced scheme, i.e., with equal group

sizes, then takes the following form

Group 1 Group 2 ⋯ Group p

𝑦11 𝑦21 ⋯ 𝑦𝑝1

𝑦12 𝑦22 ⋯ 𝑦𝑝2

𝑦13 𝑦23 ⋯ 𝑦𝑝3

⋮ ⋮ ⋱ ⋮ 𝑦1𝑛1 𝑦2𝑛2 ⋯ 𝑦𝑝𝑛𝑝

Table 1. Data layout of a one-factorial ANOVA design.

In the table, the entry 𝑦𝑖𝑗 ∈ ℝ refers to the data obtained from the 𝑗th experimental unit (“subject”) in the

𝑖th experimental group (“level of the experimental factor”), where 𝑗 = 1,… , 𝑛𝑖 and 𝑖 = 1,… , 𝑝. Note that

the subscript indices in Table 1 do not correspond to matrix row and column indices, but are reversed in this

respect. 𝑛𝑖 ∈ ℕ is the number of experimental units in group 𝑖, and the assumption of a balanced design

implies that 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑝. 𝑝 ∈ ℕ is the number or experimental groups or levels of the experimental

factor, and has to be larger than 1. The total number of data points (or experimental units) is given by

𝑛 = ∑ 𝑛𝑖𝑝𝑖=1 . In conventional discussions of one-factorial designs it is usually additional stated that the data

points were obtained from “independent experimental units”, for example, that each data point represent

data acquired from a single human participant who only contributed data in this specific condition and in no

others.

The fundamental idea of a one-factorial ANOVA is to assess whether the variability in the observed

data points is largely due to the variability in the independent variable, i.e. the differences between the

levels of the factor, or due to “inherent noise” in the dependent variable. This is achieved by assessing the

relative contributions of “treatment related variance” and “noise-related variance” in a partitioning of the

“overall variance“, which, intuitively, takes the following form

Overall data variance = Treatment-related variance + Noise-related variance (1)

203

If the ratio of treatment-related variance and noise-related variance

F-value ≈ Treatment-related variance/Noise-related variance (2)

(which roughly corresponds to the F-statistic, as we will see below) is large one may infer that the treatment

had some effect on the observed value and the null hypothesis that the treatment had no effect on the

observed data may be rejected. We next formalize these intuitions by introducing a number of quantities

that can be computed based on data as represented in Table 1.

Firstly, we can compute an average of all data points included in the table, that is, and “overall” or

“grandmean”

�̅� =1

𝑛∑ ∑ 𝑦𝑖𝑗

𝑛𝑖𝑗=1

𝑝𝑖=1 (3)

Of course, this quantity represents the average of all data points over all treatment levels. Likewise, we can

compute 𝑟 treatment level specific averages or “group means”

𝑦�̅� =1

𝑛𝑖∑ 𝑦𝑖𝑗𝑛𝑖𝑗=1 for 𝑖 = 1,2, … , 𝑝 (4)

Note that there are 𝑝 group means, which reflect the average of all data points corresponding to a given

treatment level.

Secondly, we can use sums of squared differences between averages and individual data points to

assess the variability about the means introduce above. To this end, we first define the sum of squared

deviations from the grand mean, a quantity referred to as “total sum of of squares”

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 ≔ ∑ ∑ (𝑦𝑖𝑗 − �̅�)𝑛𝑖𝑗=1

𝑝𝑖=1

2 (5)

We next define the sum of squared deviations of the individual group means from the overall mean,

weighted by the number of data points in each group, a quantity referred to as “between groups sum-of-

squares” or “treatment sum-of-squares”

𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 ≔ ∑ 𝑛𝑖(𝑦�̅� − �̅�)2𝑝

𝑖=1 (6)

Finally, we ca define the sum of squared deviations of all individual data points from their respective group

means, a quantity referred to as “within groups sum-of-squares” or “error sum-of-squares”

𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 ≔ ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)𝑛𝑖𝑗=1

𝑝𝑖=1

2 (7)

Based on these definition, it can be shown that the total sum-of-squares can be written as the sum of the

between and the within sum-of-squares, i.e.

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (8)

or, in other words, that

∑ ∑ (𝑦𝑖𝑗 − �̅�)𝑛𝑖𝑗=1

𝑝𝑖=1

2= ∑ 𝑛𝑖(𝑦�̅� − �̅�)

2𝑝𝑖=1 + ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)

𝑛𝑖𝑗=1

𝑝𝑖=1

2 (9)

Proof of equation (8) and (9)

We have

∑ ∑ (𝑦𝑖𝑗 − �̅�)2𝑛𝑖

𝑗=1𝑝𝑖=1 = ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅� + 𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1

𝑝𝑖=1

= ∑ ∑ ((𝑦𝑖𝑗 − 𝑦�̅�) + (𝑦�̅� − �̅�))2𝑛𝑖

𝑗=1𝑝𝑖=1

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2−∑ 2(𝑦𝑖𝑗 − 𝑦�̅�)(𝑦�̅� − �̅�)

𝑛𝑖𝑗=1 + ∑ (𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1

𝑛𝑖𝑗=1 )

𝑝𝑖=1

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− 2(𝑦�̅� − �̅�)∑ (𝑦𝑖𝑗 − 𝑦�̅�)

𝑛𝑖𝑗=1 + 𝑛𝑖(𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1 )

𝑝𝑖=1

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− (2(𝑦�̅� − �̅�)∑ (𝑦𝑖𝑗 −

1

𝑛𝑖∑ 𝑦𝑖𝑗𝑛𝑖𝑗=1 )

𝑛𝑖𝑗=1 ) + 𝑛𝑖(𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1 )

𝑝𝑖=1

204

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− (2(𝑦�̅� − �̅�) (∑ 𝑦𝑖𝑗 − ∑ 𝑦𝑖𝑗

𝑛𝑖𝑗=1

𝑛𝑖𝑗=1 )) + 𝑛𝑖(𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1 )

𝑝𝑖=1

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− 0 + 𝑛𝑖(𝑦�̅� − �̅�)

2𝑛𝑖𝑗=1 )

𝑝𝑖=1

= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2𝑛𝑖

𝑗=1 )𝑝𝑖=1 + ∑ 𝑛𝑖(𝑦�̅� − �̅�)

2𝑝𝑖=1

= ∑ 𝑛𝑖(𝑦�̅� − �̅�)2𝑝

𝑖=1 + ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)𝑛𝑖𝑗=1

𝑝𝑖=1

2

With the definitions (5) – (7) above, we thus have

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛

□

In order to define the F-statistic in this classical treatment of one-factorial ANOVA, the additional

concepts of “degrees of freedom” and “mean squares” are introduced. In this context, the notion of degrees

of freedom refers to the number of “independent data points” that result from computing a mean over a

group of data points. Specifically, for the case of the grand-mean and its associated total sum-of-squares, we

have 𝑛 values, and, if the grand-mean is known 𝑛 − 1 choices for different data points. In other words, the

number of degrees of freedom of 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 is 𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑛 − 1. Likewise, for the case of the grand-mean and

the group-means, and its associated between-group sum-of-squares, if the grand-mean is known, 𝑝 − 1 of

the group means may freely be chosen.The number of degrees of freedom of 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 is thus 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =

𝑝 − 1. Finally, for the case of the within-group sum-of-squares, we consider the 𝑝 group means, and the

individual data points. For each group data set, we have 𝑛𝑖 − 1 degrees of freedom, if the group mean is

known. Because we have 𝑝 groups, the total degrees of freedom in this within sum-of-squares scenario are

thus 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑝(𝑛𝑖 − 1) = 𝑛 − 𝑝. In summary, we obtain for the degrees of freedom of the sums-of-

squares introduced above

𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 ⇔ 𝑛− 1 = 𝑝 − 1 + 𝑛 − 𝑝 (10)

Division of the sums of squares terms by their respective degrees of freedom then yields estimators for the

total, between group, and within group variances, referred to as “mean squares”

𝑀𝑆𝑇𝑜𝑡𝑎𝑙 =∑ ∑ (𝑦𝑖𝑗−�̅�)

𝑛𝑖𝑗=1

𝑝𝑖=1

2

𝑛−1 , 𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛 =

∑ 𝑛𝑖(𝑦�̅�−�̅�)2𝑝

𝑖=1

𝑝−1 and 𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =

∑ ∑ (𝑦𝑖𝑗−𝑦�̅�)𝑛𝑖𝑗=1

𝑝𝑖=1

2

𝑛−𝑝 (11)

Finally, the F-statistic is introduced as the ratio between the “between group mean square” and the

“within group mean square”

𝐹 =𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛

𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =

∑ 𝑛𝑖(𝑦𝑖̅̅̅̅ −�̅�)2𝑝

𝑖=1𝑝−1

∑ ∑ (𝑦𝑖𝑗−𝑦𝑖̅̅̅̅ )𝑛𝑖𝑗=1

𝑝𝑖=1

2

𝑛−𝑝

(12)

and is declared to be distributed according to an 𝑓-distribution with 𝑝 − 1, 𝑛 − 𝑝 degrees of freedom.

The treatment of a one-factorial ANOVA is somewhat ad-hoc. Specifically, an ANOVA data table is

introduced without specifications of random variables of which the data are to be considered, yet, at the

very end, the distribution of the derived 𝐹 statistic is presented as known and given by a well-defined

probability density function. Likewise, the definition of the three sum-of-squares and three mean-squares

does not feel well-motivated, as others definitions may be equally appropriate. That the computational

scheme discussed above is nevertheless a sensible approach can be established by viewing the one-factorial

ANOVA as a probabilistic model, namely a specific GLM, from the outset. In the following, we will establish

205

this model and various aspects of it, and finally relate full and reduced versions of this model to the

quantities evaluated in the previous section.

Assume again that an experimental design has resulted in a data layout as in Table 1. However, in

contrast to the previous section, now assume that the data points are realizations of independent univariate

Gaussian variables, such that

𝑝(𝑦1𝑗) = 𝑁(𝑦1𝑗; 𝜇1, 𝜎2) for 𝑗 = 1,… , 𝑛1 (13)

𝑝(𝑦2𝑗) = 𝑁(𝑦2𝑗; 𝜇2, 𝜎2) where 𝜇2 = 𝜇1 + 𝛼2 for 𝑗 = 1,… , 𝑛2 (14)

⋯

𝑝(𝑦𝑝𝑗) = 𝑁(𝑦𝑝𝑗; 𝜇𝑝, 𝜎2) where 𝜇𝑝 = 𝜇1 + 𝛼𝑝 for 𝑗 = 1,… , 𝑛𝑝 (15)

where 𝜇1, 𝛼2, . . , 𝛼𝑝 ∈ ℝ and 𝜎2 > 0. In other words, the data points in the first group of the ANOVA table

are assumed to be realizations of independent Gaussian random variables with all the same expectation

parameter 𝜇 and the same variance parameter 𝜎2. The data points in the 𝑖th group for 𝑖 = 2,… , 𝑝 are

assumed to be realizations of independent Gaussian random variable with expectation parameter given by

𝜇𝑖 ≔ 𝜇1 + 𝛼𝑖, where 𝜇1 corresponds to the expectation parameter of the first group, while 𝛼𝑖 = 𝜇𝑖 − 𝜇1

refers to a treatment-level specific additional “effect”, and variance parameter 𝜎2 as for the first group.

Notably, the null hypothesis that all data points are realizations from the same population corresponds to

the assumption that 𝛼2 = 𝛼3 = ⋯𝛼𝑝 = 0. Further, there is a single variance parameter 𝜎2 that governs the

variability of each random variable. Of course, the above my equivalently be written in structural form as

𝑦1𝑗 = 𝜇1 + 휀1𝑗 with 𝑝(휀1𝑗) = 𝑁(휀1𝑗; 0, 𝜎2) for 𝑗 = 1,… , 𝑛1 (16)

and

𝑦𝑖𝑗 = 𝜇1 + 𝛼𝑖 + 휀𝑖𝑗 with 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎2) for 𝑗 = 1,… , 𝑛𝑖 and 𝑖 = 2,… , 𝑝 (17)

As always, the 𝑛-univariate independent Gaussian random variables can be formulated as the univariate

projections of an 𝑛-dimensional Gaussian probability distributions (i.e. a GLM) by defining an appropriate

data vector, design matrix, beta parameter vector, and spherical covariance matrix. Specifically, the

equivalent GLM representation of the above is given by defining

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (18)

where 𝑛 = ∑ 𝑛𝑖𝑟𝑖=1 , 𝜎2 > 0 and

206

𝑦 ≔

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛2⋮𝑦𝑝1⋮

𝑦𝑝𝑛𝑝)

∈ ℝ𝑛, 𝑋 ≔

(

1⋮11⋮11⋮11⋮1

0⋮01⋮10⋮00⋮0

⋯

0⋮00⋮00⋮01⋮1)

∈ ℝ𝑛×𝑝, 𝛽 ≔ (

𝜇1𝛼2⋮𝛼𝑝

) ∈ ℝ𝑝 (19)

We now consider partitioning the GLM above according to

𝑋 = (𝑋1 𝑋2) ∈ ℝ𝑛×𝑝, where 𝑋1 ∈ ℝ

𝑛×1 and 𝑋2 ∈ ℝ𝑛×(𝑝−1) (20)

and

𝛽 = (𝛽1𝛽2) where 𝛽1 ≔ 𝜇 ∈ ℝ𝑝−1 and 𝛽2 ≔ (𝛼2, 𝛼3, … , 𝛼𝑟)

𝑇 ∈ ℝ𝑝−1 (21)

For this partitioning, the “reduced model” corresponds to

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋1𝛽1, 𝜎2𝐼𝑛) with 𝑦 ≔

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛2⋮𝑦𝑝1⋮

𝑦𝑝𝑛𝑝)

∈ ℝ𝑛, 𝑋1 ≔

(

1⋮11⋮1⋮1⋮1)

∈ ℝ𝑛×1, 𝛽 ≔ 𝜇 ∈ ℝ𝑝 (22)

or, equivalently, in structural form to

𝑦𝑖𝑗 = 𝜇 + 휀𝑖𝑗𝑖 with 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎2) (𝑖 = 1,… , 𝑝, 𝑗 = 1,… , 𝑛𝑖) (23)

while the full model merely correspond to the equivalent forms of equations (13) – (19).

Next, we consider the OLS 𝛽 estimators and the residual sum-of-squares of these reduced and full

models, respectively. It is straight-forward to show that the OLS 𝛽 estimator for the reduced model is given

by the average over all data points, i.e.

�̂�1 = (1

𝑛∑ ∑ 𝑦𝑖𝑗

𝑛𝑖𝑗 =1

𝑝𝑖=1 ) =: �̅� (23)

Further, the residual sum-of-squares is given by

𝑒1𝑇𝑒1 ≔ (𝑦 − 𝑋1�̂�1)

𝑇(𝑦 − 𝑋1�̂�1) = (∑ ∑ 𝑦𝑖𝑗

𝑛𝑖𝑗=1

𝑝𝑖=1 − �̅�)

2 (24)

Notably, this quantity is identical to the total sum-of-squares as defined in equation (5). We thus have

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑒1𝑇𝑒1 (26)

207

or in other words, the equality of the total sum-squares of the as defined in the in the classical variance

partitioning of one-factorial ANOVA and the “residual sum-of-squares” of the reduced model in a GLM

model partitioning scheme a one-factor ANOVA GLM.

For the full model, the 𝛽 parameter estimator takes the form

�̂� = (

�̂�1�̂�2⋮�̂�𝑝

) = (

�̂�1�̂�2 − �̂�1

⋮�̂�𝑝 − �̂�1

) =

(

1

𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1

1

𝑛2∑ 𝑦2𝑗𝑛2𝑗=1 −

1

𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1

⋮1

𝑛𝑝∑ 𝑦𝑝𝑗𝑛𝑝𝑗=1

−1

𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1 )

=:(

�̅�1�̅�2 − �̅�1

⋮�̅�𝑝 − �̅�1

) (27)

Further, the residual errors corresponds to

𝑒12 = (𝑦 − 𝑋�̂�) =

(

𝑦11 − �̅�1⋮

𝑦1𝑛1 − �̅�1𝑦21 − (�̅�1 + �̅�2 − �̅�1)

⋮𝑦2𝑛2 − (�̅�1 + �̅�2 − �̅�1)

⋮𝑦𝑝1 − (�̅�1 + �̅�𝑝 − �̅�1)

⋮𝑦𝑝𝑛𝑝 − (�̅�1 + �̅�𝑝 − �̅�1))

=

(

𝑦11 − �̅�1⋮

𝑦1𝑛1 − �̅�1𝑦21 − �̅�2

⋮𝑦2𝑛2 − �̅�2

⋮𝑦𝑝1 − �̅�𝑝

⋮𝑦𝑝𝑛𝑝 − �̅�𝑝)

(28)

such that

𝑒12𝑇 𝑒12 = ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)

2𝑛𝑖𝑗=1

𝑝𝑖=1 (29)

Notably, this is identical to the within sum-of-squares as defined in equation (7). We thus have

𝑆𝑆𝑊𝑖𝑡ℎ𝑖𝑛 = 𝑒12𝑇 𝑒12 (30)

Further, it follows by means of the equality (8), that

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 ⇔ 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑒1𝑇𝑒1 − 𝑒12

𝑇 𝑒12 (31)

With 𝑝2 ≔ 𝑝 − 1 in the current scenario and

𝐹 =𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛

𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =

𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑝−1

𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛𝑛−𝑝

=(𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12)/𝑝2

𝑒12𝑇 𝑒12/(𝑛−𝑝)

(32)

we thus have the equivalence of the F-statistic as formulated in the classical variance partitioning scheme of

a one-factorial ANOVA design and the F-statistics as formulated as the comparison of a full and a reduced

model in the context of the equivalent one-factorial ANOVA GLM formulation.

208

Study Questions

1. Discuss commonalities and differences between Fisher’s significance testing and Neyman-Pearson’s hypothesis testing

frameworks.

2. Write down the formal definition of the (non-extended) T-Statistic and discuss its intuitive meaning.

3. The t-distribution is defined as the distribution of the random variable 𝑡 ≔ 𝑋/√𝑌/𝑛. What do 𝑋, 𝑌 and 𝑛 refer to, and how are

these entities related to the GLM framework? 4. Explain how the T-statistic can be used to test a null hypothesis. 5. Write down the formal definition of the F-Statistic and discuss its intuitive meaning.

6. The f-distribution is defined as the distribution of the random variable 𝑋 ≔𝑌1/𝑚

𝑌2/𝑛. What do 𝑌1, 𝑌2, 𝑚 and 𝑛 refer to, and how are

these entities related to the GLM framework?


1. Both Fisher’s and Neyman-Pearson’s frameworks are rooted in frequentists’ assumptions about distributions of data and ensuing

statistics given a true, but unknown, parameterized probabilistic model. Fisher’s framework requires the specification of a null

hypothesis (not necessarily a nil hypothesis in the sense of “no effect”) and uses the probability of observed data under this null

hypothesis to decide whether sufficient evidence against the null hypothesis has been gathered. In contrast Neyman-Pearson’s

framework requires the explicit specification of two hypotheses, a commitment to acceptable Type I and Type II error rates, which

together define a decision criterion.

2. The T-Statistic in its „simple form“ is given by

𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐.

Intuitively, the numerator of the T-statistics 𝑐𝑇�̂� is a measure of the effect size encoded in linear combination of the beta parameter

estimates �̂� ∈ ℝ𝑝, where the type of the linear combination (for example, the selection of a specific subcomponent of �̂�) is encoded

by 𝑐 ∈ ℝ𝑝. The denominator √�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 is a measure of the variance associated with the beta parameter estimate �̂�, which scales with the estimated GLM variance parameter σ̂2. Intuitively, the T-statistic is thus a ratio between effect size and its variance. The larger the estimated effect in comparison to its estimated variance, the more “reliable” the effect may be considered.

3. In the definition of the random variable 𝑡 ≔ 𝑋/√𝑌/𝑛, 𝑋 and 𝑌 refer to two independent scalar random variables. Specifically, X

refers to a random variable distributed according to a standard normal distribution 𝑁(𝑋; 0,1) , whereas 𝑌 refers to a random variable distributed according to a chi-square distribution with 𝑛 degrees of freedom, χ2(𝑦; 𝑛). With respect to the GLM, X corresponds roughly to a standardized parameter estimator contrast and Y roughly to an estimated variance parameter, and 𝑛 to the number of data points.

4. In a given experimental context, one usually observes a single data set 𝑦 ∈ ℝ𝑛 from which one can compute a single T-statistic

value 𝑇 ∈ ℝ for a given “null hypothesis” 𝛽0 ∈ ℝ𝑝. The logic of null hypothesis significance is then as follows: if the observed T-

statistic value and more extreme values, under the assumed null hypothesis/distribution have a very small probability to occur (for

example a probability for it or more extreme values of less than 0.05), one may infer that it is not very likely that the data that gave

rise to the computed T-statistic value was actually generated from a GLM for which the null hypothesis 𝐻0: 𝛽 = 𝛽0 ∈ ℝ𝑝 holds true

and one would “reject the null hypothesis”.

5. The formal definition of the F-statistic is given by

𝐹:ℝ+ × ℝ+ → ℝ+, (𝑒1, 𝑒12) ↦ 𝐹(𝑒1, 𝑒12) ≔(𝑒1𝑇𝑒1−𝑒12

𝑇 𝑒12)/𝑝2

𝑒12𝑇 𝑒12/(𝑛−𝑝)

where 𝑒1𝑇𝑒1 correesponds to the residual-sum-of-squares obtained under a reduced GLM “nested” within a complet GLM with

residual sum-of-squares 𝑒12𝑇 𝑒12. n corresponds to the number of data points, p to the number of parameters in the full GLM and p2

to the number of regressors/parameters that are added to the reduced model to obtain the full GLM. If the reduction in the residual-sum-of-squares afforded by the p2 additional regressors is small compared to the residual-sum-of-squares achieved under the full model, the added regressors (compared to the reduced model) are not very valuable. In other words: an F-statistic value of 0 indicates that the residual-sum-of-squares of both the reduced and the full model are identical, and thus, that the additional regressors of the full model do not contribute much to the explanation of the observed data.

209

6. In the F statistic variable 𝑋 ≔ (𝑌1/𝑚)/(𝑌2/𝑛) Y1 referes to a chi-squared distributed random variable with 𝑚 degrees of freedom and Y2 refers to a chi-squared distributed random variable with 𝑛 degrees of freedom. Within the GLM framework 𝑌1 takes on the role of difference in residual sum-squares between a reduced and a full model and 𝑚 the role of the additional degrees of freedoms, i.e., parameters incorporated in the full model with respect to the reduced model. 𝑌2 takes on the role of the full model’s residual sum-squares and 𝑛 the degrees of freedom of the full model.

210

Bayesian Estimation

(1) Model Formulation

In this section we apply the Bayesian paradigm to the GLM. Specifically, we derive a posterior

distribution 𝑝(𝛽|𝑦) for the beta parameter and expression for the evidence (marginal probability) 𝑝(𝑦) in

the Bayesian context. These derivations, in the current section, are conditional on two assumptions. Firstly,

we assume that we know the variance parameter of the GLM. Secondly, we assume that the marginal (prior)

distribution of the beta parameter is given by a multivariate Gaussian distribution. Notably, in contrast to the

classical estimation of the beta parameter by means of parameter point estimation, we treat 𝛽 as an

unobserved random variable. We thus rewrite our familiar form of the GLM data distribution

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 (1)

as a conditional distribution

𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽2 𝐼𝑛) , where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 , 𝜎𝑦|𝛽

2 > 0 (2)

Note that from now on we use 𝜎𝑦|𝛽2 instead of 𝜎2 to denote the variance parameter of this distribution, the

reason for which will become clear immediately. To simplify the notation, we will write Σ𝑦|𝛽 ≔ 𝜎𝑦|𝛽2 𝐼𝑛, and

note that the matrix Σ𝑦|𝛽 ∈ ℝ𝑛×𝑛 is positive-definite .The assumption of a Gaussian prior distribution with

spherical covariance matrix for the beta parameter is formalized by defining

𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , Σ𝛽), where 𝛽 ∈ ℝ𝑝, 𝜇𝛽 ∈ ℝ𝑝, Σ𝛽 ∈ ℝ

𝑝×𝑝 𝑝. 𝑑. (3)

Here, 𝐼𝑝 ∈ ℝ𝑝×𝑝 denotes the 𝑝-dimensional identity matrix in analogy to 𝐼𝑛. Note that by specifying the

“likelihood” 𝑝(𝑦|𝛽) and the “prior” 𝑝(𝛽), we implicitly define a joint distribution (or “generative model”)

over both the data 𝑦 and the parameters 𝛽 given by

𝑝(𝑦, 𝛽) = 𝑝(𝑦|𝛽)𝑝(𝛽) (4)

We will further investigate this joint distribution in below. For the moment we state without proof that the

posterior distribution over 𝛽 given the data 𝑦 is again a Gaussian distribution, the parameters of which we

denote by 𝜇𝛽|𝑦 ∈ ℝ𝑝 and Σ𝛽|𝑦 ∈ ℝ

𝑝×𝑝, such that we can write

𝑝(𝛽|𝑦) = 𝑁(𝛽; 𝜇𝛽|𝑦, Σ𝛽|𝑦) (5)

Likewise, we state without proof that the marginal distribution of the data 𝑦 is a Gaussian distribution, the

parameters of which we denote by 𝜇𝑦 ∈ ℝ𝑛 and Σ𝑦 ∈ ℝ

𝑛×𝑛 , such that we can write

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) (6)

Our principle aim in this section is thus to derive equations for the parameters 𝜇𝛽|𝑦, Σ𝛽|𝑦, 𝜇𝑦 and Σ𝑦 in terms

of the design matrix 𝑋, the likelihood variance parameter 𝜎𝑦|𝛽2 , and the prior parameters 𝜇𝛽 and Σ𝛽. This

endeavor is an example of a “parametric Bayesian approach with conjugate priors”. Intuitively this means

that all distributions involved are determined by parameters (i.e. we only need to specify the expectation

parameter and covariance parameter for all distributions involved), and the prior distributions and posterior

distribution are of the same functional type, i.e. both are Gaussians, such that the posterior parameters can

211

be determined in terms of “parameter update equations”. This is not the only conceivable case for applying

the Bayesian paradigm, but with respect to the GLM, a very important one.

(2) Bayesian estimation of the beta parameters

As a first step, we evaluate the functional form and parameters of the joint distribution of 𝑦 and 𝛽.

Because both distributions over the observed random variable 𝑦 and the unobserved random variable 𝛽 are

Gaussian distributions, we can immediately apply the results for the parameter of joint Gaussian

distributions specified in terms of a Gaussian marginal and a Gaussian conditional distribution derived in

above . For the current scenario, we have the Gaussian marginal distribution

𝑝(𝛽) = 𝑁(𝛽; 𝜇𝛽 , Σ𝛽), where 𝛽, 𝜇𝛽 ∈ ℝ𝑝, Σ𝛽 ∈ ℝ

𝑝×𝑝 and positive-definite (1)

and the conditional distribution over the GLM data vector 𝑦 ∈ ℝ𝑛 given 𝛽 ∈ ℝ𝑝

𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, Σ𝑦|𝛽) where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, Σ𝑦|𝛽 ∈ ℝ𝑛×𝑛 and positive-definite (2)

From the Gaussian joint distribution and conditioning theorem we can read off that the joint distribution of

the unobserved random variable 𝛽 and the observed random variable 𝑦 is a Gaussian distribution

𝑝(𝑦, 𝛽) = 𝑁((𝑦𝛽) ; 𝜇𝑦,𝛽 , Σ𝑦,𝛽) (3)

where

𝜇𝑦,𝛽 = (𝑋𝜇𝛽𝜇𝛽

) ∈ ℝ𝑛+𝑝 (4)

and

Σ𝑦,𝛽 = (Σ𝑦|𝛽 + 𝑋Σ𝛽𝑋

𝑇 𝑋Σ𝛽

Σ𝛽𝑋𝑇 Σ𝛽

) ∈ ℝ(𝑛+𝑝)×(𝑛+𝑝) (5)

The formula for the expectation parameter of the joint distribution (4) reveals that the expectation for the 𝛽

subpart of the random vector (𝑦, 𝛽)𝑇 is identical to the prior expectation of 𝛽, while the the expectation for

the 𝑦 subpart corresponds to the projection of the prior expectation of 𝛽 into the data space by means of

the design matrix 𝑋. The formula for the covariance matrix parameter of the joint distribution (5) reveals

that the “variance” parameter of the 𝛽 subpart of the random vector (𝑦, 𝛽)𝑇 is identical to the prior

“variance” parameter of 𝛽. The “variance” parameter of the 𝑦 subpart on the other hand is given as the sum

of the conditional variance Σ𝑦|𝛽 and the marginal variance Σ𝛽 projected into the data space by means of the

design matrix. In other words, the in the Bayesian conjugate prior scenario, the data covariance matrix is

affected by the parameter prior covariance – higher prior variance thus implicates a higher data variance.

Finally, the covariance of the subparts 𝑦 and 𝛽 of the random vector (𝑦, 𝛽)𝑇 results from an interaction

between the design matrix and the prior covariance parameter of 𝛽, but is unaffected by the data

conditional covariance Σ𝑦|𝛽.

Because we have derived the joint distribution 𝑝(𝑦, 𝛽) from the specifications of the prior

distribution 𝑝(𝛽) and the conditional distribution 𝑝(𝑦|𝛽) above, we can immediately apply the Gaussian

212

joint distribution and conditioning theorem to the parameters of this joint distribution in order to derive the

parameters of the posterior distribution 𝑝(𝛽|𝑦) and the marginal distribution 𝑝(𝑦).

For the conditional distribution, we obtain

𝑝(𝛽|𝑦) ≔ 𝑁(𝛽; 𝜇𝛽|𝑦, Σ𝛽|𝑦) (6)

where

Σ𝛽|𝑦 ≔ (Σ𝛽−1 + 𝑋𝑇Σ𝑦|𝛽

−1 𝑋)−1∈ ℝ𝑝×𝑝 𝑝. 𝑑. (7)

and

𝜇𝛽|𝑦 ≔ Σ𝛽|𝑦(Σ𝛽−1𝜇𝛽 + 𝑋

𝑇Σ𝑦|𝛽−1 𝑦) ∈ ℝ𝑝 (8)

We see that the posterior covariance matrix of 𝛽 results from a mixture of the prior covariance matrix and

the conditional data covariance filtered by the design matrix. Notably, the posterior covariance matrix of the

parameter 𝛽 is unaffected by the data. In other words, whatever realization of the data is observed, the

posterior covariance matrix of the parameter 𝛽 is unaffected by this outcome. The expectation parameter of

the posterior parameter distribution, on the other hand, results from a mixture of the prior expectation

parameter and the data, weighted by their respective precisions (inverse covariance matrices) and scaled by

the posterior covariance parameter. High prior precision of 𝛽 and high data variability (i.e. low conditional

precision of 𝑦) thus assigns more weight to the prior expectation, and vice versa.

For the marginal distribution of the data 𝑦, we have

𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) (9)

where

𝜇𝑦 = 𝑋𝜇𝛽 ∈ ℝ𝑛 (10)

and

Σ𝑦 = Σ𝑦|𝛽 + 𝑋Σ𝛽𝑋𝑇 ∈ ℝ𝑛×𝑛 (11)

As noted above, the marginal expectation of the 𝑦 depends on the prior expectation of 𝛽 and the design

matrix 𝑋. From a model evidence perspective, this implies that data can achieve the highest probability

under a given model if both the prior assumptions of 𝛽 as well as the “data-generating mechanism” (i.e., the

design matrix) match their true, but unknown, counterparts.

In the following, we apply the formulas for the conditional distribution of 𝛽 is three low-dimensional

scenarios, which are readily visualized: Bayesian inference for the expectation parameter of the univariate

Gaussian based on a single and multiple observations, and Bayesian inference for the offset and slope

parameters in simple linear regression.

213

(2) Examples for Bayesian beta parameter estimation

Bayesian inference for the expectation of a univariate Gaussian with a single observation

Based on the expression for the parameters of the conditional distribution 𝑝(𝛽|𝑦) stated above, we

can now verify the expressions for the posterior parameters in the context of Bayesian inference for the

expectation of a univariate Gaussian with a single observation as discussed in the previous Section. Notably,

in this case, the design matrix corresponds to the scalar 1, and the expectation parameter of the univariate

Gaussian corresponds to the only beta parameter. We thus have for the marginal distribution of the

unknown parameter

𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , 𝜎𝛽2) = 𝑁(𝛽; 𝜇𝛽 , 𝜆𝛽

−1) where 𝛽, 𝜇𝛽 ∈ ℝ, 𝜆𝛽 =1

𝜎𝛽2 > 0 (1)

and for the conditional distribution of the single data point

𝑝(𝑦|𝛽) ≔ 𝑁(𝑦; 𝛽, 𝜎𝑦|𝛽2 ) = 𝑁(𝑦; 𝛽, 𝜆𝑦|𝛽) , where 𝑦, 𝛽 ∈ ℝ, 𝜆𝑦|𝛽 =

1

𝜎𝑦|𝛽2 > 0 (2)

Application of the formulae for the posterior distribution over 𝛽 then yields for the posterior precision

parameter

𝜆𝛽|𝑦 = (𝜆𝛽 + 1𝑇 ⋅ 𝜆𝑦|𝛽 ⋅ 1) = (𝜆𝛽 + 𝜆𝑦|𝛽) > 0 (3)

and for the posterior expectation parameter

𝜇𝛽|𝑦 =1

𝜆𝛽+𝜆𝑦|𝛽(𝜆𝛽𝜇𝛽 + 1

𝑇𝜆𝑦|𝛽𝑦) =𝜆𝛽

𝜆𝛽+𝜆𝑦|𝛽𝜇𝛽 +

𝜆𝑦|𝛽

𝜆𝛽+𝜆𝑦|𝛽𝑦 ∈ ℝ (4)

By working through the general (𝑛 + 𝑝)-dimensional joint distribution case of GLM beta parameters and

data, we have thus justified the Bayesian inference scheme for the univariate Gaussian discussed in the

previous section.

Bayesian inference for the expectation of a univariate Gaussian 𝑛 observations

We have repeatedly seen that Inference for the expectation of a univariate Gaussian based on 𝑛

independent and identically distributed observation can be case in GLM form by defining a design matrix

comprising a single column of 𝑛 1′𝑠 and identifying the single 𝛽 parameter in this GLM with the expectation

parameter of the univariate Gaussian to be inferred. In the Bayesian conjugate prior context with known

variance parameter 𝜎𝑦|𝛽2 we thus have the following marginal parameter and conditional data distribution

𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , 𝜎𝛽2) = 𝑁(𝛽; 𝜇𝛽 , 𝜆𝛽

−1) where 𝛽, 𝜇𝛽 ∈ ℝ, 𝜆𝛽 =1

𝜎𝛽2 > 0 (5)

and

𝑝(𝑦|𝛽) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽2 𝐼𝑛) = 𝑁(𝑦; 𝑋𝛽, 𝜆𝑦|𝛽

−1 𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 = (1⋮1) ∈ ℝ𝑛, 𝜆𝑦|𝛽 =

1

𝜎𝑦|𝛽2 > 0 (6)

Application of the formulae for the posterior distribution over 𝛽 in terms of precisions then yields for the

posterior precision parameter

214

𝜆𝛽|𝑦 = (𝜆𝛽 + (1,… ,1)𝜆𝑦|𝛽 (1⋮1)) = 𝜆𝛽 + 𝑛𝜆𝑦|𝛽 > 0 (7)

and for the posterior expectation parameter

𝜇𝛽|𝑦 =1

𝜆𝛽+𝑛𝜆𝑦|𝛽(𝜆𝛽𝜇𝛽 + (1,… ,1)𝜆𝑦|𝛽 (

𝑦1⋮𝑦𝑛)) =

1

𝜆𝛽+𝑛𝜆𝑦|𝛽𝜇𝛽 +

𝜆𝑦|𝛽

𝜆𝛽+𝑛𝜆𝑦|𝛽∑ 𝑦𝑖𝑛𝑖=1 ∈ ℝ (8)

Note that for a prior expectation of 𝜇𝛽 = 0 and 𝜆𝛽 approaching zero, i.e. infinitely high prior variance, we

recover the ML point estimator 1

𝑛∑ 𝑦𝑖𝑛𝑖=1 for the posterior expectation parameter.

(4) Bayesian Estimation of the beta and variance parameters

In the current section we assume that both the beta parameter vector 𝛽 ∈ ℝ𝑝 and the likelihood

variance parameter 𝜎𝑦|𝛽2 > 0 are unknown and are treated as unobserved random variables in a joint model

distribution model of the form

𝑝(𝑦, 𝛽, 𝜎𝑦|𝛽2 ) = 𝑝(𝑦|𝛽, 𝜎𝑦|𝛽

2 )𝑝(𝛽, 𝜎𝑦|𝛽2 ) (1)

where 𝑝(𝛽, 𝜎𝑦|𝛽2 ) denotes the joint marginal (prior) distribution over both 𝛽 ∈ ℝ𝑝 and 𝜎𝑦|𝛽

2 > 0. Because

the unobserved random variable 𝜎𝑦|𝛽2 is required to be strictly positive for the Gaussian likelihood

𝑝(𝑦|𝛽, 𝜎𝑦|𝛽2 ) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽

2 𝐼𝑛) (2)

to be well-defined, the marginal and conditional distributions of 𝜎𝑦|𝛽2 cannot be set to Gaussian

distributions. A number of approaches for modeling uncertainty about 𝜎𝑦|𝛽2 can be found in the literature. A

common approach is to formulate the Gaussian likelihood in terms of a precision parameter 𝜆 ≔ (𝜎𝑦|𝛽2 )

−1

and assume a gamma marginal distribution for 𝜆. Another approach, which we discuss below, is to formulate

the Gaussian likelihood in terms of the variance parameter 𝜎𝑦|𝛽2 , and use the inverse gamma distribution to

model its uncertainty. Yet another approach is to use a so-called “Jeffrey’s” or “reference” prior distribution

for 𝜎𝑦|𝛽2 , which takes the roles of an “uninformative” prior distribution. We discuss these approaches in turn.

For a normal-inverse gamma prior distribution, we assume the following factorization of the joint

distribution over 𝑦, 𝛽 and 𝜎𝑦|𝛽2

𝑝(𝑦, 𝛽, 𝜎𝑦|𝛽2 ) = 𝑝(𝑦|𝛽, 𝜎𝑦|𝛽

2 )𝑝(𝛽|𝜎𝑦|𝛽2 )𝑝(𝜎𝑦|𝛽

2 ) (1)

We thus assume that the marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 factorizes into a the conditional

distribution 𝑝(𝛽|𝜎𝑦|𝛽2 ) and the marginal distribution 𝑝(𝜎𝑦|𝛽

2 ). Notably, this induces a dependence of the

probability density function values for the beta parameter 𝛽 on the value of the likelihood variance

parameter 𝜎𝑦|𝛽2 . While this may not necessarily be the most natural scenario, it is nevertheless (due to its

mathematical tractability), commonly encountered in the literature. As usual, the data likelihood is set to

𝑝(𝑦|𝛽, 𝜎𝑦|𝛽2 ) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽

2 𝐼𝑛) (2)

215

while the marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 is set to the product of a Gaussian conditional

distribution of 𝛽 given 𝜎𝑦|𝛽2 and an inverse Gamma distribution over 𝜎𝑦|𝛽

2 , i.e., the normal-inverse Gamma

distribution

𝑝(𝛽, 𝜎𝑦|𝛽2 ) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽

2 ; 𝜇𝛽 , Σ𝛽 , 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽

2 ) = 𝑁(𝛽; 𝜇𝛽 , 𝜎𝑦|𝛽2 Σ𝛽)𝐼𝐺 (𝜎𝑦|𝛽

2 ; 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽

2 ) (3)

With this marginal distribution and likelihood, one can show that the data-conditional (posterior)

distribution over 𝛽 and 𝜎𝑦|𝛽2 is given again by a normal-inverse Gamma distribution, i.e., the normal-inverse

Gamma prior distribution is a conjugate prior distribution for the Gaussian likelihood:

𝑝(𝛽, 𝜎𝑦|𝛽2 |𝑦) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽

2 ; 𝜇𝛽|𝑦, Σ𝛽|𝑦, 𝑎𝜎𝑦|𝛽2 |𝑦, 𝑏𝜎𝑦|𝛽

2 |𝑦) (4)

The parameters of this distribution are given in terms of the prior distribution parameters and the data as

follows

Σ𝛽|𝑦 = (Σ𝛽−1 + 𝑋𝑇𝑋)

−1 (5)

𝜇𝛽|𝑦 = Σ𝛽|𝑦(Σ𝛽−1𝜇𝛽 + 𝑋

𝑇𝑦) (6)

𝑎𝜎𝑦|𝛽2 |𝑦 = 𝑎𝜎𝑦|𝛽

2 +𝑛

2 (7)

𝑏𝜎𝑦|𝛽2 |𝑦 = 𝑏𝜎𝑦|𝛽

2 +1

2(𝜇𝛽𝑇Σ𝛽

−1𝜇𝛽 + 𝑦𝑇𝑦 − 𝜇𝛽|𝑦Σ𝛽|𝑦

−1 𝜇𝛽|𝑦) (8)

The posterior parameters Σ𝛽|𝑦 and 𝜇𝛽|𝑦 are thus similar to the case that 𝜎𝑦|𝛽2 is known. The posterior

parameter 𝑎𝜎𝑦|𝛽2 |𝑦 is determined by the prior parameter on 𝜎𝑦|𝛽

2 and the number of data points. Finally, the

posterior parameter 𝑏𝜎𝑦|𝛽2 |𝑦 is determined by the prior parameter 𝑏𝜎𝑦|𝛽

2 ,and three sum of squares: the sum-

of-squares of the prior expectation of 𝛽, the empirical sum-of-squares 𝑦𝑇𝑦 and the sum-of-squares of the

posterior expectation of 𝛽.

From the properties of the normal-inverse gamma distribution, we can infer that the posterior

marginal distributions over 𝛽 and 𝜎𝑦|𝛽2 are given by the following non-central multivariate t-distribution and

inverse gamma distributions, respectively

𝑝(𝛽|𝑦) = 𝑡 (𝛽; 𝜇𝛽|𝑦,𝑏𝜎𝑦|𝛽2 |𝑦

𝑎𝜎𝑦|𝛽2 |𝑦

Σ𝛽|𝑦, 2𝑎𝜎𝑦|𝛽2 |𝑦) (9)

𝑝(𝜎𝑦|𝛽2 |𝑦) = 𝐼𝐺 (𝜎𝑦|𝛽

2 ; 𝑎𝜎𝑦|𝛽2 |𝑦, 𝑏𝜎𝑦|𝛽

2 |𝑦) (10)

The two upper rows of Figure 1 visualize the Bayesian estimation of the GLM for the special case of

independent and identical sampling from a univariate Gaussian with unknown expectation (beta) and

variance parameter and a normal-inverse gamma prior distribution for these parameters.

Another choice of marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 commonly encountered is

𝑝(𝛽, 𝜎𝑦|𝛽2 ) ∝

1

𝜎𝑦|𝛽2 (1)

216

This choice of marginal distribution is referred to as “uninformative”, “Jeffreys” or “reference” prior,

because it can be argued that the distribution thus defined conveys minimal information about 𝛽 and 𝜎𝑦|𝛽2 in

a well-defined sense.

In addition, (1) is an example for a so-called “improper” prior distribution. Improper prior distributions are

characterized by density functions that do not integrate to 1, i.e., which are not probability density functions

in the general sense. They are usually denoted using proportionality statements as in (1). The marginal

distribution (1) may also be expressed as an improper normal-inverse gamma distribution

𝑝(𝛽, 𝜎𝑦|𝛽2 ) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽

2 ; 𝜇𝛽 , Σ𝛽 , 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽

2 ) (2)

with parameters

𝜇𝛽 = 0, Σ𝛽 = ∞𝐼𝑝, 𝑎𝜎𝑦|𝛽2 = −

1

2 and 𝑏𝜎𝑦|𝛽

2 = 0 (3)

Despite the fact that the marginal distributions thus defined are not probability density functions, it can be

shown that the data-conditional (posterior) distribution over 𝛽 and 𝜎𝑦|𝛽2 is given by a proper probability

density function, namely a normal-inverse gamma distribution. This normal-inverse gamma distribution has

the following parameters

Σ𝛽|𝑦 = (𝑋𝑇𝑋)−1 (4)

𝜇𝛽|𝑦 = (𝑋𝑇𝑋)𝑋𝑇𝑦 (5)

𝑎𝜎𝑦|𝛽2 |𝑦 =

𝑛−𝑝

2 (6)

𝑏𝜎𝑦|𝛽2 |𝑦 =

1

2(𝑦 − 𝑋𝜇𝛽|𝑦)

𝑇(𝑦 − 𝑋𝜇𝛽|𝑦) (7)

In this case, the data-conditional marginal distribution of 𝛽 is given by a multivariate non-central 𝑡-

distribution, of the following form

𝑝(𝛽|𝑦) = 𝑡 (𝛽; 𝜇𝛽|𝑦,(𝑦−𝑋𝜇𝛽|𝑦)

𝑇(𝑦−𝑋𝜇𝛽|𝑦)

𝑛−𝑝(𝑋𝑇𝑋)−1 , 𝑛 − 𝑝) (8)

Notably, for this choice of prior, the posterior parameters have strong similarities to the classical maximum

likelihood point estimation scenario: the posterior expectation parameter 𝜇𝛽|𝑦 corresponds to the maximum

likelihood estimator �̂�. The posterior covariance parameter Σ𝛽|𝑦 corresponds closely to the covariance

parameter of the sampling distribution covariance parameter of �̂�. Finally, the residual sum-of-squares

directly enters the rate parameter of the posterior distribution over 𝜎𝑦|𝛽2 .

If we consider the marginal posterior distribution 𝑝(𝛽|𝑦) for a single beta parameter 𝛽𝑖 for 𝑖 ∈ ℕ𝑝, we have

an even stronger similarity between Bayesian and classical point estimation: For the case of the improper

prior distribution (1), the posterior marginal distribution of 𝛽𝑖, and the sampling distribution of �̂�𝑖 are

equivalent. From (8), we have with

�̂�2 =(𝑦−𝑋�̂�𝑖)

𝑇(𝑦−𝑋�̂�𝑖)

𝑛−𝑝 (9)

217

that

𝑝(𝛽𝑖|𝑦) = 𝑡(𝛽𝑖; �̂�𝑖, �̂�2(𝑋𝑇𝑋)𝑖𝑖

−1 , 𝑛 − 𝑝) (10)

In other words with

𝑇𝐵 ≔𝛽𝑖−�̂�𝑖

√�̂�2(𝑋𝑇𝑋)𝑖𝑖−1

(11)

We thus have

𝑝(𝑇𝐵|𝑦) = 𝑡(𝑇𝐵; 𝑛 − 1) (12)

Note, however, that while the formulas for the random variables 𝑇 and 𝑇𝐵 and their distributions are

equivalent, they arise from fundamentally different assumptions about 𝛽𝑖: In the classical frequentist

scenarios, 𝛽𝑖 is a true, but unknown, fixed value, not a random variable. On the other hand �̂�𝑖 is a random

variable in both scenarios, owing to the common assumption of the Gaussian distribution of the data 𝑦.

The third row of Figure 1 visualizes the Bayesian estimation of the GLM for the special case of

independent and identical sampling from a univariate Gaussian with unknown expectation (beta) and

variance parameter and a reference prior distribution for these parameters.

Figure 1. Bayesian estimation of the GLM 𝛽 and 𝜎𝑦|𝛽2 parameters for the special univariate Gaussian case. Rows depict three different

prior scenarios: the first and second rows depict the case of a tight and loose normal-inverse gamma distribution, respectively, while the third row depicts the case of a reference prior. Columns depict components of the probabilistic model: the first column depicts

the prior distribution over 𝛽 and 𝜎𝑦|𝛽2 , the second column depicts the true, but unknown, Gaussian likelihood over the 𝑛 data

points, and a sample of 𝑛 = 10. The third column depicts the inferred posterior distributions, while the two last columns depict the

posterior marginal distributions over the 𝛽 and 𝜎𝑦|𝛽2 parameter, respectively.

218

Fundamental designs

The General Linear Model is a mathematical unification of a number of data modeling procedures.

Specifically, it unites the following concepts: simple linear regression, multiple linear regression, T-tests, the

multifactorial analysis of variance (ANOVA) and the multifactorial analysis of covariance (ANCOVA). All these

approaches instantiate specific examples of the GLM, i.e., they are characterized by their specific design

matrix and beta parameter interpretation.

To exemplify these approaches in the following, we will use one common artificial data set. In this

data set, we conceive the data vector 𝑦 ∈ ℝ30 as representing the “anatomical volume” of a brain structure

(in mm³) (for example the dorsolateral prefrontal cortex (DLPFC)) of each of 𝑛 = 30 participants (or

“experimental units”) that took part in an anatomical MRI study. The study interests concerns the question

whether the experimental factors (= independent variables) “Age” (measured in years) and “Alcohol” (intake

measured in units (= 7.9 gram) of pure alcohol)) influence the dependent variable “anatomical volume”. A

fictitious data set of this kind is shown in Table 1. Note that we are dealing with a between-subject design

which justifies the assumption that the error components 휀𝑖, 𝑖 = 1,… , 𝑛 in the GLM formulation are

independent. In the following we will consider this data set from different perspectives of experimental

designs and statistical inference procedures.

Participant Age [years] Alcohol [units] DLPFC Volume

1 15 3 178.7708

2 16 6 168.4660

3 17 5 169.9513

4 18 7 162.0778

5 19 4 170.1884

6 20 8 156.9287

7 21 1 175.4092

8 22 2 173.3972

9 23 7 154.4907

10 24 5 158.3642

11 25 1 172.1033

12 26 3 162.6648

13 27 2 165.4449

14 28 8 142.2121

15 29 4 154.3557

16 30 6 145.6544

17 31 3 155.5286

18 32 4 150.5144

19 33 7 137.8262

20 34 1 160.1183

21 35 2 155.4419

22 36 8 127.1715

23 37 5 138.0237

24 38 6 133.4589

25 39 4 139.3813

26 40 3 145.1997

27 41 7 123.7259

28 42 5 130.7300

29 43 8 114.1148

30 44 1 151.1943

31 45 6 121.7235

32 46 2 140.9424

Table 1 An example data set.

219

(1) A spectrum of GLM designs

GLM designs, i.e. specific design matrices and their corresponding beta parameter vectors, for any

data set may be conceived as lying within a spectrum of two extremes: on one side of of the spectrum, one

may assume that there is no systematic variation in the outcome measure whatsoever. This would

correspond to the notion, that each data point 𝑦𝑖 ∈ ℝ (𝑖 = 1,… , 𝑛) has been sampled from a univariate

Gaussian, and all these Gaussian have the identical mean. In other words, all observed data variability over

experimental unit may be explained as “pure Gaussian noise” around a common expectation 𝜇 ∈ ℝ

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) ⇔ 𝑦𝑖 = 𝜇 + 휀𝑖, 𝑝(휀𝑖 ) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (𝑖 = 1,… , 𝑛) (1)

As discussed previously, this “null model” corresponds to the case of a design matrix formed by a columns of

ones and a single beta parameter. The 𝑖th entry in the matrix product of these, (𝑋𝛽)𝑖, represents the

participant-independent expectation parameter 𝜇 ∈ ℝ of the univariate Gaussian:

𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ, 𝜇 ≔ (𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (2)

On the other side of the spectrum, one may conceive a case, in which there is complete and

unsystematic variability over experimental units in the sense, that each participant’s data actually

corresponds to a sample from a participant-specific univariate Gaussian distribution. In other words, each

experimental unit is modelled by an experimental unit-specific expectation parameter 𝜇𝑖 (𝑖 = 1,… , 𝑛):

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖 ) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (𝑖 = 1,… , 𝑛) (3)

In terms of GLM designs, this would correspond to the square identity matrix as design matrix and a

beta parameter vector comprising as many parameters as there are data points. This renders the 𝑖th entry in

the matrix product 𝑋𝛽 the participant-specific expectation parameter 𝜇𝑖 ∈ ℝ

𝑋 ≔ (

1 00 1

⋯ 0⋯ 0

⋮ ⋮0 0

⋱ ⋮⋯ 1

) ∈ ℝ𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (4)

Note than in both cases no interesting statements can be made with respect to the columns of the

design matrix and the outcome data variable in the sense that (2) assumes that all participants are “the

same”, and (4) assumes that all participants are “mutually different”. Most of the designs we will encounter

the following lie somewhere between (2) and (4) and thus constrain the “differences” between participants

or data points in some meaningful way which lends itself to an interpretation in terms of the independent

experimental variables that the design matrix columns represent.

Design matrices can further be classified according to whether their columns are formed by

continuously varying real numbers or by so-called “indicator” or “dummy” variables, i.e. ones and zeros. In

the first case, the designs are referred to as “continuous” or “regression” designs, in which the independent

variables represent continuous experimental factors, and the design matrix columns are usually referred to

as “regressors”, “predictors”, and sometimes as “(co)variates”. In the second case, the designs are referred

to as “categorical” or “ANOVA-type” designs. In categorical designs the independent experimental variables

are usually referred to as “experimental factors” and the values that they can take on as “levels of the

experimental factor”.

220

The difference between the “continuous” and “categorical” approaches lies in the expectation

about changes in the dependent variable as the independent variable changes. If one treats an independent

experimental variable as a continuous variate, one assume a linear effect of this predictor on the dependent

variable. If one treats an independent experimental variable as a discrete variate, one does not need to

assume that for every unit change in the independent variable one expects a scaled unit change in the value

of dependent variable. In other words, one allows for arbitrary changes in the response from one category

to another. This approach has the advantage of a simple interpretation and may be viewed as a prediction of

“qualitative” differences. On the other hand, by grouping independent variable values into discrete

categories, we are discarding information contained in the continuous covariation of independent and

dependent variables.

In the following we will discuss two forms of continuous GLM designs, simple and multiple linear

regression, two forms of categorical GLM designs, T-tests and ANOVA designs, and one mixed form, the

ANCOVA design. For each design, we will select the relevant aspects of the example data set given in Table

1, write down the corresponding GLM in structural and design matrix form, show a typical visualization, and

discuss its estimation and interpretation from a classical and a Bayesian viewpoint.

(2) Simple Linear Regression

The central idea of simple linear regression is that the expectation of the 𝑖th data variable

𝑦𝑖 (𝑖 = 1,… , 𝑛) is given by a constant offset 𝑎 ∈ ℝ and the product of the value of a single “predictor” or

“regressor” independent experimental variable 𝑥𝑖 multiplied by a slope coefficient 𝑏 ∈ ℝ

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝑎 + 𝑏𝑥𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 + 휀𝑖, 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (1)

for 𝑖 = 1,… , 𝑛. In its design matrix formulation the above corresponds to

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (

1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛

) ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0 (2)

Here the first entry in the parameter vector 𝛽 ≔ (𝛽1, 𝛽2)𝑇 assumes the role of the offset 𝑎 , and the second

entry assumes the role of the slope 𝑏.

As an example, we reconsider the example data set above by ignoring the alcohol intake variable,

which results in the data set shown in Table 2. Here “Age” corresponds to the independent experimental

variable encoded in the values 𝑥𝑖 and “DLPFC Volume” corresponds to the dependent experimental variables

𝑦𝑖. Simple linear regression designs are typically visualized by plotting the values of the independent

experimental variable on the x-axis and the values of the dependent experimental variable on the y-axis.

𝒊 𝒙𝒊: Age 𝒚𝒊 : DLPFC Volume

1 15 178.7708

2 16 168.4660

3 17 169.9513

4 18 162.0778

5 19 170.1884

6 20 156.9287

7 21 175.4092

8 22 173.3972

9 23 154.4907

10 24 158.3642

11 25 172.1033

221

12 26 162.6648

13 27 165.4449

14 28 142.2121

15 29 154.3557

16 30 145.6544

17 31 155.5286

18 32 150.5144

19 33 137.8262

20 34 160.1183

21 35 155.4419

22 36 127.1715

23 37 138.0237

24 38 133.4589

25 39 139.3813

26 40 145.1997

27 41 123.7259

28 42 130.7300

29 43 114.1148

30 44 151.1943

31 45 121.7235

32 46 140.9424

Table 2. The example data set considered as a simple linear regression design.

Table 1. Visualization of a simple linear regression design

(3) Multiple Linear Regression

Multiple linear regression may be viewed as the most general application of the GLM in the sense

that all columns of the design matrix are allowed to take on arbitrary values, and there may be arbitrary

many of them. Here, we explicitly state the multiple linear regression design applicable for the data set

above, which includes two predictor variables. For the special case of two independent experimental

variable, the model takes the form

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝑎 + 𝑏1𝑥1𝑖 + 𝑏2𝑥2𝑖, 𝜎2) ⇔ 𝑦𝑖 = 𝑎 + 𝑏1𝑥1𝑖 + 𝑏2𝑥2𝑖 + 휀𝑖 , 𝑝( 휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (1)

for (𝑖 = 1,… , 𝑛). In its design matrix formulation the above corresponds to

222

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (

11⋮1

𝑥11𝑥12⋮𝑥1𝑛

𝑥21𝑥22⋮𝑥2𝑛

) ∈ ℝ𝑛×3, 𝛽 ∈ ℝ3 and 𝜎2 > 0 (2)

Here, the first entry in the beta parameter vector 𝛽 ≔ (𝛽1, 𝛽2, 𝛽3)𝑇 assumes the role of the offset 𝑎, the

second entry assumes the role of the slope with respect to the first independent experimental variable 𝑥1𝑖

and the third entry assumes the role of the slope with respect to the second independent experimental

variable 𝑥2𝑖.

As an example, we consider the data set of Table 1 and relabel the columns in accordance with the

multiple linear regression design as shown in Table 3. Multiple linear regression designs are not easily

visualized, especially, if the number of independent experimental variables is larger than 2. In Figure 2 below

we visualize the multiple linear regression design of the current example, note however, that these kind or

graphs are rarely seen in the literature.

𝒊 𝒙𝟏𝒊 : Age 𝒙𝟐𝒊: Alcohol 𝒚𝒊 : DLPFC Volume

1 15 3 178.7708

2 16 6 168.4660

3 17 5 169.9513

4 18 7 162.0778

5 19 4 170.1884

6 20 8 156.9287

7 21 1 175.4092

8 22 2 173.3972

9 23 7 154.4907

10 24 5 158.3642

11 25 1 172.1033

12 26 3 162.6648

13 27 2 165.4449

14 28 8 142.2121

15 29 4 154.3557

16 30 6 145.6544

17 31 3 155.5286

18 32 4 150.5144

19 33 7 137.8262

20 34 1 160.1183

21 35 2 155.4419

22 36 8 127.1715

23 37 5 138.0237

24 38 6 133.4589

25 39 4 139.3813

26 40 3 145.1997

27 41 7 123.7259

28 42 5 130.7300

29 43 8 114.1148

30 44 1 151.1943

31 45 6 121.7235

32 46 2 140.9424

Table 3 The example data set considered as a multiple linear regression design with two independent experimental variables

223

Figure 2. Visualization of a multiple linear regression design with two independent experimental variables.

(4) One-sample T-Test

The one-sample T-Test is usually portrayed as a procedure to evaluate whether the null hypothesis

that all data points were generated from univariate Gaussian distributions with identical expectation

parameters. From a GLM viewpoint is corresponds to a categorical design with a single experimental factor

taking on a single level. Specifically, each data point is modelled by a univariate Gaussian variable with

identical expectation parameter over data points

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) ⇔ 𝑦𝑖 = 𝜇 + 휀𝑖 , 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (1)

where 𝑖 = 1,… , 𝑛. In its design matrix formulation (1) corresponds to

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ1 and 𝜎2 > 0 (2)

Here the single entry parameter vector 𝛽 assumes the role of the true, but unknown, expectation 𝜇. If

applied to the example data set of Table 1, the one-sample T-Test allows for evaluating the null hypothesis

that all observed data points were generated based on univariate Gaussian with identical expectations.

Table 4 shows the example data set viewed from the perspective of a one-sample T-Test.

𝒊 𝒚𝒊 : DLPFC Volume

1 178.7708

2 168.4660

3 169.9513

4 162.0778

5 170.1884

6 156.9287

7 175.4092

8 173.3972

9 154.4907

10 158.3642

11 172.1033

12 162.6648

13 165.4449

14 142.2121

15 154.3557

16 145.6544

224

17 155.5286

18 150.5144

19 137.8262

20 160.1183

21 155.4419

22 127.1715

23 138.0237

24 133.4589

25 139.3813

26 145.1997

27 123.7259

28 130.7300

29 114.1148

30 151.1943

31 121.7235

32 140.9424

Table 4. The example data set considered as a one-sample T-test design .

Figure 3. Visualization of a one-sample T-Test

One-sample T-Test data are not usually visualized. However, in line with the visualization other categorical

designs it is most appropriate to visualize them by means of their sample mean and sample standard

deviation or standard error of mean (Figure 1).

For the model in (2) the classical point beta parameter estimator evaluates to

�̂� =1

𝑛∑ 𝑦𝑖𝑛𝑖=1 (3)

and is thus given by the so-called sample mean �̅� ≔1

𝑛∑ 𝑦𝑖𝑛𝑖=1 . The variance parameter estimator evaluates

to

�̂�2 = 1

𝑛−1∑ (𝑦𝑖 − �̅�)

2𝑛𝑖=1 (4)

and is thus given by the so-called sample variance 𝑠2 ≔1

𝑛−1∑ (𝑦𝑖 − �̅�)

2𝑛𝑖=1 . Specification of the null

hypothesis 𝐻0: 𝜇 = 𝜇0,then yields the familiar formula for the T-statistic:

𝑇 = √𝑛�̅�−𝜇0

𝑠 (5)

where we defined the unbiased sample standard deviation as

225

𝑠 ≔ √𝑠2 (6)

Based on a data realization we may thus evaluate the T-statistic, which under the null hypothesis is

distributed according to a 𝑡-distribution with 𝑛 − 1 degrees of freedom. If the observed T-statistic value is

associated with a low probability under the null hypothesis being correct, we may consider revising the null

hypothesis.

Proof of (3) – (5)

For the current GLM design, the beta parameter estimator is given by

�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (3.1)

= ((1 ⋯ 1)(1⋮1))

−1

(1 ⋯ 1)(

𝑦1⋮𝑦𝑛)

= (𝑛)−1∑ 𝑦𝑖𝑛𝑖=1

=1

𝑛∑ 𝑦𝑖𝑛𝑖=1

corresponding to the arithmetic mean �̅� of the data points in 𝑦 ∈ ℝ𝑛. With 𝑝 = 1 The sigma parameter estimator is given by

�̂�2 =1

𝑛−1(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (4.1)

=1

𝑛−1((

𝑦1⋮𝑦𝑛) − (

1⋮1)(

1

𝑛∑ 𝑦𝑖𝑛𝑖=1 ))

𝑇

((

𝑦1⋮𝑦𝑛) − (

1⋮1) (

1


=1

𝑛−1

(

(

𝑦1 −1


⋮

𝑦𝑛 −1


)

)

𝑇

(

(

𝑦1 −1


⋮

𝑦𝑛 −1


)

)

=1

𝑛−1∑ (𝑦𝑖 − (

1


2𝑛𝑖=1

= 1

𝑛−1∑ (𝑦𝑖 − �̅�)

2𝑛𝑖=1

corresponding to the sample variance 𝑠2 of the data point in 𝑦 ∈ ℝ𝑛. Finally, with a contrast vector 𝑐 ≔ 1 and true, but unknown,

beta parameter 𝛽0 = 𝜇0, the T-statistic evaluates to

𝑇 =𝑐𝑇�̂�−𝑐𝑇𝛽0

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (5.1)

=1⋅�̅�−1⋅𝜇0

√𝑠21⋅((1 ⋯ 1)(1⋮1))

−1

⋅1

=�̅�−𝜇0

√𝑠2(𝑛)−1

=�̅�−𝜇0

√𝑠2(𝑛)−12

= √𝑛�̅�−𝜇0

𝑠

where we defined sample standard deviation as 𝑠 ≔ √𝑠2.

□

226

(5) Independent two-sample T-Test

The independent two-sample T-Test is a procedure to evaluate whether two groups of data points

𝑦1 ∈ ℝ𝑛1 and 𝑦2 ∈ ℝ

𝑛2 were generated from the same underlying univariate Gaussian distribution, i.e.

Gaussian with the same expectation parameters 𝜇1 = 𝜇2, corresponding to the null hypothesis, or two

different ones 𝜇1 ≠ 𝜇2 corresponding to the alternative hypothesis. In terms of GLM designs, this case thus

corresponds to one experimental factor taking on two discrete levels. The assumed probability distribution

for data points collected for the first level takes the form

𝑝(𝑦1𝑖) = 𝑁(𝑦1𝑖; 𝜇1, 𝜎2) ⇔ 𝑦1𝑖 = 𝜇1 + 휀𝑖 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (1)

where 𝑖 = 1,… , 𝑛1 and the assumed probability distribution for data points collected for the second level

takes the form

𝑝(𝑦2𝑖) = 𝑁(𝑦2𝑖; 𝜇2, 𝜎2) ⇔ 𝑦2𝑖 = 𝜇2 + 휀𝑖 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎

2) (2)

where 𝑖 = 1,… , 𝑛2. In its design matrix ormulation, the two-sample T-test for corresponds to

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛1)

∈ ℝ𝑛, 𝑋 ≔

(

1 0⋮ ⋮1 00 1⋮ ⋮0 1)

∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0 (3)

where 𝑛 ≔ 𝑛1 + 𝑛2 and notably, the data points and ones in the design matrix have to be arranged in such a

manner that they identify the corresponding group membership.

As an example consider, regrouping the example data set into two groups of observations

corresponding to participants younger than 31 years of age and participants older or equal to 31 years of age

as shown in Table 5

Group 1 Group 2

Participant 𝒊𝒋 𝒚𝟏𝒊 : DLPFC Volume Participant 𝒊𝒋 𝒚𝟐𝒊: DLPFC Volume

1 11 178.7708 17 21 155.5286

2 12 168.4660 18 22 150.5144

3 13 169.9513 19 23 137.8262

4 14 162.0778 20 24 160.1183

5 15 170.1884 21 25 155.4419

6 16 156.9287 22 26 127.1715

7 17 175.4092 23 27 138.0237

8 18 173.3972 24 28 133.4589

9 19 154.4907 25 29 139.3813

10 110 158.3642 26 210 145.1997

11 111 172.1033 27 211 123.7259

12 112 162.6648 28 212 130.7300

13 113 165.4449 29 213 114.1148

14 114 142.2121 30 214 151.1943

15 115 154.3557 31 215 121.7235

16 116 145.6544 32 216 140.9424

Table 5. The example data set of Table 1 rearranged for an independent two-sample T-test. The column Participant comprises the original data labels as in Table 1, while the columns 𝑖𝑗 comprise the indices of the data points 𝑦1𝑖 and 𝑦2𝑖after relabeling for the independent two-sample T-test.

227

Independent two-sample T-Test designs are usually visualized by portraying their group sample means and

associated standard deviations or standard errors of mean.

Figure 4. Visualization of an independent two-sample T-Test. Errorbars depict the within-group standard deviations.

For the model formulated in (1) – (3), the beta parameter estimator evaluates to

�̂� = (

1

𝑛1∑ 𝑦1𝑖𝑛1𝑖=1

1


) = (�̅�1�̅�2) (4)

In other words, the two entries in the beta parameter estimator correspond to the two sample averages �̅�1

and �̅�2. The variance parameter estimator is given by

�̂�2 =∑ (𝑦1𝑖−�̅�1)

2𝑛1𝑖=1 +∑ (𝑦2𝑖−�̅�2)

2𝑛1𝑖=1

(𝑛1−1)+(𝑛1−1)≔ 𝑠12

2 (5)

where we defined the the “pooled” or “averaged” sample variance as 𝑠122 , the square root of which

corresponds to the “pooled” sample standard deviation 𝑠12 ≔ √𝑠122 . Reformulating the null hypothesis by

𝐻0: 𝜇1 = 𝜇2 as 𝐻0: 𝜇1 − 𝜇2 = 0, specifying the contrast vector by 𝑐 = (1,−1)𝑇 and setting 𝛽0 = 0 ∈ ℝ2

then yields the familiar formula for the T-statistic of the two-sample independent T-test

𝑇 =�̅�1−�̅�2

√(1

𝑛1+1

𝑛2)𝑠12

(6)

Based on a data realization we may thus evaluate the T-statistic, which is distributed according to a 𝑡-

distribution with 𝑛 − 𝑝 = 𝑛1 + 𝑛2 − 2 degrees of freedom under the null hypothesis. If the observed T-

statistic value (or a more extreme value) is associated with a low probability under the null hypothesis being

correct, we may consider revising the null hypothesis.

Proof of (4) - (6)

For the current GLM design, the beta parameter estimator is given by

�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (4.1)

=

(

(1 ⋯ 10 ⋯ 0

0 ⋯ 01 ⋯ 1

)

(

1 0⋮ ⋮1 00 1⋮ ⋮0 1)

)

−1

(1 ⋯ 10 ⋯ 0

0 ⋯ 01 ⋯ 1

)

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛1)

228

= ((𝑛1 00 𝑛2

))

−1

(∑ 𝑦1𝑖𝑛1𝑖=1

∑ 𝑦2𝑖𝑛2𝑖=1

)

= (1/𝑛1 00 1/𝑛2

) (∑ 𝑦1𝑖𝑛1𝑖=1

∑ 𝑦2𝑖𝑛2𝑖=1

)

= (

1


1


)

which corresponds to the sample averages of the data points 𝑦11, … , 𝑦1𝑛1 , i.e. �̅�1, and 𝑦21, … , 𝑦2𝑛2 , i.e., �̅�2, respectively. With

𝑛 = 𝑛1 + 𝑛2 and 𝑝 = 2, the variance estimator for the current GLM design is given by

�̂�2 =1

𝑛1+𝑛2−2(𝑦 − 𝑋�̂�)

𝑇(𝑦 − 𝑋�̂�) (5.1)

=1

𝑛1+𝑛2−2

(

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛1)

−

(

1 0⋮ ⋮1 00 1⋮ ⋮0 1)

(�̅�1�̅�2)

)

𝑇

(

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛1)

−

(

1 0⋮ ⋮1 00 1⋮ ⋮0 1)

(�̅�1�̅�2)

)

=1

𝑛1+𝑛2−2

(

𝑦11 − �̅�1⋮

𝑦1𝑛1 − �̅�1𝑦21 − �̅�2

⋮𝑦2𝑛1 − �̅�2)

𝑇

(

𝑦11 − �̅�1⋮

𝑦1𝑛1 − �̅�1𝑦21 − �̅�2

⋮𝑦2𝑛1 − �̅�2)

=∑ (𝑦1𝑖−�̅�1)

2𝑛1𝑖=1 +∑ (𝑦2𝑖−�̅�2)

2𝑛2𝑖=1

𝑛1+𝑛2−2

which corresponds to the pooled sample variance 𝑠122 . Finally, setting 𝑐 ≔ (1,−1)𝑇 and 𝛽0 ≔ (0,0)𝑇, the T-statistic evaluates to

𝑇 =𝑐𝑇�̂�−𝑐𝑇𝛽0

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (6.1)

=(1 −1)(

�̅�1�̅�2)−(1 −1)(

00)

√𝑠122 ⋅(1 −1)(

𝑛1 00 𝑛2

)−1

(1−1)

=�̅�1−�̅�2

√𝑠122 √(1 −1)(

1/𝑛1 00 1/𝑛2

)(1−1)

=�̅�1−�̅�2

𝑠12√(1/𝑛1 −1/𝑛2)(1−1)

=�̅�1−�̅�2

√(1

𝑛1+1

𝑛2)𝑠12

□

(6) One-way ANOVA

The simplest way to think about the one-way analysis of variance (ANOVA) is to consider it the

extension of an (independent) two-sample t-test to more than two groups. It is helpful to modify the GLM

notation to reflect the one-way ANOVA layout of the data explicitly. Let 𝑚 ∈ ℕ denote the number of

groups or levels of the factor, 𝑛𝑖 ∈ ℕ denote the number of observations in group 𝑖 ∈ ℕ𝑚, and let 𝑦𝑖𝑗 ∈ ℝ

denote the dependent variable of the 𝑗-th unit in the 𝑖-th group (𝑖 ∈ ℕ𝑚,, 𝑗 ∈ ℕ𝑛𝑖). As usual, we conceive

𝑦𝑖𝑗 as a realization of a random variable with univariate Gaussian distribution. In the one-way ANOVA case,

this distribution takes the form

229

𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖, 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖 + 휀𝑖𝑗 , 𝑝(휀𝑖𝑗) ≔ 𝑁(휀𝑖𝑗; 0, 𝜎

2), 𝜎2 > 0 (1)

where the variance parameter 𝜎2 is identical for all observations. The underlying assumption of the one-way

ANOVA model in terms of the deterministic GLM aspect is

𝜇𝑖 ≔ 𝜇0 + 𝛼𝑖 (2)

where 𝜇0 ∈ ℝ takes on the role of a common offset for all data groups, and 𝛼𝑖 ∈ ℝ represents the effect of

level 𝑖 of the experimental factor. The one-way ANOVA model (1) can be formulated as special case of the

GLM. To this end, the design matrix 𝑋 ∈ ℝ𝑛×𝑝 comprises 𝑝 = 𝑚 + 1 columns: a column of ones

representing the constant offset (like in simple linear regression) and 𝑚 columns of “indicator” variables.

These indicator variables take on the value 1 for dependent variables corresponding to level 𝑖 of the factor

and the value 0 otherwise. For the exemplary case of 4 experimental groups with 𝑛𝑖 (𝑖 ∈ ℕ𝑚) data points

each we have 𝑋 ∈ ℝ𝑛×5, 𝑛 ≔ ∑ 𝑛𝑖4𝑖=1 . Using this design matrix and parameter formulation, we can write the

GLM form of the one-way ANOVA layout as follows:

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)

where

𝑦 ≔

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛2𝑦31⋮

𝑦3𝑛3𝑦41⋮

𝑦4𝑛4)

∈ ℝ𝑛, 𝑋 =

(

1⋮1

1⋮1

0⋮0

0⋮0

0⋮0

1⋮1

0⋮0

1⋮1

0⋮0

0⋮0

1⋮1

0⋮0

0⋮0

1⋮1

0⋮0

1⋮1

0⋮0

0⋮0

0⋮0

1⋮1)

∈ ℝ𝑛×5, 𝛽 ≔

(

𝜇0𝛼1𝛼2𝛼3𝛼4)

∈ ℝ5 and 𝜎2 > 0 (4)

To illustrate the use of one-way ANOVA in the context of the example data set of Table 1, we ignore

the factor “alcohol intake” and consider only the experimental factor “age”. We define four discrete levels

(L) for this factor: L1: 15 – 22 years, L2 : 23 – 30 years, L3: 31-38 years, and L4: 39 -46 years. Regrouping the

data accordingly results in the data layout of Table 2. In the example, we have 𝑚 = 4 and 𝑦𝑖𝑗 is the DLPFC

volume measure of the 𝑗th subject in the 𝑖th category of the age factor, where 𝑖 ∈ ℕ4, and 𝑗 = 1, . . . 𝑛𝑖

with 𝑛1 = 𝑛2 = 𝑛3 = 𝑛4 = 8.

L1 L2 L3 L4

P 𝒊𝒋 𝒚𝟏𝒊: DLPFC Volume P 𝒊𝒋 𝒚𝟐𝒊: DLPFC Volume P 𝒊𝒋 𝒚𝟑𝒊 : DLPFC Volume P 𝒊𝒋 𝒚𝟒𝒊: DLPFC Volume

1 11 178.7708 9 21 154.4907 17 31 155.5286 25 41 139.3813

2 12 168.4660 10 22 158.3642 18 32 150.5144 26 42 145.1997

3 13 169.9513 11 23 172.1033 19 33 137.8262 27 43 123.7259

4 14 162.0778 12 24 162.6648 20 34 160.1183 28 44 130.7300

5 15 170.1884 13 25 165.4449 21 35 155.4419 29 45 114.1148

6 16 156.9287 14 26 142.2121 22 36 127.1715 30 46 151.1943

7 17 175.4092 15 27 154.3557 23 37 138.0237 31 47 121.7235

8 18 173.3972 16 28 145.6544 24 38 133.4589 32 48 140.9424

Table 6 The exempledata set in a one-way ANOVA layout. The participant labels in in the P column correspond to the labels in Table 1 and the 𝑖𝑗-indices correspond to the relabeled dependent variables 𝑦𝑖𝑗 (1 ≤ 𝑖 ≤ 4, 1 ≤ 𝑗 ≤ 8).

230

One-way ANOVA designs are usually visualized by depicted the group sample means and the

associated standard deviations or standard errors of mean.

Figure 5. Visualization of a one-way ANOVA design. Errorbars depict the within-group standard deviations.

Unfortunately, the GLM formulation of the one-way ANOVA design as described above requires a

reformulation in order to enable the estimation of its parameters. As it stands, the design is

“overparameterized”. This problem may be viewed from at least three perspectives. From a data-analytical

perspective we have “more unknowns than observed variables”, because we can obtain average data from

𝑚 groups, but have 𝑚 + 1 parameters, 𝛼1, … , 𝛼𝑚 and the offset 𝜇0 to determine. From the perspective of

the design matrix formulation (4), the design matrix is rank-deficient, because the first column is the sum of

the last 𝑚 columns. It can be shown that the rank-deficiency of 𝑋 ∈ ℝ𝑛×(𝑚+1) results in the cross product

matrix 𝑋𝑇𝑋 ∈ ℝ(𝑚+1)×(𝑚+1) to be rank-deficient. This in turn corresponds to it being non-invertible, which

implies that the OLS beta estimator is not defined for the one-way ANOVA GLM formulation as defined so

far. Finally, from the perspective of systems of linear equations “overparameterization” implies that we have

more unknown parameters than equations from which to determine them. A simple example is the system

of linear equations

𝑝1 + 𝑝2 + 𝑝3 = 02𝑝1 + 𝑝2 + 𝑝3 = 1

⇔ (12

11

11)(

𝑝1𝑝2𝑝3) = (

01) (5)

In (5) we have two equations (and thus two “outcome measures” associated with a specific parameter

combination) and three “parameters” 𝑝1, 𝑝2 and 𝑝3. The problem is that different parameter values for

𝑝1, 𝑝2 and 𝑝3 can solve the system (or model) described by (5) and we thus cannot uniquely infer the

parameters from the measurement outcomes. For example, both the parameter vectors 𝑝𝑎 = (1,1,−2)𝑇

and 𝑝𝑏 = (1,−1,0)𝑇 solve the system of linear equations.

To nevertheless obtain a useful one-way ANOVA model, the model of equations (1) – (4)needs to be

reformulated. There are several ways in which this can be done. One approach is to set 𝜇0 = 0 or simply

drop the constant offset 𝜇0. If this approach is chosen, the 𝛼𝑖 (𝑖 ∈ ℕ𝑚) become the factor level

expectations and 𝛼𝑖 represents the expected response at factor level 𝑖. While simple and attractive, this

approach does not generalize well to models with more than one factor, and thus the so-called “reference

cell method” is preferred. The approach entails, instead to setting the overall offset 𝜇0 to zero, setting one of

the 𝛼𝑖's to zero. Conventionally, one sets 𝛼1 ≔ 0, but any of the groups could be chosen as the “reference

cell” (or “reference level”). Importantly, in this approach, the parameter 𝜇0 becomes the expected response

231

of the reference cell, and 𝛼𝑖 becomes the effect of level 𝑖 of the factor compared to the reference cell. In

other words 𝛼𝑖 becomes the expected difference between a given level of the experimental factor and the

reference cell. The table below illustrates the reformulation at the single-data point level implementing the

reference cell method for the example case of four levels.

Original Formulation Reference Cell Formulation

Level 1 𝜇0 + 𝛼1 𝜇0

Level 2 𝜇0 + 𝛼2 𝜇0 + 𝛼2

Level 3 𝜇0 + 𝛼3 𝜇0 + 𝛼3

Level 4 𝜇0 + 𝛼4 𝜇0 + 𝛼4

Table 7 The reference cell reformulation of the one-way ANOVA model.

At the level of the GLM formulation, the reference cell method is equivalent to removing one of the

indicator variables representing the levels of the factor. If we choose to set the first level to zero, we obtain

the following reformulation

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (6)

where

𝑦 ≔

(

𝑦11⋮

𝑦1𝑛1𝑦21⋮

𝑦2𝑛2𝑦31⋮

𝑦3𝑛3𝑦41⋮

𝑦4𝑛4)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮11⋮11⋮1

0⋮01⋮10⋮00⋮0

0⋮00⋮01⋮10⋮0

0⋮00⋮00⋮01⋮1)

∈ ℝ𝑛×4, 𝛽 ≔ (

𝜇0𝛼2𝛼3𝛼4

) ∈ ℝ4 and 𝜎2 > 0 (8)

We now have formulated an explicit GLM for the case of one-way ANOVA. Parameter estimation and

model inference then can proceed by using the results of the general GLM theory.

(7) Multifactorial designs and two-way ANOVA

Multifactorial designs are characterized by the fact that two or more independent experimental

factors are manipulated and all possible combinations of both factors assessed. They are usually referred to

simply as “factorial designs”. For example, a typical 2 × 2-factorial design employed in cognitive

neuroimaging could involve a stimulus manipulation (e.g. low and highly degraded visual stimuli) and a

cognitive manipulation (e.g. attended and unattended visual stimuli). Any form of two-dimensional 𝑛 ×𝑚 or

higher-dimensional 𝑛 ×𝑚 × 𝑝 × 𝑞 × … factorial design is conceivable. Due to experimental constraints and

the aim to measure each factorial combination from the same amount of experimental trials, 2 × 2-factorial

designs are probably the most prevalent designs in cognitive neuroimaging. Factorial designs allow for

measuring (1) the main effect of each factor, i.e. the differential variability in the outcome measure due to

different levels of this factor averaged over the other factors, (2) the interaction between factors. In intuitive

terms, an interaction in a 2 × 2-factorial design refers to a difference in a difference.

To illustrate the concept of a 2 × 2-factorial design before concerning ourselves with its GLM

formulation, we consider the example data set of Table 1. Here, we define the experimental factors “Age”

232

and “Alcohol” and allow each of these factors to take on only two levels: “younger than 31 years” (factor

Age, level 1) and “older than 31 years” (factor Age, level 2), and “less than 5 units of alcohol consumption”

(factor “Alcohol”, level 1) and “more than 5 units of alcohol consumption” (factor “Alcohol”, level 2). Each

combination of a specific level of one factor with a specific level of the other factor is referred to as a “cell”

of the design. 2 × 2-factorial designs are sensibly depicted using a square lattice (Figure 5), and average data

from the different cells of the of the design is commonly depicted in as bar graph (Figure 6). According to the

layout in the square lattice, the factors may also be referred to as “row” and “column” factors, respectively.

Figure 5. Conceptual visualization of a two-way ANOVA experimental design

Figure 6. Visualization of data obtained in a two-way ANOVA design. Errorbars depict the within-group standard deviations.

Based on the average observed DLPFC volume for each cell of the design and the various forms of variability

about the means, the following set of questions may now be investigated:

1. Does DLPFC volume change with the age of the participant, irrespective of (i.e. averaged over) whether

the participant consume more or less units of alcohol? Colloquially, the answer to this question is referred

to as the “main effect of age”.

2. Does DLPFC volume change with the alcohol consumption of the participant, irrespective of (i.e.

averaged over) whether the participant belongs to the young or old age group? Colloquially, the answer to

this question is referred to as the “main effect of alcohol”.

3. Does the difference in DLPFC volume observed for the different levels of the age factor change with the

different levels of the alcohol? Or the other way round, does the difference in DLPFC volume observed for

the different levels of the alcohol factor change under between old and young age? This difference in the

differences is colloquially referred to as the “interaction between the age and alcohol”.

233

In the current section, we first discuss the two-way ANOVA GLM formulation that applies if the first

two questions are of interest and then extend this formulation to the case that all three questions are of

interest. To formulate the two-way ANOVA GLM, it is helpful to first modify the notation to align itself with

the 2 × 2-factorial design. We start by considering the dependent variables. Specifically, we now use the

notation

𝑦𝑖𝑗𝑘 ∈ ℝ, where 𝑖 = 1,… , 𝑟, 𝑗 = 1,… , 𝑐 and 𝑘 = 1,… , 𝑛𝑖𝑗 (1)

to denote the 𝑘-th data point in the cell corresponding to combination of the 𝑖-th level of the “row-factor”,

where 𝑖 = 1,… , 𝑟 (rows), with the 𝑗-th level of the “column-factor”, where 𝑗 = 1,… , 𝑐 (columns). Each cell

comprises 𝑛𝑖𝑗 ∈ ℕ data points. For the special case of a 2 × 2 ANOVA design, we have 𝑟 = 𝑐 = 2.

Low (L1) High (L2)

P (𝑖𝑗𝑘) 𝒚𝟏𝟏𝒌: DLPFC Volume P (𝑖𝑗𝑘) 𝒚𝟏𝟐𝒌: DLPFC Volume

1 111 178.7708 2 121 168.4660

5 112 170.1884 3 122 169.9513

7 113 175.4092 4 123 162.0778

Young (L1) 8 114 173.3972 6 124 156.9287

11 115 172.1033 9 125 154.4907

12 116 162.6648 10 126 158.3642

13 117 165.4449 14 127 142.2121

15 118 154.3557 16 128 145.6544

P (𝑖𝑗𝑘) 𝒚𝟐𝟏𝒌: DLPFC Volume P (𝑖𝑗𝑘) 𝒚𝟐𝟐𝒌: DLPFC Volume

17 211 155.5286 19 221 137.8262

18 212 150.5144 22 222 127.1715

20 213 160.1183 23 223 138.0237

Old (L2) 21 214 155.4419 24 224 133.4589

25 215 139.3813 27 225 123.7259

26 216 145.1997 28 226 130.7300

30 217 151.1943 29 227 114.1148

32 218 140.9424 31 228 121.7235

Table 8 The example data set of Table 1 in a 2 × 2 ANOVA layout with row factor “Age”, taking on the levels “Young” and “Old” and the column factor “Alcohol” taking on the levels “Low” and “High”. Note that the column P denotes the original participant label, while the column (𝑖𝑗𝑘) denotes the relabelled dependent variable index.

For the 2 × 2 ANOVA design without interaction, consider the expectation of each data variable 𝑦𝑖𝑗𝑘

as the sum of the effects of the levels of each factor. In other words, we conceive 𝑦𝑖𝑗𝑘 as a realization of a

random variable with a univariate Gaussian distribution of the form

𝑝(𝑦𝑖𝑗𝑘) = 𝑁(𝑦𝑖𝑗𝑘; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗𝑘 = 𝜇𝑖𝑗 + 휀𝑖𝑗𝑘 , 𝑝(휀𝑖𝑗𝑘) ≔ 𝑁(휀𝑖𝑗𝑘; 0, 𝜎

2) (2)

for 𝑘 = 1,… , 𝑛𝑖𝑗 where

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽𝑗 (3)

In this formulation 𝜇0 represents a constant offset common to all cells, 𝛼𝑖 (𝑖 = 1,… , 𝑟) represents

the effect of the 𝑖th level of the row factor and 𝛽𝑗 (𝑗 = 1,… , 𝑐) represents the effect of the 𝑗th level of the

column factor. In its design matrix formulation of the GLM defined in (2) and (3), the design matrix

𝑋 ∈ ℝ𝑛×(1+𝑟+𝑐) comprises a column of 1's representing the constant offset, and two sets of indicator

variables representing the 𝑟 ∈ ℕ levels of the row factor and 𝑐 ∈ ℕ levels of the column factor and the beta

parameter vector encodes the effect of each level of each factor. That is, we have

234

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (4)

where

𝑦 ≔

(

𝑦111 ⋮

𝑦11𝑛11𝑦121 ⋮

𝑦12𝑛12𝑦211 ⋮

𝑦21𝑛21𝑦221 ⋮

𝑦22𝑛22)

∈ ℝ𝑛, 𝑋 =

(

1⋮1

1⋮1

0⋮0

1⋮1

0⋮0

1⋮1

1⋮1

0⋮0

0⋮0

1⋮1

1⋮1

0⋮0

1⋮1

1⋮1

0⋮0

1⋮1

0⋮0

1⋮1

0⋮0

1⋮1)

∈ ℝ𝑛×5, 𝛽 ≔

(

𝜇0𝛼1𝛼2𝛽1𝛽2)

∈ ℝ5and σ2 > 0 (5)

As for the one-way ANOVA, the model defined above is overparameterized. Effectively, we have five

parameters to and four equations for the respective group expectations. Viewed differently, on the level of

equation (3) we could add a constant either to each of the 𝛼𝑖’s or to each of the 𝛽𝑗’s and subtract it from 𝜇0

without altering any of the expected responses. We thus require two constraints to identify the model.

These correspond to setting

𝛼1 ≔ 𝛽1 ≔ 0 (6)

and thus identifying the combination of the first level of the row and the first level of the column as the

reference cell. The meaning of the remaining parameters is then as provided in Tables 9 and 10. The entries

in these tables depict the formulation of the expected dependent variable responses for each combination

of levels of row and column factors in terms of the initial formulation of the additive 2 × 2 ANOVA and in

terms of the reference cell method reformulation of the additive 2 × 2 ANOVA, respectively.

In the reference cell formulation of Table 6, 𝜇0 ∈ ℝ represents the expected response in the

reference cell, 𝛼𝑖 (𝑖 = 1,2) represents the effect of level 𝑖 of the row factor (compared to level 1) for any

fixed level of the column factor, and 𝛽𝑗 (𝑗 = 1,2) represents the effect of level 𝑗 of the column vector

(compared to level 1) for any fixed value of the row factor. As for the case of the one-way ANOVA, the

parameters 𝛼2 and 𝛽2 encode the differences in expected values between the design cells. Note that the

model is additive in the sense, that the effect of each factor is the same at all levels of the other factor. To

see this point consider moving from the first to the second row. The response increases by 𝛼2 if one moves

down the first column, but also if one moves down the second column.

1 2

1 𝜇0 + 𝛼1 + 𝛽1 𝜇0 + 𝛼1 + 𝛽2

2 𝜇0 + 𝛼2 + 𝛽1 𝜇0 + 𝛼2 + 𝛽2

Table 9 Initial formulation of an over-parameterized two-way additive ANOVA GLM model

Table 10 Reference cell method reformulation of the two-way additive ANOVA GLM model

1 2

1 𝜇0 𝜇0 + 𝛽2

2 𝜇0 + 𝛼2 𝜇0 + 𝛼2 + 𝛽2

235

Equivalently, the design matrix defined in (5) is not of full column rank, because the row factor

columns (column 2 and 3) as well as the column factor indicator variables (column 4 and 5) add up to the

constant offset indicator (column 1). The two required constraints correspond to dropping the variables

corresponding to the first row factor level and to the first column factor level. This results in the following

reformulation of the GLM for the 2 × 2 ANOVA layout:

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (7)

where

𝑦 ≔

(

𝑦111 ⋮

𝑦11𝑛11𝑦121 ⋮

𝑦12𝑛12𝑦211 ⋮

𝑦21𝑛21𝑦221 ⋮

𝑦22𝑛22)

∈ ℝ𝑛, X =

(

1⋮11⋮11⋮11⋮1

0⋮00⋮01⋮11⋮1

0⋮01⋮10⋮01⋮1)

∈ ℝ𝑛×3, 𝛽 ≔ (

𝜇0𝛼2𝛽2

) ∈ ℝ3 and σ2 > 0 (8)

Based on the formulation of the two-way ANOVA design in (8) we may test for significant main effects of

either experimental factor. We cannot, however, test for a significant interaction, because this has not been

modelled by the GLM. In order to allow for the modelling of interaction effects in 2 × 2-factorial designs, the

GLM of the previous section is modified as follows

𝑝(𝑦𝑖𝑗𝑘) = 𝑁(𝑦𝑖𝑗𝑘; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗𝑘 = 𝜇𝑖𝑗 + 휀𝑖𝑗𝑘 , 𝑝(휀𝑖𝑗𝑘) ≔ 𝑁(휀𝑖𝑗𝑘; 0, 𝜎

2) (10)

for 𝑘 = 1,… , 𝑛𝑖𝑗 where we now define

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 (11)

In this formulation the first three terms are familiar: 𝜇0 is a constant, and 𝛼𝑖 and 𝛽𝑗 are the main

effect of levels 𝑖 ∈ ℕ𝑟 of the row factor and 𝑗 ∈ ℕ𝑐 of the column factor. The new term (𝛼𝛽)𝑖𝑗 is an

interaction effect. It represents the effect of the combination of levels 𝑖 and 𝑗 of the row and column factors.

The notation (𝛼𝛽) should be understood as a single symbol, not as a product. One could have have chosen

𝛾𝑖𝑗 to denote this interaction effect, but the notation (𝛼𝛽)𝑖𝑗 is more suggestive and reminds us that the

term corresponds to an effect due to the combination of the levels 𝑖 and 𝑗 of each factor. Table 11 visualizes

the two-way ANOVA with interaction parameters

1 2

1 𝜇0 + 𝛼1 + 𝛽1 + (𝛼𝛽)11 𝜇0 + 𝛼1 + 𝛽2 + (𝛼𝛽)12

2 𝜇0 + 𝛼2 + 𝛽1 + (𝛼𝛽)21 𝜇0 + 𝛼2 + 𝛽2 + (𝛼𝛽)22

Table 11 Initial formulation of an over-parameterized two-way ANOVA GLM model with interaction.

As always, equations (10) and (11) can be reformulated in design matrix form:

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛)

where

236

𝑦 ≔

(

𝑦111 ⋮

𝑦11𝑛11𝑦121 ⋮

𝑦12𝑛12𝑦211 ⋮

𝑦21𝑛21𝑦221 ⋮

𝑦22𝑛22)

∈ ℝ𝑛, 𝑋 =

(

1 ⋮11⋮11⋮11⋮1

1 ⋮11⋮10⋮00⋮0

0 ⋮00⋮01⋮11⋮1

1 ⋮10⋮01⋮10⋮0

0 ⋮01⋮10⋮01⋮1

1 ⋮10⋮00⋮00⋮0

0 ⋮01⋮10⋮00⋮0

0 ⋮00⋮01⋮10⋮0

0 ⋮00⋮00⋮01⋮1 )

∈ ℝ𝑛×9, 𝛽 ≔

(

𝜇0𝛼1𝛼2𝛽1𝛽2

(𝛼𝛽)11(𝛼𝛽)12(𝛼𝛽)21(𝛼𝛽)22)

∈ ℝ4 and 𝜎2 > 0

As can be seen, addition of the second and third column, fourth and fifth column, and sixth to ninth column

all result in the first column, creating a multiply rank-deficient design matrix. We thus re-express (10) and

(11) in terms of an extended reference cell method, which sets to zero all parameters involving the first row

or the first column in the two-way layout, such that

𝛼1 ≔ 𝛽1 ≔ (𝛼𝛽)1𝑗 ≔ (𝛼𝛽)𝑖1 ≔ 0 (𝑗 ∈ ℕ𝑐 , 𝑖 ∈ ℕ𝑟) (12)

The meaning of the remaining parameters can then be read of from Table 12. Here, 𝜇0 ∈ ℝ is the expected

response in the reference cell, just as before. The main effects are now more specialized: 𝛼2 is the expected

difference between the expected response of the reference cell and the effect of level 2 of the row factor,

when the column factor is at level 1. 𝛽2 is the expected difference due to level 2 of the column factor,

compared to level 1, when the row factor is at level 1. The interaction term (𝛼𝛽)22 is the additional effect of

level 2 of the row factor, compared to level 1, when the column factor is at level 2 rather than 1. This term

can also be interpreted as the additional effect of level 2 of the column factor, compared to level 11, when

the row factor is at level 2 rather than 1. The key feature of this model is that the effect of a factor now

depends on the level of the other. For example the effect of level 2 of the row factor, compared to level 1, is

𝛼2 in the first column and 𝛼2 + (𝛼𝛽)22 in the second column.

Table 12 Reference cell method reformulation of the two-way additive ANOVA GLM model with interaction.

The design matrix formulation of the 2 × 2 ANOVA GLM with interaction after reformulation based on the

extended reference cell method then takes the following form: The design matrix is of size

𝑛 × (1 + (𝑟 − 1) + (𝑐 − 1) + (𝑟 − 1) ⋅ (𝑐 − 1)) (13)

Specifically, it comprises a column of ones to represent the constant offset 𝜇0, a set of (𝑟 − 1) ∈ ℕ

indicator variables representing the row effects, a set of (𝑐 − 1) ∈ ℕ indicator variables representing the

column effects, and a set of 𝑟 ⋅ 𝑐 ∈ ℕ indicator variables representing the interactions. The easiest way to

compute the values of the interaction indicator variable is as products of the row and column indicator

variable values. In other words, if 𝑟𝑖 takes the value 1 for observations in row 𝑖 and 0 otherwise, and 𝑐𝑗 takes

the value 1 for observations in column 𝑗 = 1 and 0 otherwise, then the product 𝑟𝑖𝑐𝑗 takes the values one for

observations that are in row 𝑖 and column 𝑗, and is 0 for all others.

1 2

1 𝜇0 𝜇0 + 𝛽2

2 𝜇0 + 𝛼2 𝜇0 + 𝛼2 + 𝛽2 + (𝛼𝛽)22

237

For the case of a 2 × 2 ANOVA model with interaction, the design matrix formulation after reference cell

reformulation thus takes the following form

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (14)

where

𝑦 ≔

(

𝑦111 ⋮

𝑦11𝑛11𝑦121 ⋮

𝑦12𝑛12𝑦211 ⋮

𝑦21𝑛21𝑦221 ⋮

𝑦22𝑛22)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮11⋮11⋮1

0⋮00⋮01⋮11⋮1

0⋮01⋮10⋮01⋮1

0⋮00⋮00⋮01⋮1)

∈ ℝ𝑛×4, 𝛽 ≔ (

𝜇0𝛼2𝛽2

(𝛼𝛽)22

) ∈ ℝ4 and 𝜎2 > 0 (15)

(8) Analysis of Covariance

The simplest way to think about the analysis of covariance (ANCOVA) is as a combination between

the categorical ANOVA-type and the continuous regression-type approaches. To illustrate this we again

consider the example data set and now conceive the participant’s age as a categorical factor with levels

young (< 31 years of age) and old (≥ 32 years of age) and alcohol consumption as a continuous factor. As

above, we modify our notation to reflect the structure of the data. To this end let 𝑚 ∈ ℕ denote the number

of groups or levels of the discrete factor, 𝑛𝑖 the number of observations in group 𝑖 ∈ ℕ𝑚, 𝑦𝑖𝑗 the value of the

dependent variable in the responses 𝑖 group at the the 𝑗 level the continuous factor, and and 𝑥𝑖𝑗 the value

of the continuous factor for the 𝑗th unit in the 𝑖th group, where 𝑗 = 1, . . . , 𝑛𝑖 and 𝑖 = 1, . . . , 𝑚. This

notation is illustrated for the example data set in Table 13.

𝒊 = 𝟏: Young Group 𝒊 = 𝟐: Young Group

P 𝒙𝟏𝒋 : Alcohol 𝒚𝟏𝒋: DLPFC Volume P 𝒙𝟐𝒋 : Alcohol 𝒚𝟐𝒋: DLPFC Volume

1 3 178.7708 17 3 155.5286

2 6 168.4660 18 4 150.5144

3 5 169.9513 19 7 137.8262

4 7 162.0778 20 1 160.1183

5 4 170.1884 21 2 155.4419

6 8 156.9287 22 8 127.1715

7 1 175.4092 23 5 138.0237

8 2 173.3972 24 6 133.4589

9 7 154.4907 25 4 139.3813

10 5 158.3642 26 3 145.1997

11 1 172.1033 27 7 123.7259

12 3 162.6648 28 5 130.7300

13 2 165.4449 29 8 114.1148

14 8 142.2121 30 1 151.1943

15 4 154.3557 31 6 121.7235

16 6 145.6544 32 2 140.9424

Table 13 The example data set of Section 7.2 rearranged in an analysis of covariance layout. Note that the column P denotes the original participant label, while the column

As usual each data point 𝑦𝑖𝑗 is conceived as a realization of a random variable with univariate

Gaussian distribution. In the ANCOVA case with one discrete factor taking on two levels and one continuous

factor, this distribution takes the form

238

𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 휀𝑖𝑗 , 𝑝(휀𝑖𝑗) ≔ 𝑁(휀𝑖𝑗; 0, 𝜎

2) , 𝜎2 > 0 (1)

where 𝑖 = 1, 2 and 𝑗 = 1,… , 𝑛𝑖 for each value of 𝑖. Then, to express the dependence of the expected

response 𝜇𝑖𝑗 on the discrete factor with two levels we use an ANOVA-type model of the form

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 (𝑖 = 1,2) (2)

whereas to model the effect of a continuous predictor, we use a regression-type model of the form

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛽1𝑥𝑖𝑗 (3)

where 𝑗 = 1,… , 𝑛𝑖 for each value of 𝑖 = 1,2. Combining these models we obtain the additive ANCOVA model

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽1𝑥𝑖𝑗 (4)

Essentially, this model defines a set of straight-line regressions, one for each level of the discrete

factor. These lines have different offsets 𝜇0 + 𝛼𝑖, but a common slope 𝛽1. In other words, they are parallel.

The common slope 𝛽1 represents the effects of the continuous variate at any level of the factor, and the

differences in the offset 𝛼𝑖 (𝑖 = 1,2)represent the effects of the discrete factor at any given value of the

covariate.

The ANCOVA model can be formulated in design matrix form by letting the design matrix 𝑋 ∈

ℝ𝑛×(𝑚+2) have a column of 1's representing the constant, a set of 2 indicator variables representing the

levels of the discrete factor, and a column with the values of the continuous variate. Together, this amounts

to the following formulation of the ANCOVA model with one discrete, two-level factor, and one continuous

factor :

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)

where

𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, X =

(

1⋮11⋮1

1⋮10⋮0

0⋮01⋮1

𝑥1𝑛1⋮

𝑥1,10𝑥2,1⋮

𝑥2,𝑛2)

∈ ℝ𝑛×4, 𝛽 = (

𝜇0𝛼1𝛼2𝛽1

) ∈ ℝ4 and 𝜎2 > 0 (6)

As for the ANOVA models discussed above, the model as defined by equations in the equation above

is not identified: one could add a constant to each 𝛼𝑖 and subtract it from 𝜇 without changing any of the

expected values. To solve this problem, we apply the reference cell method and set 𝛼1 ≔ 0, so 𝜇0 becomes

the intercept of the reference cell, and 𝛼𝑖 becomes the difference in intercepts between levels 𝑖 and one of

the factor. At the level of the GLM formulation, the model is not of full column rank because the indicators

add up to the constant, so one of them is removed to obtain the reference cell parameterization as shown in

expression (7) below.

239

𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

0⋮01⋮1

𝑥11⋮

𝑥1,𝑛1𝑥2,1⋮

𝑥2,𝑛2)

∈ ℝ𝑛×3, 𝛽 = (

𝜇0𝛼2𝛽1) ∈ ℝ3 (7)

Figure 7. Visualization of a purely additive ANCOVA analysis of the example data set with the “categorical factor” age and the continuous factor “alcohol consumption”. Note that in the purely additive design, a common slope parameter is fitted to all levels of the categorical factor.

Finally, we consider an ANCOVA model which allows for group-specific slopes, i.e. a model that

explicitly models the interaction between the categorical and the continuous experimental factor. In

structural form, we now assume that the expectation of the 𝑗th outcome measure on the 𝑖th level of the

categorical factor is given by

𝜇𝑖𝑗 ≔ (𝜇0 + 𝛼𝑖) + (𝛽0 + 𝛾𝑖)𝑥𝑖𝑗 (8)

for 𝑖 = 1,… ,𝑚 and 𝑗 = 1,… , 𝑛𝑖. In the model described by (8), (𝜇0 + 𝛼𝑖) represents the group specific

offset of the regression lines of each group, while (𝛽0 + 𝛾𝑖) represents the group specific regression line

slopes. Both are described by the combination of a common offset and slope, 𝜇0 and 𝛽0, respectively, and a

group specific additive effect to offset and slope, 𝛼𝑖 and 𝛾𝑖 (𝑖 = 1,… ,𝑚), respectively.

We now consider the model described by (8) in more detail for the case of 𝑚 ≔ 2. Resolving the

brackets in equation (8) yields

𝜇1𝑗 = 𝜇0 + 𝛼1 + 𝛽0𝑥1𝑗 + 𝛾1𝑥1𝑗 (8)

and

𝜇2𝑗 = 𝜇0 + 𝛼2 + 𝛽0𝑥2𝑗 + 𝛾2𝑥2𝑗 (9)

In design matrix form, we can write (9) and (10) as

240

𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

1⋮10⋮0

0⋮01⋮1

𝑥1𝑛1⋮

𝑥1,10𝑥2,1⋮

𝑥2,𝑛2

𝑥1𝑛1⋮

𝑥1,100⋮0

0⋮0𝑥2,1⋮

𝑥2,𝑛2)

∈ ℝ𝑛×6, 𝛽 =

(

𝜇0𝛼1𝛼2𝛽0𝛾1𝛾2)

∈ ℝ6 (10)

In this design matrix, the first column is the sum of the second and third column, while the fourth column is

the sum of the fifth and six column. The model is thus overparameterized and we will use the reference cell

restrictions to render it estimable. Specifically, we set 𝛼1 = 𝛾1 = 0. This changes the interpretation of the

remaining parameters as follows: 𝜇0 and 𝛽0 correspond to the offset and slope of the reference cell, i.e. the

first group and 𝛼2 and 𝛾2 correspond to the expected differences in offset and slope, when the categorical

factor is at level 2 rather than 1. In design matrix form, we obtain

𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

0⋮01⋮1

𝑥1𝑛1⋮

𝑥1,10𝑥2,1⋮

𝑥2,𝑛2

0⋮0𝑥2,1⋮

𝑥2,𝑛2)

∈ ℝ𝑛×4, 𝛽 = (

𝜇0𝛼2𝛽0𝛾2

) ∈ ℝ4 (11)

Note that the design matrix column that models the difference in slopes between the levels of the

categorical factors, i.e., the last column can be conceived as the element-wise product between the column

modelling the differences in offset between levels of the categorical factor (the second column), and the

column modelling the common slope (the third column). In this representation, 𝛾2 is a also referred to as the

“interaction effect”, because it models the difference in differences between levels of the continuous factor

for differences in levels of the categorical factor. A common application of these type of ANCOVA designs

with interaction in cognitive neuroimaging is the use of so-called “psychophysiological interaction” (PPI)

analyses. In Figure 8, we visualize the ANCOVA design with interaction applied to the example data set of

Table 1.

Figure 8. Visualization of an ANCOVA analysis with interaction of the example data set with the “categorical factor” age and the continuous factor “alcohol consumption”. Note that in the ANCOVA design with interaction, a different slopes result for different levels of the categorical factor.

241

Study questions

1. Explain the notions of categorical, continuous, and multifactorial experimental designs.

2. Write down the design matrix formulation for the independent and identical sampling from a univariate Gaussian distribution.

3. Write down the design matrix formulation of a simple linear regression model. What do the two beta parameters refer to?

4. Write down the design matrix formulation of a multiple linear regression model comprising three parametric regressors.

5. Write down the design matrix formulation of a one-sample t-test. In which experimental situation is a one-sample t-test

appropriate?

6. Write down the design matrix formulation of an independent two-sample t-test. In which experimental situation is an

independent two-sample t-test appropriate?

7. Verbally, explain the notion of the reference cell method reformulation of ANOVA GLM designs.

8. Verbally explain the notion of a statistical interaction using a 2 x 2 factorial experimental design of your choice.

9. Discuss the commonalities and differences between one-way ANOVA designs, additive two-way ANOVA design, and two-way

ANOVA designs with interaction.

10. Write down the design matrix formulation of a 2 x 2 ANOVA design with interaction after its reference cell formulation.

11. Write down the design matrix formulation of an additive ANCOVA model with one two-level discrete and one parametric

experimental factor after its reference cell reformulation.

12. Write down the design matrix formulation of an ANCOVA model with interaction with one two-level discrete and one

parametric experimental factor after its reference cell reformulation.


1. In categorical designs the independent variable, also referred to as the experimental factor, takes on discrete levels that usually

do not bear a quantitative relationship to one another. In continuous designs, the experimental variable or experimental

variable takes on quantitative values over a specific range with explicit quantitative relationships. In multifactorial designs,

typically multiple categorical experimental factors are crossed, such that the dependent variable is observed for all

combinations of all levels of all factors.

2. Sampling 𝑛 times independently and identically from a univariate Gaussian distribution can be written in design matrix form as

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (

1111

) ∈ ℝ𝑛, 𝛽 ∈ ℝ and 𝜎2 > 0

3. Simple linear regression for a data set comprising 𝑛 observations can be written in design matrix form as

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (

1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛

) ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0

The first component of the beta parameter vector 𝛽 ∈ ℝ2 corresponds to the offset of the simple linear regression line, i.e. its

crossing of the y-axis, while the second component of the beta parameter vector corresponds to the slope of the simple linear

regression line.

4. For three parameteric regressors, denoted here by the 𝑛-dimensional vectors 𝑥(1), 𝑥(2), 𝑥(3) ∈ ℝ3 the multiple regression

model for a data set 𝑦 ∈ ℝ𝑛 can be written in design matrix form as

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔

(

11⋮1

𝑥11

𝑥21

⋮𝑥𝑛1

𝑥12

𝑥21

⋮𝑥𝑛2)

∈ ℝ𝑛×3, 𝛽 ∈ ℝ3 and 𝜎2 > 0

5. The design matrix form of the GLM underlying the one-sample T-Test can be written as

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ1 and 𝜎2 > 0

242

It is appropriate in single-factor, single-categorical level designs and can be used to evaluate whether there is sufficient evidence

that null hypothesis that all data points were generated from univariate Gaussian distributions with identical expectation

parameters can be rejected.

6. In its design matrix formulation, the two-sample T-test for “independent, equally-sized samples under the assumption equality of

the respective group variances for data of two groups 𝐴 and 𝐵” with 𝑛 ≔ 𝑛𝐴 + 𝑛𝐵 can be written as

𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where

(

𝑦1𝐴

⋮𝑦𝑛𝐴𝑦1𝐵

⋮𝑦𝑛𝐵𝐵 )

∈ ℝ𝑛, 𝑋 ≔

(

1 0⋮ ⋮1 00 1⋮ ⋮0 1)

∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0

Two-sample T-tests as formulated above are appropriate in experimental situations in which data from two groups of different

experimental subjects are available, the variances of the two groups can be assumed to be equal, and the interest lies in whether the

underlying distributions have the identical expectation.

7. A statistical interaction in a 2 x 2 factorial design refers to a difference in a difference. For example if both the visual coherence

(low/high) and contrast (low high) of a stimulus are manipulated and a data pattern suggesting faster stimulus recognition times for

low visual coherence than high visual coherence when the contrast is high, but slower stimulus recognition times for low visual

coherence than high visual coherence when the contrast is low, one would speak of an interaction.

8. In an ANOVA design, the “reference cell method” reformulation corresponds to setting one of the experimental effects to zero,

usually the first effect of the first celI. Using this approach, the offset parameter becomes the expected response of the reference

cell, and all other effects become the level dependent effects compared to the reference cell. In other words, all other effects

become the expected differences between a given level of the experimental factor and the reference cell.

9. One-way ANOVA designs, additive two-way ANOVA design, and two-way ANOVA designs with interaction can all be formulated

using the GLM notation and the reference cell method reformulation. They differ with respect to how many columns are included in

the GLM design matrix.

10. Let 𝑛𝑖𝑗 refer to the number of data points 𝑦𝑖𝑗𝑘 (𝑘 = 1,… , 𝑛𝑖𝑗) of the 𝑖th (𝑖 = 1,2) level of the first and 𝑗th (𝑗 = 1,2) level of the

second factor. Then, in its design matrix formulation, the 2 x 2 ANOVA design with interaction takes the following form

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ≔

(

𝑦111 ⋮

𝑦11𝑛11𝑦121 ⋮

𝑦12𝑛12𝑦211 ⋮

𝑦21𝑛21𝑦221 ⋮

𝑦22𝑛22)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮11⋮11⋮1

0⋮00⋮01⋮11⋮1

0⋮01⋮10⋮01⋮1

0⋮00⋮00⋮01⋮1)

∈ ℝ𝑛×4, 𝛽 ≔ (

𝜇0𝛼2𝛽2

(𝛼𝛽)22

) ∈ ℝ4 and 𝜎2 > 0

11. For data vectors 𝑦1 ∈ ℝ𝑛1 and 𝑦2 ∈ ℝ

𝑛2 corresponding to the data observed for the first and second level of the two-level

discrete exerperimental factor and corresponding vectors 𝑥1 ∈ ℝ𝑛1 and 𝑥2 ∈ ℝ

𝑛2 for the values of the parametric/continuous

factor, the design matrix formulation of an additive ANCOVA model with one two-level discrete and one parametric experimental

factor after its reference cell reformulation is given by

𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

0⋮01⋮1

𝑥11⋮

𝑥1,10𝑥2,1⋮

𝑥2,10)

∈ ℝ𝑛×3, 𝛽 = (

𝜇𝛼2𝛽1) ∈ ℝ3 and 𝜎2 > 0

243

12. Let 𝑦𝑖,𝑗 denote the 𝑗th data point (𝑗 = 1,…𝑛𝑖) at the 𝑖th level of the discrete factor and 𝑥𝑖,𝑗 (𝑖 = 1,2, 𝑗 = 1,… , 𝑛𝑖) denote

corresponding value of the parametric factor. Then the design matrix formulation of an ANCOVA model with interaction with one

two-level discrete and one parametric experimental factor after its reference cell reformulation is given by

p(y) = N(y; Xβ, σ2In) where 𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

0⋮01⋮1

𝑥1𝑛1⋮

𝑥1,10𝑥2,1⋮

𝑥2,𝑛2

0⋮0𝑥2,1⋮

𝑥2,𝑛2)

∈ ℝ𝑛×4, 𝛽 = (

𝜇0𝛼2𝛽0𝛾2

) ∈ ℝ4 and 𝜎2 > 0

244

Advanced Theory of the GLM

245

The generalized least-squares estimator and whitening

(1) Motivation

A central assumption in the estimation and inference theory discussed for the GLM

𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)

thus far has been the sphericity of the error covariance matrix, i.e. the fact that

𝐶𝑜𝑣(휀) = 𝜎2𝐼𝑛 (2)

In the application of the GLM non-spherical error covariance matrix assumptions arise frequently, for

example in the context or repeated measures designs (see Section “Repeated measures ANOVA”), mixed

linear models (see Sections “Mixed Linear Models” and “Estimation of Mixed Linear Models”), or in the

application of the GLM to time-series data with serial error correlations in the context of FMRI data analysis

(see Section “FMRI serial correlations”). The aim of the current section is to establish why non-spherical

error covariance matrices necessitate a modification of the GLM estimation and inference theory and to

discuss the “generalized least-squares” procedure a fundamental approach that allows for the use the

results of the GLM assuming spherical error covariance matrices also in the context of non-spherical error

covariance matrices. Specific forms of non-spherical error covariance matrices will be elucidated in

subsequent sections. Notably, throughout we assume that the error distribution parameters are known.

Formally, we consider the estimation of the beta parameter vector in a general GLM with non-

spherical covariance matrix, i.e., the model

𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (3)

where 𝑉 ∈ ℝ𝑛×𝑛 is a known symmetric and positive-definite, but not necessarily spherical, matrix. In other

words, 𝑉 is not not necessarily of the form 𝐼𝑛, which we abbreviate below as “𝑉 ≠ 𝐼𝑛”.

To understand the implications of 𝑉 ≠ 𝐼𝑛, we first investigate the consequences of using the

standard OLS beta parameter and variance estimators for the GLM specified in equation (4). If we consider

the expectation of the OLS beta parameter

�̂�:= (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (4)

we see that it is unaffected by the covariance matrix of the error terms, and thus, in the sense of an

estimating bias, 𝑉 ≠ 𝐼𝑛 does not affect the estimation of the beta parameters:

𝐸(�̂�) = 𝐸((𝑋𝑇𝑋)−1𝑋𝑇𝑦) = (𝑋𝑇𝑋)−1𝑋𝑇𝐸(𝑦) = (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽 = 𝛽 (5)

The classical approach to beta parameter inference as discussed in previous Sections would now proceed by

estimating the variance parameter 𝜎2 by

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝 (6)

and use �̂�2(𝑋𝑇𝑋)−1 as the estimator for the beta parameter covariance 𝜎2(𝑋𝑇𝑋)−1, which enters the

denominator of the T-Statistic. However, the beta parameter covariance for the GLM with 𝑉 ≠ 𝐼𝑛 does not

246

correspond to 𝜎2(𝑋𝑇𝑋)−1, as we show below, and thus classical inference is misguided. Using the linear

transformation theorem for Gaussian distributions, we see that the covariance of the OLS beta estimator

𝐶𝑜𝑣(�̂�) = (𝑋𝑇𝑋)−1𝑋T(𝜎2𝑉)((𝑋𝑇𝑋)−1𝑋𝑇)𝑇 = 𝜎2(𝑋𝑇𝑋)−1𝑋𝑇𝑉𝑋(𝑋𝑇𝑋)−1 (7)

and because 𝑉 ≠ 𝐼𝑛, the right hand side of (8) does not correspond to 𝜎2(𝑋𝑇𝑋)−1 . Hence, the T-Statistic

formed by the ratio of contrasts of �̂� and �̂�2(𝑋𝑇𝑋)−1 is no longer guaranteed to be distributed according to

the 𝑡-distribution. This result is illustrated in Figure 1. Similarly it can be shown, that in the case of non-

spherical error covariance matrix, the F-Statistic is not distributed according to the 𝑓-distribution.

Figure 1 Effect of non-spherical error covariance matrices on the distribution of the T-statistic. The upper left hand-panel depicts the a spherical covariance matrix of the form 𝜎2𝐼𝑛 for a simple linear regression model with 𝑛 = 10 data points. The lower left-hand panel depicts the familiar result, that repeated sampling from the model specified using the spherical error covariance matrix yields a distribution of empirical 𝑇 values that follows the analytical 𝑡-distribution. The upper right-hand panel depicts a non-spherical covariance for the same model as on the left-hand side, where the 𝑖, 𝑗th entry of the matrix is given by exp(|𝑖 − 𝑗|). The interpretation of this covariance matrix will be explored in the sections on FMRI serial correlations. For now, we note that the empirical 𝑇 values sampled from a simple linear regression model with this non-spherical covariance matrix do not follow the corresponding 𝑡-distribution. Moreover, it can be seen that extremely large or small 𝑇-values are more likely to occur than predicted by the analytical 𝑡-distribution, which is based on the assumption of a spherical covariance matrix. Assuming a spherical covariance matrix, if there are indeed error serial correlations as implemented in the non-spherical covariance matrix thus inflates the probability for detecting “significant results”.

(2) Derivation of the generalized-least squares estimator

Under the assumption that the error covariance matrix 𝜎2𝑉 is known, there exists a conceptually

simple solution to the problem of using the classical inference theory results in the context of non-spherical

error covariance matrices. This solution is known as “generalized least-squares” estimation and is based on

the following intuition: if classical inference is only valid in the case of spherical error covariance matrices,

247

then a GLM with non-spherical error covariance matrix is simply transformed to a GLM with spherical error

covariance matrix, and inference is performed based on this transformed GLM.

To introduce the approach, we assume for the moment, that a matrix 𝐴 ∈ ℝ𝑛×𝑛 is known, which

transforms the (non-spherical) GLM

𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (1)

into the (spherical) GLM

𝑦∗ = 𝑋∗𝛽 + 휀∗ (2)

where

𝑦∗ ≔ 𝐴𝑦 ∈ ℝ𝑛, 𝑋∗ ≔ 𝐴𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 휀∗ ≔ 𝐴휀, 𝑝(휀∗) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (3)

Then the GLM

𝑦∗ = 𝑋∗𝛽 + 휀∗ (4)

is of the standard form with spherical error covariance matrix, and classical inference on the parameters can

proceed as usual using the OLS beta estimator of the transformed model

�̂� ≔ (𝑋∗𝑇𝑋∗)−1𝑋∗𝑇𝑦∗ = ((𝐴𝑋)𝑇(𝐴𝑋))

−1(𝐴𝑋)𝑇𝐴𝑦 (5)

The question is then how to specify 𝐴 ∈ ℝ𝑛×𝑛, so that this transformation works. To this end assume

that a matrix 𝐾 ∈ ℝ𝑛×𝑛 exists, such that we can write the positive-definite, but non-spherical error

covariance matrix 𝑉 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. as

𝑉 = 𝐾𝐾𝑇 (6)

where 𝐾 ∈ ℝ𝑛×𝑛 is a (triangular) matrix with the properties

(𝐾−1)𝑇 = (𝐾𝑇)−1 and 𝑉−1 = (𝐾−1)𝑇(𝐾−1) (7)

If we set 𝐴 ≔ 𝐾−1 ∈ ℝ𝑛×𝑛 and use the linear transformation theorem for Gaussian distributions, we

see that the covariance matrix of 휀∗ is given by the spherical covariance matrix 𝜎2𝐼𝑛, i.e.

𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉), 𝑉 = 𝐾𝐾𝑇 , 𝐴 = 𝐾−1, 휀∗ ≔ 𝐴휀 ⇒ 𝑝(휀∗) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (8)

Proof of (8)

We have

𝑝(휀∗) = 𝑝(𝐾−1휀) (8.1)

= 𝑁(𝐾−1휀; 𝐾−10, 𝐾−1(𝜎2𝑉)(𝐾−1)𝑇)

= 𝑁(휀∗; 0, 𝜎2𝐾−1𝐾𝐾𝑇(𝐾−1)𝑇)

= 𝑁(휀∗휀; 0, 𝜎2𝐾−1𝐾𝐾𝑇(𝐾𝑇)−1)

= 𝑁(휀∗; 0, 𝜎2𝐼𝑛)

□

248

We next consider the OLS estimator of the transformed GLM 𝑦∗ = 𝑋∗𝛽 + 휀∗ for this choice of 𝐴.

Here, we find that

�̂� = (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦 (9)

Proof of (9)

With 𝐴 ≔ 𝐾−1 we have from (5)

�̂� = ((𝐴𝑋)𝑇(𝐴𝑋))−1(𝐴𝑋)𝑇𝐴𝑦 (9.1)

= ((𝐾−1𝑋)𝑇(𝐾−1𝑋))−1(𝐾−1𝑋)𝑇𝐾−1𝑦

= (𝑋𝑇𝐾−1𝐾−1𝑋)−1𝑋𝑇(𝐾−1)𝑇𝐾−1𝑦

= (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦

where the last equality follows with the properties of 𝐾 ∈ ℝ𝑛×𝑛. □

The OLS estimator for 𝛽 of the transformed GLM 𝑦∗ = 𝑋∗𝛽 + 휀∗ in (9) is known as the generalized

least-squares estimator and usually abbreviated by �̂�𝐺𝐿𝑆. Note that in the case of a spherical error

covariance matrix, the generalized least-square estimator reduces to the ordinary least-squares estimator:

�̂�𝐺𝐿𝑆 = (𝑋𝑇(𝐼𝑛)

−1𝑋)−1𝑋𝑇(𝐼𝑛)−1𝑦 = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 = �̂� (10)

For the special case, that 𝑉 is merely “heteroscedastic”, i.e. it has only zero off-diagonal elements, it is also

known as the “weighted least-squares estimator”.

A remaining question is, how a matrix 𝐾 ∈ ℝ𝑛×𝑛 with the desired properties (cf. equations (7)and

(8)) can be obtained, and in fact, whether it is ensure to exist at all. The answers to these questions fall into

the mathematical theory of linear algebra. In short, if 𝑉 ∈ ℝ𝑛×𝑛 is positive-definite, a matrix 𝐾 ∈ ℝ𝑛×𝑛 with

the desired properties is guaranteed to exist, and can be computed from 𝑉 ∈ ℝ𝑛×𝑛 using a singular value

decomposition.

(3) Whitening

A related notion to the concept of a generalized or weigheted least-squares estimator is the concept

of “data (pre)whitening”, as prevalent in the neuroimaging literature. Here, the emphasis is on pre-

multiplying a GLM with non-spherical error covariance matrix

𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (1)

with an adequate whitening matrix 𝑊 ∈ ℝ𝑛×𝑛, such that the resulting GLM, referred to as the “whitenend”

GLM has a spherical covariance matrix. Given the discussion above, the adequate matrix for pre-multiplying

the GLM is of course again the matrix 𝑊 ≔ 𝐾−1 ∈ ℝ𝑛×𝑛, such that

𝑉 = 𝐾𝐾𝑇 (2)

By the discussion above, multiplication of the GLM in (1) by 𝑊

𝑊𝑦 = 𝑊𝑋𝛽 +𝑊휀 (3)

then yields with

249

𝑦∗ ≔𝑊𝑦 and 𝑋∗ ≔𝑊𝛽 (4)

that

𝑝(휀∗) ≔ 𝑝(𝑊휀) = 𝑝(𝐾−1휀) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (5)

and the classical distribution theory for the ordinary least-squares �̂� ∈ ℝ𝑝 for 𝛽 ∈ ℝ𝑝 and its derivatives

such as 𝑇 and 𝐹 values will be appropriate.

Study Questions

1. Why do non-spherical error covariance matrices require modifications to GLM estimation and inference? 2. Verbally sketch the derivation of the generalized least-squares estimator 3. Write down the generalized least squares estimator for a GLM of the form

𝑦 = 𝑋𝛽 + 휀, 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) 4. Show that the ordinary least-squares estimator is a special case of the generalized least-squares estimator for the case of a spherical error covariance matrix. 5. Given a GLM with non-spherical error covariance matrix 𝜎2𝑉, with which matrix does the GLM have to pre-multiplied in order to whiten it?


1. The classical distribution theory for the GLM, i.e. the analytical forms of the distribution of the ordinary least-squares beta estimator, the variance parameter estimator, and the ensuing distributions of T and F statistics all assume independently and identically distribution error terms. If this assumption does not hold true, the analytical statements about these distributions become invalid. Practically, using the standard distributional results in the case of correlated error terms increases the risk of false-positives.

2. The generalized least-squares estimator can be derived by asking which form a matrix must have, that, if multiplied to both sides of the GLM equation renders the error terms independently and identically distributed. To this end, one finds with the linear transformation theorem for Gaussian distribution that an appropriate matrix corresponds to the inverse of the “square root” of the non-spherical covariance matrix of the original GLM.

3. The generalized least-squares estimator for the GLM denoted in the question has the form

�̂� ≔ (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦

4. In the case of a spherical covariance matrix, the error terms are distributed according to 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉), where 𝑉 ≔ 𝐼𝑛. Substitution in the generalized least-squares estimator formula yields

�̂� = (𝑋𝑇(𝜎2𝐼𝑛)−1𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)

−1𝑦

= (𝑋𝑇(𝜎2)−1(𝐼𝑛)−1𝑋)−1𝑋𝑇(𝜎2)−1(𝐼𝑛)

−1𝑦

= (𝑋𝑇(𝜎2)−1𝑋)−1𝑋𝑇(𝜎2)−1𝑦

= ((𝜎2)−1)−1(𝜎2)−1(𝑋𝑇𝑋)−1𝑋𝑇𝑦

= 𝜎2(𝜎2)−1(𝑋𝑇𝑋)−1𝑋𝑇𝑦

= (𝑋𝑇𝑋)−1𝑋𝑇𝑦

5. The whitening matrix for GLM 𝑊 ∈ ℝ𝑛×𝑛 corresponds to the inverse of the square root of the non-spherical covariance matrix,

i.e. 𝑊 ≔ 𝐾−1, where 𝑉 = 𝐾𝐾𝑇 ∈ ℝ𝑛×𝑛

250

Restricted Maximum Likelihood

(1) Motivation

Above we introduced the maximum likelihood (ML) approach as a general method for the derivation

of estimators in probabilistic models. For the GLM

𝑦 = 𝑋𝛽 + 휀, 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 with 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (1)

we used the ML approach to derive the ML/OLS beta parameter estimator

�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑦 (2)

and the ML variance parameter estimator

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛 (3)

We noted above that that ML variance parameter estimator is biased and showed that the modified

estimator

�̂�2 ≔(𝑦−𝑋�̂�)

𝑇(𝑦−𝑋�̂�)

𝑛−𝑝 (4)

offers a bias-free alternative. The notion of the estimation bias of the ML variance estimator is illustrated in

Figure 1.

Figure 1 The figure depicts the empirical expectation (i.e. average) of both the REML and the ML variance estimators for a simple linear regression model as a function of the number of data points 𝑛. For each 𝑛, the second to fifth column of the design matrix

𝑋𝑛×5 comprised 𝑛 equally spaced values between 0 and 1 to the power of the design matrix column number minus 1. Data was generated based on the true, but unknown, parameter values 𝛽 ≔ (1,1,1,1,1)𝑇 and 𝜎2 ≔ 1. Per number of data points 𝑛, 1000 samples were taken from the model, and the corresponding REML and ML estimators for the variance parameter evaluated. The average and standard error of the mean of these values are plotted as blue and red curves, respectively. The ML variance estimator is clearly biased downward with respect to the true, but unknown, value of 𝜎2, while the OLS variance estimator is not.

251

Above equation (4) was introduced without much explanation. In fact, the ML approach has the

disadvantage, that the variance estimators it allows to derive are biased. This motivated statisticians to

search for another principled method, that allows for the derivation of better estimators, leading to the

introduction of the “restricted” (also known as “residual”) maximum likelihood approach (REML). In brief,

REML is an approach that enables the derivation of unbiased estimators for variance parameters in linear

models by using a modified likelihood function. This likelihood function is known as the REML objective

function and is derived in the next paragraph.

(2) The REML objective function and REML variance parameter estimators

The REML objective function can be motivated based on the probabilistic model

𝑝(𝑦, 𝛽) = 𝑝(𝑦|𝛽)𝑝(𝛽) (1)

where

𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, 𝑉𝜆) (2)

𝑋 ∈ ℝ𝑛×𝑝 is a known design matrix, 𝑝(𝛽) is the marginal distribution of 𝛽 and 𝑉𝜆 is a positive-definite

covariance matrix parameter that depends on an unknown parameter 𝜆 ∈ ℝ𝑞. Examples of 𝑉𝜆 are the

spherical covariance matrix

𝑉𝜆 ≔ 𝜎2𝐼𝑛 (3)

in which case 𝜆 ≔ 𝜎2 > 0, or the formulation of the error covariance matrix in terms of the linear

combination of covariance basis functions 𝑄𝑖 ∈ ℝ𝑛×𝑛 (𝑖 = 1,… , 𝑞) such that 𝜆 ≔ (𝜆1, … , 𝜆𝑞) ∈ ℝ

𝑞 and

𝑉𝜆 ≔ ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (4)

As discussed above, maximum likelihood estimation of 𝜆 corresponds to maximizing the log likelihood

function

ℓ ∶ Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ(𝜆) ≔ ln𝑝(𝑦|𝛽) (5)

with respect to 𝜆. Restricted maximum likelihood estimation of 𝜆 corresponds to maximizing the log

restricted likelihood function

ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) ≔ ln∫ 𝑝(𝑦|𝛽)𝑑𝛽 (6)

with respect to 𝜆. Equivalently, this may be viewed as maximization of the marginal log likelihood function

ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) ≔ ln∫ 𝑝(𝑦, 𝛽)𝑑𝛽 (7)

under the assumption of a constant (uniform) improper prior distribution

𝑝(𝛽) = 𝑐, 𝑐 ∈ ℝ (8)

over 𝛽, such that

ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) = ln∫ 𝑝(𝑦, 𝛽)𝑑𝛽 = ln∫ 𝑝(𝑦|𝛽)𝑝(𝛽)𝑑𝛽 = ln 𝑐 + ln ∫ 𝑝(𝑦|𝛽)𝑑𝛽 (9)

252

Because the first term on the right-hand side of (9) is not a function of 𝜆, (9) is equivalent to (6).

Given the generalized least-squares estimate

�̂�𝜆 ≔ (𝑋𝑇𝑉𝜆−1𝑋)

−1𝑋𝑇𝑉𝜆

−1𝑦 (10)

of 𝛽 the log restricted likelihood function evaluates to

ℓ𝑟(𝜆) = −(𝑛+𝑝

2) ln 2𝜋 −

1

2ln|𝑉𝜆| −

1

2ln|𝑋𝑇𝑉𝜆

−1𝑋| −1

2(𝑦 − 𝑋�̂�𝜆)

𝑇𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (11)

Proof of (11)

For ease of notation, we set �̂� ≔ �̂�𝜆 and 𝑉 ≔ 𝑉𝜆. Using the identity

(𝑦 − 𝑋𝛽)𝑇𝑉−1(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�) + (𝛽 − �̂�)𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�) (11.1)

which we prove below and the normalizing integral for the Gaussian distribution, we have

ℓ𝑟(𝜆) = ln∫ 𝑝(𝑦|𝛽)𝑑𝛽 (11.2)

= ln ∫ 𝑁(𝑦; 𝑋𝛽, 𝑉)𝑑𝛽

= ln (∫(2𝜋)−𝑛

2|𝑉|−1

2 exp (−1

2(𝑦 − 𝑋𝛽)𝑉−1(𝑦 − 𝑋𝛽))𝑑𝛽)

= ln (∫(2𝜋)−𝑛

2|𝑉|−1

2 exp (−1

2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)) exp (−

1

2(𝛽 − �̂�)

𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�)) 𝑑𝛽)

= ln ((2𝜋)−𝑛

2|𝑉|−1

2 exp (−1

2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�))∫ exp (−

1

2(𝛽 − �̂�)

𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�)) 𝑑𝛽)

= ln ((2𝜋)−𝑛

2|𝑉|−1

2 exp (−1

2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)) (2𝜋)−

𝑝

2|𝑋𝑇𝑉−1𝑋|−1

2)

= −𝑛

2ln 2𝜋 −

1

2ln|𝑉| −

1

2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�) −

𝑝

2ln 2𝜋 −

1

2ln|𝑋𝑇𝑉−1𝑋|

= −(𝑛+𝑝

2) ln 2𝜋 −

1

2ln|𝑉| −

1

2ln|𝑋𝑇𝑉−1𝑋| −

1

2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)

□

Having established the REML objective function, we can now state explicitly the notions of a ML and

a REML estimator for the variance parameter of a GLM. As previously, an ML variance parameter estimator

is defined as a value 𝜆𝑀𝐿 that maximizes the following function

ℓ ∶ Λ ⊂ ℝ𝑞 → ℝ+, λ ↦ ℓ(𝜆) ≔ −1

2ln|𝑉𝜆| −

1

2(𝑦 − 𝑋�̂�𝜆)

𝑇𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (12)

On the other hand, a REML variance parameter estimator is defined as a value 𝜆𝑅𝐸𝑀𝐿 that maximizes the

following function

ℓ𝑟 ∶ Λ ⊂ ℝ𝑞 → ℝ+, 𝜆 ↦ ℓ𝑟(𝜆) = −

1

2ln|𝑉𝜆| −

1


−1𝑋| −1

2(𝑦 − 𝑋�̂�𝜆)𝑉𝜆

−1(𝑦 − 𝑋𝑓�̂�𝜆) (13)

Note that ℓ corresponds to the log likelihood function for the variance parameter of the standard

GLM as introduced above after estimation of the beta parameter vector, with the modification that the

Gaussian covariance matrix 𝑉𝜆 and the beta parameter estimator �̂�𝜆 have been made explicit functions of

the variance parameter, and additive constants, which do not affect the location of extremal points, have

253

been omitted. Further note that the REML objective function” ℓ𝑟 is identical to ℓ, except for the extra term

−1


−1𝑋|.

(3) Derivatives of the REML objective function

The derivatives of the REML objective function, which are required for finding its maximum can be

computed analytically. Specifically, with

𝑃 ≔ 𝑉𝜆−1 − 𝑉𝜆

−1𝑋(𝑋𝑇𝑉𝜆−1𝑋)

−1𝑋𝑇𝑉𝜆

−1 ∈ ℝ𝑛×𝑛 (1)

the components of the gradient ∇ℓ𝑟(𝜆) ∈ ℝ𝑞 of the REML objective 𝜆 are given by

𝜕

𝜕𝜆𝑖ℓ𝑟(𝜆) = −

1

2𝑡𝑟 (𝑃 (

𝜕

𝜕𝜆𝑖𝑉𝜆)) +

1

2(𝑦 − 𝑋�̂�𝜆)

𝑇𝑉𝜆−1 (

𝜕

𝜕𝜆𝑖𝑉𝜆−1)𝑉𝜆

−1(𝑦 − 𝑋�̂�𝜆) (2)

For 𝑖 = 1,… 𝑞 and the components of the Hessian ∇2ℓ𝑟(𝜆) ∈ ℝ𝑝×𝑝 of the REML object function are given by

𝜕2

𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆) = −

1

2𝑡𝑟 (𝑃 (

𝜕2

𝜕𝜆𝑖𝜕𝜆𝑗𝑉𝜆) − (

𝜕

𝜕𝜆𝑖𝑉𝜆)𝑃 (

𝜕

𝜕𝜆𝑗𝑉𝜆)) +

1

2(𝑦 − 𝑋�̂�𝜆)

𝑇(𝜕

𝜕𝜆𝑖𝑉𝜆−1 − 2(

𝜕

𝜕𝜆𝑖𝑉𝜆)𝑃 (

𝜕

𝜕𝜆𝑗𝑉𝜆))𝑉𝜆

−1(𝑦 − 𝑋�̂�𝜆) (3)

for 𝑖, 𝑗 = 1,… , 𝑞 . The expected value of the (𝑖, 𝑗)-th component of the Hessian matrix under the data

distribution can also be evaluated analytically and results in

𝐸 (𝜕2

𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆)) = −

1

2𝑡𝑟 (𝑃

𝜕

𝜕𝜆𝑖𝑉𝜆𝑃

𝜕

𝜕𝜆𝑗𝑉𝜆) (4)

(4) Fisher-Scoring for the REML objective function and covariance basis matrices

Maximization of the REML objective function with respect to the variance parameter 𝜆 ∈ Λ ∈ ℝ𝑞

which parameterizes the covariance matrix 𝑉𝜆 can also be achieved numerically. In this case, the Fisher-

Scoring algorithm introduced in the context of maximum likelihood estimation can be adapted for the REML

objective function. Recall that, essentially, Fisher-Scoring corresponds to a multivariate Newton-Raphson

algorithm for the numerical optimization of an objective function (here the log likelihood function) with the

modification that the Hessian matrix of the objective function is replaced by its expected value under the

probabilistic model of interest. To develop an Fisher-Scoring algorithm analogue for the REML objective

function, we thus require its gradient and its expected Hessian matrix with respect to components of the

variance component vector 𝜆. These were derived in their general form above. Notably, the gradient vector

and Hessian matrix of the REML objective function can be simplified considerably if one assumes linear

covariance matrices combinations. Specifically, if we assume that

𝑉𝜆 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (1)

then

𝜕


1

2𝑡𝑟(𝑃𝑄𝑖) +

1

2(𝑦 − 𝑋�̂�𝜆)

𝑇𝑉𝜆−1𝑄𝑖

−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (2)

Proof of (2)

For an error covariance matrix of the form (1), we have

254

𝜕

𝜕𝜆𝑖𝑉𝜆 =

𝜕

𝜕𝜆𝑖(∑ 𝜆𝑖𝑄𝑖

𝑞𝑖=1 ) =

𝜕

𝜕𝜆𝑖(∑

𝜕

𝜕𝜆𝑖𝜆𝑖𝑄𝑖

𝑞𝑖=1 ) = 𝑄𝑖 (2.1)

Substitution then yields

𝜕


1

2𝑡𝑟 (𝑃 (

𝜕

𝜕𝜆𝑖𝑉𝜆)) +

1

2(𝑦 − 𝑋�̂�𝜆)

𝑇𝑉𝜆−1 (

𝜕

𝜕𝜆𝑖𝑉𝜆−1) 𝑉𝜆

−1(𝑦 − 𝑋�̂�𝜆) (2.2)

= −1


1

2(𝑦 − 𝑋�̂�𝜆)


−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆)

Likewise, the of expectation the (𝑖, 𝑗)-th component of the Hessian matrix of the REML objective function

simplifies to

𝐸 (𝜕2

𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆)) = −

1

2𝑡𝑟 (𝑃

𝜕

𝜕𝜆𝑖𝑉𝜆𝑃

𝜕

𝜕𝜆𝑖𝑉𝜆) = −

1

2𝑡𝑟(𝑃𝑄𝑖𝑃𝑄𝑗) (3)

In summary, under the assumption that the error covariance matrix decomposes into a linear

combination of covariance basis matrices, one may thus use the following numerical approach for REML

objective function maximization

Initialization

0. Define a starting point 𝜆(0) ∈ ℝ𝑞 and set 𝑘 ≔ 0. If ∇ℓ𝑟(𝜆(0)) = 0, stop! 𝜆(0) is a zero of ∇ℓ𝑟(𝜆

(0)). If not, proceed

to iterations.

Until Convergence

1. For 𝑖 = 1,… , 𝑞 set

𝜆𝑖(𝑘+1) ≔ 𝜆𝑖

(𝑘) − (1

2𝑡𝑟(𝑃𝑄𝑖𝑃𝑄𝑗))

−1

(1


1

2(𝑦 − 𝑋�̂�𝜆)


−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆))

2. If ∇ℓ𝑟(𝜆𝑖(𝑘+1)) = 0, stop! 𝜆𝑖

(𝑘+1) is a zero of ∇ℓ𝑟(𝜆(0)). If not, go to 3.


Table 1. A Fisher Scoring algorithm-analogue for the numerical optimization of the REML objective function for error covariance matrices decomposing into linear combinations of covariance basis matrices.

Study Questions 1. State the motivation for the introduction of the restricted maximum likelihood framework. 2. Under what kind of probabilistic model can the restricted maximum likelihood objective function be derived as an expected log likelihood function? 3. What is the difference between the maximum likelihood and the restricted maximum likelihood objective function for variance parameter estimation? 4. How can the restricted maximum likelihood objective function be used for statistical inference? 5. Why is the assumption of a covariance matrix that can be written as a linear combination of covariance basis matrices helpful in the context of REML? Study Question Answers

1. The variance parameter estimator derived under the maximum likelihood framework is biased, i.e. its analytical expectation does not correspond to the true, but unknown, value that it is supposed to estimate. The restricted maximum likelihood framework can be conceived as to originate from the desire to establish a general framework under which non-biased variance parameter estimators can be derived.

255

2. The restricted maximum likelihood objective function can be viewed as a marginal log likelihood function for a GLM with uniform (improper) prior distribution over the beta parameter. In this case, it corresponds to an expected log likelihood function.

3. The difference is the additional term −1


−1𝑋| which appears in the restricted maximum likelihood objective function, but

not in the maximum likelihood objective function.

4. The restricted maximum likelihood objective function can be used for statistical inference like the maximum likelihood objective function (i.e. the log likelihood function): it is viewed as function of a parameter of probabilistic model and the parameter is optimized such as to maximize the probability of observed data under the model. Like in the maximum likelihood framework, such an optimization can either proceed analytically, by finding the zeros of the derivative (or gradient) of the restricted maximum likelihood objective function explicitly, or numerically, for example using the Fisher-scoring algorithm.

5. Under the assumption of a covariance matrix of the form 𝑉𝜆 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑., where 𝑄𝑖 ∈ ℝ

𝑛×𝑛 are suitably chosen

covariance basis matrices, the gradient and expected Hessian of the restricted maximum likelihood objective function simplify considerably.

256

FMRI applications of the General Linear Model

257

The mass-univariate GLM-FMRI approach

The application of the GLM to FMRI signal time-series of single voxel data, referred to as the “mass-

univariate” GLM approach, hereafter GLM-FMRI, is a surprisingly complex topic. In the current section we

briefly review the acquisition and preprocessing of FMRI data to become familiar with the data organization

that is used for GLM-FMRI. We will next use a simple two-condition/two-voxel example to obtain an

intuition of how the GLM is used for the purpose of cognitive process brain mapping in FMRI research.

(1) FMRI data acquisition and preprocessing

The fundamental idea of GLM-FMRI studies is to map cognitive processes onto brain areas. To this

end, two fortuitous facts are exploited: firstly, because of metabolic demands local neural activity results in a

local alteration of the ratio of deoxygenated and oxygenated haemoglobin. Secondly, the local displacement

of deoxygenated haemoglobin alters the local magnetic susceptibility of brain tissue and can be detected as

in increase of the local magnetic resonance signal by an MR scanner. The changes in the MR signal induced

by local neural activity are referred to as the blood-oxygen level dependent (BOLD) contrast signal. Based on

this interaction between physic and biology, the idea of GLM-FMRI is to induce specific psychological states

in human participants, which presumably are reflected in specific neural activity states, resulting in

metabolic demands, which in turn can be detected by means of FMRI. If two different psychological

processes are reflected in anatomically different brain structures, GLM-FMRI, i.e. the statistical evaluation of

where a MR signal difference occurred post-stimulus, thus allows for mapping cognitive processes onto the

anatomy of the human brain. A full discussion of the fundamentals of FMRI is beyond the scope of this

PMFN. For an excellent introduction to this topic, please refer to (Huettel et al., 2014). Below, we provide an

overview of the steps that are involved in a standard FMRI study before the data are modelled using the

GLM.

The MRI signal arises from a complex interplay between a strong magnetic field, the magnetic

properties of atomic particles and the behaviour of the latter in the presence of electromagnetic stimulation.

The quantitative characterization of this process falls into the realm of quantum mechanics and even in

introductory discussions requires some familiarity with ordinary differential equations (Huettel et al., 2014).

For the purpose of PMFN, it suffices to know that the MRI scanner allows for taking images, i.e. three-

dimensional arrays of numbers, of the brain. The process of image generation using the electromagnetic MR

signal is based on concepts from Fourier analysis. Depending on the specific parameters that are used to

take MR images, known as sequence parameters, different image types result. So-called T1-weighted

images, which take approximately 10 minutes to acquire, have high spatial resolution and reveal fine

anatomical detail. T2*-weighted , on the other hand, only take 1 to 3 seconds to acquire and are sensitive to

variations in the MR signal due to local deoxyhemoglobin changes, that is, the BOLD contrast signal. T2*-

weighted images are the images used for FMRI. T2*-weighted images are usually acquired using a MR

sequence type known as “echo-planar imaging” , hence these images are also known as “EPI“ images. The

GLM-FMRI approach essentially converts EPI image time-series into statistical maps that indicate where local

activations occurred. These images in turn are referred to as “statistical parametric maps” (SPMs). Here,

“parametric” refers to the fact, that the statistics are evaluated using parametric assumptions about their

underlying distribution.

In general, FMRI data is organized as follows. A single human participant is usually scanned in a

single “session”, which comprises multiple “runs”. A run is usually about 10 to 15 minutes long. During a run,

the participant carries out a cognitive task (for example responding to visually presented stimuli by button

258

presses) while a series of EPI images is acquired simultaneously and continuously. Each run thus comprises a

sequence of EPI “images” or “volumes”. The time it takes to acquire a single volume corresponds to the

time-resolution of FMRI and is called “time-to-repetition” or in brief “TR”. Each volume comprises a number

of “slices” (usually in the order of 30 for whole-brain imaging), each of which contains a number of “voxels”.

Voxels, the three-dimensional analogue to the two-dimensional concept of a pixel, thus make up the entire

image. It is very helpful to simply to think of EPI images as 3D arrays of numbers representing the grey values

of the image’s voxels. It is the sequence of these grey values of a particular voxel over the course of an

experimental run to which the GLM is applied below. Importantly, the same design matrix is applied to

model the time-series data of each and every voxel. Before the GLM is applied, however, the data undergo a

significant amount of “preprocessing” to limit the influence of artefacts on the results. We will briefly review

these preprocessing steps and their purposes in the following.

FMRI data preprocessing usually comprises a sequence of steps known as (1) distortion correction, (2)

realignment, (3) slice-time correction, (4) normalization, and (5) smoothing. We briefly review each of these

steps in turn.

(1) Distortion correction. Due to inhomogeneities of the magnetic field, certain image parts of EPI

volumes may be distorted with respect to the object the image is taken of. Correcting these distortions

based on the knowledge of the field inhomogeneities is referred to as “distortion correction”. The aim of

distortion correction is thus to render the image a more veridical representation of the imaged object.

(2) Realignment. In order to allocate an observed effect after analysis of the FMRI data to a specific brain

region, one has to be sure that the time-course of a voxel actually refers to the same region during the

course of the experiment. The MR scanner’s voxel grid is overlaid over the subject’s brain in a fixed position.

Thus, if the subject moves during a run, voxels will represent different brain regions over time. For this

reason, subjects are usually fixated as much as possible and encouraged not to move during scanning.

However, some residual motion (for example by the pulsation of the blood in the brain’s vasculature) cannot

be avoided and is corrected during the realignment step, usually using the first image of the first run as

reference.

(3) Slice-time correction. The slices comprising a single EPI volume are acquired one after the other.

Because of this fact and because during data analysis EPI volumes are typically considered as data samples

for a single time point, temporal interpolation can be used to resample each slice with respect to a single EPI

volume onset time.

(4) Normalization. Normalization refers to the transformation of the subject-specific three-dimensional

voxel time-series into a standard group anatomical space. This transformation is performed by translating

and rotating the acquired data in three-dimensions and possibly also stretching and squeezing it.

Normalization of FMRI data is required, if the experiment aims at comparing data between different

subjects. Because all brains are a bit different, normalization will never really bring the same regions of two

subjects into full alignment (the question is also what “same regions” really means), but it is a useful

approach for group studies.

(5) Smoothing. “Smoothing” refers to the spatial weighted averaging of individual voxel data with data

from voxels in its vicinity. Intuitively, smoothing removes random signal fluctuations over space.

259

After data acquisition and data pre-processing (which does not alter the data format, but only the

data content) FMRI data correspond to “voxel time-courses”. In other words, for each three-dimensional

brain location (voxel) a time-course of MR signal values is obtained (Figure 1).

Figure 1 A visualization of “voxel time-courses”.

Numerically, these values may look like the table below.

Volume 1 Volume 2 Volume 2 Volume 3 ... Volume 𝒏𝑻𝑹

Voxel 1 97.3 90.2 86.1 89.9 ... 85.3

Voxel 2 98.2 91.1 87.0 89.5 ... 86.2

... ... ... ... ... ... ...

Voxel 𝑛𝑉 98.2 91.1 87.0 89.5 ... 86.2

Table 1 Tabular representation of voxel time courses of a single FMRI run.

In Table 1 and Figure 5, 𝑛𝑉 denotes the total number of voxels of the image/volume, and each line

contains as many values as the experimental run had sampling points, i.e. EPI volume acquisitions. The

fundamental idea of GLM-FMRI is to treat each voxel in isolation and apply the same GLM to each voxel, one

after the other. This is called the “mass-univariate” approach, because the dependent variable is one-

dimensional (= univariate): it represents the MR signal time-course of a single voxel. In all following

discussion of this Section, we will hence deal with the analysis of the data of a single voxel, for which we

model the observed signal value time course using the GLM. Because we are dealing with a single voxel, we

will not use an index referencing different voxels.

(2) Brain mapping using the GLM-FMRI approach

To link the FMRI data format with the GLM theory developed until now, we next discuss how the

FMRI data and knowledge about the timing of experimental stimulation is formulated in the standard GLM

form given by

260

𝑋𝛽 + 휀 = 𝑦 , where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (1)

To this end, we consider the data first. Instead of writing the data of a voxel as a line, we can also

write it as a column, where each line of the column indicates the value of that voxel at the respective TR

(Table 2).

Volume Variable MR Signal

1 𝑦1 87.3

2 𝑦2 90.2

3 𝑦3 86.1

... ... ...

𝑛𝑇𝑅 𝑦𝑛𝑇𝑅 85.3

Table 2. Tabular representation of a single voxel time course as GLM data vector 𝑦 ∈ ℝ𝑛𝑇𝑅. The data displayed correspond to the data of Voxel 1 in Table 1.

In other words, in GLM-FMRI the 𝑖th data point of a voxel time-course corresponds to the the 𝑖th

dependent variable 𝑦𝑖. The entire data set of single-voxels MR signal time course written as a column vector

thus corresponds to the data vector 𝑦 ∈ ℝ𝑛, where 𝑛 corresponds to the number of volumes acquired in a

given run, i.e. 𝑛 ≔ 𝑛𝑇𝑅.

The GLM for FMRI takes the form of multiple linear regression over time. In the design matrix, we

thus include what we believe might have had an influence on the value of 𝑦 at a given time-point. Like

always, the 𝑖th MR signal is thus modelled as weighted sum of the values of the independent variables at the

𝑖th time-point plus a time-point specific noise term:

𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 + 휀𝑖 = 𝑦𝑖 , 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎2), 𝜎2 > 0 (2)

However, two things are important now: First, the values of the independent variables 𝑥 change over time,

as they represent for example the presence or the absence of an external stimulus during the course of an

experimental run. If we consider a discrete index 𝑡 representing time, and replace the index 𝑖 with it, we may

rewrite the above as

𝑥1(𝑡)𝛽1 + 𝑥2(𝑡)𝛽2 + 𝑥3(𝑡)𝛽3 +⋯+ 𝑥𝑝(𝑡)𝛽𝑝 + 휀(𝑡) = 𝑦(𝑡), 휀(𝑡)~𝑁(휀(𝑡); 0, 𝜎2), 𝜎2 > 0 (3)

where 𝑡 = 1,2, … , 𝑛. Note that the independent variables and the dependent variables are indexed by time,

but the parameters are not. As usual, we assume that the weighting coefficients 𝛽𝑗 do not change over time

(or experimental units), but represent the fixed contribution that the independent variable 𝑥𝑗(𝑡) makes to

the observed value 𝑦(𝑡) at all times. Because we are dealing with discrete time, the time indexing using

𝑡 = 1,2,… , 𝑛 is in fact rarely used, however, and we represent (3) in standard matrix notation as

𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (4)

It should be noted, however, that for GLM-FMRI the rows of the design matrix, error vector, and the data are

functions of time as opposed to functions of independent experimental units.

We next explore, how the formulation of the mass-univariate GLM in (4) and the intuition of the

specificity of a brain region’s response to a cognitive process or experimental condition are related. To this

end, consider an FMRI experiment with one-experimental factor (e.g. the category of a visual stimulus)

261

comprising two levels (e.g. category 1: pictures of faces, category 2: pictures of houses). For generality, we

will just refer to the two different levels as condition 1 and condition 2. In a typical experiment, one presents

the participant with each of the conditions over the course of an experimental run. During each run FMRI

data is acquired continuously for e.g. 10 minutes in discrete samples of for example TR = 2 s (time of

repetition). As described above, the fundamental idea of GLM analyses of FMRI data is that in response to a

stimulus or a cognitive process, neurons become active in a region that is “specialized” for this stimulus or

cognitive process. This neural activity causes a metabolic cascade which in turn leads to a local increase in

the level of oxygenated hemoglobin in the area of the neural activity. Hemodynamic responses to a variety

of stimuli have been measured, and the concept of hemodynamic response function, i.e. a mathematical

model that describes the ideal change in the MR signal upon a neural event, has been formulated. The

particularities of hemodynamic response functions will be discussed in more detail later, for now it suffices

to note that the MR signal at a specific voxel in response to a single neural event at time point 𝑡 = 0

approximately looks like the function shown in Figure 2.

Figure 2. The hemodynamic response function. The hemodynamic response function reflects the idealized MR signal response to a brief neural event. It functions as the basis for GLM modelling of FMRI data in a temporal convolution framework. The Figure shows the “canonical hemodynamic response function” with the default parameters as implemented in the SPM software toolbox.

Assume now that an observer was presented with conditions 1 and 2 in random order over the

course of an experimental run of 260 𝑠 at the times shown by the “stick functions” in Figure 3 below.

Whenever the stick function of a condition is 1, the respective condition was presented. For example at

times 𝑡 = 0 and 𝑡 = 16 condition 1 was presented, at time 𝑡 = 32 condition 2 was presented, and so on.

Next, assume that while these conditions were presented, FMRI data were collected every 2 s from two

voxels A and B representing different brain areas, and the data shown in Figure 4 was extracted. Comparing

the event time points of Figure 3 to the voxel-time courses in Figure 4 above, one would come to the

conclusion that voxel A always shows an excursion of the MR signal, when condition 1 is presented and no

excursion when condition 2 is presented. For voxel B, on the other hand, the MR signal is responsive to both

condition 1 and 2.

262

Figure 3 Example condition onsets (stimulus timing) in an experiment with two conditions 1 and 2.

´

Figure 4 Example MR signal time-courses from two different voxels A and B.

Figure 5. Stimulus onset functions of Figure 3 convolved with a canonical hemodynamic response function.

263

In GLM-FMRI, this voxel-specific “responsiveness to a specific condition” is represented by the beta

parameter values of the predictor variable representing the respective condition. To see this, consider first

the predicted time-courses for voxels which are only and ideally responsive to conditions A or B,

respectively. These time-courses are obtained by replacing the stick functions of Figure 3 with the assumed

hemodynamic response functions, yielding Figure 5. Technically, this replacement is achieved by convolving

the stimulus stick functions with the hemodynamic response function, the details of which will be discussed

in the next section.

The predicted time-series for each condition are now identified with the columns of the design

matrix 𝑋 ∈ ℝ𝑛×2 for an observed time-series 𝑦 ∈ ℝ𝑛. That is, the number that is given for predictor and the

number that is given for predictor 2 at time-point 𝑖 are concatenated to a row vector with two entries and

entered in the 𝑖th row of the design matrix, where 𝑖 = 1,… , 𝑛 and 𝑛 is the total number of time-points (=

volumes). Alternatively, one can imagine transposing the predicted MR signal of condition 1 into a column

vector and entering this as the first column of the design matrix, and likewise transposing the predicted MR

signal of condition 2 into a column vector and entering this as the second column of the design matrix.

Commonly, the design matrix is represented as grey-scale image of its entries, which is shown in Figure 6 left

panel. Recall that the design matrix has as many rows are there are observed data points 𝑛 and as many

columns as there are predictor variables.

Now consider again the MR signal time course of voxel A in Figure 4 above. Transposing this row

vector to a column vector 𝑦 ∈ ℝ𝑛, we see that we can write down the GLM equation for this voxel time-

series quite well, if we chose the true, but unknown, parameter vector to be approximately 𝛽𝐴 ≔ (1,0)𝑇.

Intuitively, we can represent the corresponding GLM matrix multiplication graphically as in Figure 6.

Figure 6. Graphical representation of the GLM matrix product for 𝛽𝐴.

Likewise consider the MR signal time course of voxel B in Figure 4 above. Here we see that we can

recreate the voxel time-series using the same design matrix as for voxel A but setting the true, but

unknown, parameter vector to 𝛽𝐵 ≔ (1,1)𝑇 as shown in Figure 7. Note that as usual, the observed signal

264

results from the outcome of the design matrix and parameter multiplication plus a stochastic noise vector,

here denoted as 휀𝐴/휀𝐵.

Figure 7. Graphical representation of the GLM matrix product for 𝛽𝐵.

Equivalently, we can overlay the predicted (= estimated) time-series 𝑋�̂�𝐴 and 𝑋�̂�𝐵 and the observed

time-series to confirm that our choice of the beta parameter estimates as shown in the figure yield a good

approximation of the voxel-time courses (Figure 8). Here, the parameter estimates resulted in �̂�𝐴 =

(1.0027,−0.0471)𝑇 and �̂�𝐴 = (1.0033,1.0025)𝑇.

Figure 8 Time course representation of predicted and observed voxel time-courses based on the parameter choices �̂�𝐴

and �̂�𝐵.

265

In summary, the value of the (true, but unknown, or estimated) beta parameter which belongs to a

specific onset regressors, tells us something about the voxel's preference with respect to the experimental

condition: for voxel A, the first entry in 𝛽𝐴/�̂�𝐴 is large (≈ 1)and the second entry in 𝛽𝐴/�̂�𝐴 is small (≈ 0).

From the discussion above we see that this means that voxel A is responsive to condition 1, but not to

condition 2. Likewise, for voxel B, the two entries in 𝛽𝐵/�̂�𝐵 are very similar (both are ≈ 1), and as discussed

above, voxel B responds equally well to both conditions.

Finally, performing statistical tests on the estimated parameters for all voxels over the brain as

discussed in Section 7 yields the so-called “statistical parametric maps”. For example, for voxel A, the T-

statistic using a contrast vector of the form 𝑐 = (1,−1)𝑇 will yield a value deviating from zero quite a bit,

while for voxel B, the difference between the entries in �̂�𝐵 is around zero, and the T-statistic thus also close

to zero (given that the estimated variance parameter is not also close to zero). Marking voxels with high T-

statistic values (>± 2) with hot colors, and voxels with low T-statistic values (<± 2) with cold colors then

results in statistical parametric maps. Note that these maps are usually thresholded based on the

corresponding p-values so that for example, voxels with T-statistic values corresponding to p-values larger

than 0.001 are not colored at all.

Figure 9. A statistical parametric map

Study Questions

1. What does it mean that the GLM used for FMRI data analysis in a “mass-univariate” fashion? 2. Describe the FMRI data organization after FMRI data preprocessing. 3. What is the difference between the hemodynamic response and a hemodynamic response function? 4. Which GLM design category is used for the analysis of FMRI time-series data? 5. What do the beta parameter estimates obtained in a GLM FMRI time-series data of a single voxel reflect?


1. In standard GLM analyses of FMRI data sets, the time-series data from each individual voxel is modelled using the same design matrix and parameter estimation and inference is performed in a voxel-by-voxel fashion. The signal modelled is thus “univariate”, i.e. it corresponds to the scalar voxel-specific MR signal over time. However, the same procedure is used as many times as there are voxels in the data set, hence “mass-univariate”

2. After data acquisition and data prerocessing (which does not alter the data format, but only the data content) FMRI data correspond to “voxel time-courses”. In other words, for each three-dimensional brain location (voxel) a time-course of MR signal values is obtained.

3. A hemodynamic response is an empirical entity. In the context of FMRI it corresponds to the MR signal change measured at a given brain location, usually in response to some stimulation. A hemodynamic response function, on the other hand, is a theoretical

266

entity. It corresponds to an idealized, mathematical model of a hemodynamic response and is often employed in GLM-based FMRI data analyses.

4. FMRI time-series data is usually analysed using multiple linear regression models. In these models, the design matrix columns correspond to continuous “regressors” or “predictors”, which in their weighted summation correspond to the (deterministic) model of observed MR signal changes at a given voxel.

5. Beta parameter estimates in GLM analyses of FMRI data reflect the estimated effect that a given regressor has on the observed signal of a given voxel. For example, if the regressor encodes the temporal evolution of a specific experimental condition, the beta parameter estimate serves as the measure of the effect size of this condition for the given voxel.

267

First-Level Regressors

In the previous Section we have seen that the stimulus-onset times are converted into “predicted

MR signal time-courses” which form the columns of the voxel time-course GLM. In the current section, we

explore the rationale and the technical details for this conversion, commonly referred to as the “convolution

of stimulus onset functions with a hemodynamic response function”. In general, this convolution-based

approach is motivated by a “linear system” view of the of experimental perturbation and evoked BOLD

response, which we discuss below.

(1) Discrete time-signals

Discrete-time signals can be represented mathematically as “real-valued sequences”. A sequence of

numbers, in which the 𝑛th number in the sequence is denoted 𝑥𝑖, is formally written as

𝑥 ≔ (𝑥𝑖) (1)

If the number of elements 𝑥𝑖 is finite, we call the sequence a “finite sequence”. Sequences of numbers

acquired in an experimental context are usually finite. In this case, if 𝑛 ∈ ℕ denotes the number of sequence

elements, a sequence is identical to an element of ℝ𝑛. If the number of elements 𝑥𝑖 of a sequence is

infinite, we call the sequence an “infinite sequence”. Infinite sequences are usually theoretical constructs,

that can serve as approximations to data sequences acquired in experimental settings. In the current context

we allow for both negative and positive discrete indices 𝑖, i.e., we set 𝑖 ∈ ℤ. Infinite sequences with integer

indices are identical to elements of ℝℤ. The value 𝑥𝑖 of a sequence is often referred to as the “𝑖th sample of

the sequence”. An example of a real-valued finite sequence with 𝑛 = 7 elements is the following

𝑥 = (𝑥−3, 𝑥−2, 𝑥−1 𝑥0, 𝑥1, 𝑥2, 𝑥3) = (5.2, 1.0, 4.2, 𝜋, √2, 3.8, 1.0) (2)

Discrete sequences are most sensibly visualized using “stem plots”, however, often they are visualized more

conventional using line plots for convenience (Figure 1).

For two infinite sequences or two finite sequences of equal length 𝑥′ = (𝑥𝑖′) and 𝑥′′ = (𝑥𝑖

′), we

define their sum by element-wise addition, i.e., if

𝑦 = 𝑥′ + 𝑥′′ ⇔ (𝑦𝑖) = (𝑥𝑖′) + (𝑥𝑖

′′) (3)

then

𝑦𝑖 = 𝑥𝑖′ + 𝑥𝑖

′′ for all 𝑖 ∈ ℤ or 𝑖 ∈ ℕ𝑛. (4)

Likewise, we define the scalar multiplication of a sequence (𝑥𝑖) with a scalar 𝑎 ∈ ℝ as the multiplication of

all elements of the sequence with that scalar. In other words, if

𝑦 = 𝑎𝑥 ⇔ (𝑦𝑖) = 𝑎(𝑥𝑖) (5)

then

𝑦𝑖 = 𝑎𝑥𝑖 for all 𝑖 ∈ ℤ or 𝑖 ∈ ℕ𝑛 (6)

An important example of a sequence is the so-called “unit sample sequence”. It is defined as the infinite

sequence

𝛿 ≔ (𝛿𝑖) where 𝛿𝑖 ≔ {0, 𝑖 ≠ 01, 𝑖 = 0

for all 𝑖 ∈ ℤ (7)

The unit sample sequence thus takes on the value 1 at sequence index 𝑛 = 0, and takes on the value 0 for all

other values of 𝑖 ∈ ℤ. It can be visualized as a “stick function” with a single stick at 𝑖 = 0 (Figure 1). The unit

268

sample sequence plays the same role for discrete-time signals and systems that the unit impulse function or

“Dirac delta function” does for continuous-time signals and systems. For convenience, we often refer to the

unit sample sequence as a “discrete-time impulse” or simply as an “impulse”. It is important to note that a

discrete-time impulse does not suffer from the mathematical complications of the continuous-time impulse,

its definition is simple and precise.

Figure 1. Discrete time-signals. The upper left panel depicts the example sequence (2) in form of a “stem” plot, while the upper right panel depicts the same sequence in more conventional, but less precise, form. The lower panel depicts the unit sample sequence 𝛿.

One of the important aspects of the unit sample sequence is that the elements of an arbitrary

sequence can be represented as a sum of scaled, delayed impulse functions. For example, the elements of

the sequence 𝑥 given by

𝑥 ≔ (𝑥−3, 𝑥−2, 𝑥−1, 𝑥0, 𝑥1, 𝑥2, 𝑥3) = (𝑎−3, 0,0,0, 𝑎1, 𝑎2, 0) with 𝑎−3, 𝑎1, 𝑎2 ∈ ℝ (8)

can be expressed as

𝑥𝑖 = 𝑎−3 ⋅ 𝛿𝑖+3 + 0 ⋅ 𝛿𝑖+2 + 0 ⋅ 𝛿𝑖+1 + 0 ⋅ 𝛿𝑖 + 𝑎1 ⋅ 𝛿𝑖−1 + 𝑎2 ⋅ 𝛿𝑖−2 + 0 ⋅ 𝛿𝑖−3 (9)

for 𝑖 = −3,−2,−1,0,1,2,3. As an example, consider the case 𝑖 = 2. Then (5) results in

𝑥2 = 𝑎−3𝛿2+3 + 0 ⋅ 𝛿2+2 + 0 ⋅ 𝛿2+1 + 0 ⋅ 𝛿2 + 𝑎1 ⋅ 𝛿2−1 + 𝑎2 ⋅ 𝛿2−2 + 0 ⋅ 𝛿2−3

= 𝑎−3 ⋅ 𝛿5 + 0 ⋅ 𝛿4 + 0 ⋅ 𝛿3 + 0 ⋅ 𝛿2 + 𝑎1 ⋅ 𝛿1 + 𝑎2 ⋅ 𝛿0 + 0 ⋅ 𝛿−1

= 𝑎−3 ⋅ 0 + 0 ⋅ 0 + 0 ⋅ 0 + 0 ⋅ 0 + 𝑎1 ⋅ 0 + 𝑎2 ⋅ 1 + 0 ⋅ 0

= 𝑎2 (10)

More generally speaking, the elements of any sequence can be expressed as

𝑥𝑖 = ∑ 𝑥𝑖(𝛿𝑖−𝑘)∞ 𝑘=−∞ (11)

Equation (7) is important, because it expresses the sequence values 𝑥𝑖 as an infinite sum of the product of

shifted impulse functions (𝛿𝑖−𝑘)(recall that the impulse function is one for 𝑖 = 0 and zero everywhere else)

with the original sequence values 𝑥𝑖 as coefficients. This representation of a sequence will be used below to

express the transformation of a sequence under a linear system.

269

(2) Discrete-time systems

A discrete-time system can be defined mathematically as a function 𝑇 that maps values of an input

sequence 𝑥 onto values of an output sequence 𝑦.

𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ (𝑦𝑖) ≔ 𝑇((𝑥𝑖)) (1)

As usual 𝑇 represents a rule or formula for computing the output sequence value from the input sequence

values. It is important to note that the value of the output sequence at each value of the index 𝑖 ∈ ℤ can

depend on all or part of the entire sequence 𝑥. Usually, systems are defined by means of the definition of

their values, i.e. the general definition of a system as in (1) is followed by a definition of the values 𝑦𝑖 of (𝑦𝑖)

in terms of the values 𝑥𝑖 of (𝑥𝑖). An example is the so-called “accumulator system” which is defined by the

transformation

𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ 𝑇((𝑥𝑖)) = (𝑦𝑖) where 𝑦𝑖 ≔ ∑ 𝑥𝑘𝑖𝑘=−∞ (2)

In the current context, we are concerned with specific discrete time-systems: linear and time-invariant

discrete-time systems, also referred to as “LTI systems”. We next define the concept of a linear system and

then the concept of a time-invariant system.

Linear Systems

Linear systems are systems that are additive and homogenous. A system 𝑇 is said to be “additive”, if

and only if

𝑇((𝑥𝑖′) + (𝑥𝑖

′′)) = 𝑇((𝑥𝑖′)) + 𝑇((𝑥𝑖

′′)) (3)

for all input sequences (𝑥𝑖′) and (𝑥𝑖

′′). In other words, a system is said to be additive, if its output argument

for a sum of two input sequences is equal to the sum of the system’s output arguments for each individual

input sequence. A system is said to be “homogenous”, if and only if

𝑇(𝑎(𝑥𝑖)) = 𝑎𝑇((𝑥𝑖)) (4)

for all scalars 𝑎 ∈ ℝ and all sequeces (𝑥𝑖). A system 𝑇 that is both additive and homogenous is referred to as

a “linear system”.

Linear systems fulfill the following “superposition principle”. If a system 𝑇 is linear, then for all

sequences (𝑥𝑖′) and (𝑥𝑖

′′) and all scalars 𝑎, 𝑏 ∈ ℝ

𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖

′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖

′′)) (5)

Proof of (5)

Because the system is additive, we have


′′)) = 𝑇(𝑎(𝑥𝑖′)) + 𝑇(𝑏(𝑥𝑖

′′)) (5.1)

Because the system is homogenous, we further have

𝑇(𝑎(𝑥𝑖′)) + 𝑇(𝑏(𝑥𝑖

′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖

′′)) (5.2)

Thus


′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖

′′)) (5.3)

i.e., the superposition principle holds

□

As an example, we show that the accumulator system is a linear system. To this end, we consider

two sequences (𝑥𝑖′) and (𝑥𝑖

′′), which the accumulator system maps onto the two sequences

270

𝑇((𝑥𝑖′)) = (𝑦𝑖

′), where 𝑦𝑖′ ≔ ∑ 𝑥𝑘

′𝑖𝑘=−∞ (6)

and

𝑇((𝑥𝑖′′)) = (𝑦𝑖

′′), where 𝑦𝑖′ ≔ ∑ 𝑥𝑘

′𝑖𝑘=−∞ (7)

respectively. Using the definition of multiplication of sequences with scalars, the definition of sequence

addition, and the definition of the accumulator system, we have

(𝑦𝑖) = 𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖

′)) where 𝑦𝑖 ≔ ∑ (𝑎𝑥𝑘′ + 𝑏𝑥𝑘

′′)𝑖𝑘=−∞ (8)

Considering the right-hand side of the above, we have with the well-known properties of sums

𝑦𝑖 ≔ ∑ (𝑎𝑥𝑘′ + 𝑏𝑥𝑘

′′)𝑖𝑘=−∞

= ∑ 𝑎𝑥𝑘′𝑖

𝑘=−∞ + ∑ 𝑏𝑥𝑘′′𝑖

𝑘=−∞

= 𝑎∑ 𝑥𝑘′𝑖

𝑘=−∞ + 𝑏∑ 𝑥𝑘′′𝑖

𝑘=−∞

= 𝑎𝑦𝑖′ + 𝑏𝑦𝑖

′′ (9)

Because the above holds for all 𝑖 ∈ ℤ, we have found

(𝑦𝑖) = 𝑎(𝑦𝑖′) + 𝑏(𝑦𝑖

′′) = 𝑎𝑇(𝑥𝑖′) + 𝑏𝑇(𝑥𝑖

′′) (10)

and thus


′)) = 𝑎𝑇(𝑥𝑖′) + 𝑏𝑇(𝑥𝑖

′′) (11)

Time-invariant systems

A time-invariant system (also referred to as a “shift-invariant” system) is a system for which a time

shift or delay of the input sequence causes a corresponding shift in the output sequence. Formally, suppose

that a system transforms the input sequence (𝑥𝑖) into the output sequence (𝑦𝑖) . Then, the system is said to

be time invariant if, for all 𝑖0 ∈ ℤ, the input sequence (𝑥𝑖′), defined by 𝑥𝑖

′ ≔ 𝑥𝑖−𝑖0 for all 𝑖 ∈ ℤ produces the

output sequence (𝑦𝑖′), where 𝑦𝑖

′ ≔ 𝑦𝑖−𝑖0 for all 𝑖 ∈ ℤ.

In the next section, we exploit the linearity and time-invariant properties of LTI systems to introduce

the notion of discrete-time convolution.

(3) Convolution

For ete LTI systems, the output argument or “response” to an arbitrary input sequence (𝑥𝑖) can be

obtained by convolving the input sequence with the system’s “impulse response function”, which is the

system’s output sequence for a unit sample sequence. In other words, for an LTI system

𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ (𝑦𝑖) ≔ 𝑇((𝑥𝑖)) (1)

with impulse response function

(ℎ𝑖) ≔ 𝑇((𝛿𝑖)) (2)

the elements of (𝑦𝑛) can be written as

𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (3)

271

Proof of (3)

To see that the above holds, first consider the representation of a sequence 𝑥 as an infinite sum of the product of shifted

impulse functions 𝛿𝑖−𝑘 with the values of its elements, i.e.

(𝑥𝑖) = ∑ 𝑥𝑘(𝛿𝑖−𝑘)∞𝑘=−∞ (3.1)

Because the system 𝑇 considered is linear and obeys the superposition principle, its output to (3.1) can be written as

(𝑦𝑖) = 𝑇((𝑥𝑖)) = 𝑇(∑ 𝑥𝑘(𝛿𝑖−𝑘)∞𝑘=−∞ ) = ∑ 𝑥𝑘𝑇((𝛿𝑖−𝑘))

∞𝑘=−∞ (3.2)

In other words, the output of a linear system to an input sequence 𝑥 can be written as the infinite sum of the output of the system

to unit sample sequences 𝛿𝑖−𝑘 weighted by the values 𝑥𝑖 of the input sequence 𝑥. Let (ℎ𝑖𝑘) denote the sequence of the system in

response to the unit sample sequence (𝛿𝑖−𝑘), i.e.

(ℎ𝑖𝑘) ≔ 𝑇((𝛿𝑖−𝑘)) (3.3)

For a time-invariant system, if (ℎ𝑖) denotes the response to the unit sample response 𝛿𝑖, then (ℎ𝑖−𝑘) corresponds to the response to

the unit sample response 𝛿𝑖−𝑘, such that we can write

(ℎ𝑖−𝑘) ≔ 𝑇((𝛿𝑖−𝑘)) (3.4)

We can thus write

(𝑦𝑖) = ∑ 𝑥𝑘𝑇((𝛿𝑖−𝑘))∞𝑘=−∞ = ∑ 𝑥𝑘(ℎ𝑖−𝑘)

∞𝑘=−∞ (3.5)

where, notably,

𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (3.6)

□

The formation of a system’s output sequence by means of computing its values via the sum

𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (4)

where 𝑥𝑘 denotes the 𝑘the value of the input sequence 𝑥 and ℎ𝑖−𝑘 denotes the (𝑖 − 𝑘)th value of the

system’s impulse response functions, i.e. the response of the system to the unit sample response 𝛿𝑖, is

referred to as the “convolution of the input sequence with the impulse response function” and denoted by

𝑦 = 𝑥⨂ℎ (5)

The operation of discrete-time convolution thus takes two sequences (𝑥𝑖) and (ℎ𝑖) and produces a third

sequence (𝑦𝑖), the values 𝑦𝑖 of which are given by the convolution sum (4).

Note that the evaluation of each value of the output sequence in (4) requires the computation of an

infinite sum. In practical scenarios, this can often be eschewed by considering systems with “finite impulse

response functions”. Finite impulse response functions are impulse response functions that have non-zero

values only for a finite-support set –𝑚,−𝑚 + 1,… ,𝑚 − 1,𝑚 of indices. In this case, the values of the output

sequence (𝑦𝑖) can be computed based on the finite sums

𝑦 = ℎ⨂𝑥 ⇒ 𝑦𝑖 = ∑ ℎ𝑗𝑚𝑗=−𝑚 𝑥𝑖−𝑗 for all 𝑖 ∈ ℤ (6)

Proof of (6)

To show that for finite impulse response functions ℎ the infinite sum for each element 𝑦𝑖of the output sequence 𝑦 of a

system can be evaluated based on a finite sum comprising as many terms as the impulse response function contains elements, we

first show that the convolution operation is commutative, i.e. that

𝑦 = 𝑥⨂ℎ = ℎ⨂𝑥 (6.1)

Specifically, according to (4), we have with 𝑗 ≔ 𝑖 − 𝑘 (and thus 𝑘 = 𝑖 − 𝑗), we have according to (4)

𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ = ∑ 𝑥𝑖−𝑗ℎ𝑗

∞𝑗=−∞ = ∑ ℎ𝑗

∞𝑗=−∞ 𝑥𝑖−𝑗 for all 𝑖 ∈ ℤ ⇔ 𝑦 = ℎ⨂𝑥 (6.2)

272

Now, for a finite impulse response function ℎ, i.e. a sequence ℎ that only takes on nonzero values on a finite support set of

–𝑚,−𝑚 + 1,… ,𝑚 − 1,𝑚 one may thus use a finite summation for the evaluation of 𝑦

𝑦 = ℎ⨂𝑥 ⇒ 𝑦𝑖 = ∑ ℎ𝑗∞𝑗=−∞ 𝑥𝑖−𝑗 = ⋯0 + 0 + ∑ ℎ𝑗

𝑚𝑗=−𝑚 𝑥𝑖−𝑗 + 0 + 0 +⋯ = ∑ ℎ𝑗

𝑚𝑗=−𝑚 𝑥𝑖−𝑗 (6.3)

because for all other values of 𝑗, we have ℎ𝑗 ≔ 0 and thus the contribution ℎ𝑗𝑥𝑖−𝑗 to the sum is zero.

As an example, consider Figure 2. Here, the impulse response function is non-zero for the indices 𝑗 = 0,… ,7

(i.e., it is of length 8) and zero elsewhere. The system input function comprises all zeros, except for

𝑖 = 0, 𝑖 = 9, 𝑖 = 17, 𝑖 = 21, and 𝑖 = 29. Note that the input function is also finite in the current case and

comprising 40 elements.Evaluation of

𝑦𝑖 = ∑ ℎ𝑗7𝑗=0 𝑥𝑖−𝑗 for 𝑖 = 0,1,2,… ,40 (7)

results in the sequence shown in the lower panel. Note that for 𝑖 < 7, the sequence element 𝑥𝑖−𝑗 in (7) is

not defined. To nevertheless evaluate 𝑦𝑖 for 𝑖 < 7 one usually “pads” the input sequence with zeros. In the

current case, this amounts to setting 𝑥𝑘 ≔ 0 for 𝑘 = −7,−6,… ,−1.

Figure 2 Convolution. The upper right-panel depicts a finite impulse response function. The upper left panel depicts an input sequence. The lower panel depicts the response of an LTI system with impulse response function ℎ to an input sequence of the form 𝑢, or, equivalently, the convolution of the sequence 𝑢 with the sequence ℎ.

(4) The canonical hemodynamic response function

The canonical hemodynamic response function used to model the MR signal time-course in response

to an instantaneous neural event is based on the gamma probability density function. However, in the

current context, the gamma probability density function is not viewed as a probability density, but merely as

a function, i.e. it carries no probabilistic connotation. The input argument of this function is peri-stimulus

time, and the function is parameterized in its scale and rate form

𝐺 ∶ ℝ+ → ℝ+, 𝑡 ↦ 𝐺(𝑡; 𝛼, 𝛽) ≔𝛽𝛼

𝛤(𝛼)𝑡𝛼−1 𝑒𝑥𝑝(−𝛽𝑡) for 𝛼, 𝛽 > 0 (1)

Recall that in (1)

𝛤:ℝ+\{0} → ℝ, 𝑥 ↦ 𝛤(𝛼) = ∫ 𝜏𝛼−1𝑒−𝜏 𝑑𝜏∞

0 (2)

denotes the Gamma function.

273

The canonical hemodynamic response function is given by the difference of two gamma probability

density functions. The first density function describes the main response of an MR signal increase following

stimulation, while the second density function, which is subtracted from the first, describes the post-

stimulus undershoot. For an onset at 𝑡 = 0, a length of 𝑇 > 0 seconds and time-bin width of 𝛥𝑡 > 0 , the

canonical hemodynamic response function is parameterized by four parameters 휃1, 휃2, 휃3, 휃4 > 0 and given

by

𝐻 ∶ ℕ𝑛0 → ℝ,𝑢 ↦ 𝐺 (𝑢;

𝜃1

𝜃3,𝛥𝑡

𝜃3) −

1

𝜃5 𝐺 (𝑢;

𝜃2

𝜃4,𝛥𝑡

𝜃4) (3)

where 𝑢 is a time-bin index in the support interval and 𝑛 ≔ 𝑙/𝛥𝑡 is the number of support points rounded

to the next integer. In this parameterization, the parameters take on the following intuitions: 휃1 corresponds

to the delay of the peak of the hemodynamic response function with respect to its onset and 휃2 corresponds

to the delay of the post-stimulus undershoot with respect to its onset. 휃3 and 휃4 describe the width of the

main response and the undershoot, respectively. Finally 휃5 encodes the ratio of the main response with

respect to the undershoot.

Commonly chosen values for the parameters of the hemodynamic response function, specifically in

the context of high temporal resolution convolution (see below) are 𝑇 = 32 seconds, 𝛥𝑡 = 𝑇𝑅/16 seconds,

where 𝑇𝑅 is the scan repetition time used in the experimental data acquisition in seconds, 휃1 = 6, 휃2 =

16, 휃3 = 1, 휃4 = 1 and 휃5 = 6. Figure 1 visualizes the canonical hemodynamic response function for four

different parameter settings with 𝑇 = 32 and 𝑇𝑅 = 2, i.e. 𝛥𝑡 = 0.125 seconds.

Figure 1. The canonical hemodynamic response function for four different parameter settings of 휃 ≔

(휃1, 휃2, 휃3, 휃4, 휃5).

(5) Stimulus onset convolution in GLM-FMRI

In this section, we discuss some details on the generation of predicted MR signal time-courses that

result from the combination of linear time-invariant system theory with stimulus onset function and the

canonical hemodynamic response function. We firstly note that the convolution of stimulus onset functions

with the canonical HRF is commonly performed at a higher temporal resolution than the given by the MR

sampling rate (inverse TR) (Figure 1).

We next consider the effect of different stimulus duration and amplitude on the predicted MR signal

time-course, assuming no temporal overlap. As evident from the upper panel of Figure 2, increasing the

stimulus duration has two effects: firstly, up to a critical duration, it scales the predicted hemodynamic

response, i.e. the prediction of longer duration stimuli is an overall increase of the MR signal change.

274

Secondly, if the stimulus duration exceeds the duration of the canonical HRF kernel, no further signal

increase is predicted, but the return to baseline is predicted to be delayed. On the other hand, keeping the

stimulus duration constant and changing its amplitude leads to proportional increases in the amplitude of

the predicted MR signal, but no change in its temporal evolution (Figure 2, lower panel).

Figure 1. Microtime resolution convolution

Figure 2. Effects of stimulus duration and amplitude

Finally, we consider the effect of zero duration, constant amplitude stimulus onset functions, which

evoke temporal overlap in the corresponding MR signal (Figure 3). As evident from Figure 3, short stimulus

onset asynchronies prevent that the predicted MR signal returns to baseline between stimulus onsets. Up to

a certain rate, the stimulus variability is still propagated to the MR signal, however, at high rates, the

predicted MR signal takes the form of that predicted under single stimuli with long duration.

Figure 3. Effects of stimulus onset asynchronies.

275

Study Questions

1. How are the columns of a design matrix n the application of the GLM to FMRI commonly generated?

2. On which signal processing framework is the generation of GLM-FMRI regressors based?

3. Which mathematical function forms the basis of the “canonical HRF” and what does the canonical HRF describe?

4. What is the effect of increasing the duration of a stimulus in the stimulus-onset function/HRF-convolution framework for the GLM

analysis of FMRI data?

5. What is the effect of decreasing the inter-stimulus times for short duration stimulus onset functions in the stimulus-onset

function/HRF-convolution framework for the GLM analysis of FMRI data?


1. Commonly, stimulus-onset times are converted into predicted MR signal time-courses which form the columns of the voxel time-

course GLM. Technically, the process corresponds to the convolution of stimulus onset functions with a hemodynamic response

function

2. The generation of GLM-FMRI regressors is based on the framework of discrete-time linear time-invariant system theory.

3. The canonical hemodynamic response function is based on the difference of two gamma probability density functions. The first

density function describes the main response of an MR signal increase following stimulation, while the second density function,

which is subtracted from the first, describes the post-stimulus undershoot.

4. Increasing the stimulus duration has two effects: firstly, up to a critical duration, it scales the predicted hemodynamic response,

i.e. the prediction of longer duration stimuli is an overall increase of the MR signal change. Secondly, if the stimulus duration exceeds

the duration of the canonical HRF kernel, no further signal increase is predicted, but the return to baseline is predicted to be

delayed.

5. Short inter-stimulus onset intervals prevent that the predicted MR signal returns to baseline between stimulus onsets. Up to a

certain rate, the stimulus variability is still propagated to the MR signal, however, at high rates, the predicted MR signal takes the

form of that predicted under single stimuli with long duration.

276

First-Level Design Matrices

(1) Parameterizing event-related FMRI designs

Experimental design in convolution-based GLM-FMRI conforms to two fundamental questions: (1)

which conditions should be included in the experiment and (2) when should which condition be presented?

In the current section, we consider the latter question, i.e., we assume that the number of conditions and

their properties have been decided upon and the question is how to distribute them over the experimental

time-course in order to achieve a design that is in some sense “good”. More specifically, the question of

what defines a good design will be dealt with the next section. In the current section, we consider the more

fundamental question of how designs can be parameterized. To this end, we make a number of simplifying

assumptions. Firstly, we do not consider psychological factors, such as the perceived randomness of the

design. Secondly, we consider only designs comprising one or two conditions.

In the case of a single experimental condition, the only question with respect to experimental design

amounts to the question of when to present the trials of the condition. Here, we assume that a trial of an

experimental condition comprises a single event and that the minimal time between onsets of events 𝑡∗,

referred to as “minimal stimulus-onset asynchrony” is constant. In this case, the design can be

parameterized in terms of an “event-probability function” which assigns to each possible event-time a

probability of the event happening at this time or not. More formally, let

𝑆 = {0, 𝑡∗, 2𝑡∗, 3𝑡∗, … , 𝑛𝑡∗} = {𝑘𝑡∗|𝑘 ∈ ℕ𝑛0} (1)

denote a partition of the total time of an experimental run 𝑇 ≔ 𝑛𝑡∗ ∈ ℝ+. Then an “event-probability

function” is a function

𝑓 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) (2)

An event probability function induces a set of random variables {𝑒𝑘𝑡∗|𝑘 ∈ ℕ𝑛0}, that can take on the values 0

and 1, encoding the states that an event happens at time 𝑘𝑡∗ or that is does not, respectively. Notably, the

event probability function 𝑓 assigns to each time-point 𝑘𝑡∗ (𝑘 ∈ ℕ𝑛0) the probability of the event 𝑒𝑘𝑡∗ = 1,

such that

𝑓 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 𝑝(𝑒𝑘𝑡∗ = 1) (3)

Many event-probability functions are conceivable. An example for an event-probability function is

the following. Assuming that 𝑛 is an even integer, an event-probability function can be defined as

𝑓1 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ {1, 𝑘 ≤ 𝑛/20, 𝑘 > 𝑛/2

(4)

This function assigns a probability of 1 for the first 𝑛

2+ 1 minimal stimulus onset asynchrony time-points,

and a probability of 0 to the last 𝑛

2 minimal stimulus onset asynchrony time-points (see Figure 1, uppermost

panel). Another example for an event-probability function is

𝑓2 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 1 (5)

which assigns and event-occurrence probability of 1 to each minimal stimulus onset asynchrony time-point

(see Figure 1, lowermost panel). A third example of an event-probability function is the following

𝑓3 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 0.5 (𝑐𝑜𝑠 (2𝜋𝜔𝑡∗

𝑇) + 1) (6)

277

which parameterizes a time-varying event-probability by means of the frequency parameter 𝜔, representing

the number of cycles per experimental run length 𝑇 (see Figure 1, panels 2 – 3 for 𝜔 = 2,𝜔 = 4 and 𝜔 = 6,

respectively).

In the case of two experimental conditions, the event-probability function has to encode the

probabilities of trials of either condition to occur at each minimal stimulus-onset asynchrony time-point. In

other words, the random variable 𝑒𝑘𝑡∗ now takes on four values 1,2,3,4, encoding the events that a trial of

the first condition are either happening (1), or not (2), and the events that a trial of the second condition is

either happening (3) or not (4). A simple event-probability function is in this case is given by

𝑓4 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦

{

𝑝(𝑒𝑘𝑡∗ = 1) = 0.5

𝑝(𝑒𝑘𝑡∗ = 2) = 0.0

𝑝(𝑒𝑘𝑡∗ = 3) = 0.5

𝑝(𝑒𝑘𝑡∗ = 4) = 0.0

(7)

assuming that (1) a trial of either condition is presented at each 𝑘𝑡∗, (2) the probability that this trial is of

either condition is 0.5 (Figure 1 of the next section).

Figure 1. Parameterization of single condition convolution-based GLM designs in terms of event-probability functions. The red lines depict sampled onset functions according to the event-probability functions specified in the main text and visualized here as grey lines, and the blue lines depicts the resulting design matrix regressors upon convolution with the canonical hemodynamic response function. In the current one condition scenario, the design matrix comprises a single column only, such that 𝜉 = 𝑋𝑇𝑋. The uppermost and lowermost panels show that that a blocked presentation design is more efficient than an equispaced desing.

278

(2) Measuring event-related FMRI design efficiency

The “goodness” of a GLM design can be measured according to a variety of criteria. Here, we focus

on a simple criterion that has been employed in the GLM-FMRI literature and relates the variance of the beta

parameter estimates. Specifically, recall that the distribution of the OLS beta estimator �̂� ∈ ℝ𝑝 is given by

𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (1)

where 𝛽 ∈ ℝ𝑝 and 𝜎2 > 0 are the true, but unknown, GLM parameters, and 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix.

Based on the definition of the multivariate Gaussian distribution, the covariance matrix of the OLS beta

estimator is thus

𝐶𝑜𝑣(�̂�) = 𝜎2(𝑋𝑇𝑋)−1 (2)

Intuitively, the diagonal elements of this covariance matrix encode how, for fixed 𝛽, 𝜎2, and 𝑋, the effect

estimates �̂�𝑗 (𝑗 = 1,… , 𝑝) of each experimental condition vary over repeated sampling from the GLM - or in

other words, how reliable these estimates are over repeated sampling of the identical GLM. According to (2)

this variability is a function of the GLM parameter 𝜎2 and the inverse of the “design matrix correlation

matrix” 𝑋𝑇𝑋. Assuming that 𝜎2 is constant, the variability of the effect size estimates �̂�𝑗 is thus a function of

the diagonal entry

(𝑋𝑇𝑋)𝑗𝑗−1 = 𝑒𝑗

𝑇(𝑋𝑇𝑋)−1𝑒𝑗 (3)

where 𝑒𝑗 denotes the 𝑗th canonical unit vector, i.e. the vector with all zeros except a one at the 𝑗th entry. For

a contrast vector 𝑐 ∈ ℝ𝑝, this motivates the following measure of design efficiency

𝜉 ∶ (𝑐, 𝑋) ↦ 𝜉(𝑐, 𝑋) ≔ (𝑐𝑇(𝑋𝑇𝑋)−1𝑐)−1 (4)

𝜉(𝑐, 𝑋) thus increases with decreasing variability of linear compounds of parameter estimate �̂� over

repeated sampling of the GLM. Importantly, it depends on both the design matrix 𝑋 and the contrast of

interest 𝑐. In other words, according to the criterion (4) the same FMRI design can, in principle, be efficient

with respect to one contrast of interest and inefficient with respect to another.

A different interpretation of (4) is afforded by recalling the definition of the T-statistic as

𝑇 ≔𝑐𝑇�̂�−𝑐𝑇𝛽0

√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (5)

for a null hypothesis represented by 𝛽0 ∈ ℝ𝑝 und the assumption of a known variance parameter 𝜎2.

Assuming identical effect sizes, i.e. true, but unknown values of 𝛽 ∈ ℝ𝑝, adopting the design efficiency

criterion (4) corresponds to favoring larger 𝑇 values.

Examples for the efficiency of GLM-FMRI designs two conditions as a function of the minimal

stimulus onset asynchrony are shown in Figure 1. Specifically, for the two-condition event-probability

function discussed in the previous section, two sampled designs for a fixed number of 20 events are shown

for 𝑡∗ values of 4 and 12 seconds, respectively. In the lower right panel, the design efficiency 𝜉 as a function

of 𝑡∗ is evaluated for two different contrasts of interest 𝑐 = (1,1)𝑇, i.e. detecting activation over both

conditions, and 𝑐2 ≔ (1,−1)𝑇, i.e. detecting differential activation between both conditions. Notably, short

stimulus onset asynchrony values are more efficient for detecting activation across both conditions, while

intermediate stimulus onset asynchronies of around 10 sec are most efficient for detecting differential

activations.

279

Figure 1. Measures of design efficiency in convolution-based GLM-FMRI. See main text for a detailed description.

(3) Finite impulse response designs

So far we have considered the standard approach to GLM-FMRI event-related design modelling

using a pre-specified and fixed canonical hemodynamic response function as “basis function”. In principle,

many different basis functions, i.e. abstract representations of the expected MR signal response at a given

voxel are conceivable. In the current Section, we will discuss the most flexible approach, which in fact

corresponds to an estimation of the HRF shape at a given voxel itself. In other words, the finite impulse

response (FIR) approach may be seen as a means for event-related hemodynamic response averaging in its

GLM implementation. Event-related hemodynamic response averaging (also referred to as “selective

averaging”) works by partitioning the data into peri-stimulus time courses, and averaging the corresponding

data. FIR modelling on the other hand uses a combination of the notion of unit sample responses and the

GLM formulation to achieve a similar goal.

To understand FIR designs, we first consider estimating shape of a single peristimulus time-course

using the GLM. To this end, reconsider the formulation of a finite discrete-time signal (𝑦𝑖) (𝑖 = 1,… , 𝑛) as a

sum of weighted unit sample sequences δ𝑖. We saw previously that we can rewrite a signal

𝑦𝑖 = ∑ 𝑦𝑘𝛿𝑖−𝑘∞𝑘=−∞ (1)

Consider the specific case that we have the sum index 𝑘 running from 𝑘 = 1 to 𝑘 = 𝑛, i.e. a finite impulse

response:

𝑦𝑖 = ∑ 𝑦𝑖𝛿𝑖−𝑘𝑛𝑘=1 (2)

Then we have for the value of 𝑦𝑖 (𝑖 = 1,… , 𝑛)

280

𝑦1 = 𝑦1𝛿1−1 + 𝑦1𝛿1−2 + 𝑦1𝛿1−3 +⋯+ 𝑦1𝛿1−𝑛 (3)

𝑦2 = 𝑦2𝛿2−1 + 𝑦2𝛿2−2 + 𝑦2𝛿2−3 +⋯+ 𝑦2𝛿2−𝑛

𝑦3 = 𝑦3𝛿3−1 + 𝑦3𝛿3−2 + 𝑦3𝛿3−3 +⋯+ 𝑦3𝛿3−𝑛

…

𝑦𝑛 = 𝑦𝑛𝛿𝑛−1 + 𝑦𝑛𝛿𝑛−2 + 𝑦𝑛𝛿𝑛−3 +⋯+ 𝑦𝑛𝛿𝑛−𝑛

which is equivalent to

𝑦1 = 𝑦1 ⋅ 1 (4)

𝑦2 = 𝑦2 ⋅ 1

𝑦3 = 𝑦3 ⋅ 1

…

𝑦𝑛 = 𝑦𝑛 ⋅ 1

Assume now, that we aim to estimate the coefficient 𝑦𝑘 on the right hand side of (4), i.e. we treat them as

parameters, and replace them by the symbol 𝛽𝑘

𝑦𝑖 = ∑ 𝛽𝑘𝛿𝑖−𝑘𝑛k=1 (5)

Then, by analogy to (4)and exchanging the order of multiplication, we have we have the following system of

equations

𝑦1 = 1 ⋅ 𝛽1 (6)

𝑦2 = 1 ⋅ 𝛽2

𝑦3 = 1 ⋅ 𝛽3

…

𝑦𝑛 = 1 ⋅ 𝛽𝑛

Apparently, we may rewrite (6) in matrix notation as

𝑦 = 𝑋𝛽 (7)

where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ 𝐼𝑛 ∈ ℝ𝑛×𝑛 and 𝛽 ∈ ℝ𝑛. Assuming additive independent and identically distributed zero-

mean Gaussian noise, we arrive at the GLM formulation

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎

2) (𝑖 = 1,… , 𝑛) (8)

and

281

𝑋 ≔ (

1 00 1

⋯ 0⋯ 0

⋮ ⋮0 0

⋱ ⋮⋯ 1

) ∈ ℝ𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (9)

In other words, to estimate the coefficients of a finite impulse response (1) based on a single observation of

this function, we can formulate a GLM using the (𝑛 × 𝑛) identity matrix as design matrix in a GLM which has

the same number of beta parameters as there are data points. Figure 1 below illustrates this process.

Figure 1 Estimation of a finite impulse response based on a single observation. The upper panels depict two true, but unknown, finite impulse responses, corresponding to the expectation parameters 𝜇 ∈ ℝ20 of a GLM. Taking single samples from the GLM with these expectations results in the center panels. Finally, the lower panels depict the estimated beta parameters 𝛽 ≔ (𝛽1, … , 𝛽𝑛)

𝑇 based on the samples. As the design matrix corresponds to the identity matrix, the beta parameter values equal the sample values.

Consider next the case, of observing the impulse response twice, i.e. observing 𝑦 ∈ ℝ2𝑛. In this case,

the GLM formulation for averaging the corresponding data points into the correct peri-stimulus time bins

would correspond to

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎

2) (𝑖 = 1,… , 𝑛) (𝑖 = 1,… ,2𝑛) (10)

and

𝑋 ≔ (𝐼𝑛𝐼𝑛) ∈ ℝ2𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… ,2𝑛) (11)

282

More generally speaking, for the FIR GLM design for FMRI data analyses, the stimulus impulse

function design matrix as discussed above is replaced by identity matrices of the length of the expected

impulse response function. Figure 2 below depicts this process. On the left, the stimulus impulse functions

for two stimuli are shown. The center panel depicts the stimulus impulse functions convolved with a

canonical HRF as discussed in previously. Finally, the right-most panel depicts the FIR model design matrix

for an expected HRF of 16 TRs. In other words, the stimulus impulses on the left and their 15 post-stimulus

entries were replaced by a (16 × 16) identity matrix. Correspondingly, if this design matrix is employed in a

GLM, the first 16 parameters in the beta parameter vector 𝛽 ∈ ℝ32 will correspond to the finite impulse

response coefficients for stimulus 1 and the second 16 parameters in the beta parameter vector correspond

to the finite impulse response coefficients for stimulus 2.

Figure 2 Illustration of the FIR model GLM implementation. The leftmost panel depicts the stimulus impulse functions in design matrix form, and the center panel the same functions upon convolution with a canonical double gamma HRF. The rightmost panel depicts the FIR model GLM implementation corresponding to the stimulus impulse functions on the left.

Consider sampling MR signal time-courseS from the GLM formulated by the design matrix in the

center of Figure “for true, but unknown, beta parameter values of 𝛽1 ≔ (0,0)𝑇 , 𝛽2 ≔ (1,0)𝑇 and

𝛽3 ≔ (1,1)𝑇, resulting in the time-course shown in Figure 3. While the data shown in Figure 3 were sampled

from a GLM model with the design matrix shown in the center Panel of Figure 2, one may nevertheless

conceive it as a realization of the FIR design matrix model shown in the rightmost panel of Figure 2, or, put

simply “analyse with the FIR design matrix”. Estimating the beta parameter vector for this model,

comprising 2 ⋅ 16 = 32 entries, where the first 16 entries correspond to the finite impulse response for

stimulus 1 and the second 16 entries correspond to the finite impulse response for stimulus 2, results in the

BOLD time-courses shown in Figure 4: Note that the blue curve corresponds to the first 16 estimated beta

parameters and the red curve to the second 16 estimated beta parameters of the GLM parameter vector

�̂� ∈ ℝ32.

283

Figure 3 Simulated voxel time-series based on the canonical HRF design matrix shown in the center panel of Figure 2 for beta parameter vectors set to of 𝛽1 ≔ (0,0)𝑇, 𝛽2 ≔ (1,0)𝑇 and 𝛽3 ≔ (1,1)𝑇.

Figure 4 FIR model results (= estimated beta parameters) for the data shown in Figure 2.

In summary, the FIR model formulation of the GLM allows to estimate the hemodynamic response

to stimulus conditions without the need to partition the data. It should be noted however, that the FIR

model procedure does not implicitly correct for the parameter uncertainty that results from overlapping

hemodynamic response functions. In other words, it is most appropriate, if the trial onsets are well

separated in time (e.g. by 20 seconds) to allow the hemodynamic response to return to baseline. FIR models

are not the standard way to analyse whole-brain GLM data, but can sometimes be useful to evaluate the

hemodynamic response time-courses that gave rise to an interesting statistical effect.

284

(4) Psychophysiological interaction designs

Psychophysiological interaction (PPI) GLM designs for mass-univariate FMRI data analyses may be

viewed as the application of Analysis of Covariance (ANCOVA) designs to FMRI data acquired from a single

participant. Recall that we introduced the ANCOVA model as a combination of discrete-factorial and

continuous-parametric GLM designs. Specifically, we discussed the following additive ANCOVA design. For

𝑖 = 1,2 and 𝑗 = 1,… ,𝑛

2, we conceived the observation variables 𝑦𝑖𝑗 a realization of random variables with

univariate Gaussian distribution of the form

𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 휀𝑖𝑗, 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎

2), 𝜎2 > 0 (1)

for which the dependence of the expected value for each 𝑦𝑖𝑗 was given by

𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽1𝑥𝑖𝑗 (2)

Here, 𝜇0 ∈ ℝ models an offset, 𝛼𝑖 the effect of the discrete factor taking on two levels (𝑖 = 1,2) and 𝛽1 ∈ ℝ

the contribution of the value 𝑥𝑖𝑗 ∈ ℝ of the continuous factor. In its GLM formulation, this model, upon its

reference cell reformulation was written as

𝑦 ≔

(

𝑦1,1⋮

𝑦1,𝑛1𝑦2,1⋮

𝑦2,𝑛2)

∈ ℝ𝑛, 𝑋 =

(

1⋮11⋮1

0⋮01⋮1

𝑥11⋮

𝑥1,10𝑥2,1⋮

𝑥2,10)

∈ ℝ𝑛×3, 𝛽 = (

𝜇0𝛼2𝛽1) ∈ ℝ3 and 𝜎2 > 0 (3)

In PPI for GLM-FMRI, the data 𝑦, offset 𝜇0, main effect of the discrete factor 𝛼2, and the independent

variable values 𝑥𝑖𝑗 take on the following meaning:

𝑦 ∈ ℝ𝑛 represents the MR signal time-course of a specific voxel. The same GLM design is chosen for

each voxel, hence we do not index the voxel number.

𝜇0 ∈ ℝ models the run-specific MR signal offset

The discrete experimental factor corresponds to a “psychological state”. For example, 𝛼2 may model the

difference in MR signal observed under Task A with respect to Task B. An example for Task A could be

“attend to the colour of a visual stimulus”, an example for Task B could be “attend to the shape of a

visual stimulus”. Another example could be: Task A “Imagine a tactile stimulus” and Task B “Perceive a

tactile stimulus”. In other words, 𝛼2 models the MR signal difference in response to different

experimental conditions.

𝑥𝑖𝑗 ∈ ℝ represents the MR signal time-course of a fixed and pre-determined “seed region” or “seed

voxel”. Without the psychological factor, the PPI design thus would correspond to a simple linear

regression of the MR signal of each and every voxel onto the MR signal of a pre-determined voxel. We

have seen above, that simple linear regression and correlation are closely related. In most simple terms,

a PPI design without experimental factor would thus correspond to a correlation between the voxel MR

signal time-series of all voxels (one after the other), with a single seed voxel.

Based on the specific interpretations of the components introduced above, (3) then takes the following

structural form

285

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖, 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 휀𝑖~ 𝑁(휀𝑖; 0, 𝜎

2), 𝜎2 > 0 (𝑖 = 1,… , 𝑛) (4)

for which the dependence of the expected value for each 𝑦𝑖𝑗 is given by

𝜇𝑖 ≔ 𝜇0 + 𝛼2 + 𝛽1𝑥𝑖 (5)

and 𝑛 corresponds to the number of EPI volumes acquired per experimental run. In their traditional

formulation PPI designs model the main effect of the discrete factor using 1’s and -1’s for task blocks and 0’s

for fixation blocks and do not discriminate between the independent variable values for each of the two

conditions. In their GLM formulation for FMRI, additive ANCOVA designs then take the following exemplar

form

𝑦 ≔

(

𝑦1𝑦2𝑦3⋮⋮

𝑦𝑛𝑇𝑅)

∈ ℝ𝑛, 𝑋 =

(

1⋮1

1⋮1

𝑥1⋮𝑥𝑖

⋮1⋮1

⋮−1⋮−1

⋮𝑥𝑖′⋮𝑥𝑛)

∈ ℝ𝑛×3, 𝛽 = (

𝜇0𝛼2𝛽1) ∈ ℝ3, 𝜎2 > 0 (6)

So far, we considered an additive ANCOVA model only. Crucially, the PPI design corresponds to an

ANCOVA design with interaction. On the level of the GLM, the design matrix column corresponding to the

interaction parameter is formed by a nonlinear combination of the discrete factor and continuous factor

time-series. The usual way to form the interaction design matrix column is by the point-wise (Hadamard)

multiplication of the corresponding main effect columns. The same approach was initially chosen for PPI,

resulting in the following structural and GLM formulation:

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖 , 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎

2), 𝜎2 > 0 (𝑖 = 1,… , 𝑛) (7)

where

𝜇𝑖 ≔ 𝜇0 + 𝛼2 + 𝛽1𝑥𝑖 + 𝛽2(𝛼2𝛽1𝑥𝑖) (8)

Correspondingly, we have the following GLM formulation

𝑦 ≔

(

𝑦1𝑦2𝑦3⋮⋮

𝑦𝑛𝑇𝑅)

∈ ℝ𝑛, 𝑋 =

(

1⋮1

1⋮1

𝑥1⋮𝑥𝑖

⋮1⋮1

⋮−1⋮−1

⋮𝑥𝑖′⋮

𝑥𝑛𝑇𝑅

𝑥1⋮𝑥1⋮

−𝑥𝑖′⋮

−𝑥𝑛𝑇𝑅)

∈ ℝ𝑛×4, 𝛽 = (

𝜇0𝛼2𝛽1𝛽2

) ∈ ℝ4, 𝜎2 > 0 (9)

Figure 1 below shows a design matrix 𝑋 ∈ ℝ𝑛×4 of kind introduced in equation (9). The first column

in the Figure corresponds to the indicator variable for the offset, the second column models a

“psychological” main effect (e.g. attention vs. no attention), the third column a “physiological” main effect

corresponding to an MR signal time series, and the fourth column models the psychophysiological

interaction. Based on the design matrix shown in Figure1 and a beta parameter setting of 𝛽 ≔

(1,0.1.0.1,1)𝑇, Figure 2 shows a simulated PPI analysis. Here, the seed voxel time-course was generated

based on an Ohrnstein-Uhlenbeck process. The upper panel of Figure 2 depicts the simulated seed voxel

time-course and a sampled target voxel time course based on the corresponding PPI design. The lower

286

panels depict the correlation between the MR signal in the seed and target voxels as a function of the

psychological factor. Note that the correlation changes with the psychological factor, corresponding to an

interaction.

Figure 1. A design matrix for a psychophysiological interaction analysis.

Figure 2 A PPI-GLM-FMRI simulation. The upper panel depicts a simulated seed and a resulting target voxel MR course based on the design matrix shown in Figure 1 and a beta parameter setting of 𝛽 ≔ (1,0.1.0.1,1)𝑇. The lowermost panel depicts a correlation analysis between the seed and target voxel time-courses of the middle panel separately for blocks of the psychological main effect.

Why are PPI designs for mass-univariate GLM-FMRI analyses interesting (and in fact, somewhat

surprisingly, increasingly being used)? In most simple terms, the detection of a statistically significant value

for the interaction parameter 𝛽2 ∈ ℝ at a given voxel can be interpreted as a difference in the slope of the

regression between its time and the seed-voxel time-series under a modulation of the “psychological

context”, i.e. under Task A vs. Task B. If one interprets the correlation between two voxel time-series as an

287

indicator of their coupling or “connectivity” a significant interaction parameter indicates that regions of the

brain modulate their coupling depending on the current task being carried out. This speaks to a current

interest in the “dynamics of brain function”. Note however, that a significant interaction parameter may be

interpreted in two ways: a) as evidence for that the contribution of one area to another is altered by the

experimental context, or b) that the response of an area to an experimental context is altered due to activity

variation in the seed region.

Study Questions

1. Discuss, why for a design matrix 𝑋 ∈ ℝ𝑛×𝑝 and contrast vector 𝑐 ∈ ℝ𝑝 the function

𝜉 ∶ (𝑐, 𝑋) ↦ 𝜉(𝑐, 𝑋) ≔ (𝑐𝑇(𝑋𝑇𝑋)−1𝑐)−1

has some merit as a measure of the “efficiency” of the design encoded in 𝑋 for the contrast of interest 𝑐.

2. Explain the rationale for performing a FIR analysis of FMRI data.

3.Discuss which standard GLM model PPI GLM-FMRI analyses correspond to and which specific meaning the regressors and

parameters take on in the PPI context.


1. The variability of the GLM beta estimator is proportional to (XTX)−1, where the variance parameter 𝜎2 corresponds to the

proportionality constant. Assuming constant 𝜎2, and noting that the pre- and post-multiplication with c extracts the relevant

components of (XTX)−1 with respect to the beta estimator contrast 𝑐𝑇�̂�, the function above thus increases with decreasing

variability of the beta estimator due to the −1 exponent. In other words, if the variability in the contrasted beta estimator value is

small, the function ξ gives a large value and is hence an intuitive measure of “design efficiency”

2. A GLM-FIR model allows for the estimation the hemodynamic response to different stimulus conditions without the need to

partition the data and performing event-related averages.

3. PPI GLM-FMRI analyses correspond to and ANCOVA GLM with interaction. The data to be modelled corresponds to the MR signal

time-course of a specific voxel. The same GLM design is chosen for each voxel, referred to as a mass-univariate approach. The

discrete experimental factor in the ANCOVA GLM for PPI corresponds to a “psychological state”. For example, it may model the

difference in MR signal observed under Task A with respect to Task B. An example for Task A could be “attend to the colour of a

visual stimulus”, an example for Task B could be “attend to the shape of a visual stimulus”. The parametric/continuous regressor

corresponds to the MR signal time-course of a fixed and pre-determined “seed region” or “seed voxel”. Finally, the interaction

regressor and its associated parameter can be interpreted a) as capturing how the contribution of one area to another is altered by

the experimental context, or b) as capturing that the response of an area to an experimental context is altered due to activity

variation in the seed region.

288

First-level covariance matrices

(1) Serial correlations in FMRI

We have seen in a previous section, that classical inference rests on the assumption of spherical

error covariance matrices and that the violation of this assumption results in an increase of the false-positive

risk. Likewise, we have seen that the REML approach can be used to estimate the parameters of covariance

matrices, specifically, if the error covariance matrix decomposes into a linear combination of known

covariance matrix basis functions

𝑉 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 (1)

where 𝑄𝑖 ∈ ℝ𝑛×𝑛 (𝑖 = 1,… , 𝑞) denote known covariance basis functions.

In the analysis of FMRI data, non-spherical error covariance matrices are estimated routinely based

on the notion of “error serial correlations”. The fundamental idea is that FMRI time-series comprise

correlations of values adjacent in time that are not captured by the deterministic aspect of the GLM, i.e. the

expectation parameter 𝑋𝛽. Assumed physiological origins for such serial correlations are for example the

breathing cycle or the heartbeat, which are assumed to induce fluctuations in the local deoxyhemoglobin

content that are unrelated to experimental stimulation. Note that if these fluctuations were monitored, they

could enter the GLM analyses in its deterministic part, i.e. could be used to derive additional regressors.

Common covariance basis functions used in the analysis of FMRI data for 𝑞 ≔ 2 and 𝜏 ≔ 1

𝑄1 ≔ 𝐼𝑛 (2)

𝑄2 = (𝑄2)𝑖𝑗 ≔ {exp (−

1

𝜏|𝑖 − 𝑗|) 𝑖 ≠ 𝑗

0, 𝑖 = 𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑛 (3)

Figure 2 depicts the resulting error covariance matrices for different choices of 𝜆1, 𝜆2 and 𝜏.

Figure 2. Covariance matrices resulting from the combination of the covariance basis functions 𝑄1 and 𝑄2.

The first covariance basis function in (2) models independently and identically distributed errors. The second

covariance basis function in (3) models short-range correlations based on a first-order autoregressive model

of the error terms. Note that the extend of these short-term correlations is governed by a time constant 𝜏,

which is usually assumed to be fixed, such that it does not form an additional parameter of the GLM

framework.

289

In Figure 3 we illustrate the effect of error serial correlations on observed time-time series. In three

simulations, the identical expectation parameter 𝑋𝛽 was used and the error covariance was created based

on (1), (2) and (3). The first two panels show the sampled time-series data for independent error terms

(upper panel) and the residual error (lower panel). In the same layout, the third and fourth panels depict the

case of short-range error correlations with fast time constant, and the fifth and the sixth panel depict the

case of short-range error correlations with a slower decay. As the range of the error serial correlations

increases, the residual error assume a “smoother” profile.

Figure 3. The effect of error serial correlations on time-series realizations.

290

Second-level models

(1) The “summary-statistics” approach

Most GLM-FMRI studies comprise a group of participants. This leads to at least two sources in

variation of observed condition-specific MR signals: within-participant variation and between-participant

variation. Commonly, analyses that ignore between-participant variation (for example, because they refer to

data of a single participant only) are referred to as “fixed effects” or “first-level” analyses, and account for

scan-to-scan variance. Analyses that explicitly take into account between-participant (or between-session)

variation and aim to make inferences about the population that the participants were sampled from, are

referred to as “random effects” or “second-level” analyses. Usually, these models comprise both fixed and

random effects and are thus better referred to as “mixed effects” models.

A typical FMRI group analysis in the framework of the GLM uses the so-called “summary-statistics”

approach, which we sketch in the following. Note that before second-level analyses are performed the FMRI

data are typically spatially warped into a common group brain space such that voxel coordinates correspond

to (approximately) the same brain regions across participants. The summary-statistics to group inference

then proceeds as follows: First, the data from individual participants is analysed separately, usually using a

common GLM design that is motivated from the study’s experimental design. In this manner, for each voxel

beta parameter estimates for each of the model’s regressor are obtained for each participant. To isolate

specific participant-specific beta parameter estimates or to combine them in a linear fashion, and effect of

interest is specified for each subject using a contrast vector. This generates a “contrast image” comprising

the contrast of beta parameter estimates across voxel for each participant. Upon contrast formation, there

exists a single scalar number for each voxel and each participant which reflects the participant-specific effect

on this contrast and can be viewed as a scalar, participant-specific outcome measure. Finally, the contrast

images are subjected to voxel-by-voxel one-sample t-tests in order to infer which voxels display a significant

average effect size when compared to the between-participant variation.

In the following we consider this approach from the perspective of a hierarchical GLM and detail,

what kind assumptions this approach entails about the covariance structure of the underlying mixed-effects

model. To this end, we first formulate a first-level “all-in-one” GLM and then introduce a second-level

random effects model. Finally, we consider a set of assumptions in this framework that renders it equivalent

to the summary-statistic approach sketched above. In all of the below, we assume that the respective

variance parameters are known and focus on the estimation of the beta parameters.

(2) A hierarchical GLM

For 𝑘 = 1,… , 𝐾 subjects, let the 𝑘th first-level individual-subject GLM be denoted by

𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 (1)

where 𝑦𝑘 ∈ ℝ𝑛𝑘 , 𝑋𝑘 ∈ ℝ

𝑛𝑘×𝑝, 𝛽𝑘 ∈ ℝ𝑝, and

𝑝(휀𝑘) = 𝑁(휀𝑘; 0, 𝜎𝑘2𝑉𝑛𝑘) (2)

Where 0 ∈ ℝ𝑛𝑘 , 𝜎𝑘2 > 0, 𝑉𝑛𝑘 ∈ ℝ

𝑛𝑘×𝑛𝑘. Then the 𝑘th first-level single-subject effects 𝛽𝑘 ∈ ℝ𝑝 can be

estimated using the generalized-least squares estimator

291

�̂�𝑘 = (𝑋𝑘𝑇(𝑉𝑛𝑘)

−1𝑋𝑘)

−1𝑋𝑘𝑇(𝑉𝑛𝑘)

−1𝑦 (3)

By concatenation, the 𝐾 models

𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 , 𝑘 = 1,… , 𝐾 (4)

specified above may be formulated as a large, subject-separable, first level model

𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠 (5)

with 𝑛 = ∑ 𝑛𝑘𝐾𝑘=1 as follows:

𝑦𝑠 ≔ (

𝑦1⋮𝑦𝐾) ∈ ℝ𝑛, 𝑋𝑠 ≔ (

𝑋1 00 𝑋2

0

0⋱ 00 𝑋𝐾

) ∈ ℝ𝑛×𝐾𝑝, 𝛽𝑠 ≔ (𝛽1⋮𝛽𝐾

) ∈ ℝ𝐾𝑝 (6)

where the zeros in 𝑋 denote appropriately sized matrices with all zero entries and

휀𝑠 ≔ (

휀1⋮휀𝐾) ∈ ℝ𝑛, 𝑝(휀𝑠) = 𝑁(휀𝑠; 0, 𝑉𝑠), 0 ∈ ℝ

𝑛, 𝑉𝑠 ≔

(

𝜎12𝑉1 0

0 𝜎22𝑉2

0

0⋱ 00 𝜎𝐾

2𝑉𝐾)

∈ ℝ𝑛×𝑛 (7)

The “𝑠“ subscripts on the variables involved in this “all-in-one” model are mnemonic for “subjects” or

“single-units” and are meant to remind the reader that these variables correspond to concatenated variables

of “subjects or single-units”. See Figure 1 for an example of an “all-in-one” GLM. Note that due to the block-

diagonal matrix properties of 𝑋𝑠 and 𝑉𝑠 the generalized least squares estimator and its covariance for this

all-in-one model correspond to

�̂�𝑠 ≔ (𝑋𝑠𝑇𝑉𝑠

−1𝑋𝑠)−1𝑋𝑠

𝑇𝑉𝑠−1𝑦𝑠 ∈ ℝ

𝐾𝑝 (8)

In the “all-in-one” GLM above we have introduced the concatenated participant effects beta

parameter 𝛽𝑠 ∈ ℝ𝐾𝑝. On a second-level, we now model this parameter vector as the result of a linear

combination of population parameters 𝛽𝑝 under additive Gaussian noise. We thus relate subject-specific

effects or parameters in 𝛽𝑠 ∈ ℝ𝐾𝑝 to population paramters 𝛽𝑝 in the form

𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 (9)

where 𝑋𝑝 ∈ ℝ𝐾𝑝×𝑞 denotes the second-level design matrix, 𝛽𝑝 ∈ ℝ

𝑞 a set of unknown population effects,

and 휀𝑝 ∈ ℝ𝐾𝑝×𝐾𝑝 denotes the population error term with distribution

𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝2𝑉𝑝), 0 ∈ ℝ

𝑞 , 𝜎𝑝2 > 0, 𝑉𝑝 ∈ ℝ

𝑞×𝑞 (10)

The “𝑝“ subscripts on the variables involved the second-level model are mnemonic for “population” and are

meant to remind the reader that these variables model structural aspects of the population from which the

first-level units are derived. See Figure 2 for an example.

In summary, we have formulated the following hierarchical linear Gaussian model

292

𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 (11a)

𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠 (11b)

where on the second level (11a) 𝛽𝑠 ∈ ℝ𝐾𝑝, 𝑋𝑝 ∈ ℝ

𝐾𝑝×𝑞 , 𝛽𝑝 ∈ ℝ𝑞 , 𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝

2𝑉𝑝), 0 ∈ ℝ𝑞 , 𝜎𝑝

2 > 0,

𝑉𝑝 ∈ ℝ𝑞×𝑞 and, with 𝑛 = ∑ 𝑛𝑘

𝐾𝑘=1 on the first level (11b) 𝑦𝑠 ∈ ℝ

𝑛, 𝑋𝑠 ∈ ℝ𝑛×𝐾𝑝 , 𝑝(휀𝑠) = 𝑁(휀𝑠; 0, 𝑉𝑠), 0 ∈

ℝ𝑛, 𝑉𝑠 ≔ ℝ𝑛×𝑛. Note that both 𝑦𝑠 and 𝛽𝑠 are random variables. They are, however, distinct. 𝑦𝑠 is an

observable random variable which models the concatenated FMRI data time series over participants, while

𝛽𝑠 is an unobservable random variable for which only an estimate can be obtained.

Figure 1. An all-in-one GLM as conceptualization of a first-level FMRI model. The upper right panel depicts an exemplary single-session/single-subject design matrix for 𝑛𝑘 = 180 data points (scans) comprising three regressors (one constant regressor modelling the MR signal offset, and two event-related condition regressors). The upper left panel depicts a non-spherical error covariance matrix for the same data set. The lower panel depicts the block-wise concatenation of the design and covariance matrices into a “all-in-one” GLM for the concatenated data across 𝐾 = 3 participants.

(3) An equivalent beta parameter estimates model

In the summary statistics approach outline above, we first obtain subject-specific beta parameter

estimate contrasts 𝑐𝑇�̂�𝑘 (𝑘 = 1,… , 𝐾) and then model these at the second-level using a one-sample t-test

GLM. We next consider the beta parameter estimates model implied by the hierarchical GLM formulated in

the previous section and show, how we can estimate the population parameter vector 𝛽𝑝 ∈ ℝ𝑞 at the

second level based on the generalized least-squares estimate �̂�𝑠 ∈ ℝ𝐾𝑝 of the first level. To this end, we

293

Figure 2. An exemplary second-level design for the all-in-one model shown in the lower panels of Figure 1. Note that there are 3 ⋅ 3 = 9 regressors at the first-level, corresponding to an offset regressor and two condition-specific regressors per subject. According to the second-level framework discussed in the main text, these are assumed to be the result of a set of 𝑞 basis parameters at the second-level. The design matrix 𝑋𝑝 shown in the left panel, implies that there are 3 population beta parameters

(one offset, two condition-specific), which are mapped onto the three subject-specific parameters in the partitioning of 𝛽𝑠. The covariance matrix shown in the right panel in addition implies that there is some degree of within-participant covariation in the beta-parameters, but no between-participant correlation.

have the following result: the hierarchical GLM introduced in the previous section implies a GLM for the first-

level generalized least-squares estimator �̂�𝑠 ∈ ℝ𝐾𝑝 GLM of the following form

�̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀�̂�𝑠 (1)

where �̂�𝑠 ∈ ℝ𝐾𝑝, 𝑋𝑝 ∈ ℝ

𝐾𝑝×𝑞 , 𝛽𝑝 ∈ ℝ𝑞 as previously, and the error term 휀�̂�𝑠 ∈ ℝ

𝐾𝑝 is distributed according

to

𝑝(휀�̂�𝑠) = 𝑁(휀�̂�𝑠; 0, 𝑉�̂�𝑠), where 0 ∈ ℝ𝐾𝑝 and 𝑉�̂�𝑠 = 𝜎𝑝2𝑉𝑝 + (𝑋𝑠

𝑇𝑉𝑠−1𝑋𝑠)

−1 ∈ ℝ𝐾𝑝×𝐾𝑝 (2)

If we know the values of 𝜎𝑝2, 𝑉𝑝 and 𝑉𝑠, and thus 𝑉�̂�𝑠we thus may use the generalized least squares estimator

for 𝛽𝑝 in (1) to estimate the population effects from the single-subjects generalized least-squares estimators:

�̂�𝑝𝐺𝐿𝑆 ≔ (𝑋𝑝

𝑇𝑉�̂�𝑠

−1𝑋𝑝)−1𝑋𝑝𝑇𝑉�̂�𝑠

−1�̂�𝑠 (3)

Proof of (1)

From the second-level model for the subject-specific effects 𝛽𝑠 ∈ ℝ𝐾𝑝 we can derive a model for the subject-specific beta

estimates �̂�𝑠. To this end, we add �̂� to both sides of (11) and obtain

𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 ⇔ 𝛽𝑠 + �̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 + �̂�𝑠 ⇔ �̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 + (�̂�𝑠 − 𝛽𝑠) (1.1)

We next consider the distribution of the term

휀�̃� ≔ 휀𝑝 + (�̂�𝑠 − 𝛽𝑠) ∈ ℝ𝐾𝑝 (1.2)

in detail to write the right-hand side of (11) in standard GLM form. For fixed 𝛽𝑠, both 휀𝑝 and �̂�𝑠 are Gaussian random variables, thus

their sum is a Gaussian random variable as well, and we derive expressions for the expectation and covariance parameters. The expectation of 휀�̃�, due to the unbiasedness of the generalized least square estimator is given by

294

𝐸(휀�̃�) = 𝐸 (휀𝑝 + (�̂�𝑠 − 𝛽𝑠)) = 𝐸(휀𝑝) + 𝐸(�̂�𝑠 − 𝛽𝑠) = 0 + 0 = 0 ∈ ℝ𝐾𝑝 (1.3)

Under the additional assumption of 𝐶𝑜𝑣(휀𝑝, �̂�𝑠) = 0, the covariance of 휀�̃� evaluates to

𝐶𝑜𝑣(휀�̃�) = 𝐶𝑜𝑣 (휀𝑝 + (�̂�𝑠 − 𝛽𝑠)) = 𝐶𝑜𝑣(휀𝑝) + 𝐶𝑜𝑣(�̂�𝑠) = 𝜎𝑝2𝑉𝑝 + (𝑋𝑠

𝑇𝑉𝑠−1𝑋𝑠)

−1 ∈ ℝ𝐾𝑝×𝐾𝑝 (1.4)

□

In the summary-statistics approach one is primarily interested in a weighted linear combination of

parameter estimates on the first level of the form 𝑐𝑇�̂�𝑘, 𝑘 = 1,… , 𝐾, where 𝑐 ∈ ℝ𝑝. In this case the

following modifications of the second-level model for 𝛽 and �̂� ensue:

�̂�𝑠 takes on the form

�̂�𝑐 = (𝑐𝑇�̂�1⋮

𝑐𝑇�̂�𝐾

) ∈ ℝ𝐾 (1)

The covariance of �̂� = �̂�𝑐 becomes a diagonal matrix (rather than a block-diagonal matrix)

𝐶𝑜𝑣(�̂�𝑐) ≔

(

𝜎12𝑐𝑇(𝑋1

𝑇𝑋1)−1𝑐 0

0 𝜎22𝑐𝑇(𝑋2

𝑇𝑋2)−1𝑐

0

0⋱ 00 𝜎𝐾

2𝑐𝑇(𝑋𝐾𝑇𝑋𝐾)

−1𝑐)

∈ ℝ𝐾×𝐾 (2)

Under the further assumption of homogenous, i.e. equal, variance over subjects, the above simplifies to

𝐶𝑜𝑣(�̂�𝑐) ≔ 𝜎𝑠2𝐼𝐾 ∈ ℝ

𝐾×𝐾 (3)

and under the assumption of a spherical covariance matrix on the second-level, we have

�̃�𝑝 = 𝜎𝑠2𝐼𝐾 + 𝜎𝑝

2𝐼𝐾 = (𝜎𝑠2 + 𝜎𝑝

2)𝐼𝐾 = 𝜎𝑠𝑝2 𝐼𝐾 (4)

In this case, we may use the ordinary least-squares estimator to estimate the population effects

�̂�𝑝𝑂𝐿𝑆 ≔ (𝑋𝑝

𝑇𝑋𝑝)−1𝑋𝑝𝑇�̂�𝑐 (5)

based on the first-level subject-specific beta parameter estimate contrasts 𝑐𝑇�̂�1, … , 𝑐𝑇�̂�𝐾.

Study Questions

1. Verbally, discuss the notions of “fixed” and “random/mixed” analysis in GLM-FMRI. 2. Verbally, describe the one-sample t-test approach to second-level inference in GLM-FMRI. 3. Write down the all-in-one formulation of a first-level group GLM-FMRI. 4. Write down the hierarchical model underlying two-level group inference in GLM-FMRI. Study Questions Answers 1. Fixed effects analysis usually refer to analyses that discount any between-subject variation, and thus are usually applicable to

the data of a single participant (possibly over multiple FMRI sessions). Random/mixed-effects analyses on the other hand take into account between subject variation for making statistical inferences that allow to generalize to the population that the participants were sampled from

295

2. The one-sample t-test approach to second-level inference in GLM-FMRI corresponds to the three step procedure of (1) fitting an individual voxel-wise GLM to the FMRI data of each participant, (2) defining an effect of interest for each participant using a contrast vector, resulting in a participant specific “contrast image” and (3) evaluating the contrast images over participants using a one-sample t-test of the null hypothesis that the population mean of the contrast of interest is zero.

3. For 𝑘 = 1,… , 𝐾 first level GLMs 𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 with 𝑦𝑘 ∈ ℝ𝑛𝑘 , 𝑋𝑘 ∈ ℝ

𝑛𝑘×𝑝, 𝛽𝑘 ∈ ℝ𝑝, and 𝑝(휀𝑘) = 𝑁(휀𝑘; 0, 𝜎𝑘

2𝑉𝑘) the all-in-on

formulation of the first level model is given with 𝑛 = ∑ 𝑛𝑘𝐾𝑘=1 by 𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠, where

𝑦𝑠 ≔ (

𝑦1⋮𝑦𝐾) ∈ ℝ𝑛, 𝑋𝑠 ≔ (

𝑋1 00 𝑋2

0

0⋱ 00 𝑋𝐾

) ∈ ℝ𝑛×𝐾𝑝, 𝛽𝑠 ≔ (𝛽1⋮𝛽𝐾

) ∈ ℝ𝐾𝑝

and

휀𝑠 ≔ (

휀1⋮휀𝐾) ∈ ℝ𝑛, 𝑝(휀𝑠) = 𝑁(휀; 0, 𝑉𝑠), 0 ∈ ℝ

𝑛 , 𝑉𝑠 ≔

(

𝜎12𝑉1 0

0 𝜎22𝑉2

0

0⋱ 00 𝜎𝐾

2𝑉𝐾)

∈ ℝ𝑛×𝑛

4. The hierarchical model that is used to relate subject specific effects or parameters in 𝛽𝑠 ∈ ℝ𝐾𝑝 to population paramters 𝛽𝑝 is

given by 𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 where 𝑋𝑝 ∈ ℝ𝐾𝑝×𝑞 denotes the second level design matrix, 𝛽𝑝 ∈ ℝ

𝑞 a set of unknown population

effects, and 휀𝑝 ∈ ℝ𝐾𝑝×𝐾𝑝 denotes the population error term with distribution 𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝

2𝑉𝑝), 0 ∈ ℝ𝑞 , 𝜎𝑝

2 > 0, 𝑉𝑝 ∈

ℝ𝑞×𝑞 .

296

The multiple testing problem

(1) An introduction to the multiple testing problem in GLM-FMRI

In this and the next Section we are concerned with the problem of classical inference in mass-

univariate (voxel-by-voxel) applications of the GLM. Intuitively, the problem may be framed as follows. Upon

classical point estimation of the beta and variance parameters of a GLM, these can be combined in order to

produce a test statistic, such as a 𝑇 or an 𝐹 value. For example, based on a contrast vector 𝑐 ∈ ℝ𝑝, the 𝑇

statistic for a test of the null hypothesis 𝐻0: 𝛽 = 0 is given by

𝑇 ≔𝑐𝑇�̂�

√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (1)

An observed value𝑇∗ of 𝑇 at a given voxel may then be compared to the distribution of the random variable

𝑇 under the null hypothesis, which is given by the 𝑡-distribution. If the probability of the observed value 𝑇∗ is

low under the null hypothesis, for example 𝑝(𝑇 > 𝑇∗) < 0.05, this is taken as evidence against the null

hypothesis, which may then be rejected. In terms of GLM-FMRI for a given voxel, if the test statistic exceeds

a given threshold, and thus its associated probability under the null hypothesis falls below a given

probability, the corresponding voxel may be labelled “activated” for the experimental effect expressed by

𝑐𝑇𝛽. Likewise, if the test statistic falls below a given threshold, and thus its associated probability under the

null hypothesis exceeds a given probability, the corresponding voxel may be labelled “not activated”. It is

this categorical decision of a voxel being “activated” or “not-activated” based on a given statistical threshold

that classical inference for GLM-FMRI is concerned with. Note that statistical thresholds in terms of “critical

values” of a test statistic are, based on the properties of the null distribution, always associated with

probabilities for this value or a more extreme value to occur, i.e. their corresponding significance levels.

More formally, we may restate the above as follows: when testing a single null hypothesis 𝐻0, the

probability of a Type I error, i.e. rejecting the null hypothesis when it is true, is usually controlled at some

designated “significance level” 𝛼 ∈ [0,1]. This can be achieved by choosing a critical value 𝑐𝛼, such that

𝑝(𝐻0=0)(|𝑇| > 𝑐𝛼) ≤ 𝛼 (2)

We here consider “𝑇” in a generic sense, i.e. 𝑇 may refer to either the 𝑇- or the 𝐹-, or some other, suitably

chosen test statistics. Note that we indicate by the subscript (𝐻0 = 0) that the probability statement above

refers to the case that the null hypothesis is true and refrain from conditioning on 𝐻0 = 0 as often seen,

because 𝐻0 = 0 is not a random event in the context considered here. Also note that we consider the

absolute value |𝑇| of 𝑇 to allow for the evaluation of two-sided tests. Based on the above, 𝐻0 is rejected, if

|𝑇| > 𝑐𝛼. Notably, in the neuroimaging literature 𝑐𝛼 is often referred to as a “threshold” and denoted by

“𝑢”.

We now consider testing not only a single, but multiple null hypotheses, which we refer to as

“multiple testing”. Above we described that if one defines a significance-level of 𝛼 (e.g. 𝛼 = 0.05), one will,

if the null hypothesis is true, declare voxels “activated” with a Type I error probability of 0.05. In other

words, the probability to declare a voxel “not activated”, given that for this voxel the null hypothesis is true,

corresponds to 1 − 0.05 = 0.95. Imagine now a set of 5 voxels, for each of which the null hypothesis is true,

and which, in some meaningful sense, can be regarded as stochastically independent entities. In this case,

based on an significance-level of 𝛼 = 0.05, the probability to declare the first voxel “not activated” is 0.95.

297

Likewise, the probability to declare the second voxel “not activated” is 0.95. However, the probability to

declare both voxels “not activated”, assuming independence, is only 0.95 ⋅ 0.95 = 0.9025. The probability

of making at least one Type I error, i.e. declaring a voxel “activated”, although its null hypothesis is true, thus

increased from 1 − 0.95 = 0.05 to 1 − 0.9025 = 0.0975 over the two voxels. Consider now all five voxels.

The probability to declare all voxels “not activated”, assuming for all that the null hypothesis is true, and

they are independent entities, is given by 0.95 ⋅ 0.95 ⋅ 0.95 ⋅ 0.95 ⋅ 0.95 = 0.955 = 0.7738. The probability

of making at least one Type I error over the set of five voxels thus corresponds to 1 − 0.7738 = 0.2262.

Given that a typically FMRI data set may comprise around 40,000 voxels, the single-event Type I error

probability, based on an individual voxel threshold of 0.95 approaches certainty with certainty (see Figure

1).

Figure 1. The probability of at least one Type 1 error as function of the number of tests for two different significance levels 𝛼.

A different view of the problem is provided by the following simulation. Consider the case of second-

level GLM-FMRI analysis. Assume that from the first level the values of 𝑐𝑇�̂�𝑘𝑣 for voxels 𝑣 = 1,… , 𝑉 and

participants 𝑘 = 1,… , 𝐾, where 𝑉 = 1000 and 𝐾 = 12 have been taken to the second level and a one-

sample T test of the null hypothesis 𝐻0: 𝑐𝑇𝛽𝑠 = 0 is performed for each voxel. Assume further, that for all 12

participants, the null hypothesis is, in fact, true for every voxel, and that the between-subject variance is

identical across voxels and given by 𝜎2 = 1. Assuming stochastic independence over voxels In this case, the

variables 𝑐𝑇�̂�𝑘𝑣 correspond to 12000 independent and identically distributed univariate Gaussian variables

𝑥𝑣𝑘 whose probability density functions are given by

𝑝(𝑥𝑣𝑘) = 𝑁(𝑥𝑣𝑘; 0,1) (3)

where 𝑣 = 1,… , ,1000, 𝑘 = 1,… ,12. As numerically evaluated in Figure 2, performing one-sample T-tests

and rejecting the null hypothesis at a significance-level of 𝛼 = 0.05 results in approximately 1000 ⋅ 0.05 =

50 erroneous rejections, with associated T- and p-values distributed according to the histograms shown in

the lower two panels.

298

Figure 2. Simulated Type I errors numbers.

We have seen above that in the case of testing multiple null hypotheses simultaneously, controlling

the Type I error rate for single tests at conventional thresholds can lead to a large number of false positive

results, if the null hypothesis is in fact true for all tests considered. Together, these issues may informally be

referred to as the “multiple testing problem” (we prefer “multiple testing” over “multiple comparisons”,

because the latter appears to be more inspired by the notion of evaluating many contrasts in a given GLM

scenario, rather than evaluating the same contrast over many repeats, i.e. voxels). To avoid the conclusion of

important effects from high test statistic and associated low p-values in the case of multiple testing for

subsets of tests in large “families” of tests, the significance-level 𝛼 may be lowered, such that it becomes

increasingly “difficult” for a single test statistic to exceed the associated critical value 𝑐𝛼. This is the basic

tenet of all multiple testing procedures. Note, however, that there exist no common definition for the notion

of a family of tests - in neuroimaging, the “family of tests” of interest usually refers the collection of tests

over voxels, and not, for example, over assessment of contrasts, or the number of significance test a given

researcher has performed throughout her career.

(2) Type I error rates

In the case of a single statistical test, there exists only a single “Type I error rate”, namely the

probability of rejecting the null hypothesis when it is in fact true. As soon as multiple testing scenarios are

considered, it is not immediately clear, what the term “Type I error rate” refers to, and a careful

differentiation is required. In this Section we introduce a number of commonly employed error rates in the

GLM-FMRI literature. Before doing so, we formalize the multiple testing problem.

Consider the problem of simultaneously testing 𝑚 ∈ ℕ null hypotheses 𝐻0(𝑖)(𝑖 ∈ ℕ𝑚). For 𝑖 ∈ ℕ𝑚 let

𝐻0(𝑖)= 0 denote the case that the null hypothesis 𝐻0

(𝑖) is true, and 𝐻0

(𝑖)= 1 denote the case that the null

hypothesis 𝐻0(𝑖)

is not true (false). The number of 𝑚 ∈ ℕ of null hypotheses is assumed to be known, while

the sets

299

𝑀0 ≔ {𝑖 ∈ ℕ𝑚|𝐻0(𝑖) = 0} and 𝑀1 ≔ {𝑖 ∈ ℕ𝑚|𝐻0

(𝑖) = 1} (1)

i.e. the (sub)sets of elements 𝑖 of ℕ𝑚 for which the null hypothesis is true (𝑀0) or not true (𝑀1) are assumed

to be unknown. For simplicity, we define

𝑀 ≔ 𝑀0 ∪𝑀1 = ℕ𝑚 (2)

and denote the number of elements of 𝑀0 and 𝑀1 (i.e. their cardinalities) by 𝑚0: = |𝑀0| ∈ ℕ𝑚0 and

𝑚1: = |𝑀1| ∈ ℕ𝑚0 , respectively. Note that by choosing the set ℕ𝑚

0 for both 𝑚0 and 𝑚1, we allow for the

case that either set is the empty set, i.e. 𝑀0 = ∅ or 𝑀1 = ∅ . Further note, that 𝑚0 and 𝑚1 correspond to

true, but unknown, non-random quantities.

Based on observed test statistics 𝑇𝑖 (𝑖 ∈ 𝑀) (which are random variables, as they are derived from

the random variable “data”) and fixed critical value 𝑐𝛼, each null hypothesis 𝐻0(𝑖) can either be not rejected

or rejected. Note that as in the case of the single test, we here consider 𝑇𝑖 in a generic sense, i.e. the 𝑇𝑖

refer to either 𝑇 or 𝐹, or some other suitably chosen test statistics (but all to the same). Let 𝑊 ∈ ℕ𝑚0 denote

the number of not rejected null hypothesis and 𝑅 ∈ ℕ𝑚0 the number of rejected null hypotheses, where

𝑊 +𝑅 = 𝑚. 𝑊 and 𝑅, due to the random nature of the 𝑇𝑖 (𝑖 ∈ 𝑀), are observed random entities.

From these definitions, four unobservable random variables follow:

(1) the number 𝑈 ∈ ℕ𝑚0 of null hypotheses 𝐻0

(𝑖) for which 𝐻0(𝑖) = 0 and 𝐻0

(𝑖) is not rejected

(2) the number 𝑉 ∈ ℕ𝑚0 of null hypotheses 𝐻0


(𝑖) is rejected

(3) the number 𝑇 ∈ ℕ𝑚0 of null hypotheses 𝐻0

(𝑖) for which 𝐻0

(𝑖)= 1 and 𝐻0

(𝑖) is not rejected

(4) the number 𝑆 ∈ ℕ𝑚0 of null hypotheses 𝐻0


(𝑖) is rejected

For an overview, consider Table 1.

NHs not rejected for 𝑐𝛼 NHs rejected for 𝑐𝛼

True NHs 𝑈 𝑉 𝑚0

Non-true NHs 𝑇 𝑆 𝑚1

𝑊 𝑅 𝑚

Table 1. Numbers of importance in the multiple testing scenario.

Note that

𝑈 + 𝑉 = 𝑚0, 𝑇 + 𝑆 = 𝑚1, 𝑈 + 𝑇 = 𝑊, 𝑉 + 𝑆 = 𝑅 and 𝑈 + 𝑉 + 𝑇 + 𝑆 = 𝑚 (3)

Note that the rejection or non-rejection numbers refer to a fixed, but arbitrary, critical value 𝑐𝛼. In the table

all variables refer to “numbers of”, and NH the word “null hypothesis”. Note again that 𝑚 is a fixed an

known number (the total number of performed tests, e.g. voxels in the neuroimaging context), 𝑚0 and 𝑚1

are true, but unknown, fixed numbers (in the neuroimaging context referring to the numbers of truely, but

unknowingly, activated voxels), 𝑊 and 𝑅 are observed random variables, whose outcome depends on 𝑚0

and 𝑚1, the underlying samples, and the chosen and fixed critical value 𝑐𝛼, and finally, that the 𝑈, 𝑉, 𝑇 and

𝑆 are unobserved random variables.

As noted above, for a single null hypothesis 𝐻0, the probability of a Type I error, i.e. rejecting the null

hypothesis when it is true, is usually controlled at some designated significance-level 𝛼 ∈ [0,1]. This can be

300

achieved by choosing a critical value 𝑐𝛼 such that 𝑝(𝐻0=0)(|𝑇| > 𝑐𝛼) ≤ 𝛼 and 𝐻0 is rejected, if |𝑇| > 𝑐𝛼. In

the multiple testing situation a variety of generalization of Type I error rates are possible. We next

introduction a selection of Type I error rates based on the layout of Table 1.

Per-comparison error rate. The per-comparison error rate 𝑃𝐶𝐸𝑅 is defined as the expected value of the

number of Type I errors per total number of hypothesis, i.e.

𝑃𝐶𝐸𝑅 =𝐸(𝑉)

𝑚 (4)

Per-family error rate. The per-family error rate 𝑃𝐹𝐸𝑅 is not a “rate” but corresponds to the expected

number of Type I errors

𝑃𝐹𝐸𝑅 = 𝐸(𝑉) (5)

Family-wise error rate. The family-wise (sometimes also referred to as “experiment-wise”) error rate 𝐹𝑊𝐸𝑅

is defined as the probability of at least one Type I error over the family of 𝑚 hypotheses

𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝑃(𝑉 > 0) (6)

where 1{𝑉>0} denotes the indicator function, i.e. the random variable 1{𝑉>0} that takes on the value 1 if

𝑉 > 0 and 0 otherwise2.

False-discovery rate. The most straight-forward way to define the false-discovery rate 𝐹𝐷𝑅 would be as the

expectation of the ratio between the number 𝑉 of rejections of the null hypothesis, when it is true and the

total number 𝑅 of rejections.

𝐹𝐷𝑅 = 𝐸 (𝑉

𝑅) (7)

However, as 𝑅 is a random variable, and it is in principle possible that 𝑅 = 0 (no rejections), this ratio may

not be defined. Defining

𝑉

𝑅≔ 0 if 𝑅 = 0 (8)

results in the FDR definition of Benjamini & Hochberg (1995):


𝑅1{𝑅>0}) = 𝐸 (

𝑉

𝑅|𝑅 > 0) ⋅ 𝑃(𝑅 > 0) (9)

2Unfortunately, appreciation of the equality 𝐸(1{𝑉>0}) = 𝑃(𝑉 > 0) is conditional on the familiarity with measure-theoretic

approaches to probability theory. For those readers, who fulfil this criterion, we recall that for a probability space (Ω,𝒜, 𝑃) and a set 𝐴 ∈ 𝒜 (i.e. an event), the indicator function is defined as

1𝐴: Ω → {0,1},𝜔 ↦ 1𝐴(𝜔) ≔ {1, 𝜔 ∈ 𝐴0, 𝜔 ∉ 𝐴

Informally, the random variable 1𝐴 takes on the value 1 in the case that the event 𝐴 occurred, and the value 0 in the case that the event 𝐴 did not occur. Notably, the expected value of the random variable 1𝐴 corresponds to the probability of the event 𝐴, where using the Lebesgue integral, we have

𝐸(1𝐴) = ∫ 1𝐴(𝜔)𝑑𝑃(𝜔)Ω= ∫ 1 𝑑𝑃(𝜔)

𝐴= 𝑃(𝐴)

301

Positive false discovery rate. If at least on null hypothesis 𝐻0(𝑖) is rejected, we have 𝑅 > 0. In this case, the

conditional expectation of the proportion of Type I errors among the rejected hypotheses, given that a least

on hypothesis is rejected is referred to as positive false discovery rate 𝑝𝐹𝐷𝑅

𝑝𝐹𝐷𝑅 = 𝐸 (𝑉

𝑅|𝑅 > 0) (10)

We next consider some basic relationships between the Type I error rates defined above, which

follow from their definitions and the formulation of the problem as summarized in Table 1. Firstly, note that

by (6)

0 ≤ 𝑉 ≤ 𝑅 ≤ 𝑚 and 𝑅 = 0 ⇒ 𝑉 = 0 (11)

We thus have

𝑉

𝑚≤

𝑉

𝑅1{𝑅>0} ≤ 1{𝑉>0} ≤ 𝑉 (12)

Taking expectations on the above results in

𝐸 (𝑉

𝑚) =

𝐸(𝑉)

𝑚≤ 𝐸 (

𝑉

𝑅1{𝑅>0}) ≤ 𝐸(1{𝑉>0}) ≤ 𝐸(𝑉) ⇔ 𝑃𝐶𝐸𝑅 ≤ 𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅 ≤ 𝑃𝐹𝐸𝑅 (13)

In the neuroimaging literature, the most popular Type I error rates are arguably the family-wise error

rate 𝐹𝑊𝐸𝑅 and the false discovery rate 𝐹𝐷𝑅. We thus consider their relationship in more detail. From (13)

the false discovery rate is less, or equal, to the family-wise error rate. Equality holds in the special case that

all null hypotheses 𝐻0(𝑖)

are true, i.e. 𝐻0(𝑖)= 0 for all 𝑖 ∈ 𝑀. In other words: for 𝑚 = 𝑚0, the inequality

𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅 reduces to the equality 𝐹𝐷𝑅 = 𝐹𝑊𝐸𝑅. This may be seen from the relationships in Table 1 as

follows: From 𝑚0 = 𝑚, it follows that 𝑚1 = 0. 𝑚1 = 0 implies that 𝑇 + 𝑆 is zero, and as 𝑇, 𝑆 ≥ 0, this

implies that both 𝑇 = 0 and 𝑆 = 0. From 𝑆 = 0 it follows that 𝑅 = 𝑉. There are now two possible scenarios:

the number of rejected null hypotheses 𝑅 is zero or it is not zero (and all rejections, if any, are obviously

wrong). In the first case, the 𝐹𝐷𝑅 evaluates to


𝑅1{𝑅>0}) = 𝐸 (

𝑉

𝑅⋅ 0) = 0 (14)

and because 𝑉 = 𝑅 = 0 the 𝐹𝑊𝐸𝑅 evaluates to

𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝐸(0) = 0 (15)

and we have 𝐹𝐷𝑅 = 𝐹𝑊𝐸𝑅. In the case that 𝑅 is not zero, i.e., 𝑅 > 0, from 𝑉 = 𝑅 it follows that


𝑅1{𝑅>0}) = 𝐸(1 ⋅ 1) = 1 (16)

Because 𝑉 = 𝑅 implies that if 𝑅 > 0 then it follpows that 𝑉 > 0, we have

𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝐸(1) = 1 (17)

Also in this case, we thus have 𝐹𝐷𝑅 = 𝐹𝑊𝐸. In other words: under the assumption that all null hypotheses

are true, 𝐹𝐷𝑅 and 𝐹𝑊𝐸 are equivalent.

302

(3) Exact, weak, and strong control of family-wise error rates

In this section, we define what we understand by the exact, weak, or strong control of a family-wise

error rate. This is important in the context of mass-univariate GLM-FMRI, because, intuitively, it allows to

associate the ability to localize an observed effect with different forms of control of family-wise error rates.

To this end, we first introduce the notion of a liberal, conservative or exact statistical test.

Liberal, exact, and conservative statistical tests

Given a null hypothesis 𝐻0 and a test statistic 𝑇 ∈ 𝒯, where 𝒯 denotes the set of values that the test

statistic can take on, a statistical test is said to be liberal, conservative, or exact, if, for any given significance

level 𝛼 ∈ [0,1] and corresponding rejection region 𝑅𝛼 ⊂ 𝒯 the probability that 𝑇 belongs to the rejection

region 𝑅𝛼, denoted by

𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) (1)

is greater than, less than, or equal to 𝛼, respectively. Appropriate control of the Type I error rate requires an

exact or conservative test. In other words, for a liberal test, we have

𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) > 𝛼 (2)

For an exact test, we have

𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) = 𝛼 (3)

and for a conservative test, we have

𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) < 𝛼 (4)

Weak and strong control of family-wise error rates

In order to handle the multiple testing problem appropriately, the rejection criteria, i.e. the rejection

region 𝑅𝛼 for a given test have to be chosen so that the probability of rejecting one or more of the null

hypotheses when the rejected null hypotheses are actually true is sufficiently small. Let the search volume

Ω ≔ {𝑣1, … , 𝑣𝐾} (5)

consists of 𝐾 ∈ ℕ voxels 𝑣1, … , 𝑣𝐾 and let 𝐻01, … , 𝐻0

𝐾 be the null hypotheses for each voxel. The omnibus null

hypothesis 𝐻Ω is the logical conjunction of the events (𝐻01 = 0),… , (𝐻0

𝐾 = 0), that is

𝐻Ω ≔ (𝐻01 = 0) ∩ (𝐻0

2 = 0) ∩ …∩ (𝐻0𝐾 = 0) (6)

In words: 𝐻Ω corresponds to the case that 𝐻01 and 𝐻0

2 and … and 𝐻0𝐾 are true. To test each of the 𝐻0

1, … , 𝐻0𝐾

we use a set or a “family” of tests 𝑇1, … , 𝑇𝑘. For all 𝑗 ∈ {1,…𝐾} let 𝐸𝐻0=0𝑗

be the event that the test 𝑇𝑗

incorrectly rejects 𝐻0𝑗, that is,

𝐸𝑗 ≔ {𝑇𝑗 ∈ 𝑅𝛼𝑗 𝑎𝑛𝑑 𝐻0𝑗= 0} (7)

where 𝛼𝑗 is the corresponding rejection region. Suppose the test is exact or possibly conservative, i.e.

303

𝑝𝐻Ω(𝐸𝑗) ≤ 𝛼𝑗 (8)

In the context of the family {𝑇𝑗}𝑗∈𝐾 of tests, the family-wiser error rate 𝐹𝑊𝐸𝑅 is defined as the probability

of falsely rejecting any of the null hypotheses {𝐻0𝑘}𝑗∈𝐾

. Let 𝐸Ω denote the event that the omnibus hypothesis

is rejected, that is,

𝐸Ω ≔ 𝐸1 ∪ 𝐸2 ∪ …∪ 𝐸𝐾 =∪𝑗=1𝐾 𝐸𝑗 (9)

Note that this union of events is to be understood as the event that at least one 𝐸𝑗 of the 𝐾 events has

occurred, or in other words, that 𝐸1 and/or 𝐸2 and/or … and/or 𝐸𝐾 has occurred. If only a single 𝐸𝑗 is

rejected, the omnibus hypothesis is rejected.

Weak control of the 𝐹𝑊𝐸𝑅 requires that the probability of falsely rejecting the omnibus null

hypothesis 𝐻Ω is , at most, the test level 𝛼, i.e. that

𝑝𝐻Ω(𝐸Ω) ≤ 𝛼 (10)

Evidence against the null hypothesis 𝐻Ω indicates the presence of at least one of the null hypotheses

𝐻0𝑗, 𝑗 = 1,… , 𝐾, for arbitrary 𝑗, or, in brain imaging terms, of “some activation somewhere”. Informally, this

implies that the test has no “localizing power”, meaning that the Type I error rate for individual voxels is not

controlled. Tests that have only weak control over the 𝐹𝑊𝐸𝑅 are called “omnibus tests” and are useful in

detecting whether there is any experimentally induced effect at all, regardless of location. If, on the other

hand, there is interest in not only detecting an experimentally induced signal but also reliably locating the

effect, a test procedure with “strong control” over the 𝐹𝑊𝐸𝑅 is required.

Strong control over the 𝐹𝑊𝐸𝑅 requires that the 𝐹𝑊𝐸𝑅 be controlled, not just under 𝐻Ω but also

under any subset of hypotheses. Specifically, for any subset of voxels 𝐵 ⊆ Ω and corresponding omnibus

hypothesis

𝐻𝐵 ≔∩𝑗∈𝐵 𝐻0𝑗 (11)

the probability of the event of rejecting the omnibus hypothesis 𝐻𝐵,

𝐸𝐵 ≔∪𝑗∈𝐵 𝐸𝑗 (12)

is smaller or equal to 𝛼, i.e.

𝑝𝐻𝐵(𝐸𝐵) ≤ 𝛼 (13)

Note again, that this inequality is required to hold for all possible choices of the subset 𝐵 from the original

set Ω. In other words, all possible subsets of hypotheses are tested with weak control over the 𝐹𝑊𝐸𝑅. This

ensures that the test is valid (i.e. exact or conservative) at every voxel and, from a neuroimaging perspective,

that the validity of the test in any given region is not affected by the truth of the null hypothesis elsewhere.

Thus, a test procedure with strong control over the 𝐹𝑊𝐸𝑅 can be said to have “localizing power”.

304

(4) The Bonferroni procedure and its “conservativeness” in GLM-FMRI

Broadly speaking, there are two classes of multiple testing procedures commonly used in the

literature: single-step and stepwise procedures. In single-step procedures, identical adjustments are made

for the tests of all hypotheses, regardless of the observed test statistics or raw 𝑝-values. In stepwise

procedures, the rejection of particular null hypotheses is based not only on the total number of hypotheses,

but also on the outcomes of the tests on other hypotheses. One may further distinguish step-down

procedures, which order the raw 𝑝-values or associated test statistics starting with the most extreme under

the null hypothesis, while step-up procedures use the reverse strategy. We here consider the Bonferroni

procedure as an example for a single-step approach controlling the 𝐹𝑊𝐸𝑅 in the strong sense.

The Bonferroni procedure

The Bonferroni procedure may now be formulated as follows. Given a family of null hypotheses

𝐻0(𝑖), 𝑖 = 1,… ,𝑚, and an interest in a family-wise Type I error rate of less or equal to a significance-level

𝛼 ∈ [0,1], each individual hypothesis 𝐻0(𝑖) is tested at a reduced significance-level 𝛼𝑖 ∈ [0,1], such that

∑ 𝛼𝑖𝑚𝑖 = 𝛼. To this end, one sets the adjusted 𝛼-values 𝛼𝑖 to

𝛼𝑖 ≔𝛼

𝑚 for 𝑖 = 1,… ,𝑚 (1)

and reject the null hypothesis 𝐻0(𝑖)

for 𝑝𝑖 < 𝛼𝑖, where 𝑝𝑖 denotes the raw 𝑝-value obtained for the test of

hypothesis 𝐻0(𝑖)

. One can show that that the Bonferroni procedure controls the 𝐹𝑊𝐸𝑅 in the strong sense

by capitalizing on Boole’s inequality

𝑃(∪𝑖=1𝑛 𝐴𝑖) ≤ ∑ 𝑃(𝐴𝑖)

𝑛𝑖=1 (2)

as follows. Given a set of 𝑚 null hypotheses 𝐻0(1), 𝐻0

(2), … , 𝐻0

(𝑚) and associated observed raw 𝑝-values

𝑝1, 𝑝2, … , 𝑝𝑚, we have

𝐹𝑊𝐸𝑅 = 𝑃(𝑉 > 0) ≤ 𝑃 (∪𝑖=1𝑚0 {𝑝𝑖 ≤

𝛼

𝑚}) ≤ ∑ 𝑃 ({𝑝𝑖 ≤

𝛼

𝑚})

𝑚0𝑖=1 ≤ 𝑚0

𝛼

𝑚≤ 𝑚

𝛼

𝑚= 𝛼 (3)

Less formally, consider the event 𝑝𝑖 ≤𝛼

𝑚, i.e. that the 𝑖th observed raw 𝑝-value is smaller or equal to the

adjusted significance level 𝛼

𝑚. The joint probability that this holds for all (arbitrarily chosen) 𝑚0 ≤ 𝑚 true null

hypotheses is, by Booles inequality, less or equal to the sum of the probabilities for each of these events,

which is maximally 𝛼

𝑚. Since the sum runs over all tested null hypotheses which are true, the sum is less or

equal then 𝑚0𝛼

𝑚 and as 𝑚0 ≤ 𝑚, this value is less or equal to 𝛼. Because the above holds true for any choice

of 𝑚0, strong control of the 𝐹𝑊𝐸𝑅 follows.

It is often stated that the Bonferroni procedure is “too conservative” for controlling Type I error

rates in mass-univariate GLM-FMRI. We next consider, how this statement may be understood. Recall that

one way to achieve strong 𝐹𝑊𝐸𝑅 control is to adjust the level of significance at which the different

hypotheses 𝐻01, … , 𝐻0

𝐾 are test. As seen above, the single-step Bonferroni procedure is an illustrative

example of such a strategy. Suppose that the set of hypotheses {𝐻0𝑗}𝑗=1,…,𝐾

is tested at an equal level

𝑏 ∈ [0,1], such that

305

𝑝𝐻Ω(𝐸𝑗) ≤ 𝑏 (4)

In general

𝑝𝐻Ω(𝐸Ω) = 𝑝𝐻Ω(∪𝑗=1𝐾 𝐸𝑗) ≤ 𝑝𝐻Ω(𝐸1) + ⋯+ 𝑝𝐻Ω(𝐸𝐾) ≤ 𝐾𝑏 (5)

If 𝑏 is chosen such that 𝐾𝑏 = 𝛼, i.e. 𝑏 =𝛼

𝐾, it follows that

𝑝𝐻Ω(𝐸Ω) ≤ 𝐾𝑏 = 𝐾𝛼

𝐾= 𝛼 (6)

In the case that

𝑝𝐻Ω(𝐸Ω) < 𝑝𝐻Ω(𝐸1) + ⋯+ 𝑝𝐻Ω(𝐸𝐾) (7)

the Bonferroni correction will thus correspond to a conservative test. To see, when such a situation might

occur, consider the case 𝐾 = 2. Then, from elementary probability theory, we have

𝑃(𝐸1 ∪ 𝐸2) = 𝑃(𝐸1) + 𝑃(𝐸2) − 𝑃(𝐸1 ∩ 𝐸2) (8)

Note that for disjoint events 𝐸1 ∩ 𝐸2 = ∅ and thus

𝑃(𝐸1 ∩ 𝐸2) = 𝑃(∅) = 0 (9)

If the events 𝐸1 and 𝐸2 are not disjoint, then we have

𝑃(𝐸1 ∪ 𝐸2) < 𝑃(𝐸1) + 𝑃(𝐸2) (10)

and thus indeed a conservative test, as

𝑝𝐻Ω(𝐸Ω) = 𝑝𝐻Ω(𝐸1 ∪ 𝐸2) < 𝑝𝐻Ω(𝐸1) + 𝑝𝐻Ω(𝐸2) ≤ 𝛼 (11)

Note however, that disjoint events are defined by 𝐸1 ∩ 𝐸2 = ∅, whereas while independent events are

defined by 𝑃(𝐸1 ∩ 𝐸2) = 𝑃(𝐸1)𝑃(𝐸2). Why are the events 𝐸1 and 𝐸2 not disjoint? Because it possible that

both 𝐸1 and 𝐸2, i.e. the events that the test 𝑇1 incorrectly rejects 𝐻01 and that test 𝑇2 incorrectly rejects 𝐻0

2

can occur simultaneously. What is the probability that this will happen? That depends on the dependence

structure between test statistics. For independent tests, it is given by

𝑃(𝐸1 ∩ 𝐸2) = 𝑃(𝐸1)𝑃(𝐸2) = 𝑃(𝑇1 ∈ 𝑅𝛼1)𝑃(𝑇2 ∈ 𝑅𝛼2) (12)

We conclude that under the formalization above it is the fact that the events 𝐸𝑗 , 𝑗 = 1,… , 𝐾 are not

mutually exclusive (i.e. not disjoint) that renders the Bonferroni procedure conservative. For two events, it

depends on the probability of their joint occurrence 𝑃(𝐸𝑖 ∩ 𝐸𝑗), how conservative it will be. This sense of

conservativeness of the Bonferroni correction has motivated the search for alternative approaches to Type I

error rate control in mass-univariate GLM-FMRI, such as topological inference based on the concept of

Gaussian random fields, and non-parametric approaches to multiple testing control.

306

Study Questions

1. Why is the multiple testing problem of relevance for classical GLM-FMRI?

2. Define the family-wise error rate and the false discovery rate for a multiple testing problem.

3. How are the false-discovery rate and the family-wise error rate related?

4. Define the notions of liberal, exact, and conservative statistical tests

5. Discuss the notions of weak and strong control of an multiple comparison error rate in the context of GLM-FMRI.

6. Formulate the Bonferroni procedure


1. GLM-FMRI analyses are typically mass-univariate, i.e. a statistical test is performed at each voxel. Because there are usually

many voxels, the probability of falsely rejecting the null hypothesis when it is in fact true, or committing a “Type 1 error” at least

once is very high.

2. The family-wise error rate is defined as the probability of at least one Type I error over a family of hypotheses. The false-

discovery rate corresponds (roughly) to the expectation of the ratio between the number of rejections of the null hypothesis,

when it is true, and the total number of rejections over a family of hypotheses.

3. In general, 𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅, that is, the false-discovery rate is smaller or equal to the family-wise error rate. Equality of the false-

discovery-rate and the family-wise error rate holds in the special case that all null hypothesis are true over the whole set of

statistical tests.

4. Given a null hypothesis 𝐻0 and a test statistic 𝑇 of data , a statistical test is said to be liberal, conservative, or exact, if, for any

given significance level 𝛼 ∈ [0,1] and corresponding rejection region 𝑅𝛼 , the probability that 𝑇 belongs to the rejection region

𝑅𝛼 denoted by 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) is greater than, less than, or equal to 𝛼, respectively. Appropriate control of the Type I error rate

requires an exact or conservative test. In other words, for a liberal test, we have 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) > 𝛼. For an exact test, we have

𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) = 𝛼 and for a conservative test, we have 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) < 𝛼.

5. Controlling a Type I error rate under the assumption of the “omnibus null hypothesis”, i.e. the assumption that the null

hypothesis is true for all tests considered is referred to as weak control. In the context of GLM-FMRI with voxel null hypotheses

of no activation at a given voxel, the complete null hypothesis may be expressed as “no activation at any voxel in the volume of

the brain under examination”. Evidence against the null hypothesis indicates the presence of “some activation somewhere”.

Informally, this test has no “localizing power” in that the Type I error for individual voxels is not controlled. In other words: if the

omnibus null hypothesis has been rejected, then any set of voxels could be declared as “activated”. Controlling a Type I error

rate for every possible choice of subset of tests for which the null hypothesis holds, is referred to as “strong control”. A test

procedure with strong control over a Type I error rate has “localizing power”.

6. Given a family of null hypotheses 𝐻0(𝑖), 𝑖 = 1, … ,𝑚, and an interest in a family-wise Type I error rate of less or equal to a

significance-level 𝛼 ∈ [0,1], the Bonferroni procedure corresponds to testing each individual hypothesis 𝐻0(𝑖)

at a reduced

significance-level 𝛼𝑖 ∈ [0,1], such that ∑ 𝛼𝑖𝑚𝑖 = 𝛼. To this end, one sets the adjusted 𝛼-values 𝛼𝑖 to 𝛼𝑖 ≔ 𝛼/𝑚 for 𝑖 = 1,… ,𝑚

and reject the null hypothesis 𝐻0(𝑖)

for 𝑝𝑖 < 𝛼𝑖, where 𝑝𝑖 denotes the raw 𝑝-value obtained for the test of hypothesis 𝐻0(𝑖)

. In

other words, instead of testing against the significance level of 𝛼, one tests the 𝑚 individual hypotheses against 𝛼/𝑚.

307

Classification Approaches

Binary classification methods for fMRI and M/EEEG data analysis posit the existence of a “training

data set”

{(𝑥(𝑖), 𝑦(𝑖))}𝑖=1

𝑛 (1)

comprising “training examples” (𝑥(𝑖), 𝑦(𝑖)), where 𝑥(𝑖) ∈ ℝ𝑚 is an 𝑚-dimensional data vector, usually

referred to as “feature vector” or “input vector” (or in more generally, “feature variable”), and 𝑦(𝑖) ∈ ℝ is a

univariate “target variable”, also referred to as “output variable”. The superscript 𝑖 = 1,… , 𝑛 indexes the

training examples. For binary classification schemes, we may have 𝑦(𝑖) ∈ {0,1} or 𝑦(𝑖) ∈ {−1,1}. As an

example, 𝑥(𝑖) may be a vector of the MR signal from 𝑚 voxels of a cortical region of interest, while the

corresponding 𝑦(𝑖) may code the condition the participant was exposed to when the data 𝑥(𝑖) was recorded.

The goal of classification approaches is to learn a mapping from an observed data pattern 𝑥(𝑖) to the

corresponding value of the target variable. In terms of fMRI, the goal is thus to predict the stimulation

condition from the observed voxel pattern activation. Note that, compared to the theory of the GLM, the

intuitive meanings of “𝑥” and “𝑦” are reversed: in the GLM “𝑥” refers to aspects of the experimental design,

i.e., the independent variable, and is used to predict data “𝑦”. In the current scenario “𝑥” refers to data that

is used to predict the experimental design “𝑦”. In other words, while in the familiar GLM scenario data “𝑦” is

regressed on predictors “𝑥”, in the current section, data “𝑥” is used to predict experimental conditions “𝑦”.

In general, there exist at least three different learning approaches (Bishop, 2006). The first

approach is referred to as a generative learning approach and models the joint distribution of feature and

target variables 𝑝(𝑥(𝑖), 𝑦(𝑖)), such that the conditional probability of 𝑝(𝑦(𝑖)|𝑥(𝑖)) can explicitly be evaluated.

An example for a generative learning method is linear discriminant analysis, and will be discussed in Section

1 below. The second learning method is referred to as a discriminative learning approach and models the

probability 𝑝(𝑦(𝑖)) given 𝑥(𝑖), but not conditioned on 𝑥(𝑖), as the feature vector is not considered as a

random variable. An example for a discriminative learning method is logistic regression, which we consider

as special case of a larger model class known as “generalized linear models” in Section 2. Finally, a third

approach makes use of linear discriminant functions based on geometrical consideration without reference

to probabilistic concepts. An example of such an approach is the popular support vector classification

technique, which we will discuss in Section 3.

The most common use of this procedure in FMRI is based on the idea that if the prediction accuracy

of the learning model is larger than chance, than the underlying data (for example, a cortical region of

interest) must represent some “information” about the experimental condition of interest. To this end often

a full data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1

𝑛 is partitioned into a “training data set” {(𝑥(𝑖), 𝑦(𝑖))}

𝑖=1

𝑟 and a mutually

exclusive “test data set” {(𝑥(𝑗), 𝑦(𝑗))}𝑗=1

𝑠, such that 𝑟 + 𝑠 = 𝑛, the learning model trained on the “training

data set”, and used to predict the experimental condition 𝑦(𝑗) of the associated test data set feature vector

𝑥(𝑗) for 𝑗 = 1,… , 𝑠 . By repeatedly changing the allocation of examples from the full data set to the training

and test data sets, all possible combinations of training and test data are explored. This procedure is referred

to as “cross-validation”.

A simple discriminative learning approach for binary classification is afforded by logistic regression.

In brief, upon learning a set of parameters 𝛽, the logistic regression model outputs the probability of a

308

feature vector 𝑥(𝑖) ∈ ℝ𝑚 belonging to one of two classes as target variable 𝑦(𝑖) ∈ {0,1}. To classify the

feature vector, the following decision rule is usually employed: If 𝑝(𝑥(𝑖)) ≤ 0.5, then 𝑥(𝑖) is allocated to the

first class (i.e. the class membership prediction is set to 𝑦(𝑖) = 0), and else it is allocated to the second class

(i.e. the class membership prediction is set to 𝑦(𝑖) ≔ 1). Comparing the allocation of feature vectors to

classes based on known class memberships than allows for (informally) inferring how much information the

feature vectors carry with respect to their class membership.

(1) Generative learning - Linear Discriminant Analysis

The linear discriminant analysis model posits that the multivariate data points 𝑥(𝑖) (𝑖 = 1,… , 𝑛)

correspond to realizations of one of two possible multivariate Gaussian random variables governed by

distinct expectation vectors 𝜇0 ∈ ℝ𝑚 and 𝜇1 ∈ ℝ

𝑚, but with the same covariance matrix Σ ∈ ℝ𝑚×𝑚 𝑝. 𝑑..

The class membership (i.e. the expectation vector) of the 𝑖th data point is assumed to be determined by an

associated Bernoulli random variable 𝑦(𝑖) with parameter 𝜇 ∈ [0,1]. Note that 𝜇0, 𝜇1 refer to expectations

of Gaussian random variables, while 𝜇 corresponds to the expectation of a Bernoulli variable. Formally, for

𝑖 = 1,… , 𝑛

𝑝(𝑥(𝑖), 𝑦(𝑖)) = 𝑝(𝑦(𝑖))𝑝(𝑥(𝑖)|𝑦(𝑖)) (1)

where

𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛(𝑦(𝑖); 𝜇) (2)

and

𝑝(𝑥(𝑖)|𝑦(𝑖)) = 𝑁(𝑥; 𝜇𝑦(𝑖) , Σ) (3)

This model is best understood from an “ancestral” sampling perspective: first a value 𝑦(𝑖) ∈ {0,1} is sampled

from a Bernoulli distribution with probability 𝜇 ∈ [0,1] of being 1. Next ,depending on the outcome of 𝑦(𝑖),

the associated multivariate data point 𝑥(𝑖) ∈ ℝ𝑚 is sampled from the multivariate normal distribution with

expectation 𝜇0, if 𝑦(𝑖) = 0, and expectation 𝜇1, if 𝑦(𝑖) = 1. Figure 1 below depicts 100 samples (or “training

data points”) (𝑥(𝑖), 𝑦(𝑖)), 𝑖 = 1,… ,100 from an LDA model with two-dimensional data points 𝑥(𝑖) ∈ ℝ2 and

with parameters 𝜇 = 0.7, 𝜇0 = (−1,−1)𝑇 , 𝜇1 = (1,1)

𝑇 and covariance matrix 𝐼2. Note that the outcome of

𝑦(𝑖) is depicted as the color of the 𝑥(𝑖) data points.

The LDA model can be used for classification of novel feature vectors, denoted here by 𝑥∗ ∈ ℝ𝑚 as

follows. First, based on a training data set {(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… , 𝑛}, the parameters of the LDA model are

learned by means of analytical maximum likelihood estimation. We will omit the full derivation of the ML

estimators here for brevity and only provide a sketch of it: Like usual, the log likelihood function of the

training data is formulated as

ℓ(𝜇, 𝜇0, 𝜇1, Σ) = ln(∏ 𝑝(𝑥(𝑖), 𝑦(𝑖))𝑚𝑖=1 ) = ln(∏ 𝑝(𝑥(𝑖)|𝑦(𝑖))𝑝(𝑦(𝑖))𝑚

𝑖=1 ) (4)

and evaluated based on the functional forms of 𝑝(𝑦(𝑖)) and 𝑝(𝑥(𝑖)|𝑦(𝑖)) specified in (2) and (3). Next, the log

likelihood function is maximized with respect to the parameters 𝜇, 𝜇0, 𝜇1 and Σ by computing the

corresponding partial derivatives of the log likelihood function, setting to zero and solving for the ML

estimators. Using the indicator function

309

1𝐴 = {1 𝑖𝑓 𝐴 𝑖𝑠 𝑡𝑟𝑢𝑒0 𝑖𝑓 𝐴 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒

(5)

the resulting ML estimators for the parameters of the LDA model are given by

�̂� =1

𝑚∑ 1{𝑦(𝑖)=1}𝑚𝑖=1 (6)

�̂�0 =∑ 1

{𝑦(𝑖)=0}𝑥(𝑖)𝑚

𝑖=1

∑ 1{𝑦(𝑖)=0}

𝑚𝑖=1

(7)

�̂�1 =∑ 1

{𝑦(𝑖)=1}𝑥(𝑖)𝑚

𝑖=1

∑ 1{𝑦(𝑖)=1}

𝑚𝑖=1

(8)

Σ̂ =1

𝑚∑ (𝑥(𝑖) − �̂�𝑦(𝑖))𝑚𝑖=1 (𝑥(𝑖) − �̂�𝑦(𝑖))

𝑇 (9)

Proof of (6) – (9)

Figure 1 Linear discriminant analysis realization and classification

Note that these estimators have very intuitive interpretations: the Bernoulli parameter 𝜇 is

estimated by the proportion of observed 1’s and 0’s in the training data set, the Gaussian parameter 𝜇0 is

estimated by the average of all 𝑥(𝑖) belonging to training points with 𝑦(𝑖) = 0, and likewise for 𝜇1, while the

common covariance Σ is estimated by the average of the empirical covariances of the first and second

classes. For the training set realized in Figure 1, for which the true, but unknown parameters, are given by

𝜇 = 0.7, 𝜇0 = (−1,−1)𝑇 , 𝜇1 = (1,1)

𝑇 and Σ = (1 00 12

) (10)

these parameter estimates are given by

310

�̂� = 0.70, �̂�0 = (−1.15,−0.93)𝑇, �̂�1 = (0.95,0.88)

𝑇 and Σ̂ = (0.86 0.060.06 1.09

) (11)

Once estimated, the LDA model can be used for the classification of novel feature vectors 𝑥∗ ∈ ℝ𝑚 as

follows: To determine the class membership of a novel feature vector 𝑥∗ ∈ ℝ𝑚, the probability of 𝑦∗ being 1

conditional on 𝑥∗ is evaluated. To this end Bayes theorem

𝑝( 𝑦∗|𝑥∗) ∝ 𝑝( 𝑦∗)𝑝(𝑥∗| 𝑦∗) (12)

is applied, which, upon some algebraic manipulation, results in the following expression for the probability

of 𝑦∗ being 1 given 𝑥∗:

𝑝(𝑦∗|𝑥∗) =1

1+exp(−𝑥∗𝑇�̂�) (13)

Proof of (13)

As will be seen below, the expression (13) corresponds to the logistic regression mean function. Notably, in

expression (13), 𝑥∗ refers to the “augmented” feature vector, i.e. 𝑥∗ ≔ (1, 𝑥∗)𝑇, and �̂� corresponds to the

following vector function of the ML estimates

�̂� ≔ (

1

2�̂�0𝑇Σ̂−1�̂�0 − ln(1 − �̂�) −

1

2�̂�1𝑇Σ̂−1�̂�1 + ln(�̂�)

−Σ̂−1(�̂�0 − �̂�1)) (14)

Based on the probability 𝑝(𝑦∗|𝑥∗), a simple decision rule can then be used to classify the novel feature

vector 𝑥∗: if the probability 𝑝(𝑦∗ = 1|𝑥∗) is larger than 0.5, 𝑥∗ is classified as member of the class 𝑦 = 1,

else it is classified as a member of the class 𝑦 = 0.

(2) Discriminative Learning - Logistic Regression

Here we consider logistic regression, which we formulate in terms of a generalized linear model.

Consider now the case of Bernoulli distributed target variable 𝑦(𝑖) with

𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛(𝑦(𝑖); 𝜇(𝑖)) (1)

where 𝐸(𝑦(𝑖)) = 𝜇(𝑖) ∈ [0,1] and the Bernoulli distribution is a member of the exponential family. To

formulate a generalized linear model, we require a link function 𝑔, which maps the expectations of the

response variable 𝑦(𝑖) onto the linear predictor 휂(𝑖). For the logistic regression model, this link function is

defined by the so-called “logit function”, given by

𝑔: [0,1] → ℝ, 𝐸(𝑦(𝑖)) = 𝜇(𝑖) ↦ ln𝜇(𝑖)

1−𝜇(𝑖)≔ 휂(𝑖) (2)

The functional role of the link function becomes somewhat clearer, if we consider its inverse 𝑔−1, which

maps the familiar linear predictor 휂(𝑖) ≔ 𝛽0 + 𝑥1(𝑖)𝛽1 +⋯+ 𝑥𝑚

(𝑖)𝛽𝑚 onto the expectation of 𝑦(𝑖), which is

given by

𝑔−1:ℝ → [0,1], 휂(𝑖) ↦ 1

1+exp(−𝜂(𝑖))= 𝜇(𝑖) = 𝐸(𝑦(𝑖)) (3)

This mapping is found by solving the defining equation (2) of 𝑔 for 𝜇(𝑖) as shown below.

311

Proof of (3)

We solve the defining equation of 𝑔 for 𝜇(𝑖). From (2), we have

휂(𝑖) = ln𝜇(𝑖)

1−𝜇(𝑖)⇔ −휂(𝑖) = − ln

𝜇(𝑖)

1−𝜇(𝑖)⇔ −휂(𝑖) = ln

1−𝜇(𝑖)

𝜇(𝑖) (3.1)

where the last equality follows with the properties of the logarithm. Taking the exponential of the last equality then results in

exp(−휂(𝑖)) = exp (𝑙𝑛1−𝜇(𝑖)

𝜇(𝑖)) ⇔ exp(−휂(𝑖)) =

1−𝜇(𝑖)

𝜇(𝑖) (3.2)

which we may now solve for 𝜇(𝑖):

exp(−휂(𝑖)) =1−𝜇(𝑖)

𝜇(𝑖)⇔ 𝜇(𝑖) exp(−휂(𝑖)) + 𝜇(𝑖) = 1 ⇔ 𝜇(𝑖)(1 + exp(−휂(𝑖))) = 1 (3.3)

and thus

𝑔−1: ℝ → [0,1], 휂(𝑖) ↦ 1

1+exp(−𝜂(𝑖))= 𝜇(𝑖) = 𝐸(𝑦(𝑖)) (3.4)

□

The mean function 𝑔−1 (also referred to as “logistic function”, or more generally, because of its S-shape a

“sigmoid function”) may thus be regarded as a nonlinear function of the linear predictor 𝑥(𝑖)𝛽, which maps

unbounded values 𝑥(𝑖)𝛽 onto values in the range [0,1], which in turn serve as the expectations of a

Bernoulli random variable. Figure 1 below depicts the mean function and its inverse, the link function.

Figure 1. The logistic regression link function (left panel), and its inverse, the mean function (right panel).

For simplicity we will denote the function 𝑔−1 in the following by

𝑓𝛽: ℝ𝑚 → ℝ,𝑥 ↦ 𝑓𝛽(𝑥) ≔

1

1+exp(−𝑥𝑇𝛽) (4)

to emphasize its character as a nonlinear transformation of the input feature vector 𝑥 ∈ ℝ𝑚 and its

dependence on a parameter vector 𝛽 ∈ ℝ𝑚. Note that with

312

𝜇(𝑖) = 𝑓𝛽(𝑥(𝑖)) (5)

the Bernoulli distribution probability mass function of 𝑦(𝑖) may be written as

𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛 (𝑦(𝑖); 𝑓𝛽(𝑥)) = (𝑓𝛽(𝑥(𝑖)))

𝑦(𝑖)

(1 − 𝑓𝛽(𝑥(𝑖)))

1−𝑦(𝑖)

(6)

As derived below, the log likelihood function for the parameters of the logistic regression model

based on 𝑛 training examples is given by

ℓ(𝛽) = ∑ 𝑦(𝑖) ln (1

1+exp(−𝑥(𝑖)𝑇𝛽))𝑛

𝑖=1 + (1 − 𝑦(𝑖)) ln (1 −1

1+exp(−𝑥(𝑖)𝑇𝛽)) (7)

Proof of (7)

Assuming that the 𝑛 training examples were generated independently, we can write down the likelihood of the parameters 𝛽 ∈ ℝ𝑚+1 as

𝐿(𝛽) = 𝑝𝛽(𝑦(𝑖), … , 𝑦(𝑛)) = ∏ 𝐵𝑒𝑟𝑛 (𝑦(𝑖); 𝑓𝛽(𝑥))

𝑛𝑖=1 = ∏ (𝑓𝛽(𝑥

(𝑖)))𝑦(𝑖)

(1 − 𝑓𝛽(𝑥(𝑖)))

1−𝑦(𝑖)𝑛𝑖=1 (7.1)

Taking the logarithm, we obtain the log likelihood function (7) as follows

ℓ(𝛽) ≔ ln 𝐿(𝛽) (7.2)

= ln (∏ (𝑓𝛽(𝑥(𝑖)))

𝑦(𝑖)

(1 − 𝑓𝛽(𝑥(𝑖)))

1−𝑦(𝑖)𝑛𝑖=1 )

= ∑ ln ((𝑓𝛽(𝑥(𝑖)))

𝑦(𝑖)

(1 − 𝑓𝛽(𝑥(𝑖)))

1−𝑦(𝑖)

)𝑛𝑖=1

= ∑ ln ((𝑓𝛽(𝑥(𝑖)))

𝑦(𝑖)

)𝑛𝑖=1 + ln((1 − 𝑓𝛽(𝑥

(𝑖)))1−𝑦(𝑖)

)

= ∑ 𝑦(𝑖) ln (𝑓𝛽(𝑥(𝑖)))𝑛

𝑖=1 + (1 − 𝑦(𝑖)) ln (1 − 𝑓𝛽(𝑥(𝑖)))

□

A common way to maximize the log likelihood function in the case of logistic regression is by means

of gradient ascent. Based on the analytical evaluation of the partial derivatives of the log likelihood function

with respect to 𝛽𝑗, a gradient ascent algorithm for logistic regression is given by

1. Select the initial 𝛽0 and learning rate 𝛼 > 0 appropriately

2. For 𝑘 = 0,1,2…

𝛽𝑘+1 = 𝛽𝑘 + 𝛼∇ℓ(𝛽𝑘) =

(

𝛽0𝑘

𝛽1𝑘

⋮𝛽𝑚𝑘)

+ 𝛼

(

∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛

𝑖=1 𝑥0(𝑖)

∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛

𝑖=1 𝑥1(𝑖)

⋮

∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛

𝑖=1 𝑥𝑚(𝑖))

(8)

313

Note that the magnitude of the update of each entry in 𝛽𝑘+1 is proportional to the summed “prediction

error” ∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛

𝑖=1 , i.e. the summed differences between the observed data points 𝑦(𝑖), 𝑖 =

1,… , 𝑛 and the data point prediction 𝑓𝛽𝑘(𝑥(𝑖)) based on the previous parameter setting 𝛽𝑘.

Proof of (8)

We evaluate the partial derivatives 𝜕

𝜕𝛽𝑗ℓ(𝛽), 𝑗 = 0,1… ,𝑚 of the log likelihood function (7). To this end, it is helpful to first

rewrite the function

𝑓𝛽: ℝ𝑚 → ℝ, 𝑥 ↦ 𝑓𝛽(𝑥) ≔

1

1+exp(−𝑥𝑇𝛽) (8.1)

as a function of the dot product 𝑥𝑇𝛽, i.e as

𝑓:ℝ → ℝ, 𝑥𝑇𝛽 ↦ 𝑓𝛽(𝑥𝑇𝛽) ≔

1

1+exp(−𝑥𝑇𝛽) (8.2)

We further note that the derivative of the function 𝑓 evaluates to

𝑑

𝑑𝑥𝑇𝛽𝑓 = 𝑓′: ℝ → ℝ, 𝑥𝑇𝛽 ↦ 𝑓′(𝑥𝑇𝛽) = 𝑓(𝑥𝑇𝛽)(1 − 𝑓(𝑥𝑇𝛽)) (8.3)

This may be seen as follows: For a function

ℎ: ℝ → ℝ, 𝑧 ↦ ℎ(𝑧) =1

1+exp(−𝑧) (8.3)

we have, using the chain rule of differentiation

𝑑

𝑑𝑧ℎ(𝑧) =

𝑑

𝑑𝑧(1 + exp(−𝑧))−1 = (−(1 + exp(−𝑧))−2) exp(−𝑧) (−1) (8.4)

and thus

𝑑

𝑑𝑧ℎ(𝑧) =

exp(−𝑧)

(1+exp(−𝑧))2 (8.5)

The right-hand side of the above may now be reformulated as ℎ(𝑧)(1 − ℎ(𝑧)) as

exp(−𝑧)

(1+exp(−𝑧))2=

1+exp(−𝑧)

(1+exp(−𝑧))2−

1

(1+exp(−𝑧))2=

1

1+exp(−𝑧)−

1

(1+exp(−𝑧))2=

1

1+exp(−𝑧)(1 −

1

1+exp(−𝑧)) = ℎ(𝑧)(1 − ℎ(𝑧) (8.6)

Using this result we can now evaluate the entries 𝜕

𝜕𝛽𝑗ℓ(𝛽), 𝑗 = 0,1… ,𝑚:

𝜕

𝜕𝛽𝑗ℓ(𝛽) =

𝜕

𝜕𝛽𝑗(∑ 𝑦(𝑖) ln (𝑓 (𝑥(𝑖)

𝑇𝛽))𝑛

𝑖=1 + (1 − 𝑦(𝑖)) log (1 − 𝑓 (𝑥(𝑖)𝑇𝛽))) (8.7)

= ∑ 𝑦(𝑖)𝜕

𝜕𝛽𝑗(ln (𝑓 (𝑥(𝑖)

𝑇𝛽)))𝑛

𝑖=1 + (1 − 𝑦(𝑖))𝜕

𝜕𝛽𝑗ln (1 − 𝑓 (𝑥(𝑖)

𝑇𝛽))

= ∑ 𝑦(𝑖)1

𝑓(𝑥(𝑖)𝑇𝛽)(𝜕

𝜕𝛽𝑗𝑓 (𝑥(𝑖)

𝑇𝛽))𝑛

𝑖=1 + (1 − 𝑦(𝑖))1

1−𝑓(𝑥(𝑖)𝑇𝛽)

𝜕

𝜕𝛽𝑗(1 − 𝑓 (𝑥(𝑖)

𝑇𝛽))

= ∑ 𝑦(𝑖)1

𝑓(𝑥(𝑖)𝑇𝛽)(𝜕


𝑇𝛽))𝑛

𝑖=1 − (1 − 𝑦(𝑖))1

1−𝑓(𝑥(𝑖)𝑇𝛽)(𝜕


𝑇𝛽))

= ∑ (𝑦(𝑖)1

𝑓(𝑥(𝑖)𝑇𝛽)− (1 − 𝑦(𝑖))

1

1−𝑓(𝑥(𝑖)𝑇𝛽))

𝜕


𝑇𝛽)𝑛

𝑖=1

Using (8.3), we then have

𝜕

𝜕𝛽𝑗ℓ(𝛽) = ∑ (𝑦(𝑖)

1

𝑓(𝑥(𝑖)𝑇𝛽)− (1 − 𝑦(𝑖))

1

1−𝑓(𝑥(𝑖)𝑇𝛽))𝑓 (𝑥(𝑖)

𝑇𝛽) (1 − 𝑓 (𝑥(𝑖)

𝑇𝛽))𝑛

𝑖=1𝜕

𝜕𝛽𝑗(𝑥(𝑖)

𝑇𝛽) (8.9)

314

Finally,

𝜕

𝜕𝛽𝑗(𝑥(𝑖)

𝑇𝛽) = (𝛽0 + 𝑥1

(𝑖)𝛽1 +⋯+ 𝑥𝑚

(𝑖)𝛽𝑚) = (

𝜕

𝜕𝛽𝑗𝛽0 +

𝜕

𝜕𝛽𝑗𝑥1(𝑖)𝛽1 +⋯+

𝜕

𝜕𝛽𝑗𝑥𝑚(𝑖)𝛽𝑚) = 𝑥𝑗

(𝑖) (8.10)

we then further have

𝜕

𝜕𝛽𝑗ℓ(𝛽) = ∑ (𝑦(𝑖)

𝑓(𝑥(𝑖)𝑇𝛽)(1−𝑓(𝑥(𝑖)

𝑇𝛽))

𝑓(𝑥(𝑖)𝑇𝛽)

− (1 − 𝑦(𝑖))𝑓(𝑥(𝑖)

𝑇𝛽)(1−𝑓(𝑥(𝑖)

𝑇𝛽))

1−𝑓(𝑥(𝑖)𝑇𝛽)

)𝑥𝑗(𝑖)𝑛

𝑖=1 (8.11)

= ∑ (𝑦(𝑖) (1 − 𝑓 (𝑥(𝑖)𝑇𝛽)) − (1 − 𝑦(𝑖))𝑓 (𝑥(𝑖)

𝑇𝛽))𝑛

𝑖=1 𝑥𝑗(𝑖)

= ∑ (𝑦(𝑖) − 𝑦(𝑖)𝑓 (𝑥(𝑖)𝑇𝛽) − 𝑓 (𝑥(𝑖)

𝑇𝛽) + 𝑦(𝑖)𝑓 (𝑥(𝑖)

𝑇𝛽))𝑛


= ∑ (𝑦(𝑖) − 𝑓 (𝑥(𝑖)𝑇𝛽))𝑛


= ∑ (𝑦(𝑖) − 𝑓𝛽(𝑥(𝑖)))𝑛


□

(3) Support Vector Classification

In contrast to linear discriminant analysis and logistic regression (and, in fact to the great majority of

models discussed herein), the popular support vector classification approach is not a probabilistic model. As

will be seen below, estimating the parameters of a support vector classifier corresponds to solving a

constrained quadratic programming problem which is derived based on geometric intuitions. While support

vector classifiers may thus practically achieve very good classification performance, the interpretation of this

performance is somewhat limited by the fact that the assigned target value to a feature vector has only

geometric, but no probabilistic meaning. Support vector classifiers hence are not able to serve as a models

that quantify the remaining uncertainty of a phenomenon of interest given a set of observations.

To develop the theory of support vector classification, we will proceed as follows. Firstly, we will

review the geometric intuitions and properties of linear discriminant functions. We will next discuss the case

of linearly separable classes and soft-margin classification. Thirdly, we will review the formulation of these

geometric problems in terms of constrained quadratic programming problems and their solution, and finally,

discuss the notion of kernel functions for support vector classification, giving rise to the notion of a support

vector machine.

Geometry of linear discriminant functions

Consider a training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1

𝑛 comprising 𝑛 training examples (𝑥(𝑖), 𝑦(𝑖)), where

𝑥(𝑖) ∈ ℝ𝑚 is an 𝑚-dimensional feature vector and 𝑦(𝑖) ∈ {−1,1} is the corresponding target variable

signifying the class membership of 𝑥(𝑖). The aim of support vector classification is to learn the parameters of

a linear discriminant function that can be used for the classification of feature vectors. The structural model

that is used to relate feature vectors to target variable in support vector classification is given by a “linear

classification function”. We define a linear classification function as a function

𝑓 ∶ ℝ𝑚 → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑤𝑇𝑥 + 𝑤0 (1)

315

i.e., as an affine-linear function that maps a feature vector 𝑥 ∈ ℝ𝑚 onto an output value 𝑓(𝑥) ∈ ℝ based on

a “weight vector” 𝑤 ∈ ℝ and a “bias parameter” 𝑤0 ∈ ℝ. To achieve a binary classification of the feature

vector 𝑥 based on the output of a linear classification function, it is augmented with a decision function

𝑔 ∶ ℝ → {−1,1}, 𝑓(𝑥) ↦ 𝑔(𝑓(𝑥)) ≔ {−1, 𝑓(𝑥) < 0

1, 𝑓(𝑥) ≥ 0 (2)

In other words, if the output of the classification function 𝑓 is negative, it is assigned to class −1, else it is

assigned to class 1. It is critically to note that, for a fixed feature vector, the linear classification function’s

(and thus also the decision function’s) output behaviour depends on the values of the weight vector 𝑤 and

the bias 𝑤0. The aim of training a support vector classifier represented by linear classification and decision

functions thus correspond to selecting 𝑤 and 𝑤0 according to some criteria. Before discussing approaches of

how this is achieved, we review the geometry induced by a training set, and the combination of linear

classification and decision functions. While the geometric intuitions discussed below hold for arbitrary

feature space dimensions, the basic concepts are best visualized in the plane, i.e. for 𝑚 ≔ 2 (Figure 1).

Figure 1. Basic concepts of binary linear classification. Note that the hyperplane 𝐻 divides the feature space (here ℝ2) into two

decision regions 𝐷+1 and 𝐷−1.

Consider the case of a training set of 𝑛 examples which is “linearly separable”, i.e., one may find a line such

that all feature vectors of class −1 lie on one side of that line, and all other feature vectors of class +1 lie on

the other side (Figure 1). Such a line is also referred to as a “decision surface”, which separates the

underlying feature space into two “decision regions”. Notably, the dimensionality of the decision surface is

𝑚 − 1, which in the case of a planar feature space corresponds to a 1-dimensional subspace of the plane,

i.e. a line. The decision surface is defined as the set of points in feature space for which the output value of a

linear discriminant function is 0. Such a decision surface is also referred to as a “hyperplane”, which is

defined in terms of (1) as

𝐻𝑤 ≔ {𝑥 ∈ ℝ𝑚|𝑓(𝑥) = 𝑤𝑇𝑥 + 𝑤0 = 0} ⊂ ℝ𝑚 (3)

Note that the hyperplane, and thus intuitively, the location and orientation of the decision surface, is a

function of the weight vector 𝑤 and the bias parameter 𝑤0 of the linear discriminant function. In other

316

words, changing the weight vector and the bias changes the hyperplane (3). Three geometric characteristics

of hyperplanes and the associated parameters of the underlying linear discriminant function are crucial for

the theory of support vector classification (Figure 2)

The weight vector 𝑤 is always orthogonal to the line described by the hyperplane. Changing the weight

vector thus allows for changing the orientation of the decision surface, while a specific value of the

weight vector corresponds to a specific orientation of the decision surface in feature space.

The output of the linear discriminant function 𝑓 for a feature vector 𝑥 provides a measure of the

distance of 𝑥 from the hyperplane.

The bias parameter 𝑤0 of the linear discriminant function corresponds to the distance of the hyperplane

from the origin of feature space. Changing the bias parameter thus shifts the hyperplane in feature

space and a specific choice of the bias parameter corresponds to a specific location of the hyperplane in

feature space.

Figure 2. Geometric properties of hyperplanes and linear classification functions.

In the following, we restate the properties above more formally and provide proofs.

Orthogonality of weight vector and hyperplane. Let 𝑓 denote a linear discriminant function and 𝐻𝑤 denote

the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the weight vector 𝑤 ∈ ℝ𝑚. Then

the weight vector 𝑤 is orthogonal to any vector 𝑦 pointing in the direction of the hyperplane. (4)

Proof of (4)

Let 𝑥𝑎, 𝑥𝑏 ∈ 𝐻𝑤 be arbitrary points on the hyperplane. Then the following system of affine-linear equations holds

𝑤𝑇𝑥𝑎 + 𝑤0 = 0 (4.1)

𝑤𝑇𝑥𝑏 +𝑤0 = 0 (4.2)

Subtracting (5.2) from (5.1) yields

𝑤𝑇𝑥𝑎 − 𝑤𝑇𝑥𝑏 = 0 ⇔ 𝑤𝑇(𝑥𝑎 − 𝑥𝑏) = 0 (4.3)

and thus the weight vector is orthogonal to the vector 𝑦 ≔ (𝑥𝑎 − 𝑥𝑏), which points in the direction of the hyperplane.

317

□

Linear discriminant function output as distance from the hyperplane . Let 𝑓 denote a linear discriminant

function and 𝐻𝑤 denote the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the

weight vector 𝑤 ∈ ℝ𝑚. Then 𝑓(𝑥) is proportional to the minimal Euclidean distance 𝑑 ∈ ℝ of the point 𝑥

from the hyperplane with proportionality constant ‖𝑤‖2−1

𝑑 =1

‖𝑤‖2𝑓(𝑥) (5)

Proof of (5)

Consider the decomposition of a point 𝑥 ∈ ℝ𝑚 into its orthogonal projection onto a given hyperplance 𝑥𝑝 ∈ ℝ𝑚 and its distance

from the hyperplane 𝑑𝑤

‖𝑤‖2, 𝑟 ∈ ℝ

𝑥 = 𝑥𝑝 + 𝑑𝑤

‖𝑤‖2 (5.1)

Note that this decomposition is possible, because 𝑤 is orthogonal to any vector pointing in the direction of the hyperplane and

‖𝑤

‖𝑤‖2‖2= 1 (Figure 2). Consider now the evaluation of the so decomposed 𝑥 under the linear discriminant function

𝑓(𝑥) = 𝑤𝑇𝑥 + 𝑤0 = 𝑤𝑇 (𝑥𝑝 + 𝑑

𝑤

‖𝑤‖2) + 𝑤0 = 𝑤

𝑇𝑥𝑝 + 𝑤0 + 𝑑𝑤𝑇𝑤

‖𝑤‖2 (5.2)

Because 𝑥𝑝 ∈ 𝐻𝑤, we have

𝑓(𝑥) = 𝑑𝑤𝑇𝑤

‖𝑤‖2= 𝑑

‖𝑤‖22

‖𝑤‖2= 𝑟‖𝑤‖2 ⇒ 𝑑 =

1

‖𝑤‖2𝑓(𝑥) (5.3)

□

Bias parameter as distance of the hyperplane from the origin. Let 𝑓 denote a linear discriminant function

and 𝐻𝑤 denote the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the weight vector

𝑤 ∈ ℝ𝑚. Let 𝑑0 denote the minimal distance of points on the hyperplane from the origin. Then

𝑑0 =𝑤0

‖𝑤‖2 (6)

Proof of (6)

Consider the evaluation of the minimal distance of the origin 𝑥0 = (0,… ,0)𝑇 ∈ ℝ𝑚 from points on the hyperplane given by (10).

Then

𝑑0 =1

‖𝑤‖2𝑓(𝑥0) = 𝑤

𝑇𝑥0 + 𝑤0 = 𝑤𝑇 (0⋮0) + 𝑤0 = 𝑤0 (6.1)

□

Hyperplane margin, support vectors, and canonical hyperplane

Note again that (4) and (6) imply that the weight vector 𝑤 and bias parameter 𝑤0 of the linear

discriminant function together uniquely determine the location and orientation of the hyperplane 𝐻𝑤. In

other words, by adjusting the values of the parameters, one can determine the classification for all possible

input feature vectors. We next consider the question of how (in some well-defined sense) “good” values for

𝑤 and 𝑤0 can be determined based on a training set . To this end, we first introduce the notion of the

training-set dependent “hyperplane margin” and “support vector set”.

318

For a given training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1

𝑛 and a given linear discriminant function specified in

terms of the values of 𝑤 and 𝑤0, the minimum distance 𝑑(𝑖) of each feature input vector 𝑥(𝑖) to a point on

the induced hyperplane can be evaluated based on (5). In order to measure this distance in an absolute

sense, i.e., irrespective of on which side of the hyperplane the feature vector 𝑥(𝑖) is located, we multiply (5)

by the label 𝑦(𝑖) ∈ {−1,1}, such that all thus measured distances are positive

𝑑(𝑖) =𝑦(𝑖)

‖𝑤‖2𝑓(𝑥(𝑖)) =

𝑦(𝑖)(𝑤𝑇𝑥(𝑖)+𝑤0)

‖𝑤‖2 (𝑖 = 1,… , 𝑛) (7)

Based on (7), the “margin” of the hyperplane induced by 𝑓 for a given training set is defined as the minimum

minimal distance of a training feature vector from the hyperplane (Figure 3)

𝑑∗ = min𝑥(𝑖){𝑑(𝑖)} = min𝑥(𝑖) {

𝑦(𝑖)(𝑤𝑇𝑥(𝑖)+𝑤0)

‖𝑤‖2} (8)

Figure 3. Hyperplane margin.

The points 𝑥(𝑖) for which the equality 𝑑(𝑖) = 𝑑∗ holds, i.e., which lie on the margin of the hyperplane, are

called “support vectors”. With the help of support vectors, we may specify the notion of a “canonical

hyperplane”. So far, equivalent hyperplanes can be obtained by multiplying the linear discriminant function

𝑓 by a scalar, as

𝑓(𝑥) = 0 ⇔ 𝑎𝑓(𝑥) = 𝑎𝑤𝑇𝑥 + 𝑎𝑤0 = 𝑎(𝑤𝑇𝑥 + 𝑤0) = 0 (9)

To obtain a unique representation, we now define the absolute distance of a support vector from the

hyperplane to be 1. Let 𝑥∗ be a support vector and 𝑦∗ the associated class label. Then this convention means

that, by definition

𝑓(𝑥∗) = 𝑦∗(𝑤𝑇𝑥∗ +𝑤0) = 1 (10)

Moreover, as 𝑥∗ achieves the minimum in (8), the margin is given by

𝑑∗ =1

‖𝑤‖2 (11)

319

The convention of defining the absolute distance of a support vector from the hyperplane to be 1 has the

benefit, that now, by definition, any feature vector that is not a support vector, say 𝑥(𝑖), has an absolute

distance of 𝑦(𝑖)𝑥(𝑖) > 1 from the hyperplane, and that the margin 𝑑∗ is a simple function of the weight

vector 𝑤. We next consider how this weight vector 𝑤 is determined based on the intuition of “maximizing

the margin”. To this end, we consider two scenarios: a linearly separable training data set, i.e., the case in

which a hyperplane can be found, that results in a correct classification of all training feature vectors, and a

non-linearly separable training set, in which no such hyperplane can be found.

Linearly separable training sets and maximum margin classification

Having introduced the margin of a canonical hyperplane, we can now state a basic criterion for

deciding which settings of 𝑤 and 𝑤0 are “good” for a given training set in well-defined sense. We first

consider the case of a “linearly separable training set”, i.e. a training set for which a linear discriminant

function can be found such that all training data points are classified correctly (Figure 4). A “good” linear

discriminant function can conceived as satisfying two conditions: firstly, it results in the correct classification

of all training data points, and secondly, it maximizes the margin. The latter condition can be formalized in

terms of an optimal 𝑤∗ as

𝑤∗ = argmax𝑤 (1

‖𝑤‖2) (12)

while the former can be formalized as the set of linear constraints

𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 for 𝑖 = 1,… , 𝑛 (13)

Figure 4. Maximum margin classification for linearly separable training sets.

Note that instead of maximizing ‖𝑤‖2−1, one may equivalently minimize

1

2‖𝑤‖2

2, a formulation of which has

some benefits with respect to the standard theory of quadratic programming. The set of 𝑛 inequalities (13)

corresponds to the fact that all training data feature vectors 𝑥(𝑖) are to be either support vectors, i.e.,

𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) = 1, or lie on either side of the margin of the canonical hyperplane, i.e. 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +

320

𝑤0) > 1. In summary, for linear separable trainings sets and the intuitive notion of maximum margin

classification, learning of the parameters 𝑤 and 𝑤0 can be cast as the linearly constrained quadratic

programming problem

min𝑤 (1

2‖𝑤‖2

2) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 for 𝑖 = 1,… , 𝑛 (14)

As discussed in more detail below, (14) can be solved for 𝑤∗ and 𝑤0∗ using techniques from the nonlinear

optimization literature.

Nonlinearly separable training sets and soft-margin classification

We next generalize the formulation of learning the linear discriminant function parameters from the

linearly separable case with maximum margin to the non-linearly separable case with “soft margin”.

Consider the scenario of overlapping class distributions (Figure 5). In this case, it is not possible to find a

setting of 𝑤 and 𝑤0 such that all training data feature vectors are correctly classified and the hyperplane

margin is maximized. To establish the notion of a “good” hyperplane also in this case, the linear constraints

(13) are augmented with “slack variables”, such that the constraints in this case are given by

𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 − 𝜉𝑖 for 𝑖 = 1,… , 𝑛 (15)

To make sense of (15), firstly note that there is a slack variable 𝜉𝑖 for each training data feature vector 𝑥(𝑖). If

𝜉𝑖 = 0, (15) corresponds to (13), and the constraint corresponds to the maximum margin constraint as

above. Secondly, if 0 < 𝜉𝑖 < 1, then the training feature vector is still correctly classified, but lies closer to

the hyperplane than in the maximum margin case. Finally, if 𝜉𝑖 > 1, the point training feature vector 𝑥(𝑖) is

misclassified. The fundamental aims of soft margin classification are hence to firstly minimize 1

2‖𝑤‖2

2 such

as to maximize the margin, and secondly, to minimize the sum of slack variables 𝜉𝑖 over training data points

in order to achieve good classification performance on the training data set. In terms of a quadratic

constrained programming problem, this intuition can be formalized as

min𝑤,𝜉𝑖 (1

2‖𝑤‖2

2 + 𝐶 ∑ 𝜉𝑖𝑘𝑛

𝑖=1 ) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0 for 𝑖 = 1,… , 𝑛 (16)

Figure 5. Soft-margin classification for nonlinearly separable training sets.

321

Note that the objective function captures the intuition of hyperplane margin maximization in by means of

minimization of 1

2‖𝑤‖2

2 while simultaneously attempting to minimize the “loss” ∑ 𝜉𝑖𝑘𝑛

𝑖=1 . Here 𝑘 ∈ ℕ is a

constant that determines the particular kind of loss that is considered for the optimization. If 𝑘 = 1, the

“loss term” ∑ 𝜉𝑖𝑛𝑖=1 is referred to as “hinge loss”, if 𝑘 = 2,∑ 𝜉𝑖

2𝑛𝑖=1 is referred to as “quadratic loss”. Finally,

the constant 𝐶 determines the relative contributions of the margin maximization and loss term and is usually

chosen empirically.

Dual Lagrangian formulation of soft-margin support vector classifiers with hinge loss

Above we have seen that for a training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1

𝑛 the parameters 𝑤 ∈ ℝ𝑚 and 𝑤0 ∈ ℝ

of a “good” linear discriminant function

𝑓 ∶ ℝ𝑚 → ℝ,𝑥 ↦ 𝑓(𝑥) ≔ 𝑤𝑇𝑥 + 𝑤0 (17)

can be found by solving the constrained quadratic programming problem

min𝑤,𝜉𝑖 (1

2𝑤𝑇𝑤 + 𝐶 ∑ 𝜉𝑖

𝑘𝑛𝑖=1 ) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0 for 𝑖 = 1,… , 𝑛 (18)

In principle, any constrained quadratic programming technique may be employed to solve (18). Traditionally,

however, the constrained quadratic programming problem above is transformed into its dual Lagrangian

form. This has the advantage that the feature vectors 𝑥(𝑖) enter the procedure only by means of their inner

products, which reduces the dimensionality of the problem and lends itself nicely to a generalization by

means of “kernel functions”. In this section, we consider the transformation of (18) into its dual Lagrangian

form.

From the general theory of constrained nonlinear optimization, we know that for a general

constrained nonlinear optimization problem of the form

min𝑥∈ℝ𝜈 𝑓(𝑥) subject to 𝑐𝑖(𝑥) = 0 (𝑖 ∈ 𝐸) and 𝑐𝑖(𝑥) ≥ 0 (𝑖 ∈ 𝐼)

(19)

with |𝐸 ∪ 𝐼| = 𝑚, the Lagrangian function is given by

𝐿 ∶ ℝ𝜈 × ℝ𝑚 → ℝ, (𝑥, 𝜆1, … , 𝜆𝑚) ↦ 𝐿(𝑥, 𝜆1, … , 𝜆𝑚) ≔ 𝑓(𝑥) − ∑ 𝜆𝑖𝑐𝑖(𝑥)𝑖∈𝐸∪𝐼 (20)

and the dual objective function is given by

𝑞:ℝ𝑚 → ℝ, 𝜆 ↦ 𝑞(𝜆) ≔ min𝑥 𝐿(𝑥, 𝜆) (21)

Setting 𝑤 ≔ (𝑤1, … , 𝑤𝑛)𝑇 , 𝜉 ≔ (𝜉1, … , 𝜉𝑛)

𝑇 , 𝜔 ≔ (𝑤𝑇 , 𝜉𝑇)𝑇 and 𝜆 ≔ (𝛼𝑇 , 𝛽𝑇)𝑇, where 𝛼 ≔ (𝛼1, … , 𝛼𝑛)𝑇

and, 𝛽 ≔ (𝛽1, … , 𝛽𝑛) the Lagrangian function for the hinge loss case of 𝑘 = 1 of (19) is thus given by

𝐿𝑝𝑟𝑖𝑚𝑎𝑙 ∶ ℝ4𝑛 × ℝ2𝑛 → ℝ, (𝜔, 𝜆) ↦ 𝐿𝑝𝑟𝑖𝑚𝑎𝑙(𝜔, 𝜆) ≔ (22)

1


𝑛𝑖=1 − ∑ 𝛼𝑖

𝑛𝑖=1 (𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖) − ∑ 𝛽𝑖

𝑛𝑖=1 𝜉𝑖

Analytical evaluation of the minimum of the function 𝐿𝑝𝑟𝑖𝑚𝑎𝑙 with respect to 𝜔 then yields the dual

Lagrangian (objective) function

322

𝐿𝑑𝑢𝑎𝑙 ∶ ℝ𝑛 → ℝ, 𝛼 ↦ 𝐿𝑑𝑢𝑎𝑙(𝛼) ≔ ∑ 𝛼𝑖

𝑛𝑖=1 −

1

2∑ ∑ 𝛼𝑖𝛼𝑗𝑦

(𝑖)𝑦(𝑗)𝑥(𝑖)𝑇𝑥(𝑗)𝑛

𝑗=1𝑛𝑖=1 (23)

Proof of (23)

Computing he partial derivatives of 𝐿𝑝𝑟𝑖𝑚𝑎𝑙 with respect to 𝑤,𝑤0 and 𝜉𝑖, seting to zero and solving for the extremal points yields the

following conditions

𝜕

𝜕𝑤𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔

𝜕

𝜕𝑤(1

2𝑤𝑇𝑤) −

𝜕

𝜕𝑤∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑤𝑇𝑥(𝑖) = 0 ⇔ 𝑤 − ∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖) = 0 ⇔ 𝑤 = ∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖) (23.1)

𝜕

𝜕𝑤0𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔ −

𝜕

𝜕𝑤0∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑤0 = 0 ⇔ −∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖) = 0 (23.2)

and

𝜕

𝜕𝜉𝑖𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔ 𝐶

𝜕

𝜕𝜉𝑖∑ 𝜉𝑖𝑛𝑖=1 −

𝜕

𝜕𝜉𝑖∑ 𝛼𝑗𝑛𝑗=1 𝜉𝑗 − ∑ 𝛽𝑗

𝑛𝑗=1 𝜉𝑗 = 0 ⇔ 𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0 ⇔ 𝛽𝑖 = 𝐶 − 𝛼𝑖 (𝑖 = 1,… , 𝑛) (23.3)

We next firstly reformulate the primal Lagrangian functional form as

1

2𝑤𝑇𝑤 + 𝐶∑ 𝜉𝑖

𝑛𝑖=1 −∑ 𝛼𝑖

𝑛𝑖=1 (𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖) − ∑ 𝛽𝑖

𝑛𝑖=1 𝜉𝑖 (23.4)

=1


𝑛𝑖=1 − ∑ 𝑤𝑇𝛼𝑖𝑦

(𝑖)𝑥(𝑖) + ∑ 𝛼𝑖𝑦(𝑖)𝑤0

𝑛𝑖=1

𝑛𝑖=1 + ∑ 𝛼𝑖

𝑛𝑖=1 − ∑ 𝛼𝑖𝜉𝑖 −

𝑛𝑖=1 ∑ 𝛽𝑖

𝑛𝑖=1 𝜉𝑖

=1

2𝑤𝑇𝑤 − 𝑤𝑇 ∑ 𝛼𝑖𝑦

(𝑖)𝑥(𝑖) + 𝑤0∑ 𝛼𝑖𝑦(𝑖)𝑛

𝑖=1𝑛𝑖=1 + ∑ 𝛼𝑖

𝑛𝑖=1 − ∑ (𝐶 − 𝛼𝑖 − 𝛽𝑖)𝜉𝑖

𝑛𝑖=1

Substitution of the conditions (23.1) – (23.3) then yields

𝐿𝑑𝑢𝑎𝑙(𝛼) (23.5)

=1

2𝑤𝑇𝑤 − 𝑤𝑇 ∑ 𝛼𝑖𝑦

(𝑖)𝑥(𝑖) + 𝑤0∑ 𝛼𝑖𝑦(𝑖)𝑛

𝑖=1𝑛𝑖=1 + ∑ 𝛼𝑖

𝑛𝑖=1 − ∑ (𝐶 − 𝛼𝑖 − 𝛽𝑖)𝜉𝑖

𝑛𝑖=1

=1

2𝑤𝑇𝑤 − 𝑤𝑇𝑤 + 𝑤0 ⋅ 0 + ∑ 𝛼𝑖

𝑛𝑖=1 − ∑ 0 ̇𝜉𝑖

𝑛𝑖=1

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2𝑤𝑇𝑤

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2(∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖))

𝑇 ∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)

𝑇 ∑ 𝛼𝑖

𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2(𝛼1𝑦

(1)𝑥(1)𝑇+ 𝛼2𝑦

(2)𝑥(2)𝑇+⋯+ 𝛼2𝑦

(𝑛)𝑥(𝑛)𝑇) ∑ 𝛼𝑗

𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2(𝛼1𝑦

(1)𝑥(1)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) + 𝛼2𝑦

(2)𝑥(2)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +⋯+ 𝛼𝑛𝑦

(𝑛)𝑥(𝑛)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)) ∑ 𝛼𝑗

𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2(∑ 𝛼1𝑦

(1)𝑥(1)𝑇𝛼𝑗

𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +∑ 𝛼2𝑦

(2)𝑥(2)𝑇𝛼𝑗

𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +⋯+ ∑ 𝛼𝑛𝑦

(𝑛)𝑥(𝑛)𝑇𝛼𝑗

𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗))

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2(∑ 𝛼1𝛼𝑗𝑦

(1)𝑦(𝑗)𝑥(1)𝑇𝑥(𝑗)𝑛

𝑗=1 +∑ 𝛼2𝛼𝑗𝑦(2)𝑦(𝑗)𝑥(2)

𝑇𝑥(𝑗)𝑛

𝑗=1 +⋯+ ∑ 𝛼𝑛𝛼𝑗𝑦(𝑛)𝑦(𝑗)𝑥(𝑛)

𝑇𝑥(𝑗)𝑛

𝑗=1 )

= ∑ 𝛼𝑖𝑛𝑖=1 −

1

2∑ ∑ 𝛼𝑖𝛼𝑗𝑦

(𝑖)𝑦(𝑗)𝑥(𝑖)𝑇𝑥(𝑗)𝑛

𝑗=1𝑛𝑖=1

□

Study Questions

1. Explain the concept of “cross-validation” in classification approaches to fMRI data analysis.

323

2. Verbally describe the maximum likelihood estimators for the parameters 𝜇 ∈ [0,1], 𝜇0, 𝜇1 ∈ ℝ𝑚 and Σ ∈ ℝ𝑚×𝑚 of a linear

discriminant analysis model.

3. According to which distribution is the target variable 𝑦 distributed in a logistic regression model?

4. Name a commonality and a difference between logistic regression and linear discriminant analysis.


1. Classification approaches in fMRI data analysis are based on the idea that if the condition prediction accuracy of a learning

model is larger than chance, than the underlying data (for example, BOLD signal feature vectors of cortical region of interest)

must represent some “information” about the experimental condition of interest. To this end often a full data set

{(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… , 𝑛} of date feature vectors 𝑥(𝑖) and condition labels𝑦(𝑖) is partitioned into a “training data set”

{(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… ,𝑚} and a mutually exclusive “test data set” {(𝑥(𝑗), 𝑦(𝑗))|𝑗 = 1,… , 𝑞}, such that 𝑚 + 𝑞 = 𝑛. The

classification-learning model trained on the “training data set”, and used to predict the experimental condition 𝑦(𝑗) of the

associated test data set feature vector 𝑥(𝑗) (𝑗 = 1,… , 𝑞) and the prediction accuracy recorded. By repeatedly changing the

allocation of examples from the full data set to the training and test data sets, all possible combinations of training and test data

are explored and the prediction accuracies averaged over combinations to yield a final prediction accuracy

2. The maximum likelihood estimators for the parameters 𝜇 is estimated by the proportion of observed 1’s and 0’s in the training

data set, the Gaussian parameter 𝜇0 ∈ ℝ𝑚 is estimated by the average of all 𝑥(𝑖) belonging to training points with 𝑦(𝑖) = 0, and

likewise for 𝜇1 ∈ ℝ𝑚, while the common covariance Σ is estimated by the average of the empirical covariances of the first and

second classes.

3. The target variable 𝑦 ∈ {0,1} of a logistic regression model is distributed according to a Bernoulli distribution.

4. Both logistic regression and linear discriminant analysis associate multivariate feature vectors 𝑥(𝑖) ∈ ℝ𝑚 with values 𝑦(𝑖) ∈

{0,1} of a random variable distributed according to a Bernoulli distribution. While logistic regression treats the data points 𝑥(𝑖)

as non-random entities, linear discriminant analysis treats them as realizations of two multivariate Gaussian distributions with

expectation parameters 𝜇0 ∈ ℝ𝑚 and 𝜇1 ∈ ℝ

𝑚associated with the values 𝑦(𝑖) being either 0 or 1.

324

Deterministic Dynamical Models

325

Deterministic dynamic causal models for FMRI

(1) The structural form of dynamic causal model for FMRI

In structural format dynamic causal models (DCMs) for FMRI take the following form

�̇� = 𝑓(𝑥, 𝑢, 휃𝑓) (1)

𝑦 = 𝑔(𝑥, 휃𝑔) (2)

where

𝑢 ∶ 𝐼 → 𝑈 ⊆ ℝ𝑙 , 𝑡 ↦ 𝑢(𝑡) (3)

is a vector-valued function of system inputs, usually corresponding to the temporal evolution of 𝑙 ∈ ℕ

experimental conditions,

𝑥 ∶ 𝐼 → 𝑋 ⊆ ℝ𝑛, 𝑡 ↦ 𝑥(𝑡) (4)

is a vector-valued functions of latent system states describing the evolution of neural and hemodynamic

variables, and

𝑦 ∶ 𝐼 → 𝑋 ⊆ ℝ𝑚, 𝑡 ↦ 𝑦(𝑡) ≔ 𝑔(𝑥(𝑡), 𝑢(𝑡), 휃) (5)

is a vector value function of system observables in continuous time, usually representing the MR signal

in 𝑚 ∈ ℕ regions of interest, and 휃 ∈ Θ ⊆ ℝ𝑝 is a vector of parameters. While the function 𝑢 is usually

explicitly formalized based on the experimental manipulation, the function 𝑥 is specified in terms of a first-

order 𝑛-dimensional system of ordinary differential equations. The function

𝑓 ∶ 𝑋 × 𝑈 × Θ𝑓 → 𝑋, (𝑥(𝑡), 𝑢(𝑡), 휃𝑓) ↦ 𝑓(𝑥(𝑡), 𝑢(𝑡), 휃𝑓) ≔ �̇�(𝑡) (6)

specifies the rate of change of the system state and is usually referred as the system’s “evolution function”.

The function

𝑔 ∶ 𝑋 × Θ𝑔 → 𝑌, (𝑥(𝑡), 휃𝑔) ↦ 𝑔(𝑥(𝑡), 휃𝑔) ≔ 𝑦(𝑡) (7)

maps latent states onto system observables and is referred to as “observation function”.

In standard formulations of deterministic DCMs, each of the 𝑚 ∈ ℕ regions is equipped with five

latent states, such that the hidden state space of the system corresponds to 𝑋 ⊆ ℝ5𝑚. One of these latent

states describes the evolution of a lumped neural activity process, while four of these latent states describe

the dynamics of regionally-specific neuro-vascular coupling processes. In this formulation, the evolution

function 𝑓 takes the following form

𝑓 ∶ 𝑋 × 𝑈 × Θ𝑓 , (𝑥, 𝑢, 휃𝑓) ↦ 𝑓(𝑥, 𝑢, 휃𝑓) ≔ (𝑓(𝑖)(𝑥, 𝑢, 휃𝑓))𝑖=1,…,𝑚

=

(

𝑓(1)(𝑥, 𝑢, 휃𝑓)

𝑓(2)(𝑥, 𝑢, 휃𝑓)

⋮𝑓(𝑚)(𝑥, 𝑢, 휃𝑓))

(8)

326

In other words, the evolution function partitions regions-of-interest-wise. To ensure positivity, the state

variables 𝑥3, 𝑥4 and 𝑥5 are exponentiated, which we will denote by defining �̃�3 ≔ exp(𝑥3) , �̃�4 ≔ exp(𝑥4)

and �̃�5 ≔ exp(𝑥5), respectively.

Each region-of-interest specific evolution function then takes the following form

𝑓(𝑖) ∶ 𝑋 × 𝑈 × Θ𝑓 → ℝ5, (𝑥, 𝑢, 휃𝑓) ↦ 𝑓(𝑖)(𝑥, 𝑢, 휃𝑓) ≔

(

𝑎𝑖𝑥1(𝑖)+ (∑ 𝑢𝑘

𝑙𝑘=1 𝑏𝑖

𝑘)𝑥1(𝑖)+ 𝑐𝑖𝑢

𝑥1(𝑖) − 𝜅𝑠𝑥2

(𝑖) − 𝜅𝑓 (�̃�3(𝑖) − 1)

𝑥2(𝑖)/�̃�3

(𝑖)

1

𝜏0�̃�4(𝑖) (�̃�3

(𝑖)− (�̃�4

(𝑖))1/𝛼)

�̃�3(𝑖) (1−(1−𝐸0))

1

�̃�3(𝑖)

𝐸0− (�̃�4

(𝑖))1/𝛼 �̃�5

(𝑖)

�̃�4(𝑖))

(9)

where {𝑎𝑖 ∈ ℝ1×𝑚, 𝑏𝑖

𝑘 ∈ ℝ1×𝑚, 𝑐𝑖 ∈ ℝ1×𝑙, 𝜅𝑠, 𝜅𝑓 , 𝜏0, 𝛼, 𝐸0 ∈ ℝ} ⊂ 휃𝑓 are parameters and 𝑥𝑗

(𝑖) (𝑗 = 1,… ,5)

are the five state variables that together form the state vector 𝑥(𝑖) ≔ (𝑥1(𝑖), 𝑥2

(𝑖), 𝑥3(𝑖), 𝑥4

(𝑖), 𝑥5(𝑖))

𝑇 of the 𝑖th

region of interest. We discuss the differential equations for each state variable in more detail below.

Like the evolution function, the observer function partitions source-wise, such that it takes the form

𝑔: 𝑋 × Θ𝑔, (𝑥, 휃𝑔) ↦ (𝑔(𝑖)(𝑥, 휃𝑔))𝑖=1,…,𝑚

=

(

𝑔(1)(𝑥, 휃𝑔)

𝑔(2)(𝑥, 휃𝑔)

⋮𝑔(𝑚)(𝑥, 휃𝑔))

(10)

with

𝑔(𝑖): 𝑋 × Θ𝑔 → ℝ, (𝑥, 휃𝑔) ↦ 𝑔(𝑥, 휃𝑔) ≔ 𝑉0 (𝑐1(1 − �̃�5) + 𝑐2(𝑖)(1 −

�̃�5

�̃�4) + 𝑐3

(𝑖)(1 − �̃�4)) (11)

where 𝑐1 = 4.3𝜈0𝐸0𝑇𝐸, 𝑐2 = 𝜖0(𝑖)𝑟0𝐸0𝑇𝐸 and 𝑐3 = 1 − 𝜖0

(𝑖) are region-independent and region-dependent

constants, respectively.

(2) The neural state evolution function

The form of the evolution function for the neural state variable

𝑥1(𝑖) = 𝑎𝑖𝑥1

(𝑖) + (∑ 𝑢𝑘𝑙𝑘=1 𝑏𝑖

𝑘)𝑥1(𝑖) + 𝑐𝑖𝑢 (1)

is motivated by the following considerations. Let 𝑥1 ≔ (𝑥1(𝑖), … , 𝑥1

(𝑚))𝑇∈ 𝑆 ⊆ ℝ𝑚 denote a state vector

comprising the neural state variables of a set of 𝑚 ∈ ℕ regions, and Θ1 ⊂ Θ denote the parameter subspace

for parameters governing the evolution of the first state variable of each regional system. Consider the

system’s evolution function in the form

�̇�1 = 𝜙(𝑥1, 𝑢, 휃1) (2)

327

In DCM for FMRI, the evolution function for the neural state 𝑥1 is defined as

𝜙 ∶ 𝑆 × 𝑈 × Θ1 → ℝ𝑚, (𝑥, 𝑢, 휃) ↦ 𝜙(𝑥, 𝑢, 휃1) ≔ 𝐴𝑥1 + (∑ 𝑢𝑘𝐵𝑘𝑙

𝑘=1 )𝑥1 + 𝐶𝑢 (3)

where 휃1 ≔ {𝐴, 𝐵1, … , 𝐵𝑙 ∈ ℝ𝑚×𝑚, 𝐶 ∈ ℝ𝑚×𝑙} is a set of matrices. The first term in (3) encodes a (possible)

contribution of each system variable current state to the rate of change of each system variable,

independent of the system input function 𝑢, the second term in (3) encodes a (possible) interactive

contribution of the system’s current state and the system inputs, and the third term encodes a (possible)

contribution of the system’s input function to the rate of change of in each system variable. Intuitively, if

one assumes that the input functions 𝑢𝑘 take on only the values 1 and 0, then the last term in (3) represents

the contribution of the input on the rate of change on the state variables scaled by the values in the matrix

𝐶. Likewise, the middle term represents a condition-dependent contribution of the neural state variables

scaled by the values in the condition specific matrices 𝐵1, … , 𝐵𝑙 to the rate of change of the neural state

vector. Together, the parameters in 휃1 thus specify how neural activity at a given time-point and in a given

region-of-interest influence the evolution of other neural activities. Intuitively, these parameters thus

capture the notion of “effective connectivity”, informally defined as the “effect neural activity in one brain

region has on neural activity in another brain region.

As an example, consider 𝑚 = 3 and 𝑙 = 2, i.e. a neural system comprising three regions and thus

state variables, and two input functions (e.g. the evolution of two experimental conditions over time). Then

(1) takes the form

�̇�1(𝑡) = 𝐴𝑥1(𝑡) + (𝑢1(𝑡)𝐵 + 𝑢2(𝑡)𝐵)𝑥1(𝑡) + 𝐶𝑢(𝑡) (4)

where 𝑥1(𝑡) ∈ ℝ3, 𝑢(𝑡) ∈ ℝ2, 𝐴, 𝐵1, 𝐵2 ∈ ℝ3×3 and 𝐶 ∈ ℝ3×2 . Writing out the above explicitly results in

(

�̇�1(1)(𝑡)

�̇�1(2)(𝑡)

�̇�1(3)(𝑡)

) = (

𝑎11 𝑎12 𝑎13𝑎21 𝑎22 𝑎23𝑎31 𝑎32 𝑎33

)(

𝑥1(1)(𝑡)

𝑥1(2)(𝑡)

𝑥1(3)(𝑡)

) + (𝑢1(𝑡)(

𝑏111 𝑏12

1 𝑏131

𝑏211 𝑏22

1 𝑏231

𝑏311 𝑏32

1 𝑏331

)+ 𝑢2(𝑡)(

𝑏112 𝑏12

2 𝑏132

𝑏212 𝑏22

2 𝑏232

𝑏312 𝑏32

2 𝑏332

))(

𝑥1(1)(𝑡)

𝑥1(2)(𝑡)

𝑥1(3)(𝑡)

) + (

𝑐11 𝑐12𝑐21 𝑐22𝑐31 𝑐32

)(𝑢1(𝑡)𝑢2(𝑡)

) (5)

and we see that indeed

�̇�1(𝑖) = 𝑎𝑖𝑥1

(𝑖) + (∑ 𝑢𝑘𝑙𝑘=1 𝑏𝑖

𝑘)𝑥1(𝑖) + 𝑐𝑖𝑢 (6)

for vectors 𝑎𝑖 ∈ ℝ1×𝑚, 𝑏𝑖

𝑘 ∈ ℝ1×𝑚 (𝑘 = 1,… , 𝑙) and 𝑐𝑖 ∈ ℝ1×𝑙 as specified in the previous section.

To obtain an intuition about the dynamic repertoire of neural systems specified as in (3), we first

consider the case of 𝑚 = 2 and 𝑙 = 1 with 𝐵1 = 0. We thus obtain

�̇�1(𝑡) = 𝐴𝑥1(𝑡) + 𝐶𝑢(𝑡) (7)

where 𝑥1(𝑡) ∈ ℝ2, 𝐴 ∈ ℝ2×2, 𝐶 ∈ ℝ2×1 and 𝑢(𝑡) ∈ ℝ . For 𝐶 ≔ (1,0)𝑇,modelling a positive influence of the

input function on the rate of change of the first neural state variable 𝑥1(1)

, a “Gaussian-like” input function

𝑢 ∶ [0, 𝑇] → ℝ+, 𝑡 ↦ 𝑢(𝑡) ≔ exp (−1

𝜎2(𝑡 − 𝜇)2) (8)

328

and the initial condition 𝑥1(0) = (0,0)𝑇, we consider the scenarios

𝐴(1) = (−1 0 0 −1

) , 𝐴(2) = (−1 0 0.9 −1

) , 𝐴(3) = (−1 0.9 0 −1

) and 𝐴(4) = (−1 0.9 0.9 −1

) (9)

Intuitively, the diagonal entries of the matrices in (9) model a self-inhibitory dynamic, that is, the rate of

change of each neural state is negatively proportional to its current state. In addition, 𝐴(1) models the case

of no connections between the neural state variables, 𝐴(2) models the case of a positive influence of neural

state 𝑥1(1)

on the rate of change of neural state 𝑥1(2)

, 𝐴(3) models the case of a positive influence of neural

state 𝑥1(2)

on neural state 𝑥1(1)

, and 𝐴(4) models the case of positive influences of both state variables on

each other. The dynamics resulting from these scenarios are visualized in Figure 1

Figure 1. Neural system scenarios based on equations (7) – (9). The left panels depict the intuitive connectivity structure that this

represented by the 𝐴(𝑖) ∈ ℝ2×2 and 𝐶 ∈ ℝ2×1 matrices of the current example system. Note that the self-inhibitory values on the

diagonal of the 𝐴(𝑖) result in the dissipation of neural state activity over time. Further note that from the structure of 𝐴(1) and 𝐴(3) it

follows that 𝑥1(2)

remains at the baseline level of 0.

(3) Interpretation of the hemodynamic evolution and observer functions

In the following, we discuss the intuition of the state variables 𝑥2(𝑖), 𝑥3

(𝑖), 𝑥4(𝑖) and 𝑥5

(𝑖) for each of the

𝑚 regions of interest. Because the intuition is identical over regions, we drop the superscript (𝑖) in the

following discussion. The reader should nevertheless be aware that these state variables exist for each of the

regions and that parameters governing the evolution of these state can either be assumed to be regionally

specific or common to all regions.

For each region, the second and third state variables of region-of-interest model the dynamics of

neurovascular coupling. State variable 𝑥2 models the evolution of a vasodilatory signal and state variable 𝑥3

models the evolution of the regional blood flow. More specifically, the system of ODEs

Vasodilatory Signal �̇�2 = 𝑥1 − 𝜅𝑠𝑥2 − 𝜅𝑓(�̃�3 − 1) (7)

Blood Flow �̇�3 = 𝑥2/�̃�3 (8)

329

establish a link between the state value of the regional neural state variable and the regional blood flow. As

evident from (2) the change of the vasodilatory signal is positively related to the neural activity state, and

negatively related to itself, modelling a signal decay. Finally, the vasodilatory signal is negatively related to

the regional blood flow. The change of regional blood flow �̇�3 is positively related to the vasodilatory signal

state 𝑥2 and scales with itself.

The fourth and fifth state variables of each region model the blood flow induced changes of blood

volume and deoxyhemoglobin content. More specifically, this subsystem describes the behaviour of the

post-capillary venous compartment by analogy to an inflated Balloon, and models the evolution of blood

volume 𝑥4 and deoxyhemoglobin content 𝑥5. The evolution equations are given by

Blood Volume �̇�4 =1

𝜏0�̃�4(�̃�3 − �̃�4

1/𝛼) (9)

Deoxyhemoglobin �̇�5 = �̃�3(1−(1−𝐸0))

1�̃�3

𝐸0− �̃�4

1/𝛼 �̃�5

�̃�4 (10)

The motivation for is as follows. The rate of change of volume 𝑥4 is given by the difference of the current

inflow �̃�3 and a volume-dependent outflow �̃�41/𝛼

. Notably, this form of outflow models the balloon-like

characteristics of the venous compartment to expel blood at a greater rate when distended. Finally, the rate

of change of deoxyhemoglobin reflects the delivery of deoxyhemoglobin into the venous compartment

minus that expelled. More specifically, the first term in (5) reflects the product of the current blood inflow �̃�3

and a blood flow dependent model of oxygen extraction given by (1 − (1 − 𝐸0))1

�̃�3 and 𝐸0 denotes a resting

oxygen extraction fraction. The second term in (5) denotes the product of volume dependent blood outflow

�̃�41/𝛼

and the concentration of deoxyhemoglobin per blood volume.

Finally, the observer function

BOLD signal 𝑔:𝒳 × Θ, (𝑥, 휃) ↦ 𝑔(𝑥, 휃) ≔ 𝑉0 (𝑐1(1 − �̃�5) + 𝑐2 (1 −�̃�5

�̃�4) + 𝑐3(1 − �̃�4)) (11)

with constants 𝑐1 ≔ 4.3𝜈0𝐸0𝑇𝐸, 𝑐2 ≔ 𝜖0𝑟0𝐸0𝑇𝐸, and 𝑐3 ≔ 1− 𝜖0 relates the state vector of the regional

neurovascular system to the observed MR signal change, or “BOLD signal” for short. Here, 𝜈0 = 40.3

denotes a frequency offset at the out surface of magnetized vessels in Hz, 𝐸0 = 0.4 denotes the resting

state oxygen extraction fraction, 𝑇𝐸 = 0.04 denotes the echo time in seconds, 𝜖0 denotes the ratio of intra-

to extravascular MR signal and is estimated for each region, and 𝑟0 = 25 denotes the slope of the

intravascular relaxation rate as a function of oxygen saturation.

In Figure 2 we depict the evolution of the hemodynamic and BOLD signal variables for a pre-specific

evolution of the neural system state for a single region. Notably, and increase in neural activity results in an

increase of the vasodilatory signal and blood flow, and a delayed increase in the blood volume, while these

three mechanisms are accompanied by a decrease of the local deoxyhemoglobin concentration. Based on

the observe function, the predicted BOLD signal change exhibits the typical increase upon stimulation and a

post-stimulus undershoot.

330

Figure 2. Regional hemodynamic and BOLD signal components of DCMs for FMRI. The uppermost panel depicts the neural system variable 𝑥1(𝑡) evolution of a single region, here simulated by a Gaussian centered at time 0. The middle panel depicts the resulting evolution of the hemodynamic system state variables 𝑥2(𝑡), 𝑥3(𝑡), 𝑥4(𝑡), and 𝑥5(𝑡), modelling the regional-specific vasodilatory signal, blood flow, blood volume, and deoxyhemoglobin content, respectively. Note that neural activation induces an increase in the vasodilatory signal, followed by an increase in blood flow, and, with some delay, an increase in regional blood volume, while it induces a decreases in deoxyhemoglobin content. The changes in blood volume and deoxyhemoglobin in turn result in an increase of the regional BOLD signal.

(4) The probabilistic form of DCM for FMRI

To render DCM for FMRI a probabilistic model that allows for parameter inference and model

comparison based on observed data, the structural form discussed above is embedded into a parametric

Gaussian framework in discrete time. To this end, let

𝑌 ≔ (𝑌1⋮𝑌𝑚

) (1)

denote the concatenated time-series data from 𝑚 regions of interest, where 𝑌𝑖 ∈ ℝ𝑛 corresponds to the

𝑛 ∈ ℕ measurements taken for region 𝑖. We may conceive the structural from of DCM for FMRI as discussed

previously as a mapping

ℎ ∶ 𝑈 × Θ𝑓 × Θ𝑔 → ℝ𝑚𝑛 , (𝑢, 휃𝑓 , 휃𝑔) ↦ ℎ(𝑢, 휃𝑓 , 휃𝑔) (2)

which, using an appropriately chosen discretization method, maps an input function 𝑢 and parameters 휃𝑓

and 휃𝑔 onto the concatenated predicted discrete MR signal time-series of the 𝑚 regions. The likelihood of

DCM for FMRI than takes the following form

𝑌 = ℎ(𝑢, 휃𝑓 , 휃𝑔) + 𝑋휃ℎ + 휀 (3)

In (3)

𝑋 ≔ 𝑑𝑖𝑎𝑔(𝑋1, 𝑋2, … , 𝑋𝑚) ∈ ℝ𝑚𝑛×𝑚𝑞 (4)

denotes a block-diagonal matrix comprising a set of 𝑞 ∈ ℕ “nuisance” regressors for each region. Typically,

these regressors are chosen as a set of discrete-time, discrete-frequency cosine function which are used to

model slow-drifts of the observed signals 𝑌1, … , 𝑌𝑚 and 휃ℎ ∈ ℝ𝑚𝑞 is a further parameter vector. 휀 is an

additive Gaussian error term, i.e. 𝑝(휀) = 𝑁(휀; 0, Σ ) with expectation 0 ∈ ℝ𝑚𝑛 and covariance matrix

Σ ∈ ℝ𝑚𝑛×𝑚𝑛 𝑝. 𝑑., which is often chosen to be of the form Σ ≔ ∑ 𝜆𝑖𝑄𝑖𝑘𝑖=1 for parameters 𝜆𝑖 ∈ ℝ and

covariance basis matrices 𝑄𝑖 ∈ ℝ𝑚𝑛×𝑚𝑛.

331

To simplify the variational treatment of (3) below, we set Θ ≔ {Θ𝑓 , Θ𝑔, Θℎ} and summarize the

structural DCM model function and the nuisance term in a single, input function-specific function

𝜓𝑢 ∶ Θ → ℝ𝑚𝑛, 휃 ↦ 𝜓𝑢(휃) (5)

Based on these conventions, the likelihood-embedding of DCM for FMRI thus takes the following form

𝑝(𝑌) = 𝑁(𝑌; 𝜓𝑢(휃), Σ ) (6)

Study questions

1. Which mathematical framework underlies deterministic dynamic causal models for FMRI?

2. DCM for FMRI is often portrayed as a framework to measure “effective connectivity”. How is the notion of effective connectivity

build into the DCM for FMRI framework?

3. DCM for FMRI is a model for the BOLD signal time-series of 𝑚 regions of interest. How many latent state variables are used to the

model the dynamics of each region and what do they model?

4. What does the observer function of DCM for FMRI describe?


1. Deterministic dynamical causal models for FMRI are based on latent systems of ordinary differential equations equipped with

nonlinear observation functions.

2. The evolution function of DCM for FMRI corresponds to a system of ordinary differential equations, some of which specify the

evolution of lumped neural activity variables in selected regions of interest. These equations are governed by parameters which can

be viewed as representations of the effect that neural state activity (or input activity) has on the rate of change of itself or on other

regions.

3. Five state variables are used to model the dynamics of each region. The first state variable models lumped neural activity, the

second state variable describes a vasodilatory signal, the third state variable models the region’s blood (in)flow, the fourth variable

models the region’s blood volume, and the fifth state variable models the regional concentration of deoxyhemoglobin.

4. The observer function of DCM for FMRI describes how local blood volume and deoxyhemoglobin concentration give rise to the

observed BOLD signal change.

332

Variational Bayesian inversion of deterministic dynamical models

(1) Model formulation

In this section, we consider the variational Bayesian estimation of the following static two-level

nonlinear Gaussian model

𝑦 = 𝑓(휃) + 휀 (1)

휃 = 𝜇𝜃 + 휂 (2)

where 𝑦 ∈ ℝ𝑛, 휀 ∈ ℝ𝑛, 휃 ∈ ℝ𝑚, 𝜇𝜃, 휂 ∈ ℝ𝑚, (𝑛,𝑚 ∈ ℕ), 휀 and 휂 are distributed according to 𝑝(휀) =

𝑁(휀; 0, Σ𝑦) and 𝑝(휂) = 𝑁(휂; 0, Σ𝜃) with Σ𝑦 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. , Σ𝜃 ∈ ℝ

𝑛×𝑛 𝑝. 𝑑., respectively, and

𝑓 ∶ ℝ𝑚 → ℝ𝑛, 휃 ↦ 𝑓(휃) (3)

is a differentiable, not necessarily linear, function3. To apply the principles of variational Bayes to the model

problem given by (1) and (2), we first consider the joint distribution corresponding to the structural form of

(1) and (2). To this end, we make the simplifying assumption that the parameters Σ𝑦 and Σ𝜃 that govern the

covariance of the probabilistic error 휀 and 휂 are known (these parameters are sometimes referred to as

“hyperparameters”). We further assume that the expectation parameter 𝜇𝜃 of the unobserved random

variable 휃 is known as well. We then have the random variables 𝑦 and 휃 which are governed by the

following joint distribution

𝑝(𝑦, 휃) = 𝑝(𝑦|휃)𝑝(휃) (4)

The conditional distribution of the observed random variable 𝑦 is specified by (1), and the marginal

distribution of the unobserved random variable 휃 is specified by (2). In functional form, we can thus write

the joint distribution (4) as the product of two multivariate Gaussian distributions

𝑝(𝑦, 휃) = 𝑁(𝑦; 𝑓(휃), Σ𝑦)𝑁(휃; 𝜇𝜃, Σ𝜃) (5)

To summarize, we have specified a joint distribution over one observed random variable 𝑦 and one

unobserved random variable 휃 that is given by the product of two multivariate Gaussian distributions and

parameterized by an additional set of known parameters Σ𝑦, 𝜇𝜃 and Σ𝜃. Given an observed value 𝑦∗ of the

observed random variable, the aim is now to obtain an approximation to the posterior distribution of the

unobserved random variables 휃.

(2) Fixed-form evaluation of the variational free energy

To achieve this aim, the standard assumption in the DCM literature, is to approximate the posterior

distribution 𝑝(휃|𝑦) (a) using a variational mean-field approximation, which is redundant here, as there is

only a single unobserved random variable, and (b) to use a fixed-form VB approach by setting the variational

distributions to Gaussian distributions from the outset:

3 Note that with respect to deterministic models for fMRI, we have set 𝑛 ≔ 𝑛𝑚, 𝑦 ≔ 𝑌, 𝑓 ≔ 𝜓𝑢 and have assumed that Θ = ℝ𝑚. In

addition, we have introduced a Gaussian distribution over the parameter vector 휃.

333

𝑞(휃) ≔ 𝑁(휃;𝑚𝜃, 𝑆𝜃) (1)

(note that in this Section we use roman letters to distinguish the variational parameters from the

parameters of the marginal distributions of the joint distribution). Somewhat unfortunately, this latter

assumption is referred to in the DCM literature as “Laplace approximation”. This is unfortunate, because the

term “Laplace approximation” is used in the general machine learning and statistics literature for the

approximation of an arbitrary probability density function with a Gaussian density and not for the definition

of a variational distribution in terms of a Gaussian distribution. As discussed above the latter is usually

referred to as a “fixed form” variational distribution and the ensuing inversion scheme as “fixed-form

variational Bayes”. Above we noted that fixed-form VB has the benefit that a probabilistic calculus problem

is reformulated as a nonlinear optimization problem, as we will sketch in the following.

Recall that the principle mechanism of variational Bayes was to maximize the variational free energy

ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)

𝑞(𝜗)) 𝑑𝜗 (2)

with respect to the variational distribution 𝑞(𝜗) over the unobserved random variables 𝜗. Also recall that

this is a problem of variational calculus, because the free energy as defined in (2) is a function of a

(probability density) function. Under a fixed-form assumption as in (1), however, all probability densities

involved in the definition of the variational energy are known and can be substituted. In effect, this renders

the variational free energy a function of the variational parameters and transforms a problem of variational

calculus to a nonlinear optimization problem in multivariate calculus, which can be addressed using the

standard machinery of nonlinear optimization algorithms. In the current case, the unobserved random

variables 𝜗 correspond to 휃 and substitution yields the following form of the variational free energy

𝐹 ∶ ℝ𝑚 × ℝ𝑚×𝑚 → ℝ+, (𝑚𝜃, 𝑆𝜃) ↦ 𝐹(𝑚𝜃 , 𝑆𝜃) (3)

where

𝐹(𝑚𝜃, 𝑆𝜃) ≔ ∫𝑁(휃;𝑚𝜃, 𝑆𝜃) ln (𝑁(𝑦;𝑓(𝜃),Σ𝑦)𝑁(𝜃;𝜇𝜃,Σ𝜃 )

𝑁(𝜃;𝑚𝜃,𝑆𝜃))𝑑휃 (4)

Note that we have changed the variational free energy from ℱ to 𝐹 to indicate that instead of a functional

we are now dealing with a function of real-valued parameters. From a mathematical perspective, it is worth

noting that the fixed-form reformulation of the variational free energy by no means results in a trivial

problem. First, the argument of the function 𝐹 is defined by an integral term involving the nonlinear function

𝑓, which, as discussed below, can be evaluated analytically only approximately. This is important, because it

calls into question the validity of the optimized free energy approximation to the log marginal probability.

However, as of now, the magnitude of the ensuing approximation error as a function of the degree of

nonlinearity of 𝑓 does not seem to have been systematically studied in the literature. Second, the function

𝐹 is not a “simple” real-valued multivariate function, in the sense that its arguments are real-vectors, but

also covariance parameters, which have a predefined structure, i.e. they are positive-definite matrices.

Fortunately, optimization of the function 𝐹 with respect to these parameters can be achieved analytically, as

discussed below. We next provide the functional form of the free energy function (4) including its derivation

(which, surprisingly, is largely absent from the DCM literature) and then proceed to discuss its optimization

with respect to the variational parameters 𝑚𝜃 and 𝑆𝜃.

334

Using a multivariate first-order Taylor approximation of the nonlinear function 𝑓, the function

defined in (3) and (4) and setting Σ𝑦 ≔ 𝜆𝑦−1𝐼𝑛 can be approximated by

𝐹:ℝ𝑚 × ℝ𝑚×𝑚 → ℝ, (𝑚𝜃, 𝑆𝜃) ↦ 𝐹(𝑚𝜃, 𝑆𝜃) (5)

where

𝐹(𝑚𝜃, 𝑆𝜃) ≔ −𝜆𝑦

2(𝑦 − 𝑓(𝑚𝜃))

𝑇(𝑦 − 𝑓(𝑚𝜃)) −

𝜆𝑦

2𝑡𝑟(𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6)

−1

2(𝑚𝜃 − 𝜇𝜃)

𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) −

1

2𝑡𝑟(Σ𝜃

−1𝑆𝜃) −1

2ln|Σ𝜃| +

1

2ln|𝑆𝜃|

where 𝑡𝑟 denotes the trace operator, and 𝐽𝑓(𝑚𝜃) denotes the Jacobian matrix of the function 𝑓 evaluated

at the variational expectation parameter and a number of constant terms have been removed for ease of

presentation. Note that formally, we should have used different symbols for the functions defined in (2) and

(4) and its approximation provided in (5) and (6). We provide a derivation of (6) from (4) below.

Proof of (6)

For the derivation of (6), we require the following property of expectations of multivariate random variables 𝑥 ∈ ℝ𝑑 under Gaussian distributions, which we state without proof.

Gaussian expectation theorem. For 𝑥,𝑚, 𝜇 ∈ ℝ𝑑 , Σ ∈ ℝ𝑑×𝑑 𝑝. 𝑑. and 𝐴 ∈ ℝ𝑑×𝑑

⟨(𝑥 − 𝑚)𝑇𝐴(𝑥 − 𝑚)⟩𝑁(𝑥;𝜇,Σ) = (𝜇 −𝑚)𝑇𝐴(𝜇 −𝑚) + 𝑡𝑟(𝐴Σ) (6.1)

where

𝑡𝑟 ∶ ℝ𝑑×𝑑 → ℝ, (𝑚𝑖𝑗)1≤𝑖,𝑗≤𝑑≔ 𝑀 ↦ 𝑡𝑟(𝑀) ≔ ∑ 𝑚𝑖𝑖

𝑑𝑖=1 (6.2)

denotes the trace operator, i.e. the sum of the diagonal elements of its argument. Proofs of the above can be found for example in (Petersen et al., 2006) and in the references therein. We are concerned with the following joint distribution

𝑝(𝑦, 휃) = 𝑝(𝑦|휃)𝑝(휃) (6.3)

where

𝑝(𝑦|휃) ≔ 𝑁(𝑦; 𝑓(휃), Σ𝑦) and 𝑝(휃) ≔ 𝑁(휃; 𝜇𝜃 , Σ𝜃) (6.4)

and the variational distribution

𝑞(휃) = 𝑁(휃;𝑚𝜃 , 𝑆𝜃) (6.5)

Based on this notational simplification, we now consider the variational free energy integral term in (6). Using the properties of the logarithm and the linearity of the integral, we first note that

ℱ(𝑞(휃)) = ∫ 𝑞(휃) ln (𝑝(𝑦,𝜃)

𝑞(𝜃)) 𝑑휃 = ∫ 𝑞(휃)(ln 𝑝(𝑦, 휃) − ln 𝑞(휃)) 𝑑휃 = ∫𝑞(휃) ln 𝑝(𝑦, 휃) 𝑑휃 − ∫ 𝑞(휃) ln 𝑞(휃) 𝑑휃 (6.6)

Of the remaining two integral terms, the latter corresponds to the differential entropy of a multivariate Gaussian distribution and is well-known to correspond to a nonlinear function of the variational covariance parameter 𝑆𝑥:

∫ 𝑞(휃) ln 𝑞(휃) 𝑑휃 = ℋ(𝑁(휃;𝑚𝜃 , 𝑆𝜃)) =1

2ln|𝑆𝜃| +

𝑚

2ln(2𝜋𝑒) (6.7)

There thus remains the evaluation of the first integral term, which corresponds to the expectation of the log joint probability density of the observed and unobserved random variables under the variational distribution of the unobserved random variables.

∫ 𝑞(휃) ln 𝑝(𝑦, 휃) 𝑑𝑥 = ⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) (6.8)

335

Substitution of the functional form of 𝑝(𝑦, 휃) (cf. equation (6)) the results, using the current notation for expectations, in

⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) (6.9)

= ⟨ln(𝑁(𝑦; 𝑓(휃), Σ𝑦)𝑁(휃; 𝜇𝜃 , Σ𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

= ⟨ln (𝑁(𝑦; 𝑓(휃), Σ𝑦))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨ln(𝑁(휃; 𝜇𝜃 , Σ𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝑥𝜃)

= ⟨ln ((2𝜋)−𝑛

2|Σ𝑦|−1

2 exp (−1

2(𝑦 − 𝑓(휃))

𝑇Σ𝑦−1(𝑦 − 𝑓(휃))) )⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

+ ⟨ln ((2𝜋)−𝑚

2 |Σ𝜃|−1

2 exp (−1

2(휃 − 𝜇𝜃)

𝑇Σ𝜃−1(휃 − 𝜇𝜃)) )⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

= ⟨−𝑛

2ln 2𝜋 −

1

2ln|Σ𝑦| −

1

2(𝑦 − 𝑓(휃))

𝑇Σ𝑦−1(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨−

𝑚

2ln 2𝜋 −

1

2ln|Σ𝜃| −

1

2(휃 − 𝜇𝜃)

𝑇Σ𝜃−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

= −𝑛

2ln 2𝜋 −

1

2ln|Σ𝑦| −

1

2⟨(𝑦 − 𝑓(휃))

𝑇Σ𝑦−1(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)−

𝑚

2ln 2𝜋 −

1

2ln|Σ𝜃| −

1

2⟨(휃 − 𝜇𝜃)

𝑇Σ𝜃−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

There thus remain two integral terms. Of these, the latter can be evaluated readily using the Gaussian expectation properties introduced above. Specifically, we have

⟨(휃 − 𝜇𝜃)𝑇Σ𝜃

−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨휃𝑇Σ𝜃

−1휃 − 휃𝑇Σ𝜃−1𝜇𝜃 − 𝜇𝜃

𝑇Σ𝜃−1휃 + 𝜇𝜃

𝑇Σ𝜃−1𝜇𝜃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.10)

= ⟨휃𝑇Σ𝜃−1휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) − 2𝜇𝜃

𝑇Σ𝜃−1⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + 𝜇𝜃

𝑇Σ𝜃−1𝜇𝜃

= 𝑡𝑟(Σ𝜃−1휃) +𝑚𝜃

𝑇Σ𝜃−1𝑚𝜃 − 2𝜇𝜃

𝑇Σ𝜃−1𝑚𝜃 + 𝜇𝜃

𝑇Σ𝜃−1𝜇𝜃

= 𝑡𝑟(Σ𝜃−1𝑆𝜃) + (𝑚𝜃 − 𝜇𝜃)

𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃)

There thus remains the evaluation of the first integral term on the right-hand side of (6.9). To simplify proceedings, we assume in the following that Σ𝑦 ≔ 𝜆𝑦

−1𝐼𝑛 , such that we only need to consider the term

⟨(𝑦 − 𝑓(휃))𝑇𝐼𝑛(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨𝑦

𝑇𝑦 − 𝑦𝑇𝑓(휃) − 𝑓(휃)𝑇𝑦 + 𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.11)

= 𝑦𝑇𝑦 − 2𝑦𝑇⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

We are hence led to the evaluation of the expectation of a Gaussian random variable 휃 under the nonlinear transformation 𝑓. The (apparent) idea of the DCM literature is to approximate the function 𝑓 by a multivariate first-order Taylor expansion in order to evaluate the remaining expectations (see (Chappell, Groves, Whitcher, & Woolrich, 2009) for an explicit discussion of this approach). Denoting the Jacobian matrix of 𝑓 evaluated at the variational expectation parameter 𝑚𝜃 as the function

𝐽𝑓 ∶ ℝ𝑚 → ℝ𝑛×𝑚, 𝑚𝜃 ↦ 𝐽𝑓(𝑚𝜃) ≔ (𝑑

𝑑𝜃𝑓) |𝜃=𝑚𝜃

≔

(

𝜕

𝜕𝜃1 𝑓1(𝑚𝜃) ⋯

𝜕

𝜕𝜃𝑚 𝑓1(𝑚𝜃)

⋮ ⋱ ⋮𝜕

𝜕𝜃1 𝑓𝑛(𝑚𝜃) ⋯

𝜕

𝜕𝜃𝑚 𝑓𝑛(𝑚𝜃))

(6.12)

we thus write

𝑓(휃) ≈ 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃) (6.13)

By replacing 𝑓(휃) in the first expectation of the right-hand side of (6.11) with the approximation (6.13), we then obtain

⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) ≈ ⟨𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.14)

= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(⟨휃 − 𝑚𝜃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃))

= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) −𝑚𝜃)

= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(𝑚𝜃 −𝑚𝜃)

= 𝑓(𝑚𝜃)

336

Further, replacing 𝑓(휃) in the second expectation of the right-hand side of (6.11) with the approximation (6.13), we obtain

⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) ≈ ⟨(𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))

𝑇(𝑓(𝑚𝜃) + 𝐽

𝑓(𝑚𝜃)(휃 −𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.15)

= ⟨𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 2𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)(휃 −𝑚𝜃) + (𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))

𝑇(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)

= 𝑓(𝑚휃)𝑇𝑓(𝑚휃) + 2𝑓(𝑚휃)

𝑇𝐽𝑓(𝑚휃)⟨(휃−𝑚휃)⟩𝑁(휃;𝑚휃,𝑆휃) + ⟨(𝐽𝑓(𝑚휃)(휃−𝑚휃))

𝑇

(𝐽𝑓(𝑚휃)(휃−𝑚휃))⟩𝑁(휃;𝑚휃,𝑆휃)

Considering the first remaining expectations yields

⟨(𝑥 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) −𝑚𝜃 = 𝑚𝜃 −𝑚𝜃 = 0 (6.16)

To evaluate the second remaining expectation, we first rewrite it as

⟨(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))𝑇(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨(휃 − 𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.17)

and note that (휃 −𝑚𝜃)𝑇 ∈ ℝ1×(𝑚+𝑝), 𝐽𝑓(휃)𝑇 ∈ ℝ(𝑚+𝑝)×𝑛 , 𝐽𝑓(𝑚𝜃) ∈ ℝ

𝑛×(𝑚+𝑝) and (휃 − 𝑚𝜃) ∈ ℝ(𝑚+𝑝)×1. Application of the

Gaussian expectation theorem (6.1) then yields

⟨(휃 −𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = (𝑚𝜃 −𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)(𝑚𝜃 −𝑚𝜃) + 𝑡𝑟(𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.18)

= 𝑡𝑟(𝐽𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)

We thus have

⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = 𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 𝑡𝑟(𝐽

𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.19)

In summary, we obtain the following approximation for the integral on the left-hand side of (6.11)

⟨(𝑦 − 𝑓(휃))𝑇(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = 𝑦

𝑇𝑦 − 2𝑦𝑇⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.20)

≈ 𝑦𝑇𝑦 − 2𝑦𝑇𝑓(𝑚𝜃) + 𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 𝑡𝑟(𝐽

𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)

= (𝑦 − 𝑓(𝑚𝜃))𝑇(𝑦 − 𝑓(𝑚𝜃) + 𝑡𝑟(𝐽

𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)

Concatenating the results (6.10) and (6.20), we thus have obtained the following approximation of the joint density of observed and unobserved random variables under the variational distribution

⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) = −𝑛

2ln 2𝜋 +

𝑛

2ln 𝜆𝑦 −

𝜆𝑦

2((𝑦 − 𝑓(𝑚𝜃))

𝑇(𝑦 − 𝑓(𝑚𝜃)) + 𝑡𝑟(𝐽

𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)) (6.21)

−𝑚

2ln 2𝜋 −

1

2ln|Σ𝜃| −

1

2(𝑡𝑟(Σ𝜃

−1𝑆𝜃) + (𝑚𝜃 − 𝜇𝜃)𝑇Σ𝜃

−1(𝑚𝜃 − 𝜇𝜃))

Together with the result for the entropy term in (6.7), we have thus found the following approximation for the variational free energy functional under the Gaussian fixed form assumption about 𝑞(휃)

ℱ(𝑞(휃)) = −𝑛

2ln 2𝜋 +

𝑛

2ln 𝜆𝑦 −

𝜆𝑦

2(𝑦 − 𝑓(𝑚𝜃))

𝑇(𝑦 − 𝑓(𝑚𝜃)) −

𝜆𝑦



−𝑚

2ln 2𝜋 −

1

2ln|Σ𝜃| −

1



1

2𝑡𝑟(Σ𝜃

−1𝑆𝜃) +1

2ln|𝑆𝜃| +

𝑚

2ln(2𝜋𝑒)

Notably, the variational free energy upon evaluation of the defining integrals can now be expressed as a function of the variational parameters 𝑚𝜃 and 𝑆𝜃. Because the presence of additional constants does not change the location of optima of this free energy function, we omit these constants for ease of presentation in the following and in the main text and write

𝐹:ℝ𝑚 × ℝ𝑚×𝑚 → ℝ, (𝑚𝜃 , 𝑆𝜃) ↦ 𝐹(𝑚𝜃 , 𝑆𝜃) (6.23)

337

where

𝐹(𝑚𝜃 , 𝑆𝜃) = −𝜆𝑦

2(𝑦 − 𝑓(𝑚𝜃))

𝑇(𝑦 − 𝑓(𝑚𝜃)) −

𝜆𝑦



−1



1

2𝑡𝑟(Σ𝜃

−1𝑆𝜃) −1

2ln|Σ𝜃| +

1

2ln|𝑆𝜃|

Equation (6.24) corresponds to equation (6) of the main text.

□

(3) Optimization of the variational free energy

Optimizing, i.e. finding a minimum or a maximum of a nonlinear multivariate, real valued function of

the form ℎ ∶ ℝ𝑛 → ℝ such as (9) with respect to its input arguments is a central problem in the

mathematical theory of nonlinear optimization. Intuitively, many methods of nonlinear optimization are

based on a simple premise: from basic calculus we know that a necessary condition for an extremal point at

a given location in the input space of a function is that the first derivative evaluates to zero at this point, i.e.

the function is neither increasing nor decreasing. If one extends this idea to functions of multidimensional

entities, one can show that one may maximize the function 𝐹 with respect to its input argument 𝑆𝜃 based on

a simple formula. Omitting all terms of the function 𝐹 that do not depend on 𝑆𝜃 and which hence do not

contribute to changes in the value of 𝐹 as 𝑆𝜃 changes, we write the first derivative of 𝐹 with respect to 𝑆𝜃

suggestively as

𝜕

𝜕𝑆𝜃𝐹(𝑚𝜃, 𝑆𝜃) = −

𝜆𝑦

2𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃) −1

2Σ𝜃−1 +

1

2𝑆𝜃−1 (1)

Setting the derivative of 𝐹 with respect to 𝑆𝜃 to zero and solving for the extremal argument 𝑆𝜃∗ then yields

the following update rule for the variational covariance parameters

𝑆𝜃∗ = (𝜆𝑦𝐽

𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃) + Σ𝜃

−1)−1

(2)

Proof of (1)

We only provide a heuristic proof to demonstrate the general idea. A formal mathematical proof would also require the

characterization of for the function 𝐹 as concave function and a sensible notation for derivatives of functions of multivariate entities

such as vectors and positive-definite matrices. Here, we use the notation for partial derivatives. We have

𝜕

𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = −

𝜆𝑦

2

𝜕

𝜕𝑆𝜃𝑡𝑟(𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) −1

2

𝜕

𝜕𝑆𝜃𝑡𝑟(Σ𝜃

−1𝑆𝜃) +1

2

𝜕

𝜕𝑆𝜃ln|𝑆𝜃| (1.1)

which, using the following rules for matrix derivatives involving the trace operator and logarithmic determinants (cf. equations (103)

and (57) in (Petersen et al., 2006)

𝜕

𝜕𝑋𝑡𝑟(𝐴𝑋𝑇) = 𝐴 and

𝜕

𝜕𝑋ln |𝑋| = (𝑋𝑇)−1 (1.2)

with 𝑆𝜃 = 𝑆𝜃𝑇 yields

𝜕

𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = −

𝜆𝑦

2𝐽𝑓(𝑚𝜃)


2Σ𝜃−1 +

1

2𝑆𝜃−1 (1.3)

Setting the above to zero then yields the equivalent relations

𝜕

𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = 0 ⇔ −

𝜆𝑦

2𝐽𝑓(𝑚𝜃)


2Σ𝜃−1 +

1

2𝑆𝜃−1 = 0 ⇔ 𝑆𝜃

∗ = (𝜆𝑦𝐽𝑓(𝑚𝜃)

𝑇𝐽𝑓(𝑚𝜃) + Σ𝜃−1)

−1 (1.4)

□

338

In contrast to the variational covariance parameter, maximization of the variational free energy

function with respect to 𝑚𝜃 cannot be achieved analytically, but requires an iterative numerical optimization

algorithm. In the DCM literature, the algorithm employed to this end is fairly specific, but related to standard

nonlinear optimization algorithms such as gradient and Newton descents. To simplify the notational

complexity of the discussion below, we rewrite the function 𝐹 as function of only the variational expectation

parameter, assuming that it has been maximized with respect to the variational covariance parameter 𝑆𝜃

previously. The function of interest then takes the form

𝐹:ℝ𝑚 → ℝ,𝑚𝜃 ↦ 𝐹(𝑚𝜃) (3)

where

𝐹(𝑚𝜃) = −𝜆𝑦

2(𝑦 − 𝑓(𝑚𝜃))

𝑇(𝑦 − 𝑓(𝑚𝜃)) −

𝜆𝑦


𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) −1


𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) (4)

Numerical optimization schemes usually work by “guessing” an initial value for the maximization

argument of the nonlinear function under study and then iteratively update this guess according some

update rule. A basic gradient ascent scheme for the function specified in (3) is provided in Table 1.

Initialization

0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 , a step-size 𝜅 > 0 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃

(0)) = 0, stop! 𝑚휃

(0) is a zero of ∇𝐹. If

not, proceed to iterations.

Until Convergence

1. Set 𝑚휃(𝑘+1)

≔ 𝑚휃(𝑘)+ 𝜅∇𝐹(𝑚휃

(𝑘))

2. If ∇𝐹(𝑚휃(𝑘+1)) = 0, stop! 𝑚휃

(𝑘+1) is a zero of 𝐹. If not, go to 3.


Table 1. Gradient ascent algorithm for the determination of a an optimal variational expectation parameter 𝑚휃.

A prerequisite for the application of the algorithm described in Table 1 is the availability of the

gradient of the function 𝐹 evaluated at 𝑚𝜃(𝑘) for 𝑘 = 0,1,2,… In the DCM literature, it is proposed to

approximate this gradient analytically by omitting higher derivatives of the function 𝑓 with respect to 𝑚𝜃.

The function (4) comprises first derivatives of the function 𝑓 with respect to 𝑚𝜃 in the form of the Jacobian

𝐽𝑓(𝑚𝜃) in the second term. If this term is omitted, the gradient of 𝐹 evaluates to

∇𝐹(𝑚𝜃) = −𝜆𝑦𝐽𝑓(𝑚𝜃)

𝑇(𝑦 − 𝑓(𝑚𝜃)) − Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) (5)

and the update rule for the variational expectation parameter takes the form

𝑚𝜃(𝑘+1)

= 𝑚𝜃(𝑘)− 𝜆𝑦𝐽

𝑓 (𝑚𝜃(𝑘))

𝑇(𝑦 − 𝑓 (𝑚𝜃

(𝑘))) − Σ𝜃−1 (𝑚𝜃

(𝑘) − 𝜇𝜃)𝑇 (𝑘 = 0,1,… ) (6)

Proof of (5)

∇𝐹(𝑚휃) = −𝜆𝑦

2

𝜕

𝜕𝑚휃(𝑦 − 𝑓(𝑚휃))

𝑇(𝑦 − 𝑓(𝑚휃)) −

𝜆𝑦

2

𝜕

𝜕𝑚휃𝑡𝑟(𝐽𝑓(𝑚휃)

𝑇𝐽ℎ(𝑚휃)𝑆휃) −1

2

𝜕

𝜕𝑚휃((𝑚휃 − 𝜇휃)

𝑇Σ휃−1(𝑚휃 − 𝜇휃)) (5.1)

339

Notably, the second term above involves second-order derivatives of the function 𝑓 with respect to 𝑚휃. Following the DCM

literature we neglect these terms, and obtain, using the rules of the calculus for multivariate real-valued functions (Petersen et al.,

2006)

∇𝐹(𝑚휃) = −𝜆𝑦

22(

𝜕

𝜕𝑚휃𝑓(𝑚휃))

𝑇

(𝑦 − 𝑓(𝑚휃)) −1

22Σ휃

−1(𝑚휃 − 𝜇휃) = −𝜆𝑦𝐽𝑓(𝑚휃)

𝑇(𝑦 − 𝑓(𝑚휃)) − Σ휃−1(𝑚휃 − 𝜇휃) (5.2)

□

From a technical viewpoint, gradient ascent schemes for the maximization of nonlinear functions are

suboptimal and thus rarely employed in numerical computing. Furthermore, as can be shown in simple

univariate examples the approximated gradient can easily fail to reliably identify the necessary condition for

an extremal point. A more robust method is a globalized Newton scheme with Hessian modification and

numerically evaluated gradients and Hessians as shown in Table 2.

Initialization

0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃

(0)) = 0, stop! 𝑚휃

(0) is a zero of ∇𝐹. If not, proceed to

iterations.

Until Convergence

1. Evaluate the Newton search direction 𝑝𝑘 ≔ (𝐻𝐹(𝑚휃(𝑘)))

−1

∇𝐹(𝑚휃(𝑘))

2. If 𝑝𝑘𝑇∇𝐹(𝑚휃

(𝑘)) < 0, 𝑝𝑘 is a descent direction. In this case, modify 𝐻𝐹(𝑚휃(𝑘)) to render it positive-definite.

3. Evaluate a step-size 𝑡𝑘 fulfilling the sufficient Wolfe-condition using the following algorithm: Set 𝑡𝑘 ≔ 1 and select

𝜌 ∈ 0,1[, 𝑐 ∈]0,1[.Until 𝐹(𝑚휃(𝑘) + 𝑡𝑘𝑝𝑘) ≥ 𝐹(𝑚휃

(𝑘)) + 𝑐1𝑡𝑘∇𝐹(𝑚휃(𝑘))

𝑇𝑝𝑘 set 𝑡𝑘 ≔ 𝜌𝑡𝑘.


≔ 𝑚휃(𝑘) + 𝑡𝑘 (𝐻

𝐹(𝑚휃(𝑘)))

−1

∇𝐹(𝑚휃(𝑘))

5. If ∇𝐹(𝑚휃(𝑘+1)) = 0, stop! 𝑚휃

(𝑘+1) is a zero of 𝐹. If not, go to 3.


Table 2. A globalized Newton method with Hessian modification

Intuitively, the algorithm described in Table 2 works by approximating the target function 𝐹 at the

current iterand 𝑚𝜃(𝑘)

by a second-order Taylor expansion and analytically determining an extremal point of

this approximation at each iteration. The location of this extremal point corresponds to the search direction

𝑝𝑘. If, however, the Hessian 𝐻𝐹 (𝑚𝜃(𝑘)) is not positive-definite, there is no guarantee that 𝑝𝑘 is an ascent

direction. Especially in regions far away from a local extermal point, this can often be the case. Thus, a

number of modification techniques have been developed that minimally change 𝐻𝐹 (𝑚𝜃(𝑘)) , but render it

positive-definite, such that an ascent direction is obtained. Finally, on each iteration the Newton step-size 𝑡𝑘

is determined such as yield an increase in the target function, but with a not too short step-size. This

approach is referred to as backtracking and the conditions for sensible step-lengths are given by the

necessary and sufficient Wolfe-conditions. Notably, the algorithm shown in Table 2 is a standard approach

for the iterative optimization of a nonlinear function, and hence analytical results on its performance bounds

are available.

The optimization scheme actually employed in much of the DCM simulation literature, however, is

more specific. It derives from the local linearization method for nonlinear stochastic dynamical systems and

340

is usually formulated in differential equation form. In Table 4, we provide a basic numerical optimization

formulation of this scheme.

Initialization

0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃

(0)) = 0, stop! 𝑚휃

(0) is a zero of ∇𝐹. If not, proceed to

iterations.

Until Convergence


= 𝑚휃(𝑘)+ exp(𝜏𝐻𝐹(𝑚휃

(𝑘)) − 𝐼)𝐻𝐹(𝑚휃(𝑘))

−1∇𝐹(𝑚휃

(𝑘))

2. If ∇𝐹(𝑚휃(0)) = 0, stop! 𝑚휃

(0) is a zero of 𝐹. If not, go to 3.


Table 4. Local linearization based gradient ascent algorithm

Documents

Probabilistic Models in Functional Neuroimaging...The aim of the following section is to provide a brief overview about data analytical strategies employed in noninvasive cognitive