Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CRYO-EM: BREAKTHROUGHS IN CHEMISTRY, STRUCTURAL BIOLOGY, AND STATISTICAL UNDERPINNINGS
By
Tze Leung Lai Shao-Hsuan Wang
Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang
I-Ping Tu
Technical Report No. 2020-14 November 2020
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
CRYO-EM: BREAKTHROUGHS IN CHEMISTRY, STRUCTURAL BIOLOGY, AND STATISTICAL UNDERPINNINGS
By
Tze Leung Lai Stanford University
Shao-Hsuan Wang
Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang
I-Ping Tu Academia Sinica, Taiwan
Technical Report No. 2020-14 November 2020
This research was supported in part by National Science Foundation grant DMS 1811818.
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
http://statistics.stanford.edu
Cryo-EM: Breakthroughs in Chemistry, Structural
Biology, and Statistical Underpinnings
Tze Leung Lai1, Shao-Hsuan Wang2, Yi-Ching Yao2, Szu-Chi Chung2,
Wei-Hau Chang3, and I-Ping Tu2
1Department of Statistics, Stanford University, USA
2Institute of Statistical Science, Academia Sinica, Taiwan
3Institute of Chemistry, Academia Sinica, Taiwan
August 10, 2020
Abstract
Cryogenic electron microscopy (cryo-EM) has revolutionized structural biology, or-
ganic and medicinal chemistry, molecular and cellular physiology, and its fundamental
importance was recognized in the 2017 Nobel Prize in Chemistry. Herein we first review
the statistical underpinnings of high-resolution 3D image reconstruction from 2D cryo-EM
data that have three characteristic features: missing data, high noise level, and massive
datasets. We then discuss challenges and opportunities for statistical science posed by
high-resolution structure determination using cryo-EM, and review recent advances in
high-dimensional multivariate analysis, dimension reduction, maximum a posteriori esti-
mation of latent variables, hidden Markov models and uncertainty quantification in this
connection.
Keywords: Cross correlation, cryo-EM, hidden Markov models, MCMC, model bias,
multivariate analysis, regularized likelihood
1
1 Introduction
Cryogenic electron microscopy (cryo-EM) is an imaging technique that uses transmitted electron2
waves to obtain projection images of a biological sample. In contrast to X-ray crystallography,
single particle cryo-EM does not need crystals and thereby is amenable to structural deter-4
mination of proteins that are refractory to crystallization, including membrane proteins and
yeast spliceosomes that exhibit dynamic patterns (Liao et al. (2013); Yan et al. (2015)). This6
capability enables single particle cryo-EM to record structures in solution, and single particle
cryo-EM breakthroughs in high-resolution structure determination pioneered by Jacques Dubo-8
chet, Joachim Frank, and Richard Henderson were awarded the Nobel Prize in Chemistry in
2017.10
Although single particle cryo-EM analysis has become the mainstream method for solving
high-resolution 3D structure density maps of biomolecules, cryo-EM images are extremely noisy12
and the signal-to-noise ratio is often less than 0.1. As a result, a typical cryo-EM experiment
tends to collect a large number of particle images (usually more than hundreds of thousands)14
and to compensate the noise contamination by averaging. The size of a cryo-EM particle image
is often larger than 100 pixels measured in each direction. The data characteristics of cryo-EM16
images include strong noise contamination, huge dimension and large sample size, making their
processing and statistical analysis very challenging. Henderson (2013) pointed out how spurious18
patterns could easily emerge by averaging a large number of white-noise images aligned to a
reference image through rotation and translation. In particular, using Einstein’s facial image as20
the reference, a blurred Einstein’s face emerged by averaging 1000 aligned white-noise images,
which he dubbed “Einstein from noise”.22
Section 2 explains the “Einstein from noise” phenomenon from the statistical perspective
of multivariate analysis, extreme value theory, and variable selection bias in analyzing high-24
dimensional data. Section 3 begins with a review of how noisy cryo-EM images are analyzed
to circumvent such bias and also to avoid overfitting, culminating in “near-atomic-resolution26
(i.e., 3-4A) structures for several icosahedral viruses and resolutions in the range of 4-6A for
complexes with less or no symmetry” mentioned by Scheres (2012b) and explained in Sec-28
2
tion 3.2 below, where the background of 3-4A and 4-6A for near-atomic resolutions is also
provided. It then discusses the statistical underpinnings of this work and subsequent develop-30
ments in cryo-EM data collection and analysis in studies of molecular and cellular physiology
or protein/RNA/virus/toxin/organometallic compound structure, and describes some recent32
advances in hidden Markov models that can pave the way toward a definitive solution. Section
4 describes the opportunities and challenges cryo-EM image analysis poses to statistical and34
computational sciences, and gives further discussion, additional background and concluding
remarks.36
2 A statistical perspective of “Einstein from noise”
“Einstein from noise” actually refers to the work of Stewart and Grigorieff (2004), who did38
a simulation experiment generating 1000 white-noise images and aligning each of them to
Einstein’s facial image through rotation and translation. A blurred Einstein’s face emerged40
from averaging the 1000 aligned images. Henderson (2013) used this phenomenon to warn the
community that an incorrect 3D density map could be constructed when data are blindly fitted42
to a model. In this section, we provide a statistical explanation and theory of this phenomenon.
To simplify the presentation, we do not delve into the technical details of how rotating an image44
may destroy the pixel format; these details and how to address them will be given in Section
3. Instead we treat an image as vector of dimension p and a white-noise image as a random46
vector uniformly distributed on the (p − 1)-dimensional unit sphere. The cross correlation of
two images is defined as the inner product of the corresponding vectors. Section 2.1 describes48
the statistical model and presents a simulation study with n = 106 white-noise images with
the pixel number p = 100× 100 = 104. Among the one million white-noise images, the largest50
cross correlation value with Einstein’s facial image (the reference) is 0.048, and yet the cross
correlation increases dramatically to 0.652 after averaging the 600 images that have the largest52
cross correlation values with Einstein’s facial image in the simulation study, which thereby
illustrates the essence of the “Einstein from noise” phenomenon. Section 2.2 connects this54
theory to extreme value theory and multivariate analysis.
3
2.1 White-noise Images and Image Selection Bias in Reference Align-56
ment
Let R be the reference matrix (the digital version of the reference image) of dimension d1× d2.58
We assume that ‖R‖ = 1 where ‖·‖ denotes the Frobenius norm of a matrix or Euclidean norm
of a vector. We generate n independent and identically distributed (iid) white-noise images as60
follows. Let Z1, . . . ,Zn be iid d1×d2 random matrices such that the d1d2 components of each Zi
are iid standard normal. We refer to Zi/‖Zi‖ (the normalized version of Zi), i = 1, . . . , n as n62
iid white-noise images. Let r = vec(R), the p-dimensional column vector which is the vectorized
version of R, where p = d1d2. The fact that ‖r‖ = 1 implies r ∈ Sp−1 (the (p− 1)-dimensional64
unit sphere). LetX i = vec(Zi)/‖Zi‖. Thus,X1, . . . ,Xn are iid uniformly distributed on Sp−1.
We refer to both Zi/‖Zi‖ and X i as the i-th white-noise image. The cross correlation of X i66
and r (or equivalently of Zi/‖Zi‖ and R) is defined as r>X i (the inner product of X i and r),
where r> denotes the transpose of r. Note that r>X i = cos Θi, where Θi is the angle between68
r and X i. The n white-noise images are ordered (and denoted by X(1), . . . ,X(n)) according
to their cross correlation values with r. In other words, (X(1), . . . ,X(n)) is a permutation of70
(X1, . . . ,Xn) such that r>X(1) ≥ r>X(2) ≥ · · · ≥ r>X(n). Let Θ1:n ≤ Θ2:n ≤ · · · ≤ Θn:n be
the order statistics of the angles (Θ1, · · · ,Θn), so that cos Θi:n = r>X(i), i = 1, . . . , n. Let72
Xm = m−1∑m
i=1X(i). Then Xm/‖Xm‖ ∈ Sp−1 refers to the normalized average of the m
white-noise images that are most highly cross-correlated with the reference image. Our goal74
is to find a good approximation of the distribution of ρn,p,m = r>Xm/‖Xm‖ when n, p, and
m are large. Note that for m = 1, ρn,p,1 = r>X(1) = cos Θ1:n, is the largest cross correlation76
value. Figures 1 and 2 summarize our simulation study.
Figure 1 consists of the reference image, which is Einstein’s face, and 6 images corresponding78
to m−1∑m
i=1Z(i)/‖Z(i)‖ for m = 100, . . . , 600, where Z(i) (i = 1, . . . , 6) have the 6 largest cross
correlation (CC) values (0.048, 0.046, 0.045,80
0.044, 0.044, 0.044) with the reference image. Note that Einstein’s face clearly emerges at
m = 300, 400, 500, 600, with different degrees of blurring, corresponding to CC values 0.540,82
0.585, 0.623, 0.652. Figure 2 shows similar results with three different reference images ranging
4
from a simple circle to a tree frog, indicating that the phenomenon of “Einstein from noise” is84
robust across various reference images. The cross correlation values in Figure 2 are about the
same across different reference images. This can be explained by the fact that if X is uniformly86
distributed on Sp−1, then the distribution of r>X is independent of r.
Figure 1: Example with Einstein’s face as the reference image.
Figure 2: Reference images ranging from single circle to tree frog.
5
2.2 Extreme Value Theory and Multivariate Analysis88
Recall that cos Θ1:n is the largest cross correlation, whose distribution can be approximated as
follows (when n and p are large). Let90
Kn,p = − lnn+1
2ln lnn− 1
2ln
((2 lnn)/p
1− exp {−(2 lnn)/p}
)+
1
2ln(4π).
Thenp− 1
2ln(1− cos2 Θ1:n)−Kn,p
d−→ G uniformly as min{n, p} → ∞, (1)
whered→ denotes convergence in distribution and the cumulative distribution function of G is
given by G(t) = 1 − e−et , t ∈ R which is known as the extreme value distribution of Gumbel
type. Based on (1), for fixed α ∈ (0, 1) and for large n and p, an approximate 100α-th quantile
of the distribution of cos Θ1:n is
Mn,p(α) =√
1− exp{2(Kn,p + ln ln α−1)/(p− 1)}.
Recall that cos Θ1:n = 0.048 in the simulation study summarized in Figure 1, where n = 10692
and p = 104. This observed value is compatible with the approximate 25th, 50th, and 75th
quantiles, i.e. M106,104(0.25) = 0.046, M106,104(0.5) = 0.047, M106,104(0.75) = 0.052. Figure94
3 plots Mn,p(α) versus log10 n for n ≤ 10100 with p = 104 and α = .1, .5, .9. Note that the
three quantile curves are very close to each other, indicating that cos Θ1:n has a small standard96
deviation. Figure 3 suggests that for P(cos Θ1:n ≥ 0.1) to be at least 0.1, n is required to be
greater than 1020, and for P(cos Θ1:n ≥ 0.2) to be at least 0.1, n is required to be greater than98
1080. In other words, it is unlikely for any of n iid white-noise images of dimension 100×100 to
have a cross correlation value with Einstein’s face greater than 0.2 unless n is astronomically100
large.
By (1), when n and p are large, ln(1− cos2 Θ1:n) = −2p−1(lnn)(1 + op(1)), hence102
cos Θ1:n =
√2 lnn
p(1 + op(1)), (2)
6
if (lnn)/p = o(1). Thus, under the condition (lnn)/p = o(1), with high probability, the n iid
white-noise images all have negligible cross correlation values with the reference.104
On the other hand, the cross correlation ρn,p,m of Xm/‖Xm‖ and r may be significantly greater
than 0 if m = mn grows with n. We now sketch the derivation of a crude approximation of ρn,p,m106
when p = pn and m = mn both grow with n. Since eachX i is uniformly distributed on Sp−1, the
distribution of r>X i is independent of r. Therefore, the distribution of ρn,p,m = r>Xm/‖Xm‖108
is independent of r. In this subsection, for convenience we take r = (1, 0, . . . , 0)> as the
reference vector. We begin by decomposing each X(i) into two components, one parallel to r110
and the other orthogonal to r, namely,
X(i) = (r>X(i))r +X(i)⊥ = (cos Θi:n)r +X
(i)⊥ ,
where X(i)⊥ = X(i)− (cos Θi:n)r denotes the component of X(i) orthogonal to r. (See Figure 4,112
in which the black vector is the sorted ith white-noise vector X(i) ∈ Sp−1 which is decomposed
into the blue vector parallel to the reference vector shown in orange and the red vector lying114
in the subspace orthogonal to r; note that the cross correlation cos Θi:n is the inner product of
the black vector and the orange vector.)116
7
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
=.5=.1=.9
Figure 3: Approximate 100α-th quantile (α = .1, .5, .9) of cos Θ1:n versus log10 n.
Figure 4: Relationship between the reference vector and a white-noise vector.
8
Therefore,
Xm =1
m
m∑i=1
X(i) =
(1
m
m∑i=1
cos Θi:n
)r +
1
m
m∑i=1
X(i)⊥ . (3)
Note that under the condition lnnp
= o(1), (2) implies that cos Θ1:n is approximately√
2 lnnp
.118
In addition to that condition, if m = mn also grows at a slow rate, then 1m
∑mi=1 cos Θi:n will
only be slightly less than cos Θ1:n, so that120
1
m
m∑i=1
cos Θi:n =
√2 lnn
p(1 + op(1)).
On the other hand,
1 ≥ ‖X(i)⊥ ‖ =
√1− cos2 Θi:n ≥
√1− cos2 Θ1:n ≈
√1− 2 ln n
p,
showing that ‖X(i)⊥ ‖ is nearly equal to 1 under the specified condition. Furthermore, it is122
readily seen that the (normalized) orthogonal components X(i)⊥ /‖X
(i)⊥ ‖, i = 1, . . . , n are iid
uniformly distributed on the (p−2)-dimensional unit sphere {(0, x2, . . . , xp) : x22+ · · ·+x2p = 1},124
which implies that if m = mn grows slowly and is much smaller than p = pn, then with high
probability, X(i)⊥ /‖X
(i)⊥ ‖, i = 1, . . . ,m (and hence X
(i)⊥ , i = 1, . . . ,m) are nearly orthogonal to126
each other. It then follows that the length of∑m
i=1X(i)⊥ is approximately
√m. Combining all
these arguments together with (3) yields that with high probability128
ρ2n,p,m = (r>Xm/‖Xm‖)2 =( 1m
∑mi=1 cos Θi:n)2(
1m
∑mi=1 cos Θi:n
)2+∥∥∥ 1m
∑mi=1X
(i)⊥
∥∥∥2≈
2 ln np
2 ln np
+ 1m
=2mp
ln n
1 + 2mp
ln n.
More precisely, when n, p and m are all large, we have the following asymptotic results. If
p = pn satisfies (lnn)2/p = o(1) and m = mn satisfies m/n = o(1), then130
ρ2n,p,m =βn,p,m
1 + βn,p,m+ op(1), (4)
9
where
βn,p,m =m
p
{2 ln
n
m− ln ln
n
m− ln(4π) + 2
}is a model bias index. If m = mn satisfies the stronger condition m(ln lnn)4/(lnn)2 = o(1),
then132
p (1 + βn,p,m)2(8m+ 2p β2
n,p,m
)1/2 (ρ2n,p,m − βn,p,m1 + βn,p,m
)d−→ N(0, 1). (5)
3 Statistical advances for cryo-EM image analysis
We begin this section with a review and discussion of the statistical underpinnings of cryo-134
EM image analysis. Noting that “a major goal of structural biology is to provide mechanistic
understanding of critical biological processes” and that “the most detailed insights come from136
atomic structures of macromolecules and complexes in these processes in relevant functional
states”, Cheng (2015) says that until the breakthroughs in cryo-EM image analysis “only a few138
years ago”, the routine method for studies of these atomic structures was X-ray crystallogra-
phy, which “completely depends on growth of well-ordered 3D crystals” where production “is a140
major bottleneck for challenging targets”. Because of “recent breakthroughs in hardware and
software”, which Cheng et al. (2015) review, cryo-EM has emerged as a technique for determin-142
ing structures at atomic resolution comparable to the crystallographic approaches, which also
determines “a number of structures of proteins and complexes that have vexed crystallogra-144
phers”. Kimanius et al. (2016) point out that despite such advances, two technological factors
still limit wide applicability of cryo-EM. One is limited access to high-end microscopes, and the146
other is that data processing and analysis require computational resources that are not directly
accessible to many labs. They say that “to extract fine structural details, one needs to average148
over multiple images of identical complexes to cancel noise sufficiently. ” They then explain:“
This is achieved by isolating two-dimensional particle-projections in the micrographs, which can150
then be recombined into a three-dimensional structure (Cheng et al., 2015). The latter requires
the estimation of the relative orientation of all particles, which can be done by a wide range152
of different image processing programs” that have to address the issue that “any one data set
10
typically comprises images of multiple different structures” including multiple conformations,154
data heterogeneity due to contaminants and other sources, and “ the classification of heteroge-
neous data into homogeneous subsets has therefore proven critical for high resolution structure156
determination and provides a tool for structural analysis of dynamic systems”. In Section 3.1,
we consider some recent developments in statistical science for identifying structurally homo-158
geneous subsets from these heterogeneous cryo-EM data. Kimanius et al. (2016) also note that
an increasingly popular choice in their list of image processing programs is RELION (Scheres,160
2012b) that uses an empirical Bayes (EB) approach to single-particle analysis (Scheres, 2012a).
Section 3.2 reviews this approach and recent developments for its implementation. Section162
3.3 describes recent breakthroughs in the more general problem of adaptive filtering in hidden
Markov models and how they can be used to develop a definitive alternative to the (somewhat164
circuitous) EM approach.
3.1 2D Clustering Methods for Cryo-EM Images166
An important step in the workflow of cryo-EM image analysis is 2D clustering to identify clean
particle image sets, after theses images have gone through the multi-reference alignment (MRA)168
and contrast transfer function steps of the workflow which will be described in Section 3.2. The
K-means approach was first used by van Heel (1984) in an MRA method for clustering randomly170
oriented biological macromolecules. Sorzano et al. (2010) proposed a K-means algorithm CL2D
for implementing their clustering approach to MRA of 2D projections in cryo-EM. For a given172
number m of clusters, CL2D iteratively bisects the data until at least m clusters appear. During
the process, CL2D dismisses the clusters whose size is smaller than a pre-specified number like174
30 and bisects the largest cluster once a dismiss is executed. Another novelty of CL2D is that it
adopts a kernel-based entropy measure ‘correntropy‘ to measure the distance that can mitigate176
the impact of outliers. When m is large as in the case of cryo-EM, setting good initial values
is important for CL2D to avoid being stuck at local minima. However, Yang et al. (2012) have178
pointed out the difficulty of finding good initial values and have found the K-means approach
to be unsatisfactory. Moreover, it is difficult to prespecify a manageable (not too large) number180
11
of clusters for the particle images.
Noting that two major approaches have been adopted in “the vast number of clustering al-182
gorithms developed”, namely, model-based and distance-based approaches, Chen et al. (2014)
have proposed to combine both approaches into a clustering algorithm called γ-SUP for 2D184
clustering of cryo-EM images. A model-based approach models the data as a mixture of para-
metric distributions with the mean estimates used as cluster centers, whereas a distance-based186
approach uses some distance measure of the similarity between data points, as in the K-means
method, hierarchical clustering, and the SUP (self-updating process) method of Shiu and Chen188
(2016). The γ -SUP algorithm models the data with a q-Gaussian mixture (model-based ap-
proach) and uses the γ-divergence (to measure the similarity between the empirical distribution190
and the model distribution) in SUP (distance-based approach). The q-Gaussian mixture model
has a density function of the form192
f(y) =m∑j=1
pjgq(y;µj, σ), y ∈ Rp, (6)
where gq(·;µ, σ) is the q-Gaussian density function
gq(y;µ, σ) = (√
2πσ)−pcp,q expq(−‖y − µ‖2/(2σ2)), (7)
with q < 1, the q-exponential function expq(u) = (1 + (1− q)u)1/(1−q)+ and normalizing constant194
cp,q = (1 − q)p/2Γ(1 + p/2 + (1 − q)−1)/Γ(1 + (1 − q)−1). These distributions have compact
support in Rp. Letting q → 1 yields the multivariate normal density with mean vector µ and196
covariance matrix σ2Ip.
Chen et al. (2014) note in their Section 3.1 that instead of working with the mixture density198
f that requires specification of the number m of mixture components, it is more practical to
fit each component separately through the optimization problem minµj ,σDγ(F∗ ‖ gq(·;µj, σ)),200
where F ∗ is the actual distribution (unknown) generating the data and Dγ(· ‖ ·) is the γ-
divergence (to the defined below) with γ > 1 − q. For given σ, the minimizer µ∗j is given by202
the solution of the equation µ∗j =∫yw(y;µ∗j , σ)dF ∗(y)/
∫w(y;µ∗j , σ)dF ∗(y). Hence replac-
12
ing F ∗ by the empirical distribution F of {Y i, 1 ≤ i ≤ n} leads to the following recursive204
implementation of µ∗j :
µ(`+1)j =
∫yw(y;µ
(`)j , σ)dF (y)∫
w(y;µ(`)j , σ)dF (y)
, ` = 0, 1, . . . (8)
Section 3.2 of Chen et al. (2014) then uses the SUP algorithm of Shiu and Chen (2016) to206
replace F in (8) by the empirical distribution F (`) of {µ(`)i , 1 ≤ i ≤ n}, thereby leading to the
γ-SUP recursion208
µ(`+1)j =
∫yw(y;µ
(`)j , σ)dF (`)(y)∫
w(y;µ(`)j , σ)dF (`)(y)
=n∑i=1
w(`)ij µ
(`)i
w(`)ij
, ` = 0, 1, . . . (9)
The explicit formulas for the weight function w and weights w(`)ij are derived in Sections 2.3
and 3.2 of Chen et al. (2014) by using (a) the γ-divergence Dγ(F ‖ gq(·;µ, σ)) = Cγ(F ‖210
gq(·;µ, σ)) − Cγ(F ‖ F ), where Cγ(· ‖ ·) is the “γ cross-entropy” defined by Cγ(F ‖ g) =
−∫gγ(y)dF (y)/[γ(γ+ 1)‖g‖γγ+1], in which ‖g‖γ+1 is a normalizing constant, and (b) an equiv-212
alent and more tractable problem of maximizing∫
[expq(−‖y−µ‖2/(2σ2))]γdF (y), culminating
in the formula214
w(`)ij = exp1−s
(−∥∥∥(µ
(`)j − µ
(`)i )/τ
∥∥∥2) , (10)
where τ =√
2σ/√γ − (1− q) > 0, s = (1− q)/{γ− (1− q)} > 0, µ(`) is defined recursively by
(9), and ‖x− y‖ is the Euclidean distance between x,y ∈ Rp. Chen et al. (2014, p. 269) note216
that in view of (10), γ-SUP starts with n (scaled) “cluster centers” Y i/τ, 1 ≤ i ≤ n, thereby
circumventing the need for initializing with random centers for other methods; moreover, these218
nonnegative and decreasing weights (with respect to the Euclidean distance ‖µ(`)i − µ
(`)j ‖) and
the compact support of the q-Gaussian distribution ensure the convergence of γ-SUP. Hence,220
eventually “γ-SUP converges to certain K clusters, where K depends on the tuning parameters
(s, τ) but otherwise is data-driven.” Another advantage of γ-SUP is that σ is absorbed in222
the tuning parameter τ , hence selection of τ obviates the need to select σ. Chen et al. (2014,
13
p. 277) note that “when τ reaches a critical value, the images in the same (eventual) cluster224
can start with attracting each other and will finally merge”, in contrast to the case of small τ
for which each particle image forms a cluster, hence a phase transition diagram illustrated in226
their Figure 5 can be used to determine τ . After choosing τ in this way, s can be chosen by
minimizing the impurity, or c-impurity, performance measure introduced in their Section 4.2.228
After presenting simulation studies of γ-SUP and real cryo-EM data on E. coli 70S ribosome,
their conclusion section says: “Characteristically, sets of cryo-EM images have low signal-to-230
noise ratio, many of which are misaligned and should be treated as outliers, and which form
a large number of clusters due to their free orientations. Because of its capability to identify232
outliers, γ-SUP can separate out the misaligned images and create the possibility for further
correcting them”, which provides a much smaller number of clusters with larger cluster sizes,234
and cluster averages comparable to those of a 2D projection of the ribosome 3D structure
obtained by X-ray crystallography.236
Chung et al. (2020) recently provided further analysis of benchmark cryo-EM data 70S
Ribosome and 80S Ribosome together with a novel two-stage procedure 2SDR for Dimension238
Reduction. A ribosome is made from complexes of RNAs (ribonucleic acids) that are present
in all living cells to perform protein synthesis by linking amino acids together in the order240
specified by the codons of mRNA (messenger RNA) molecules to form polypeptide chains.
A 70S Ribosome comprises of a large 50S subunit and small 30S subunit; the S stands for242
Svedberg, a unit of time equal to 10−13 seconds, measuring how fast molecules move in a
centrifuge. Eukaryotic ribosomes are also known as 80S Ribosomes and have a large 60S244
subunit and small 40S subunit. In the past decade, the linear subspace model that represents
the protein motion using the eigenvolumes from the covariance matrix of 3D structures is an246
active research area, as will be discussed further in Section 4. In all these approaches, PCA
(principal component analysis) plays an important role to estimate the top eigenvolumes.248
Let X,X1, · · · ,Xn be i.i.d. p × q random matrices. Let y = vec(X i), where vec is
the operator of matrix vectorization by stacking the matrix into a vector by columns. The250
statistical model for PCA is y = µ + Γν + ε, where µ is the mean, ν ∈ Rr with r ≤ pq,
Γ is a pq × r matrix with orthonormal columns, and ε is independent of ν with E(ε) = 0252
14
and Cov(ε) = c Ipq. The zero-mean vector ν has covariance matrix ∆ = diag(δ1, δ2, · · · , δr)
with δ1 ≥ δ2 ≥ · · · ≥ δr > 0 . The estimate Γ contains the first r eigenvectors of the sample254
covariance matrix Sn = n−1∑n
i=1(yi−y)(yi−y)>, and vec(X)+Γνi provides a reconstruction
of the noisy data vec(X i). The computational cost, which increases with both the sample size256
n and the dimension pq, becomes overwhelming for high-dimensional data. An alternative
approach to matrix vectorization is MPCA (multilinear PCA), developed by Ye (2005) and258
Hung et al. (2012), which models a p× q random matrix X by
X = Z + ε ∈ Rp×q, Z = M +AUB>, (11)
where M ∈ Rp×q is the mean, U ∈ Rp0×q0 is a random matrix with p0 ≤ p, q0 ≤ q, A and B
are non-random p×p0, q×q0 matrices with orthogonal column vectors, ε is a zero-mean random
vector independent of U such that Cov(vec(ε)) = σ2 Ipq. Ye (2005) proposed to use generalized
low-rank approximations of matrices to estimate A and B. Given (p0, q0), A consists of the
leading p0 eigenvectors of the covariance matrix∑n
i=1(X i−X)P B(X i−X)>, and B consists
of the leading q0 eigenvectors of∑n
i=1(X i −X)>P A(X i −X), where the matrix P A = AA>
(respectively, PB = BB>
) is the projection operator into the span of the column vectors of A
(respectively, B). The estimates can be computed by an iterative procedure that usually takes
no more than 10 iterations to converge. Replacing A and B by their estimates A and B in
(11) yields
U i = A>
(X i −X)B, hence AU iB>
= P A(X i −X)P B, (12)
i.e., vec(AU iB>
) = P B⊗Avec(X i−X), where ⊗ denotes the Kronecker product and P B⊗A =260
(BB>
) ⊗ (AA>
). Chung et al. (2020) propose a new model, called hybrid PCA and denoted
by HMPCA, in which the subscript M stands for MPCA and H stands for ”hybrid” of MPCA262
and PCA. Specifically, HMPCA assumes the MPCA model (11) with reduced rank (p0, q0) via
the p0 × q0 random matrix U and then assumes a rank-r model, with r ≤ p0q0, for vec(U)264
to which a zero-mean random error ε with Cov(ε) = cIp0,q0 is added. This leads to dimension
15
reduction of vec(X −M − ε) = vec(Ap0UB>q0
) from p0q0 to r. Since U = A>p0(X −M )Bq0 in266
view of (12), vec(Ap0UB>q0
) = PBq0⊗Ap0vec(X −M − ε) is the projection of X −M − ε into
span(Bq0 ⊗Ap0), which has dimension r after this further rank reduction. The actual ranks,268
which we denote by (p∗0, q∗0) and r∗, are unknown as are the other parameters of the HMPCA
model, and 2SDR uses a sample of size n to fit the model and estimate the ranks.270
The first stage of 2SDR uses (11) to model a noisy image X as a matrix. Ye’s estimates A
and B described in the preceding paragraph depend on the given values of the rank pair (p0, q0).272
Chung et al. (2020) introduce a rank selection criterion to choose the rank pair and show that its
minimizer (p0, q0) is a consistent estimator (as n→∞) of the true value (p∗0, q∗0) of the rank pair.274
The second stage of 2SDR performs PCA on the covariance matrix n−1∑n
i=1 vec(U i)vec(U i)>
to obtain its ordered eigenvalues which are used in a “generalized information criterion” (GIC)276
to choose the rank r ≤ p0q0; minimization of GIC over r ≤ p0q0 yields the estimator rGIC, which
is then shown to be a consistent estimator of the true rank r∗ under certain sparsity conditions.278
The analysis of 70S Ribosome benchmark data in Section 3 of Chung et al. (2020) shows that
2SDR can improve 2D image clustering to curate the clean particles and 3D classification to280
separate various conformations, and enhance the performance of γ-SUP via dimension reduc-
tion. The 80S Ribosome dataset contains 105,247 particle images with pixel size 360 × 360,282
for which “many current PCA implementation fail to solve the complete set of eigenvectors”
whereas 2SDR reduces the computational complexity “by several orders of magnitude”, thereby284
circumventing the prohibitive computational overhead of vectorization for dimension reduction
via PCA. To illustrate, the top row of Figure 5 shows nine images randomly selected from a286
dataset that contains 5090 Betagal particle images with pixel size 256 × 256. The next five
rows of Figure 5 show their reconstructions by PCA, MPCA with (p0, q0) = (19, 19), 2SDR288
with (p0, q0) = (19, 19) and rGIC = 33, Wavelet and BM3D, respectively. “Wavelet” refers to
adaptive wavelet thresholding for image denoising and compression introduced by Chang et al.290
(2000), who proposed to use (a) matrix representation of 2D noisy images, (b) wavelet thresh-
olding to minimize a Bayes risk assuming a generalized Gaussian model for the signal and a292
Gaussian model for the noise, and (c) the minimum description length criterion for choosing
quantization levels and binwidths. “BM3D” refers to block-matching and 3D reconstruction294
16
introduced by Danielyan et al. (2011), who proposed to use (i) a block-matching algorithm to
group image fragments (which may not be disjoint but have the same size), (ii) hard threshold-296
ing to filter the 3D group spectra, and (iii) inversion of the filtered spectra to provide individual
reconstruction for each block in the group, so that the final image reconstruction can be com-298
puted as a weighted average of all blockwise estimates. Figure 5 shows that 2SDR clearly
performs much better than the other methods.300
Figure 5: Nine randomly selected particle images and their reconstructions.
3.2 Latent Orientations and Regularized Likelihood Optimization
Scheres (2012a) gives an overview of RELION (REgularized LIkelihood OptimizatioN), which is302
an open-source computer program to implement the “Bayesian approach to cryo-EM structure
determination, in which the reconstruction problem is expressed as the optimization of a single304
target function” in Scheres (2012b). For the latter reference about which Section 1 refers to this
section for background and explanation, the angstrom A is 0.1 nanometer, a unit of length that306
is widely used to express the sizes of atoms, molecules, and electromagnetic wavelengths. It
pinpoints a fundamental difficulty with 3D structure reconstruction from cryo-EM data, which308
is “the lack of information about the relative orientations of all particles and, in the case of
17
structural variability in the sample, also their assignment to a structurally unique class” because310
“these data are lost during the experiment, where molecules in distinct comformations coexist
in solution and adopt random orientations in the ice.” Hence cryo-EM structure determination312
is an “ill-posed problem” which needs to be “tackled by regularization, where the experimental
data are complemented by prior information so that the two sources of information together314
fully determine a unique solution.” Since in practice the experimental data often “ need to
be supplemented with prior information” because of low signal-to-noise ratio (SNR) or insuf-316
ficiently large sample size, a Bayesian approach that assumes “a Gaussian distribution on the
Fourier components of the signal” is used for maximum a posteriori (MAP) estimation of the318
latent vector of actual orientations of the images. MAP is a point estimate defined by the mode
(i.e., argmaxθf(θ | Xn)) of the posterior density f(· | Xn) given the observed sample Xn of size320
n, hence it is the solution of a regularized likelihood maximization problem. For a loss function
of the form L(φ, a) = I{‖φ−a‖<c}, the Bayes estimate φc approaches MAP as c→ 0 if −f(· | Xn)322
is convex. Scheres (2012b, pp. 521-525) uses (a) the Dempster-Laird-Rubin EM algorithm to
evaluate MAP with fast Fourier-space interpolation and adaptive EM to speed up the compu-324
tations, (b) the ratio πF/πT (of the posterior probability of assigning a false orientation φF to
that of assigning the true orientation φT ) to assess the accuracy of the orientation assignments326
of individual particles, and (c) a “gold standard” (explained below) to avoid overfitting.
Scheres (2016) reviews the processing of structurally heterogeneous cryo-EM data in RE-328
LION. He recognizes that the Bayesian approach in Scheres (2012b,a) is actually empirical
Bayes (EB) regularization because the hyperparameters in the prior model are replaced by330
their estimates in the regularized likelihood. He says:“ Whereas in standard Bayesian methods
the prior is fixed before any data are observed, inside RELION parameters of the prior are332
estimates from the data themselves. This type of algorithm is referred to as an empirical Bayes
approach”, in which “both the likelihood and the prior are expressed in the Fourier domain”334
that “permits a convenient description of the effects of microscope optics and defocusing (by
the so-called contrast transfer function, or CTF).” He notes that using the EM algorithm to336
compute the MAP estimate in the Fourier domain “results in an update formula for the re-
construction that shows strong similarities with previously introduced Wiener filters” which338
18
depend on estimates for the power and the noise as a function of spatial frequency. By using
the EB approach to update these estimates from the data, “RELION effectively calculates the340
best possible filter, in the sense that it yields the least noisy reconstruction, at every iteration
of the optimization process.” As pointed out by Scheres (2012b, p.520), the MAP estimate342
is based on the following linear regression model in Fourier space for the kth homogeneous
structure group (k = 1, . . . , K):344
xij = cij
L∑`=1
P φj`sk` + εij, (13)
where xij is the jth component (j = 1, . . . , J) of the 2D Fourier transform of the ith image
(i = 1, . . . , N), cij is the jthe component of the CTF for the ith image, sk` is the `th component346
of the 3D Fourier transform of the kth structure group, εij is noise (usually assumed to be
independent and normally distributed with mean 0 and variance σ2) in the complex plane,348
and P φ = (P φj`)1≤j≤J,1≤`≤L is a matrix that relates the 2D Fourier transform to the 3D Fourier
transform by the projection-slice theorem; see Section 4.2 for background and further details.350
We next discuss the “gold standard”, using FSC (Fourier Shell Correlation) curves and
movies to correct beam-induced motion during the exposure to the electron beam, to avoid352
overfitting. Scheres (2012b, p.520) provides details of the derivation of the iterative algorithm,
showing how to update the iterations σ2t+1 from s
(t)k` , the signal variance v
(t+1)k` from s
(t+1)k` , the354
posterior probability π(t+1)k`,φ of class assignment k and orientation φ (given the data on the
ith image) from π(t)k`,φ, s
(t)k` , and v
(t)k` . The updating formulas involve integration over the latent356
orientation φ, which in practice “are replaced by (Riemann) summations over discretely sampled
orientations, and translations are limited to a user-defined range.” Scheres (2016, pp.129-132)358
describes the statistical underpinnings and historical background of the use of FSC curves and
the development of the gold standard for the validation of cryo-EM structure. Because the EB360
approach updates the hyperparameters and latent orientations concurrently at each iteration,
“once one over-estimates the power of the true signal due to an inadvertent build-up of a small362
amount of noise in the reconstruction, one will allow even more noise in the next iteration”,
which led to “overfitting, where noise in the model iteratively builds up”, in many cryo-EM364
19
structure determination projects by 2010 and the convention of “a community-driven task-force
for the validation of EM structures”.366
One of the recommendations of the task-force was to use the “gold-standard approach to re-
finement” by splitting the data into two halves and refining independent reconstructions against368
each “half-set”; see Henderson et al. (2012). The FSC between the two independent reconstruc-
tions “then yields a reliable resolution estimate so that the iterative build-up of noise can be370
prevented.” This is akin to two-fold cross-validation (CV) and therefore can be extended to k-
fold CV if N is sufficiently large, using FSC as the performance measure, which was a common372
practice in the field that “had evolved toward the refinement of a single reconstruction” from
which “the resulting angles would be used to make two (no longer independent) reconstructions374
from random half-sets at each iteration.” In 2011, the tilt-pair experiments by Henderson, Chen
and their collaborators showed, however, that “alignments are dominated by the lower spatial376
frequencies, which are almost indistinguishable between reconstructions from all or half of the
data”, and that the previously introduced FSC=0.143 performed well when two independent378
reconstructions were used but was “too optimistic” when using a single reconstruction which
produced “inflated FSC”.380
Scheres (2012a, p.411) points out that “there remains one problem” with the “assumption
of independence between Fourier components of the signal” in the prior distribution underlying382
the MAP estimate; “this assumption is known to be a poor one” for a macromolecular complex,
resulting in under-estimation of power in the signal and over-smoothing of the reconstruction.384
Scheres (2012b, p.527) introduced a “3D auto-refinement” with which “the user only selects
a relatively coarse initial orientational sampling and this sampling rate is automatically in-386
creased during the refinement” that monitors two convergence criteria, namely “the estimated
resolution based on the (over-smoothed) gold-standard FSC curve” and “the average changes in388
the optimal orientation and class assignments for all particles”, so that “once both criteria no
longer improve from one iteration to the next, the orientational sampling rates are increased.”390
Specifically, rotational sampling is “increased 2-fold by using the next Healpix grid” and trans-
lational sampling is “adjusted to the estimate accuracy of the translational assignments” based392
on the πF/πT criterion, and the adaptive EM algorithm is used during all iterations. Before
20
termination upon convergence, “a final iteration is performed where the the two independent394
halves of the data are combined in a single reconstruction.” Scheres (2016, Section 2.3) de-
scribes an alternative method that was recommended by the task-force to carry out this final396
step, saying: “Because orientational and class assignments are predominantly driven by the
lower frequency content of the images, they are usually not noticeably affected by the under-398
estimation of resolution. However, upon convergence the highest amount of information needs
to the extracted from the reconstruction . . . . By masking out the solvent region from the400
two half-reconstructions, the noise gets reduced and the FSCs will increase,” but masking also
introduces “convolution effects” that can be corrected by “phase randomization”, details of402
which are given in Scheres (2016, p.132).
3.3 Hidden Markov Models, MCMC and Uncertainty Quantifica-404
tion
Using the EB approach via the EM algorithm for iterative estimation of the state vector406
sk = (sk1, . . . , skL)>, together with gold standard FSC coupled with phase randomization for
concurrent iterative estimation of the hyperparameter vector θ, appears circuitous when (13)408
is actually a hidden Markov model (HMM) and the task at hand is joint state and parameter
estimation (or adaptive filtering) in the HMM, which is a long-standing problem with major410
applications in many STEM (Science, Technology, Engineering, Mathematics) fields.
Lai et al. (2020) recently developed a new MCMC-SS (Markov Chain Monte Carlo with412
Sequential Substitutions) for adoptive filtering in HMMs on general state spaces. Their basic
idea is to approximate a target distribution by the empirical distribution of M representative414
atoms, chosen sequentially by an MCMC scheme so that the distribution approximates the
target distribution after a large number K of iterations. Making use of bounds on a weighted416
total variation norm of the difference between the target distribution and the empirical measure
defined by the sample paths of the MCMC scheme, they have also developed an asymptotic418
theory of MCMC-SS. This theory includes asymptotic normality (as both K and M approach
∞) of the estimates of functionals of the target distribution using MCMC-SS, together with420
21
consistent estimation of their standard errors, and provides oracle properties that prove their
asymptotic optimality. In particular, the convergence is guaranteed and automated for MCMC-422
SS, in contrast to standard MCMC schemes for which manual checks of convergence are needed.
Moreover, the computation can be vectorized and accelerated using a GPU, and parallelized424
across multiple GPUs; see Lai et al. (2020) where applications to image analysis with uncertainty
quantification are also given.426
4 Statistical and computational challenges/opportunities
An article by Bendory et al. (2020), referred to hereafter as BBS, that just appeared in IEEE428
Signal Processing Magazine, introducing the “challenging and exciting computational tasks in-
volved in reconstructing 3-D molecular structures by cryo-EM” and describing the “computa-430
tional challenges and opportunities”, is the inspirational source behind this concluding section.
We focus on the statistical challenges and opportunities, which we will relate to the computa-432
tional ones that they have covered comprehensively and eloquently. In this connection, Section
4.1 also provides additional background for the material presented in Section 3.2. Section 4.2434
discusses the statistical challenges and opportunities together with concluding remarks.
4.1 Mathematical Theory of Cryo-EM Image Reconstruction, Ver-436
ification and Computational Building Blocks
Section II of BBS describes the mathematical model generating cryo-EM images and formulates438
the inverse problem of image reconstruction as estimating the molecular structure represented
by the 3D orientation of the particles embedded in the ice from the 2D images I1, . . . , IN ; each440
image Ii is formed by rotating φ by a 3D rotation Rωiand 2D shift ti. While ω1, . . . , ωN ,
t1, . . . , tN are unknown a priori, they are nuisance parameters as “their estimation is not an442
aim” of the reconstruction of φ, which “is possible up to three intrinsic ambiguities: a global
3D rotation, the (3D) location of the center of the molecule, and handedness.” The first two444
are related to the nuisance parameters and are handled by stochastic modeling as in the EM
22
algorithm of RELION. In proteins, the polypeptide chain forms a nuclear of right- and left-446
handed helices and superhelices, and other structures such as a right-handed double helical β-
hairpin that is strongly twisted and coiled; see Efimov (2018). BBS notes that “the handedness448
of structure cannot be determined from cryo-EM images alone because the original 3D object
and its reflection give rise to identical sets of projections.” Hence, “φ may be thought of as a450
random signal with an unknown distribution defined over a space of possible structures”, which
BBS discusses in Section VIIA as a computational and theoretical challenge and surveys two452
different approaches in the literature. One is in the direction that we have discussed in Section
3.2 and has “apparent drawbacks” that “it does not scale well for large K and ignores the454
correlation between different functional states of the molecule and thus overlooks important
information.” The other assumes that “φ1, . . . , φN can be embedded in a low-dimensional456
space”, which can be learned by using PCA for linear subspaces or “by other spectral methods,
such as diffusion maps” in more intricate low-dimensional manifolds. We will return to these458
challenges and opportunities in Section 4.2.
Section V of BBS describes five building blocks in the algorithmic pipeline of single particle460
reconstruction using cryo-EM, including 2D classification and denoising/dimension reduction
techniques that we have already discussed in Section 3.1. Section 4.2 will consider CTF estima-462
tion and bias mitigation in particle picking. Here we consider motion correction, about which
Section II of BBS says:“Modern electron microscopes produce multiple micrographs” and the464
electron detectors “acquire multiple frames per micrograph,” allowing them to partially correct
the blur caused by the movement of the electron beam by “aligning and averaging the frames”;466
motion correction is in essence a multi-reference alignment problem, for which the main chal-
lenge for cryo-EM images is “the high noise levels that hamper precise estimation of relative468
sifts between frames.” Recent solutions include “strategies for per-particle correction”, whereas
earlier solutions aim at estimating the movement of the entire micrograph which is divided into470
patches and “motions within each patch are estimated based on cross-correlation.”
Scheres (2016, Section 2.4) describes a hardware-based “movie-processing” approach to mo-472
tion correction. He points out that with the advent of direct-electron detectors, “the possibility
arose to collect movies during the exposure of the sample to the electron beam” because when474
23
the beam’s electrons hit the sample, inelastically scattered electrons deposit energy and “the
sample starts to move upon exposure to the electron beam.” Hence, “this beam-induced motion476
causes a blurring in the images that can be corrected by movie-processing, since each of the
movie frames contains a sharper snapshot of the moving objects”, as has been illustrated for478
large rotavirus particles. Although it is more difficult to correct for beam-induced motions of
smaller complexes, the movie-processing approach can be adapted “based on the observation480
that neighboring particles often move in similar directions”, and therefore “by fitting straight
lines through the most likely translations from the original movie-processing approach, and by482
considering groups of neighboring particles in these fits, the high noise levels in the estimated
movement tracks could be sufficiently reduced” for smaller complexes.484
We now consider the closely related problem of verification that a 3D reconstruction is
“a reliable and faithful representation of the underlying molecule” discussed in Section VIIB486
of BBS as a computational and theoretical challenge. Although “this is a question of cru-
cial importance for any scientific field” and several validation techniques were proposed in488
the cryo-EM literature, “there are no agreed-upon computational verification methods” and in
practice, “structure validation is based on a set of heuristics and the experts’ knowledge and490
experience”, e.g., initializing the 3D reconstruction algorithm from multiple random points and
attaining similar structures in all cases, recovering similar structure by applying other technolo-492
gies (such as X-ray crystallography and nuclear magnetic resonance) to the same molecule. The
movie-processing approach described in the preceding paragraph offers a promising verification494
procedure which will be discussed further in Section 4.2.
4.2 Statistical Challenges and Opportunities496
Analyzing “beam-induced motions for groups of neighboring particles” is only a key constituent
of the movie-processing approach reviewed by Scheres (2016, pp.133-134), who also describes498
the importance of combining it with “a novel way of handling radiation-damage weighting” as an
adjuvant. He refers to Glaeser (2016) for reviews of radiation chemistry and radiation damage500
caused by exposure to the electron beam, which starts with the breakage of chemical bonds in
24
the sample, destroying the high-resolution content of the sample, and continues with unfolding502
of the secondary structure elements and protein domains, with low-resolution information per-
sisting longer than high-resolution information, until the macromolecular complex is eventually504
destroyed. Hence dose-response relations (in which response refers to the resolution-dependent
effect of radiation damage) of the type used in pharmacokinetics/pharmacodynamics are po-506
tentially useful for such modeling and for providing frequency-dependent weights to average the
aligned movie frames of each particle. He says: “By weighting the different spatial frequencies508
in each movie frame differently, the useful information from each movie frame is retained.” He
remarks that this approach results in improved signal-to-noise ratios and is sometimes called510
“particle polishing”.
The next two statistical challenges and opportunities are CTF estimation and bias mitiga-512
tion in particle picking mentioned in the last paragraph of Section 4.1. Both are related to what
Section III of BBS calls “three characteristic features of the cryo-EM data: high noise level,514
missing data, and massive datasets.” Estimation of the CTF parameters cij in (13) is a linear
regression problem if there were no missing data. As pointed out by BBS, if the viewing di-516
rection and location associated with each particle were known, “estimating φ would be a linear
inverse problem” and recovery in this case is based on the projection-slice theorem (also called518
Fourier slice theorem) stating that the 2D Fourier transform of the projection of φ belonging
to a 3D manifold is the restriction of the 3D Fourier transform of φ to a 2D plane, hence φ520
can be estimated to a certain resolution if one has enough projections from known viewing
directions. The statistical challenge is that the viewing directions are unknown “missing data”522
and “any method to estimate the viewing directions is destined to fail” because of low SNR
and the “Neyman-Scott paradox” (caused by too many nuisance parameters relative to the524
sample size). Hence “it is essential to consider statistical methods that circumvent rotation
estimation”, providing an opportunity for statistical science that we have discussed in Sections526
3.2 and 3.3.
Sections IVB1 and IVC of BBS discuss model bias, exemplified by the “Einstein from noise”528
phenomenon which shows how the reconstructed molecular structure can be biased by the
reference image picked manually by the analyst to “extract 2D projections (particles) from the530
25
noisy micrographs”, as we have also shown in Section 2. To mitigate the user’s selection bias,
RELION uses an automated particle picking algorithm that is described with updates by Scheres532
(2016, pp.135-137), which we summarize here. To identify the positions of individual particles
in all micrographs, the user’s task is to extract suitable particles from a relatively small subset534
of micrographs (typically a thousand particles is enough) so that after extraction the particles
undergo a first round of reference-free 2D classification in RELION and the 2D classes are used536
as templates for automated particle picking of all micrographs; the autopicking algorithm has
two important parameters: a threshold that expresses how restrictive the particle picking is538
(with higher threshold values for picking few particles and therefore more restrictive) and the
minimum inter-particle distance. Note that improvements in 2D classification methodology540
and software implementation have been evolving, as described above in Section 3.1 and in BBS
(Section VD). Indeed 3D multi-reference alignment for high-resolution structure determination542
is a continually improving area that offers challenges and opportunities for statistical science.
BBS says: “Establishing computational tools that provide confidence intervals to estimated544
structures and are immunized against systematic errors is one of the remaining challenges in
the field .” The recent advances in joint state and parameter estimation in hidden Markov546
models described in Section 3.3 are opportune developments to meet these challenges.
In conclusion, we share the perspective of BBS that the emerging field of cryo-EM is an548
alluring area for statistical and computational scientists to tackle challenging problems and
develop analytical and computational tools to “drive the field forward”, and in return “broaden550
our understanding of the fundamental mechanisms of life.” This understanding is not only
timely in the battle against SARS-CoV-2 but also important for developing vaccines, diagnostics552
and drugs to prevent or treat COVID-19 infections; see the articles by Wrapp et al. (2020) and
Gao et al. (2020) in Science early this year.554
References
Bendory, T., Bartesaghi, A., and Singer, A. (2020). Single-particle cryo-electron microscopy:556
Mathematical theory, computational challenges, and opportunities. IEEE Signal Processing
26
Magazine, 37(2):58–76.558
Chang, S. G., Yu, B., and Vetterli, M. (2000). Adaptive wavelet thresholding for image denoising
and compression. IEEE Transactions on Image Processing, 9:1532–1546.560
Chen, T.-L., Hsieh, D.-N., Hung, H., Tu, I.-P., Wu, P.-S., Wu, Y.-M., Chang, W.-H., and
Huang, S.-Y. (2014). γ-SUP: A clustering algorithm for cryo-electron microscopy images of562
asymmetric particles. The Annals of Applied Statistics, 8:259–285.
Cheng, Y. (2015). Single-particle cryo-em at crystallographic resolution. Cell, 161:450–457.564
Cheng, Y., Grigorieff, N., Penczek, P., and Walz, T. (2015). A primer to single-particle cryo-
electron microscopy. Cell, 161:438–449.566
Chung, S.-C., Wang, S.-H., Niu, P.-Y., Huang, S.-Y., Chang, W.-H., and Tu, I.-P. (2020). Two-
stage dimension reduction for noisy high-dimensional images and application to cryogenic568
electron microscopy. Annals of Mathematical Sciences and Applications, 5,:in press.
Danielyan, A., Katkovnik, V., and Egiazarian, K. (2011). BM3D frames and variational image570
deblurring. IEEE Transactions on Image Processing, 21:1715–1728.
Efimov, A. V. (2018). Chirality and handedness of protein structures. Ultramicroscopy, 83:103–572
110.
Gao, Q., Bao, L., Mao, H., Wang, L., Xu, K., Yang, M., Li, Y., Zhu, L., Wang, N.,574
Lv, Z., Gao, H., Ge1, X., Kan, B., Hu, Y., Liu, J., Cai1, F., Jiang1, D., Yin, Y.,
Qin, C., Li, J., Gong, X., Lou, X., Shi, W., Wu, D., Zhang, H., Zhu, L., Deng,576
W., Li, Y., Lu, J., Li4, C., Wang, X., Yin, W., Zhang, Y., and Qin, C. (2020).
Development of an inactivated vaccine candidate for SARS-CoV-2. Science, in press.578
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202686/pdf/abc1932.pdf.
Glaeser, R. M. (2016). Specimen behavior in the electron beam. Chapter 2 (pp. 20-50) in580
Methods in Enzymology., pages vol. 579, ScienceDirect, Elsevier.
27
Henderson, R. (2013). Avoiding the pitfalls of single particle cryo-electron microscopy: Einstein582
from noise. Proc. Natl. Acad. Sci. USA, 110:18037–18041.
Henderson, R., Sali, A., Baker, M. L., Carragher, B., Devkota, B., Downing, K. H. Egelman,584
E. H., Feng, Z., Frank, J., Grigorieff, N., Wen Jiang, W., Ludtke, S. J., Medalia, O., Penczek,
P. A., Rosenthal, P. B., Rossmann, M. G., Schmid, M. F., Schroder, G. H. amd Steven, A. C.,586
Stokes, D. L., Westbrook, J. D., Wriggers, W., Yang, H., Young, J., Berman, H. M., and
Chiu, W. Kleywegt, G. J. a. C. L. (2012). Outcome of the first electron microscopy validation588
task force meeting. Structure, 20:205–214.
Hung, H., Wu, P.-S., Tu, I.-P., and Huang, S.-Y. (2012). On multilinear principal component590
analysis of order-two tensors. Biometrika, 99:569–583.
Kimanius, D., Forsberg, B. O., Scheres, S. H., and Lindahl, E. (2016). Accelerated cryo-EM592
structure determination with parallelisation using GPUs in RELION-2. Elife, 5:e18722.
Lai, T. L., Xu, H., Zhu, M. H., and Chan, H. P. (2020). MCMC with sequential substitu-594
tions for joint state and parameter estimation in hidden Markov models. Technical Report,
Department of Statistics, Stanford University.596
Liao, M., Cao, E., Julius, D., and Cheng, Y. (2013). Structure of the TRPV1 ion channel
determined by electron cryo-microscopy. Nature, 504:107–112.598
Scheres, S. H. (2012a). A bayesian view on cryo-em structure determination. Journal of
Molecular Biology, 415:406–418.600
Scheres, S. H. (2012b). RELION: implementation of a bayesian approach to cryo-EM structure
determination. Journal of Structural Biology, 180:519–530.602
Scheres, S. H. (2016). Processing of structurally heterogeneous cryo-EM data in RELION.
Chapter 6 (pp. 125-157) in Methods in Enzymology., pages vol. 579, ScienceDirect, Elsevier.604
Shiu, S. Y. and Chen, T. L. (2016). On the strengths of the self-updating process clustering
algorithm. Journal of Statistical Computation and Simulation, 86:1010–1031.606
28
Sorzano, C. O. S., Bilbao-Castro, J. R., Shkolnisky, Y., Alcorlo, M., Melero, R., Caffarena-
Fernandez, G., Li, M., Xu, G., Marabini, R., and Carazo, J. M. (2010). A clustering approach608
to multireference alignment of single-particle projections in electron microscopy. Journal of
Structural Biology, 171:197–206.610
Stewart, A. and Grigorieff, N. (2004). Noise bias in the refinement of structures derived from
single particles. Ultramicroscopy, 102:67–84.612
van Heel, M. (1984). Multivariate statistical classification of noisy images (randomly oriented
biological macromolecules). Ultramicroscopy, 13:165–183.614
Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C. L., Abiona, O., Graham,
B. S., and McLellan, J. S. (2020). Cryo-EM structure of the 2019-nCoV spike in the prefusion616
conformation. Science, 367:1260–1263.
Yan, C., Hang, J., Wan, R., Huang, M., Wong, C. C., and Shi, Y. (2015). Structure of a yeast618
spliceosome at 3.6-angstrom resolution. Science, 349:1182–1191.
Yang, Z., Fang, J., Chittuluru, J., Asturias, F. J., and Penczek, P. A. (2012). Iterative stable620
alignment and clustering of 2D transmission electron microscope images. Structure, 20:237–
247.622
Ye, J. (2005). Generalized low rank approximation of matrices. Machine Learning, 61:167–191.
29