Stanford University - CRYO-EM: BREAKTHROUGHS IN ......Tze Leung Lai Shao-Hsuan Wang Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang I-Ping Tu Technical Report No. 2020-14 November 2020 Department

CRYO-EM: BREAKTHROUGHS IN CHEMISTRY, STRUCTURAL BIOLOGY, AND STATISTICAL UNDERPINNINGS

By

Tze Leung Lai Shao-Hsuan Wang

Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang

I-Ping Tu

Technical Report No. 2020-14 November 2020

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

CRYO-EM: BREAKTHROUGHS IN CHEMISTRY, STRUCTURAL BIOLOGY, AND STATISTICAL UNDERPINNINGS

By

Tze Leung Lai Stanford University

Shao-Hsuan Wang

Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang

I-Ping Tu Academia Sinica, Taiwan

Technical Report No. 2020-14 November 2020

This research was supported in part by National Science Foundation grant DMS 1811818.

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

http://statistics.stanford.edu

Cryo-EM: Breakthroughs in Chemistry, Structural

Biology, and Statistical Underpinnings

Tze Leung Lai1, Shao-Hsuan Wang2, Yi-Ching Yao2, Szu-Chi Chung2,

Wei-Hau Chang3, and I-Ping Tu2

1Department of Statistics, Stanford University, USA

2Institute of Statistical Science, Academia Sinica, Taiwan

3Institute of Chemistry, Academia Sinica, Taiwan

August 10, 2020

Abstract

Cryogenic electron microscopy (cryo-EM) has revolutionized structural biology, or-

ganic and medicinal chemistry, molecular and cellular physiology, and its fundamental

importance was recognized in the 2017 Nobel Prize in Chemistry. Herein we first review

the statistical underpinnings of high-resolution 3D image reconstruction from 2D cryo-EM

data that have three characteristic features: missing data, high noise level, and massive

datasets. We then discuss challenges and opportunities for statistical science posed by

high-resolution structure determination using cryo-EM, and review recent advances in

high-dimensional multivariate analysis, dimension reduction, maximum a posteriori esti-

mation of latent variables, hidden Markov models and uncertainty quantification in this

connection.

Keywords: Cross correlation, cryo-EM, hidden Markov models, MCMC, model bias,

multivariate analysis, regularized likelihood

1

1 Introduction

Cryogenic electron microscopy (cryo-EM) is an imaging technique that uses transmitted electron2

waves to obtain projection images of a biological sample. In contrast to X-ray crystallography,

single particle cryo-EM does not need crystals and thereby is amenable to structural deter-4

mination of proteins that are refractory to crystallization, including membrane proteins and

yeast spliceosomes that exhibit dynamic patterns (Liao et al. (2013); Yan et al. (2015)). This6

capability enables single particle cryo-EM to record structures in solution, and single particle

cryo-EM breakthroughs in high-resolution structure determination pioneered by Jacques Dubo-8

chet, Joachim Frank, and Richard Henderson were awarded the Nobel Prize in Chemistry in

2017.10

Although single particle cryo-EM analysis has become the mainstream method for solving

high-resolution 3D structure density maps of biomolecules, cryo-EM images are extremely noisy12

and the signal-to-noise ratio is often less than 0.1. As a result, a typical cryo-EM experiment

tends to collect a large number of particle images (usually more than hundreds of thousands)14

and to compensate the noise contamination by averaging. The size of a cryo-EM particle image

is often larger than 100 pixels measured in each direction. The data characteristics of cryo-EM16

images include strong noise contamination, huge dimension and large sample size, making their

processing and statistical analysis very challenging. Henderson (2013) pointed out how spurious18

patterns could easily emerge by averaging a large number of white-noise images aligned to a

reference image through rotation and translation. In particular, using Einstein’s facial image as20

the reference, a blurred Einstein’s face emerged by averaging 1000 aligned white-noise images,

which he dubbed “Einstein from noise”.22

Section 2 explains the “Einstein from noise” phenomenon from the statistical perspective

of multivariate analysis, extreme value theory, and variable selection bias in analyzing high-24

dimensional data. Section 3 begins with a review of how noisy cryo-EM images are analyzed

to circumvent such bias and also to avoid overfitting, culminating in “near-atomic-resolution26

(i.e., 3-4A) structures for several icosahedral viruses and resolutions in the range of 4-6A for

complexes with less or no symmetry” mentioned by Scheres (2012b) and explained in Sec-28

2

tion 3.2 below, where the background of 3-4A and 4-6A for near-atomic resolutions is also

provided. It then discusses the statistical underpinnings of this work and subsequent develop-30

ments in cryo-EM data collection and analysis in studies of molecular and cellular physiology

or protein/RNA/virus/toxin/organometallic compound structure, and describes some recent32

advances in hidden Markov models that can pave the way toward a definitive solution. Section

4 describes the opportunities and challenges cryo-EM image analysis poses to statistical and34

computational sciences, and gives further discussion, additional background and concluding

remarks.36

2 A statistical perspective of “Einstein from noise”

“Einstein from noise” actually refers to the work of Stewart and Grigorieff (2004), who did38

a simulation experiment generating 1000 white-noise images and aligning each of them to

Einstein’s facial image through rotation and translation. A blurred Einstein’s face emerged40

from averaging the 1000 aligned images. Henderson (2013) used this phenomenon to warn the

community that an incorrect 3D density map could be constructed when data are blindly fitted42

to a model. In this section, we provide a statistical explanation and theory of this phenomenon.

To simplify the presentation, we do not delve into the technical details of how rotating an image44

may destroy the pixel format; these details and how to address them will be given in Section

3. Instead we treat an image as vector of dimension p and a white-noise image as a random46

vector uniformly distributed on the (p − 1)-dimensional unit sphere. The cross correlation of

two images is defined as the inner product of the corresponding vectors. Section 2.1 describes48

the statistical model and presents a simulation study with n = 106 white-noise images with

the pixel number p = 100× 100 = 104. Among the one million white-noise images, the largest50

cross correlation value with Einstein’s facial image (the reference) is 0.048, and yet the cross

correlation increases dramatically to 0.652 after averaging the 600 images that have the largest52

cross correlation values with Einstein’s facial image in the simulation study, which thereby

illustrates the essence of the “Einstein from noise” phenomenon. Section 2.2 connects this54

theory to extreme value theory and multivariate analysis.

3

2.1 White-noise Images and Image Selection Bias in Reference Align-56

ment

Let R be the reference matrix (the digital version of the reference image) of dimension d1× d2.58

We assume that ‖R‖ = 1 where ‖·‖ denotes the Frobenius norm of a matrix or Euclidean norm

of a vector. We generate n independent and identically distributed (iid) white-noise images as60

follows. Let Z1, . . . ,Zn be iid d1×d2 random matrices such that the d1d2 components of each Zi

are iid standard normal. We refer to Zi/‖Zi‖ (the normalized version of Zi), i = 1, . . . , n as n62

iid white-noise images. Let r = vec(R), the p-dimensional column vector which is the vectorized

version of R, where p = d1d2. The fact that ‖r‖ = 1 implies r ∈ Sp−1 (the (p− 1)-dimensional64

unit sphere). LetX i = vec(Zi)/‖Zi‖. Thus,X1, . . . ,Xn are iid uniformly distributed on Sp−1.

We refer to both Zi/‖Zi‖ and X i as the i-th white-noise image. The cross correlation of X i66

and r (or equivalently of Zi/‖Zi‖ and R) is defined as r>X i (the inner product of X i and r),

where r> denotes the transpose of r. Note that r>X i = cos Θi, where Θi is the angle between68

r and X i. The n white-noise images are ordered (and denoted by X(1), . . . ,X(n)) according

to their cross correlation values with r. In other words, (X(1), . . . ,X(n)) is a permutation of70

(X1, . . . ,Xn) such that r>X(1) ≥ r>X(2) ≥ · · · ≥ r>X(n). Let Θ1:n ≤ Θ2:n ≤ · · · ≤ Θn:n be

the order statistics of the angles (Θ1, · · · ,Θn), so that cos Θi:n = r>X(i), i = 1, . . . , n. Let72

Xm = m−1∑m

i=1X(i). Then Xm/‖Xm‖ ∈ Sp−1 refers to the normalized average of the m

white-noise images that are most highly cross-correlated with the reference image. Our goal74

is to find a good approximation of the distribution of ρn,p,m = r>Xm/‖Xm‖ when n, p, and

m are large. Note that for m = 1, ρn,p,1 = r>X(1) = cos Θ1:n, is the largest cross correlation76

value. Figures 1 and 2 summarize our simulation study.

Figure 1 consists of the reference image, which is Einstein’s face, and 6 images corresponding78

to m−1∑m

i=1Z(i)/‖Z(i)‖ for m = 100, . . . , 600, where Z(i) (i = 1, . . . , 6) have the 6 largest cross

correlation (CC) values (0.048, 0.046, 0.045,80

0.044, 0.044, 0.044) with the reference image. Note that Einstein’s face clearly emerges at

m = 300, 400, 500, 600, with different degrees of blurring, corresponding to CC values 0.540,82

0.585, 0.623, 0.652. Figure 2 shows similar results with three different reference images ranging

4

from a simple circle to a tree frog, indicating that the phenomenon of “Einstein from noise” is84

robust across various reference images. The cross correlation values in Figure 2 are about the

same across different reference images. This can be explained by the fact that if X is uniformly86

distributed on Sp−1, then the distribution of r>X is independent of r.

Figure 1: Example with Einstein’s face as the reference image.

Figure 2: Reference images ranging from single circle to tree frog.

5

2.2 Extreme Value Theory and Multivariate Analysis88

Recall that cos Θ1:n is the largest cross correlation, whose distribution can be approximated as

follows (when n and p are large). Let90

Kn,p = − lnn+1

2ln lnn− 1

2ln

((2 lnn)/p

1− exp {−(2 lnn)/p}

)+

1

2ln(4π).

Thenp− 1

2ln(1− cos2 Θ1:n)−Kn,p

d−→ G uniformly as min{n, p} → ∞, (1)

whered→ denotes convergence in distribution and the cumulative distribution function of G is

given by G(t) = 1 − e−et , t ∈ R which is known as the extreme value distribution of Gumbel

type. Based on (1), for fixed α ∈ (0, 1) and for large n and p, an approximate 100α-th quantile

of the distribution of cos Θ1:n is

Mn,p(α) =√

1− exp{2(Kn,p + ln ln α−1)/(p− 1)}.

Recall that cos Θ1:n = 0.048 in the simulation study summarized in Figure 1, where n = 10692

and p = 104. This observed value is compatible with the approximate 25th, 50th, and 75th

quantiles, i.e. M106,104(0.25) = 0.046, M106,104(0.5) = 0.047, M106,104(0.75) = 0.052. Figure94

3 plots Mn,p(α) versus log10 n for n ≤ 10100 with p = 104 and α = .1, .5, .9. Note that the

three quantile curves are very close to each other, indicating that cos Θ1:n has a small standard96

deviation. Figure 3 suggests that for P(cos Θ1:n ≥ 0.1) to be at least 0.1, n is required to be

greater than 1020, and for P(cos Θ1:n ≥ 0.2) to be at least 0.1, n is required to be greater than98

1080. In other words, it is unlikely for any of n iid white-noise images of dimension 100×100 to

have a cross correlation value with Einstein’s face greater than 0.2 unless n is astronomically100

large.

By (1), when n and p are large, ln(1− cos2 Θ1:n) = −2p−1(lnn)(1 + op(1)), hence102

cos Θ1:n =

√2 lnn

p(1 + op(1)), (2)

6

if (lnn)/p = o(1). Thus, under the condition (lnn)/p = o(1), with high probability, the n iid

white-noise images all have negligible cross correlation values with the reference.104

On the other hand, the cross correlation ρn,p,m of Xm/‖Xm‖ and r may be significantly greater

than 0 if m = mn grows with n. We now sketch the derivation of a crude approximation of ρn,p,m106

when p = pn and m = mn both grow with n. Since eachX i is uniformly distributed on Sp−1, the

distribution of r>X i is independent of r. Therefore, the distribution of ρn,p,m = r>Xm/‖Xm‖108

is independent of r. In this subsection, for convenience we take r = (1, 0, . . . , 0)> as the

reference vector. We begin by decomposing each X(i) into two components, one parallel to r110

and the other orthogonal to r, namely,

X(i) = (r>X(i))r +X(i)⊥ = (cos Θi:n)r +X

(i)⊥ ,

where X(i)⊥ = X(i)− (cos Θi:n)r denotes the component of X(i) orthogonal to r. (See Figure 4,112

in which the black vector is the sorted ith white-noise vector X(i) ∈ Sp−1 which is decomposed

into the blue vector parallel to the reference vector shown in orange and the red vector lying114

in the subspace orthogonal to r; note that the cross correlation cos Θi:n is the inner product of

the black vector and the orange vector.)116

7

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

=.5=.1=.9

Figure 3: Approximate 100α-th quantile (α = .1, .5, .9) of cos Θ1:n versus log10 n.

Figure 4: Relationship between the reference vector and a white-noise vector.

8

Therefore,

Xm =1

m

m∑i=1

X(i) =

(1

m

m∑i=1

cos Θi:n

)r +

1

m

m∑i=1

X(i)⊥ . (3)

Note that under the condition lnnp

= o(1), (2) implies that cos Θ1:n is approximately√

2 lnnp

.118

In addition to that condition, if m = mn also grows at a slow rate, then 1m

∑mi=1 cos Θi:n will

only be slightly less than cos Θ1:n, so that120

1

m

m∑i=1

cos Θi:n =

√2 lnn

p(1 + op(1)).

On the other hand,

1 ≥ ‖X(i)⊥ ‖ =

√1− cos2 Θi:n ≥

√1− cos2 Θ1:n ≈

√1− 2 ln n

p,

showing that ‖X(i)⊥ ‖ is nearly equal to 1 under the specified condition. Furthermore, it is122

readily seen that the (normalized) orthogonal components X(i)⊥ /‖X

(i)⊥ ‖, i = 1, . . . , n are iid

uniformly distributed on the (p−2)-dimensional unit sphere {(0, x2, . . . , xp) : x22+ · · ·+x2p = 1},124

which implies that if m = mn grows slowly and is much smaller than p = pn, then with high

probability, X(i)⊥ /‖X

(i)⊥ ‖, i = 1, . . . ,m (and hence X

(i)⊥ , i = 1, . . . ,m) are nearly orthogonal to126

each other. It then follows that the length of∑m

i=1X(i)⊥ is approximately

√m. Combining all

these arguments together with (3) yields that with high probability128

ρ2n,p,m = (r>Xm/‖Xm‖)2 =( 1m

∑mi=1 cos Θi:n)2(

1m

∑mi=1 cos Θi:n

)2+∥∥∥ 1m

∑mi=1X

(i)⊥

∥∥∥2≈

2 ln np

2 ln np

+ 1m

=2mp

ln n

1 + 2mp

ln n.

More precisely, when n, p and m are all large, we have the following asymptotic results. If

p = pn satisfies (lnn)2/p = o(1) and m = mn satisfies m/n = o(1), then130

ρ2n,p,m =βn,p,m

1 + βn,p,m+ op(1), (4)

9

where

βn,p,m =m

p

{2 ln

n

m− ln ln

n

m− ln(4π) + 2

}is a model bias index. If m = mn satisfies the stronger condition m(ln lnn)4/(lnn)2 = o(1),

then132

p (1 + βn,p,m)2(8m+ 2p β2

n,p,m

)1/2 (ρ2n,p,m − βn,p,m1 + βn,p,m

)d−→ N(0, 1). (5)

3 Statistical advances for cryo-EM image analysis

We begin this section with a review and discussion of the statistical underpinnings of cryo-134

EM image analysis. Noting that “a major goal of structural biology is to provide mechanistic

understanding of critical biological processes” and that “the most detailed insights come from136

atomic structures of macromolecules and complexes in these processes in relevant functional

states”, Cheng (2015) says that until the breakthroughs in cryo-EM image analysis “only a few138

years ago”, the routine method for studies of these atomic structures was X-ray crystallogra-

phy, which “completely depends on growth of well-ordered 3D crystals” where production “is a140

major bottleneck for challenging targets”. Because of “recent breakthroughs in hardware and

software”, which Cheng et al. (2015) review, cryo-EM has emerged as a technique for determin-142

ing structures at atomic resolution comparable to the crystallographic approaches, which also

determines “a number of structures of proteins and complexes that have vexed crystallogra-144

phers”. Kimanius et al. (2016) point out that despite such advances, two technological factors

still limit wide applicability of cryo-EM. One is limited access to high-end microscopes, and the146

other is that data processing and analysis require computational resources that are not directly

accessible to many labs. They say that “to extract fine structural details, one needs to average148

over multiple images of identical complexes to cancel noise sufficiently. ” They then explain:“

This is achieved by isolating two-dimensional particle-projections in the micrographs, which can150

then be recombined into a three-dimensional structure (Cheng et al., 2015). The latter requires

the estimation of the relative orientation of all particles, which can be done by a wide range152

of different image processing programs” that have to address the issue that “any one data set

10

typically comprises images of multiple different structures” including multiple conformations,154

data heterogeneity due to contaminants and other sources, and “ the classification of heteroge-

neous data into homogeneous subsets has therefore proven critical for high resolution structure156

determination and provides a tool for structural analysis of dynamic systems”. In Section 3.1,

we consider some recent developments in statistical science for identifying structurally homo-158

geneous subsets from these heterogeneous cryo-EM data. Kimanius et al. (2016) also note that

an increasingly popular choice in their list of image processing programs is RELION (Scheres,160

2012b) that uses an empirical Bayes (EB) approach to single-particle analysis (Scheres, 2012a).

Section 3.2 reviews this approach and recent developments for its implementation. Section162

3.3 describes recent breakthroughs in the more general problem of adaptive filtering in hidden

Markov models and how they can be used to develop a definitive alternative to the (somewhat164

circuitous) EM approach.

3.1 2D Clustering Methods for Cryo-EM Images166

An important step in the workflow of cryo-EM image analysis is 2D clustering to identify clean

particle image sets, after theses images have gone through the multi-reference alignment (MRA)168

and contrast transfer function steps of the workflow which will be described in Section 3.2. The

K-means approach was first used by van Heel (1984) in an MRA method for clustering randomly170

oriented biological macromolecules. Sorzano et al. (2010) proposed a K-means algorithm CL2D

for implementing their clustering approach to MRA of 2D projections in cryo-EM. For a given172

number m of clusters, CL2D iteratively bisects the data until at least m clusters appear. During

the process, CL2D dismisses the clusters whose size is smaller than a pre-specified number like174

30 and bisects the largest cluster once a dismiss is executed. Another novelty of CL2D is that it

adopts a kernel-based entropy measure ‘correntropy‘ to measure the distance that can mitigate176

the impact of outliers. When m is large as in the case of cryo-EM, setting good initial values

is important for CL2D to avoid being stuck at local minima. However, Yang et al. (2012) have178

pointed out the difficulty of finding good initial values and have found the K-means approach

to be unsatisfactory. Moreover, it is difficult to prespecify a manageable (not too large) number180

11

of clusters for the particle images.

Noting that two major approaches have been adopted in “the vast number of clustering al-182

gorithms developed”, namely, model-based and distance-based approaches, Chen et al. (2014)

have proposed to combine both approaches into a clustering algorithm called γ-SUP for 2D184

clustering of cryo-EM images. A model-based approach models the data as a mixture of para-

metric distributions with the mean estimates used as cluster centers, whereas a distance-based186

approach uses some distance measure of the similarity between data points, as in the K-means

method, hierarchical clustering, and the SUP (self-updating process) method of Shiu and Chen188

(2016). The γ -SUP algorithm models the data with a q-Gaussian mixture (model-based ap-

proach) and uses the γ-divergence (to measure the similarity between the empirical distribution190

and the model distribution) in SUP (distance-based approach). The q-Gaussian mixture model

has a density function of the form192

f(y) =m∑j=1

pjgq(y;µj, σ), y ∈ Rp, (6)

where gq(·;µ, σ) is the q-Gaussian density function

gq(y;µ, σ) = (√

2πσ)−pcp,q expq(−‖y − µ‖2/(2σ2)), (7)

with q < 1, the q-exponential function expq(u) = (1 + (1− q)u)1/(1−q)+ and normalizing constant194

cp,q = (1 − q)p/2Γ(1 + p/2 + (1 − q)−1)/Γ(1 + (1 − q)−1). These distributions have compact

support in Rp. Letting q → 1 yields the multivariate normal density with mean vector µ and196

covariance matrix σ2Ip.

Chen et al. (2014) note in their Section 3.1 that instead of working with the mixture density198

f that requires specification of the number m of mixture components, it is more practical to

fit each component separately through the optimization problem minµj ,σDγ(F∗ ‖ gq(·;µj, σ)),200

where F ∗ is the actual distribution (unknown) generating the data and Dγ(· ‖ ·) is the γ-

divergence (to the defined below) with γ > 1 − q. For given σ, the minimizer µ∗j is given by202

the solution of the equation µ∗j =∫yw(y;µ∗j , σ)dF ∗(y)/

∫w(y;µ∗j , σ)dF ∗(y). Hence replac-

12

ing F ∗ by the empirical distribution F of {Y i, 1 ≤ i ≤ n} leads to the following recursive204

implementation of µ∗j :

µ(`+1)j =

∫yw(y;µ

(`)j , σ)dF (y)∫

w(y;µ(`)j , σ)dF (y)

, ` = 0, 1, . . . (8)

Section 3.2 of Chen et al. (2014) then uses the SUP algorithm of Shiu and Chen (2016) to206

replace F in (8) by the empirical distribution F (`) of {µ(`)i , 1 ≤ i ≤ n}, thereby leading to the

γ-SUP recursion208

µ(`+1)j =

∫yw(y;µ

(`)j , σ)dF (`)(y)∫

w(y;µ(`)j , σ)dF (`)(y)

=n∑i=1

w(`)ij µ

(`)i

w(`)ij

, ` = 0, 1, . . . (9)

The explicit formulas for the weight function w and weights w(`)ij are derived in Sections 2.3

and 3.2 of Chen et al. (2014) by using (a) the γ-divergence Dγ(F ‖ gq(·;µ, σ)) = Cγ(F ‖210

gq(·;µ, σ)) − Cγ(F ‖ F ), where Cγ(· ‖ ·) is the “γ cross-entropy” defined by Cγ(F ‖ g) =

−∫gγ(y)dF (y)/[γ(γ+ 1)‖g‖γγ+1], in which ‖g‖γ+1 is a normalizing constant, and (b) an equiv-212

alent and more tractable problem of maximizing∫

[expq(−‖y−µ‖2/(2σ2))]γdF (y), culminating

in the formula214

w(`)ij = exp1−s

(−∥∥∥(µ

(`)j − µ

(`)i )/τ

∥∥∥2) , (10)

where τ =√

2σ/√γ − (1− q) > 0, s = (1− q)/{γ− (1− q)} > 0, µ(`) is defined recursively by

(9), and ‖x− y‖ is the Euclidean distance between x,y ∈ Rp. Chen et al. (2014, p. 269) note216

that in view of (10), γ-SUP starts with n (scaled) “cluster centers” Y i/τ, 1 ≤ i ≤ n, thereby

circumventing the need for initializing with random centers for other methods; moreover, these218

nonnegative and decreasing weights (with respect to the Euclidean distance ‖µ(`)i − µ

(`)j ‖) and

the compact support of the q-Gaussian distribution ensure the convergence of γ-SUP. Hence,220

eventually “γ-SUP converges to certain K clusters, where K depends on the tuning parameters

(s, τ) but otherwise is data-driven.” Another advantage of γ-SUP is that σ is absorbed in222

the tuning parameter τ , hence selection of τ obviates the need to select σ. Chen et al. (2014,

13

p. 277) note that “when τ reaches a critical value, the images in the same (eventual) cluster224

can start with attracting each other and will finally merge”, in contrast to the case of small τ

for which each particle image forms a cluster, hence a phase transition diagram illustrated in226

their Figure 5 can be used to determine τ . After choosing τ in this way, s can be chosen by

minimizing the impurity, or c-impurity, performance measure introduced in their Section 4.2.228

After presenting simulation studies of γ-SUP and real cryo-EM data on E. coli 70S ribosome,

their conclusion section says: “Characteristically, sets of cryo-EM images have low signal-to-230

noise ratio, many of which are misaligned and should be treated as outliers, and which form

a large number of clusters due to their free orientations. Because of its capability to identify232

outliers, γ-SUP can separate out the misaligned images and create the possibility for further

correcting them”, which provides a much smaller number of clusters with larger cluster sizes,234

and cluster averages comparable to those of a 2D projection of the ribosome 3D structure

obtained by X-ray crystallography.236

Chung et al. (2020) recently provided further analysis of benchmark cryo-EM data 70S

Ribosome and 80S Ribosome together with a novel two-stage procedure 2SDR for Dimension238

Reduction. A ribosome is made from complexes of RNAs (ribonucleic acids) that are present

in all living cells to perform protein synthesis by linking amino acids together in the order240

specified by the codons of mRNA (messenger RNA) molecules to form polypeptide chains.

A 70S Ribosome comprises of a large 50S subunit and small 30S subunit; the S stands for242

Svedberg, a unit of time equal to 10−13 seconds, measuring how fast molecules move in a

centrifuge. Eukaryotic ribosomes are also known as 80S Ribosomes and have a large 60S244

subunit and small 40S subunit. In the past decade, the linear subspace model that represents

the protein motion using the eigenvolumes from the covariance matrix of 3D structures is an246

active research area, as will be discussed further in Section 4. In all these approaches, PCA

(principal component analysis) plays an important role to estimate the top eigenvolumes.248

Let X,X1, · · · ,Xn be i.i.d. p × q random matrices. Let y = vec(X i), where vec is

the operator of matrix vectorization by stacking the matrix into a vector by columns. The250

statistical model for PCA is y = µ + Γν + ε, where µ is the mean, ν ∈ Rr with r ≤ pq,

Γ is a pq × r matrix with orthonormal columns, and ε is independent of ν with E(ε) = 0252

14

and Cov(ε) = c Ipq. The zero-mean vector ν has covariance matrix ∆ = diag(δ1, δ2, · · · , δr)

with δ1 ≥ δ2 ≥ · · · ≥ δr > 0 . The estimate Γ contains the first r eigenvectors of the sample254

covariance matrix Sn = n−1∑n

i=1(yi−y)(yi−y)>, and vec(X)+Γνi provides a reconstruction

of the noisy data vec(X i). The computational cost, which increases with both the sample size256

n and the dimension pq, becomes overwhelming for high-dimensional data. An alternative

approach to matrix vectorization is MPCA (multilinear PCA), developed by Ye (2005) and258

Hung et al. (2012), which models a p× q random matrix X by

X = Z + ε ∈ Rp×q, Z = M +AUB>, (11)

where M ∈ Rp×q is the mean, U ∈ Rp0×q0 is a random matrix with p0 ≤ p, q0 ≤ q, A and B

are non-random p×p0, q×q0 matrices with orthogonal column vectors, ε is a zero-mean random

vector independent of U such that Cov(vec(ε)) = σ2 Ipq. Ye (2005) proposed to use generalized

low-rank approximations of matrices to estimate A and B. Given (p0, q0), A consists of the

leading p0 eigenvectors of the covariance matrix∑n

i=1(X i−X)P B(X i−X)>, and B consists

of the leading q0 eigenvectors of∑n

i=1(X i −X)>P A(X i −X), where the matrix P A = AA>

(respectively, PB = BB>

) is the projection operator into the span of the column vectors of A

(respectively, B). The estimates can be computed by an iterative procedure that usually takes

no more than 10 iterations to converge. Replacing A and B by their estimates A and B in

(11) yields

U i = A>

(X i −X)B, hence AU iB>

= P A(X i −X)P B, (12)

i.e., vec(AU iB>

) = P B⊗Avec(X i−X), where ⊗ denotes the Kronecker product and P B⊗A =260

(BB>

) ⊗ (AA>

). Chung et al. (2020) propose a new model, called hybrid PCA and denoted

by HMPCA, in which the subscript M stands for MPCA and H stands for ”hybrid” of MPCA262

and PCA. Specifically, HMPCA assumes the MPCA model (11) with reduced rank (p0, q0) via

the p0 × q0 random matrix U and then assumes a rank-r model, with r ≤ p0q0, for vec(U)264

to which a zero-mean random error ε with Cov(ε) = cIp0,q0 is added. This leads to dimension

15

reduction of vec(X −M − ε) = vec(Ap0UB>q0

) from p0q0 to r. Since U = A>p0(X −M )Bq0 in266

view of (12), vec(Ap0UB>q0

) = PBq0⊗Ap0vec(X −M − ε) is the projection of X −M − ε into

span(Bq0 ⊗Ap0), which has dimension r after this further rank reduction. The actual ranks,268

which we denote by (p∗0, q∗0) and r∗, are unknown as are the other parameters of the HMPCA

model, and 2SDR uses a sample of size n to fit the model and estimate the ranks.270

The first stage of 2SDR uses (11) to model a noisy image X as a matrix. Ye’s estimates A

and B described in the preceding paragraph depend on the given values of the rank pair (p0, q0).272

Chung et al. (2020) introduce a rank selection criterion to choose the rank pair and show that its

minimizer (p0, q0) is a consistent estimator (as n→∞) of the true value (p∗0, q∗0) of the rank pair.274

The second stage of 2SDR performs PCA on the covariance matrix n−1∑n

i=1 vec(U i)vec(U i)>

to obtain its ordered eigenvalues which are used in a “generalized information criterion” (GIC)276

to choose the rank r ≤ p0q0; minimization of GIC over r ≤ p0q0 yields the estimator rGIC, which

is then shown to be a consistent estimator of the true rank r∗ under certain sparsity conditions.278

The analysis of 70S Ribosome benchmark data in Section 3 of Chung et al. (2020) shows that

2SDR can improve 2D image clustering to curate the clean particles and 3D classification to280

separate various conformations, and enhance the performance of γ-SUP via dimension reduc-

tion. The 80S Ribosome dataset contains 105,247 particle images with pixel size 360 × 360,282

for which “many current PCA implementation fail to solve the complete set of eigenvectors”

whereas 2SDR reduces the computational complexity “by several orders of magnitude”, thereby284

circumventing the prohibitive computational overhead of vectorization for dimension reduction

via PCA. To illustrate, the top row of Figure 5 shows nine images randomly selected from a286

dataset that contains 5090 Betagal particle images with pixel size 256 × 256. The next five

rows of Figure 5 show their reconstructions by PCA, MPCA with (p0, q0) = (19, 19), 2SDR288

with (p0, q0) = (19, 19) and rGIC = 33, Wavelet and BM3D, respectively. “Wavelet” refers to

adaptive wavelet thresholding for image denoising and compression introduced by Chang et al.290

(2000), who proposed to use (a) matrix representation of 2D noisy images, (b) wavelet thresh-

olding to minimize a Bayes risk assuming a generalized Gaussian model for the signal and a292

Gaussian model for the noise, and (c) the minimum description length criterion for choosing

quantization levels and binwidths. “BM3D” refers to block-matching and 3D reconstruction294

16

introduced by Danielyan et al. (2011), who proposed to use (i) a block-matching algorithm to

group image fragments (which may not be disjoint but have the same size), (ii) hard threshold-296

ing to filter the 3D group spectra, and (iii) inversion of the filtered spectra to provide individual

reconstruction for each block in the group, so that the final image reconstruction can be com-298

puted as a weighted average of all blockwise estimates. Figure 5 shows that 2SDR clearly

performs much better than the other methods.300

Figure 5: Nine randomly selected particle images and their reconstructions.

3.2 Latent Orientations and Regularized Likelihood Optimization

Scheres (2012a) gives an overview of RELION (REgularized LIkelihood OptimizatioN), which is302

an open-source computer program to implement the “Bayesian approach to cryo-EM structure

determination, in which the reconstruction problem is expressed as the optimization of a single304

target function” in Scheres (2012b). For the latter reference about which Section 1 refers to this

section for background and explanation, the angstrom A is 0.1 nanometer, a unit of length that306

is widely used to express the sizes of atoms, molecules, and electromagnetic wavelengths. It

pinpoints a fundamental difficulty with 3D structure reconstruction from cryo-EM data, which308

is “the lack of information about the relative orientations of all particles and, in the case of

17

structural variability in the sample, also their assignment to a structurally unique class” because310

“these data are lost during the experiment, where molecules in distinct comformations coexist

in solution and adopt random orientations in the ice.” Hence cryo-EM structure determination312

is an “ill-posed problem” which needs to be “tackled by regularization, where the experimental

data are complemented by prior information so that the two sources of information together314

fully determine a unique solution.” Since in practice the experimental data often “ need to

be supplemented with prior information” because of low signal-to-noise ratio (SNR) or insuf-316

ficiently large sample size, a Bayesian approach that assumes “a Gaussian distribution on the

Fourier components of the signal” is used for maximum a posteriori (MAP) estimation of the318

latent vector of actual orientations of the images. MAP is a point estimate defined by the mode

(i.e., argmaxθf(θ | Xn)) of the posterior density f(· | Xn) given the observed sample Xn of size320

n, hence it is the solution of a regularized likelihood maximization problem. For a loss function

of the form L(φ, a) = I{‖φ−a‖<c}, the Bayes estimate φc approaches MAP as c→ 0 if −f(· | Xn)322

is convex. Scheres (2012b, pp. 521-525) uses (a) the Dempster-Laird-Rubin EM algorithm to

evaluate MAP with fast Fourier-space interpolation and adaptive EM to speed up the compu-324

tations, (b) the ratio πF/πT (of the posterior probability of assigning a false orientation φF to

that of assigning the true orientation φT ) to assess the accuracy of the orientation assignments326

of individual particles, and (c) a “gold standard” (explained below) to avoid overfitting.

Scheres (2016) reviews the processing of structurally heterogeneous cryo-EM data in RE-328

LION. He recognizes that the Bayesian approach in Scheres (2012b,a) is actually empirical

Bayes (EB) regularization because the hyperparameters in the prior model are replaced by330

their estimates in the regularized likelihood. He says:“ Whereas in standard Bayesian methods

the prior is fixed before any data are observed, inside RELION parameters of the prior are332

estimates from the data themselves. This type of algorithm is referred to as an empirical Bayes

approach”, in which “both the likelihood and the prior are expressed in the Fourier domain”334

that “permits a convenient description of the effects of microscope optics and defocusing (by

the so-called contrast transfer function, or CTF).” He notes that using the EM algorithm to336

compute the MAP estimate in the Fourier domain “results in an update formula for the re-

construction that shows strong similarities with previously introduced Wiener filters” which338

18

depend on estimates for the power and the noise as a function of spatial frequency. By using

the EB approach to update these estimates from the data, “RELION effectively calculates the340

best possible filter, in the sense that it yields the least noisy reconstruction, at every iteration

of the optimization process.” As pointed out by Scheres (2012b, p.520), the MAP estimate342

is based on the following linear regression model in Fourier space for the kth homogeneous

structure group (k = 1, . . . , K):344

xij = cij

L∑`=1

P φj`sk` + εij, (13)

where xij is the jth component (j = 1, . . . , J) of the 2D Fourier transform of the ith image

(i = 1, . . . , N), cij is the jthe component of the CTF for the ith image, sk` is the `th component346

of the 3D Fourier transform of the kth structure group, εij is noise (usually assumed to be

independent and normally distributed with mean 0 and variance σ2) in the complex plane,348

and P φ = (P φj`)1≤j≤J,1≤`≤L is a matrix that relates the 2D Fourier transform to the 3D Fourier

transform by the projection-slice theorem; see Section 4.2 for background and further details.350

We next discuss the “gold standard”, using FSC (Fourier Shell Correlation) curves and

movies to correct beam-induced motion during the exposure to the electron beam, to avoid352

overfitting. Scheres (2012b, p.520) provides details of the derivation of the iterative algorithm,

showing how to update the iterations σ2t+1 from s

(t)k` , the signal variance v

(t+1)k` from s

(t+1)k` , the354

posterior probability π(t+1)k`,φ of class assignment k and orientation φ (given the data on the

ith image) from π(t)k`,φ, s

(t)k` , and v

(t)k` . The updating formulas involve integration over the latent356

orientation φ, which in practice “are replaced by (Riemann) summations over discretely sampled

orientations, and translations are limited to a user-defined range.” Scheres (2016, pp.129-132)358

describes the statistical underpinnings and historical background of the use of FSC curves and

the development of the gold standard for the validation of cryo-EM structure. Because the EB360

approach updates the hyperparameters and latent orientations concurrently at each iteration,

“once one over-estimates the power of the true signal due to an inadvertent build-up of a small362

amount of noise in the reconstruction, one will allow even more noise in the next iteration”,

which led to “overfitting, where noise in the model iteratively builds up”, in many cryo-EM364

19

structure determination projects by 2010 and the convention of “a community-driven task-force

for the validation of EM structures”.366

One of the recommendations of the task-force was to use the “gold-standard approach to re-

finement” by splitting the data into two halves and refining independent reconstructions against368

each “half-set”; see Henderson et al. (2012). The FSC between the two independent reconstruc-

tions “then yields a reliable resolution estimate so that the iterative build-up of noise can be370

prevented.” This is akin to two-fold cross-validation (CV) and therefore can be extended to k-

fold CV if N is sufficiently large, using FSC as the performance measure, which was a common372

practice in the field that “had evolved toward the refinement of a single reconstruction” from

which “the resulting angles would be used to make two (no longer independent) reconstructions374

from random half-sets at each iteration.” In 2011, the tilt-pair experiments by Henderson, Chen

and their collaborators showed, however, that “alignments are dominated by the lower spatial376

frequencies, which are almost indistinguishable between reconstructions from all or half of the

data”, and that the previously introduced FSC=0.143 performed well when two independent378

reconstructions were used but was “too optimistic” when using a single reconstruction which

produced “inflated FSC”.380

Scheres (2012a, p.411) points out that “there remains one problem” with the “assumption

of independence between Fourier components of the signal” in the prior distribution underlying382

the MAP estimate; “this assumption is known to be a poor one” for a macromolecular complex,

resulting in under-estimation of power in the signal and over-smoothing of the reconstruction.384

Scheres (2012b, p.527) introduced a “3D auto-refinement” with which “the user only selects

a relatively coarse initial orientational sampling and this sampling rate is automatically in-386

creased during the refinement” that monitors two convergence criteria, namely “the estimated

resolution based on the (over-smoothed) gold-standard FSC curve” and “the average changes in388

the optimal orientation and class assignments for all particles”, so that “once both criteria no

longer improve from one iteration to the next, the orientational sampling rates are increased.”390

Specifically, rotational sampling is “increased 2-fold by using the next Healpix grid” and trans-

lational sampling is “adjusted to the estimate accuracy of the translational assignments” based392

on the πF/πT criterion, and the adaptive EM algorithm is used during all iterations. Before

20

termination upon convergence, “a final iteration is performed where the the two independent394

halves of the data are combined in a single reconstruction.” Scheres (2016, Section 2.3) de-

scribes an alternative method that was recommended by the task-force to carry out this final396

step, saying: “Because orientational and class assignments are predominantly driven by the

lower frequency content of the images, they are usually not noticeably affected by the under-398

estimation of resolution. However, upon convergence the highest amount of information needs

to the extracted from the reconstruction . . . . By masking out the solvent region from the400

two half-reconstructions, the noise gets reduced and the FSCs will increase,” but masking also

introduces “convolution effects” that can be corrected by “phase randomization”, details of402

which are given in Scheres (2016, p.132).

3.3 Hidden Markov Models, MCMC and Uncertainty Quantifica-404

tion

Using the EB approach via the EM algorithm for iterative estimation of the state vector406

sk = (sk1, . . . , skL)>, together with gold standard FSC coupled with phase randomization for

concurrent iterative estimation of the hyperparameter vector θ, appears circuitous when (13)408

is actually a hidden Markov model (HMM) and the task at hand is joint state and parameter

estimation (or adaptive filtering) in the HMM, which is a long-standing problem with major410

applications in many STEM (Science, Technology, Engineering, Mathematics) fields.

Lai et al. (2020) recently developed a new MCMC-SS (Markov Chain Monte Carlo with412

Sequential Substitutions) for adoptive filtering in HMMs on general state spaces. Their basic

idea is to approximate a target distribution by the empirical distribution of M representative414

atoms, chosen sequentially by an MCMC scheme so that the distribution approximates the

target distribution after a large number K of iterations. Making use of bounds on a weighted416

total variation norm of the difference between the target distribution and the empirical measure

defined by the sample paths of the MCMC scheme, they have also developed an asymptotic418

theory of MCMC-SS. This theory includes asymptotic normality (as both K and M approach

∞) of the estimates of functionals of the target distribution using MCMC-SS, together with420

21

consistent estimation of their standard errors, and provides oracle properties that prove their

asymptotic optimality. In particular, the convergence is guaranteed and automated for MCMC-422

SS, in contrast to standard MCMC schemes for which manual checks of convergence are needed.

Moreover, the computation can be vectorized and accelerated using a GPU, and parallelized424

across multiple GPUs; see Lai et al. (2020) where applications to image analysis with uncertainty

quantification are also given.426

4 Statistical and computational challenges/opportunities

An article by Bendory et al. (2020), referred to hereafter as BBS, that just appeared in IEEE428

Signal Processing Magazine, introducing the “challenging and exciting computational tasks in-

volved in reconstructing 3-D molecular structures by cryo-EM” and describing the “computa-430

tional challenges and opportunities”, is the inspirational source behind this concluding section.

We focus on the statistical challenges and opportunities, which we will relate to the computa-432

tional ones that they have covered comprehensively and eloquently. In this connection, Section

4.1 also provides additional background for the material presented in Section 3.2. Section 4.2434

discusses the statistical challenges and opportunities together with concluding remarks.

4.1 Mathematical Theory of Cryo-EM Image Reconstruction, Ver-436

ification and Computational Building Blocks

Section II of BBS describes the mathematical model generating cryo-EM images and formulates438

the inverse problem of image reconstruction as estimating the molecular structure represented

by the 3D orientation of the particles embedded in the ice from the 2D images I1, . . . , IN ; each440

image Ii is formed by rotating φ by a 3D rotation Rωiand 2D shift ti. While ω1, . . . , ωN ,

t1, . . . , tN are unknown a priori, they are nuisance parameters as “their estimation is not an442

aim” of the reconstruction of φ, which “is possible up to three intrinsic ambiguities: a global

3D rotation, the (3D) location of the center of the molecule, and handedness.” The first two444

are related to the nuisance parameters and are handled by stochastic modeling as in the EM

22

algorithm of RELION. In proteins, the polypeptide chain forms a nuclear of right- and left-446

handed helices and superhelices, and other structures such as a right-handed double helical β-

hairpin that is strongly twisted and coiled; see Efimov (2018). BBS notes that “the handedness448

of structure cannot be determined from cryo-EM images alone because the original 3D object

and its reflection give rise to identical sets of projections.” Hence, “φ may be thought of as a450

random signal with an unknown distribution defined over a space of possible structures”, which

BBS discusses in Section VIIA as a computational and theoretical challenge and surveys two452

different approaches in the literature. One is in the direction that we have discussed in Section

3.2 and has “apparent drawbacks” that “it does not scale well for large K and ignores the454

correlation between different functional states of the molecule and thus overlooks important

information.” The other assumes that “φ1, . . . , φN can be embedded in a low-dimensional456

space”, which can be learned by using PCA for linear subspaces or “by other spectral methods,

such as diffusion maps” in more intricate low-dimensional manifolds. We will return to these458

challenges and opportunities in Section 4.2.

Section V of BBS describes five building blocks in the algorithmic pipeline of single particle460

reconstruction using cryo-EM, including 2D classification and denoising/dimension reduction

techniques that we have already discussed in Section 3.1. Section 4.2 will consider CTF estima-462

tion and bias mitigation in particle picking. Here we consider motion correction, about which

Section II of BBS says:“Modern electron microscopes produce multiple micrographs” and the464

electron detectors “acquire multiple frames per micrograph,” allowing them to partially correct

the blur caused by the movement of the electron beam by “aligning and averaging the frames”;466

motion correction is in essence a multi-reference alignment problem, for which the main chal-

lenge for cryo-EM images is “the high noise levels that hamper precise estimation of relative468

sifts between frames.” Recent solutions include “strategies for per-particle correction”, whereas

earlier solutions aim at estimating the movement of the entire micrograph which is divided into470

patches and “motions within each patch are estimated based on cross-correlation.”

Scheres (2016, Section 2.4) describes a hardware-based “movie-processing” approach to mo-472

tion correction. He points out that with the advent of direct-electron detectors, “the possibility

arose to collect movies during the exposure of the sample to the electron beam” because when474

23

the beam’s electrons hit the sample, inelastically scattered electrons deposit energy and “the

sample starts to move upon exposure to the electron beam.” Hence, “this beam-induced motion476

causes a blurring in the images that can be corrected by movie-processing, since each of the

movie frames contains a sharper snapshot of the moving objects”, as has been illustrated for478

large rotavirus particles. Although it is more difficult to correct for beam-induced motions of

smaller complexes, the movie-processing approach can be adapted “based on the observation480

that neighboring particles often move in similar directions”, and therefore “by fitting straight

lines through the most likely translations from the original movie-processing approach, and by482

considering groups of neighboring particles in these fits, the high noise levels in the estimated

movement tracks could be sufficiently reduced” for smaller complexes.484

We now consider the closely related problem of verification that a 3D reconstruction is

“a reliable and faithful representation of the underlying molecule” discussed in Section VIIB486

of BBS as a computational and theoretical challenge. Although “this is a question of cru-

cial importance for any scientific field” and several validation techniques were proposed in488

the cryo-EM literature, “there are no agreed-upon computational verification methods” and in

practice, “structure validation is based on a set of heuristics and the experts’ knowledge and490

experience”, e.g., initializing the 3D reconstruction algorithm from multiple random points and

attaining similar structures in all cases, recovering similar structure by applying other technolo-492

gies (such as X-ray crystallography and nuclear magnetic resonance) to the same molecule. The

movie-processing approach described in the preceding paragraph offers a promising verification494

procedure which will be discussed further in Section 4.2.

4.2 Statistical Challenges and Opportunities496

Analyzing “beam-induced motions for groups of neighboring particles” is only a key constituent

of the movie-processing approach reviewed by Scheres (2016, pp.133-134), who also describes498

the importance of combining it with “a novel way of handling radiation-damage weighting” as an

adjuvant. He refers to Glaeser (2016) for reviews of radiation chemistry and radiation damage500

caused by exposure to the electron beam, which starts with the breakage of chemical bonds in

24

the sample, destroying the high-resolution content of the sample, and continues with unfolding502

of the secondary structure elements and protein domains, with low-resolution information per-

sisting longer than high-resolution information, until the macromolecular complex is eventually504

destroyed. Hence dose-response relations (in which response refers to the resolution-dependent

effect of radiation damage) of the type used in pharmacokinetics/pharmacodynamics are po-506

tentially useful for such modeling and for providing frequency-dependent weights to average the

aligned movie frames of each particle. He says: “By weighting the different spatial frequencies508

in each movie frame differently, the useful information from each movie frame is retained.” He

remarks that this approach results in improved signal-to-noise ratios and is sometimes called510

“particle polishing”.

The next two statistical challenges and opportunities are CTF estimation and bias mitiga-512

tion in particle picking mentioned in the last paragraph of Section 4.1. Both are related to what

Section III of BBS calls “three characteristic features of the cryo-EM data: high noise level,514

missing data, and massive datasets.” Estimation of the CTF parameters cij in (13) is a linear

regression problem if there were no missing data. As pointed out by BBS, if the viewing di-516

rection and location associated with each particle were known, “estimating φ would be a linear

inverse problem” and recovery in this case is based on the projection-slice theorem (also called518

Fourier slice theorem) stating that the 2D Fourier transform of the projection of φ belonging

to a 3D manifold is the restriction of the 3D Fourier transform of φ to a 2D plane, hence φ520

can be estimated to a certain resolution if one has enough projections from known viewing

directions. The statistical challenge is that the viewing directions are unknown “missing data”522

and “any method to estimate the viewing directions is destined to fail” because of low SNR

and the “Neyman-Scott paradox” (caused by too many nuisance parameters relative to the524

sample size). Hence “it is essential to consider statistical methods that circumvent rotation

estimation”, providing an opportunity for statistical science that we have discussed in Sections526

3.2 and 3.3.

Sections IVB1 and IVC of BBS discuss model bias, exemplified by the “Einstein from noise”528

phenomenon which shows how the reconstructed molecular structure can be biased by the

reference image picked manually by the analyst to “extract 2D projections (particles) from the530

25

noisy micrographs”, as we have also shown in Section 2. To mitigate the user’s selection bias,

RELION uses an automated particle picking algorithm that is described with updates by Scheres532

(2016, pp.135-137), which we summarize here. To identify the positions of individual particles

in all micrographs, the user’s task is to extract suitable particles from a relatively small subset534

of micrographs (typically a thousand particles is enough) so that after extraction the particles

undergo a first round of reference-free 2D classification in RELION and the 2D classes are used536

as templates for automated particle picking of all micrographs; the autopicking algorithm has

two important parameters: a threshold that expresses how restrictive the particle picking is538

(with higher threshold values for picking few particles and therefore more restrictive) and the

minimum inter-particle distance. Note that improvements in 2D classification methodology540

and software implementation have been evolving, as described above in Section 3.1 and in BBS

(Section VD). Indeed 3D multi-reference alignment for high-resolution structure determination542

is a continually improving area that offers challenges and opportunities for statistical science.

BBS says: “Establishing computational tools that provide confidence intervals to estimated544

structures and are immunized against systematic errors is one of the remaining challenges in

the field .” The recent advances in joint state and parameter estimation in hidden Markov546

models described in Section 3.3 are opportune developments to meet these challenges.

In conclusion, we share the perspective of BBS that the emerging field of cryo-EM is an548

alluring area for statistical and computational scientists to tackle challenging problems and

develop analytical and computational tools to “drive the field forward”, and in return “broaden550

our understanding of the fundamental mechanisms of life.” This understanding is not only

timely in the battle against SARS-CoV-2 but also important for developing vaccines, diagnostics552

and drugs to prevent or treat COVID-19 infections; see the articles by Wrapp et al. (2020) and

Gao et al. (2020) in Science early this year.554

References

Bendory, T., Bartesaghi, A., and Singer, A. (2020). Single-particle cryo-electron microscopy:556

Mathematical theory, computational challenges, and opportunities. IEEE Signal Processing

26

Magazine, 37(2):58–76.558

Chang, S. G., Yu, B., and Vetterli, M. (2000). Adaptive wavelet thresholding for image denoising

and compression. IEEE Transactions on Image Processing, 9:1532–1546.560

Chen, T.-L., Hsieh, D.-N., Hung, H., Tu, I.-P., Wu, P.-S., Wu, Y.-M., Chang, W.-H., and

Huang, S.-Y. (2014). γ-SUP: A clustering algorithm for cryo-electron microscopy images of562

asymmetric particles. The Annals of Applied Statistics, 8:259–285.

Cheng, Y. (2015). Single-particle cryo-em at crystallographic resolution. Cell, 161:450–457.564

Cheng, Y., Grigorieff, N., Penczek, P., and Walz, T. (2015). A primer to single-particle cryo-

electron microscopy. Cell, 161:438–449.566

Chung, S.-C., Wang, S.-H., Niu, P.-Y., Huang, S.-Y., Chang, W.-H., and Tu, I.-P. (2020). Two-

stage dimension reduction for noisy high-dimensional images and application to cryogenic568

electron microscopy. Annals of Mathematical Sciences and Applications, 5,:in press.

Danielyan, A., Katkovnik, V., and Egiazarian, K. (2011). BM3D frames and variational image570

deblurring. IEEE Transactions on Image Processing, 21:1715–1728.

Efimov, A. V. (2018). Chirality and handedness of protein structures. Ultramicroscopy, 83:103–572

110.

Gao, Q., Bao, L., Mao, H., Wang, L., Xu, K., Yang, M., Li, Y., Zhu, L., Wang, N.,574

Lv, Z., Gao, H., Ge1, X., Kan, B., Hu, Y., Liu, J., Cai1, F., Jiang1, D., Yin, Y.,

Qin, C., Li, J., Gong, X., Lou, X., Shi, W., Wu, D., Zhang, H., Zhu, L., Deng,576

W., Li, Y., Lu, J., Li4, C., Wang, X., Yin, W., Zhang, Y., and Qin, C. (2020).

Development of an inactivated vaccine candidate for SARS-CoV-2. Science, in press.578

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202686/pdf/abc1932.pdf.

Glaeser, R. M. (2016). Specimen behavior in the electron beam. Chapter 2 (pp. 20-50) in580

Methods in Enzymology., pages vol. 579, ScienceDirect, Elsevier.

27

Henderson, R. (2013). Avoiding the pitfalls of single particle cryo-electron microscopy: Einstein582

from noise. Proc. Natl. Acad. Sci. USA, 110:18037–18041.

Henderson, R., Sali, A., Baker, M. L., Carragher, B., Devkota, B., Downing, K. H. Egelman,584

E. H., Feng, Z., Frank, J., Grigorieff, N., Wen Jiang, W., Ludtke, S. J., Medalia, O., Penczek,

P. A., Rosenthal, P. B., Rossmann, M. G., Schmid, M. F., Schroder, G. H. amd Steven, A. C.,586

Stokes, D. L., Westbrook, J. D., Wriggers, W., Yang, H., Young, J., Berman, H. M., and

Chiu, W. Kleywegt, G. J. a. C. L. (2012). Outcome of the first electron microscopy validation588

task force meeting. Structure, 20:205–214.

Hung, H., Wu, P.-S., Tu, I.-P., and Huang, S.-Y. (2012). On multilinear principal component590

analysis of order-two tensors. Biometrika, 99:569–583.

Kimanius, D., Forsberg, B. O., Scheres, S. H., and Lindahl, E. (2016). Accelerated cryo-EM592

structure determination with parallelisation using GPUs in RELION-2. Elife, 5:e18722.

Lai, T. L., Xu, H., Zhu, M. H., and Chan, H. P. (2020). MCMC with sequential substitu-594

tions for joint state and parameter estimation in hidden Markov models. Technical Report,

Department of Statistics, Stanford University.596

Liao, M., Cao, E., Julius, D., and Cheng, Y. (2013). Structure of the TRPV1 ion channel

determined by electron cryo-microscopy. Nature, 504:107–112.598

Scheres, S. H. (2012a). A bayesian view on cryo-em structure determination. Journal of

Molecular Biology, 415:406–418.600

Scheres, S. H. (2012b). RELION: implementation of a bayesian approach to cryo-EM structure

determination. Journal of Structural Biology, 180:519–530.602

Scheres, S. H. (2016). Processing of structurally heterogeneous cryo-EM data in RELION.

Chapter 6 (pp. 125-157) in Methods in Enzymology., pages vol. 579, ScienceDirect, Elsevier.604

Shiu, S. Y. and Chen, T. L. (2016). On the strengths of the self-updating process clustering

algorithm. Journal of Statistical Computation and Simulation, 86:1010–1031.606

28

Sorzano, C. O. S., Bilbao-Castro, J. R., Shkolnisky, Y., Alcorlo, M., Melero, R., Caffarena-

Fernandez, G., Li, M., Xu, G., Marabini, R., and Carazo, J. M. (2010). A clustering approach608

to multireference alignment of single-particle projections in electron microscopy. Journal of

Structural Biology, 171:197–206.610

Stewart, A. and Grigorieff, N. (2004). Noise bias in the refinement of structures derived from

single particles. Ultramicroscopy, 102:67–84.612

van Heel, M. (1984). Multivariate statistical classification of noisy images (randomly oriented

biological macromolecules). Ultramicroscopy, 13:165–183.614

Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C. L., Abiona, O., Graham,

B. S., and McLellan, J. S. (2020). Cryo-EM structure of the 2019-nCoV spike in the prefusion616

conformation. Science, 367:1260–1263.

Yan, C., Hang, J., Wan, R., Huang, M., Wong, C. C., and Shi, Y. (2015). Structure of a yeast618

spliceosome at 3.6-angstrom resolution. Science, 349:1182–1191.

Yang, Z., Fang, J., Chittuluru, J., Asturias, F. J., and Penczek, P. A. (2012). Iterative stable620

alignment and clustering of 2D transmission electron microscope images. Structure, 20:237–

247.622

Ye, J. (2005). Generalized low rank approximation of matrices. Machine Learning, 61:167–191.

29

Documents

Stanford University - CRYO-EM: BREAKTHROUGHS IN ......Tze Leung Lai Shao-Hsuan Wang Yi-Ching Yao Szu-Chi Chung Wei-Hau Chang I-Ping Tu Technical Report No. 2020-14 November 2020 Department