13
rsif.royalsocietypublishing.org Research Cite this article: Reeves DB, Magaret AS, Greninger AL, Johnston C, Schiffer JT. 2018 Model-based estimation of superinfection prevalence from limited datasets. J. R. Soc. Interface 15: 20170968. http://dx.doi.org/10.1098/rsif.2017.0968 Received: 22 December 2017 Accepted: 5 February 2018 Subject Category: Life Sciences–Mathematics interface Subject Areas: biomathematics, computational biology, biophysics Keywords: superinfection, expectation maximization, mathematical modelling, HIV, HSV, ecology Author for correspondence: Daniel B. Reeves e-mail: [email protected] Model-based estimation of superinfection prevalence from limited datasets Daniel B. Reeves 1 , Amalia S. Magaret 1,3,4 , Alex L. Greninger 3 , Christine Johnston 1,2 and Joshua T. Schiffer 1,2 1 Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA 2 Department of Medicine, 3 Department of Laboratory Medicine, and 4 Biostatistics, University of Washington, Seattle, WA, USA DBR, 0000-0001-5684-9538; ASM, 0000-0002-1221-3746; ALG, 0000-0002-7443-0527; CJ, 0000-0002-3073-0843; JTS, 0000-0002-2598-1621 Humans can be infected sequentially by different strains of the same virus. Estimating the prevalence of so-called ‘superinfection’ for a particular patho- gen is vital because superinfection implies a failure of immunologic memory against a given virus despite past exposure, which may signal challenges for future vaccine development. Increasingly, viral deep sequencing and phylogenetic inference can discriminate distinct strains within a host. Yet, a population-level study may misrepresent the true prevalence of super- infection for several reasons. First, certain infections such as herpes simplex virus (HSV-2) only reactivate single strains, making multiple samples necessary to detect superinfection. Second, the number of samples collected in a study may be fewer than the actual number of independently acquired strains within a single person. Third, detecting strains that are rela- tively less abundant can be difficult, even for other infections such as HIV-1 where deep sequencing may identify multiple strains simultaneously. Here we develop a model of superinfection inspired by ecology. We define an infected individual’s richness as the number of infecting strains and use ecological evenness to quantify the relative strain abundances. The model uses an EM methodology to infer the true prevalence of superinfection from limited clinical datasets. Simulation studies with known true prevalence are used to contrast our EM method to a standard (naive) calcu- lation. While varying richness, evenness and sampling we quantify the accuracy and precision of our method. The EM method outperforms in all cases, particularly when sampling is low, and richness or unevenness is high. Here, sensitivity to our assumptions about clinical data is considered. The simulation studies also provide insight into optimal study designs; estimates of prevalence improve equally by enrolling more participants or gathering more samples per person. Finally, we apply our method to data from published studies of HSV-2 and HIV-1 superinfection. 1. Introduction Superinfection is highly relevant to the disciplines of public health and immu- nology because it indicates that primary infection may provide only limited cross-immunity against re-exposure to a new strain of the virus. Development of a prophylactic vaccine may be more challenging if superinfection with a particular virus is common. As deep sequencing is increasingly performed on clinical samples, the issue of superinfection will be increasingly common, because deep sequencing provides previously unavailable strain-defining infor- mation. Therefore, we have designed a model to analyse limited sequence data to infer the prevalence of superinfection on a population level. Superinfection has been observed for many common viruses including human immunodeficiency virus (HIV) [1–3], hepatitis C [4], herpes simplex virus (HSV) [5,6] and cytomegalovirus (CMV) [7]. Prevalence often correlates with risk factors for acquisition. For instance, a population-based survey found HSV-2 & 2018 The Author(s) Published by the Royal Society. All rights reserved. on March 1, 2018 http://rsif.royalsocietypublishing.org/ Downloaded from

Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

rsif.royalsocietypublishing.org

Research

Cite this article: Reeves DB, Magaret AS,

Greninger AL, Johnston C, Schiffer JT. 2018

Model-based estimation of superinfection

prevalence from limited datasets. J. R. Soc.

Interface 15: 20170968.

http://dx.doi.org/10.1098/rsif.2017.0968

Received: 22 December 2017

Accepted: 5 February 2018

Subject Category:Life Sciences – Mathematics interface

Subject Areas:biomathematics, computational biology,

biophysics

Keywords:superinfection, expectation maximization,

mathematical modelling, HIV, HSV, ecology

Author for correspondence:Daniel B. Reeves

e-mail: [email protected]

& 2018 The Author(s) Published by the Royal Society. All rights reserved.

Model-based estimation of superinfectionprevalence from limited datasets

Daniel B. Reeves1, Amalia S. Magaret1,3,4, Alex L. Greninger3,Christine Johnston1,2 and Joshua T. Schiffer1,2

1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA2Department of Medicine, 3Department of Laboratory Medicine, and 4Biostatistics, University of Washington,Seattle, WA, USA

DBR, 0000-0001-5684-9538; ASM, 0000-0002-1221-3746; ALG, 0000-0002-7443-0527;CJ, 0000-0002-3073-0843; JTS, 0000-0002-2598-1621

Humans can be infected sequentially by different strains of the same virus.

Estimating the prevalence of so-called ‘superinfection’ for a particular patho-

gen is vital because superinfection implies a failure of immunologic memory

against a given virus despite past exposure, which may signal challenges for

future vaccine development. Increasingly, viral deep sequencing and

phylogenetic inference can discriminate distinct strains within a host. Yet,

a population-level study may misrepresent the true prevalence of super-

infection for several reasons. First, certain infections such as herpes

simplex virus (HSV-2) only reactivate single strains, making multiple

samples necessary to detect superinfection. Second, the number of samples

collected in a study may be fewer than the actual number of independently

acquired strains within a single person. Third, detecting strains that are rela-

tively less abundant can be difficult, even for other infections such as HIV-1

where deep sequencing may identify multiple strains simultaneously. Here

we develop a model of superinfection inspired by ecology. We define an

infected individual’s richness as the number of infecting strains and use

ecological evenness to quantify the relative strain abundances. The model

uses an EM methodology to infer the true prevalence of superinfection

from limited clinical datasets. Simulation studies with known true

prevalence are used to contrast our EM method to a standard (naive) calcu-

lation. While varying richness, evenness and sampling we quantify the

accuracy and precision of our method. The EM method outperforms in all

cases, particularly when sampling is low, and richness or unevenness is

high. Here, sensitivity to our assumptions about clinical data is considered.

The simulation studies also provide insight into optimal study designs;

estimates of prevalence improve equally by enrolling more participants or

gathering more samples per person. Finally, we apply our method to data

from published studies of HSV-2 and HIV-1 superinfection.

1. IntroductionSuperinfection is highly relevant to the disciplines of public health and immu-

nology because it indicates that primary infection may provide only limited

cross-immunity against re-exposure to a new strain of the virus. Development

of a prophylactic vaccine may be more challenging if superinfection with a

particular virus is common. As deep sequencing is increasingly performed on

clinical samples, the issue of superinfection will be increasingly common,

because deep sequencing provides previously unavailable strain-defining infor-

mation. Therefore, we have designed a model to analyse limited sequence data

to infer the prevalence of superinfection on a population level.

Superinfection has been observed for many common viruses including human

immunodeficiency virus (HIV) [1–3], hepatitis C [4], herpes simplex virus (HSV)

[5,6] and cytomegalovirus (CMV) [7]. Prevalence often correlates with risk factors

for acquisition. For instance, a population-based survey found HSV-2

Page 2: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

2

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

superinfection prevalence of 3.7% with higher prevalences

found in individuals also infected with HIV [6]. Measured

prevalence for multiple strain HCV infections inclusive of

superinfection and simultaneous acquisition range from 0.7%

to 25% with particularly high prevalence noted among inject-

ing drug users [4,8,9]. HIV superinfection also varies

according to risk exposure and can approach 10% in certain

studies. Primary infection incidence exceeds superinfection

incidence in certain cohorts but not others [8,10]. Immune fac-

tors may protect against or enhance the risk of superinfection

[11,12], though even the presence of broadly neutralizing

antibodies may not protect against superinfection [13].

Yet, the prevalence of superinfection in populations is

generally unknown and empirical measures of superinfection

may represent underestimates for several reasons. For viruses

with periodic reactivation such as HSV and CMV, a single

sample during a period of active viral replication may only

detect the single reactivating strain (for HSV, see [14,15]).

Even for systemic infection with continual replication at mul-

tiple sites (such as HIV), where superinfecting strains are

more likely to replicate simultaneously, limited sampling

may not be adequate to detect all superinfecting strains, par-

ticularly if a single strain predominates. Anatomic sampling

challenges may exist.

Deep sequencing data can distinguish different strains

of viruses within a single host to uncover superinfection

[16–18]. If a certain genetic distance between strains is

observed, it is possible to rule out evolutionary linkages and

confidently infer multiple acquisitions of the same viral patho-

gen. Though deep sequencing does not necessarily distinguish

superinfection (sequential acquisition) from dual-strain infec-

tion (simultaneous acquisition), superinfection is likely to

represent the more common phenomenon: it is increasingly

recognized that bottlenecks at sites of transmission often

limit transmission to a single strain [19]. For HSV and other

periodic viruses, multiple samples over time are necessary to

determine superinfection. For HIV and other chronic systemic

viral infections, deep sequencing may allow identification of

multiple strains from a single timepoint, detecting a quasi-

species of many viral variants simultaneously [20,21]. In both

cases, even if sequential sampling is performed in study

participants, there is no guarantee that all existing strains will

be detected. Throughout the paper we ignore false positives,

assuming that proper sequencing protocols were used.

However, each sequencing platform and viral system carries

its own challenges in identifying distinct variants (sample

collection, target amplification, library preparation and recom-

bination [22]) and care must be taken to ensure the original data

used for the model are well-validated [23].

In ecology, estimators for animal richness (number of

species) and evenness (relative abundance of species) have

been developed given the challenges of wildlife collection

[24–26]. Some extensions of these methods have been used

for genetics [27] and virology [28,29]. However, these methods

are not well-suited for our purposes (superinfection studies)

because they assess scenarios with many species. Thus, we

adapted the richness and evenness framework to design

our own model for estimating superinfection.

Our method uses an expectation-maximization (EM)

algorithm to infer the real superinfection prevalence from

limited deep sequencing data. We first derive the probability

of observing a given richness from a true richness in a single

host. Moving to the population level, we demonstrate

estimation of underlying average richness from the limited

observed counts of superinfection. This estimation assumes

equivalent risk of infection over time and across study partici-

pants (i.e. a Poisson model for the true underlying richness).

By simulating clinical study data, we perform extensive com-

parisons of our EM method to the standard calculation (naive

estimate) of superinfection prevalence for relevant parameter

ranges. Our method begins with this estimate and improves

upon it by including more information about the sampling

procedure. For example, if the number of strains exceeds

the number of samples, some strains will necessarily be unde-

tected. Given the financial costs of superinfection studies,

these constraints raise a practical question: will a more confi-

dent estimate arise from sampling a larger distribution of

infected persons or performing more longitudinal samples

within a single person? By limiting the total number of

samples, we determine that EM estimation improves equally

from sampling more participants or increasing the number of

samples per participant. The first model applies to viruses

that reactivate periodically and thus require sequential

sampling to detect superinfection. Therefore, we also extend

our EM model to the context of viruses like HIV where mul-

tiple strains can be detected in a single sample. In that

context, simulation studies demonstrate that the extent of

underestimated superinfection depends on sampling depth

and evenness of strain distribution. We finally apply our

method to published HSV-2 and HIV data to demonstrate

our estimate of prevalence and how our model allows for

decreased sampling to achieve the same estimates. In the

HSV example [6], we demonstrate that in the most conserva-

tive estimate superinfection prevalence may be twofold

higher than would be expected using a standard calculation

(or naive estimate). While published HIV superinfection

studies are less-limited in sample size [2], our tool provides

a utilitarian alternative approach to minimize the cost and

patient burden of sampling.

1.1. An ecological framework for estimating theprevalence of viral superinfection

In the Results, we develop a framework and methodology to

infer the prevalence of viral superinfection from population-

level studies. In particular, we differentiate viruses in which

a single strain is more likely to be replicating at a single

point in time (HSV-2) from viruses in which multiple strains

replicate concurrently (HIV-1). Under the latter assumption,

we consider the impact of sequencing depth.

We include definitions of important terms below. Super-

infection means sequential infection by different strains of

the same virus—as opposed to dual-strain infection: simul-

taneous infection by more than one strain of a virus, or

coinfection: infection with more than one type of pathogen

(e.g. HSV þ HIV). Strains define distinct viral infections;

each strain is a collection of related sequence variants that

differ due to within-host viral mutation. To develop a math-

ematical framework to analyse superinfection, we borrow

from ecology. Richness (R) denotes the number of infecting

strains. Within a single host, two strains may not be observed

to be equally abundant due to random fluctuations, immune

pressure, timing of infection, inoculum size of virus or biases

in experimental detection (‘catchability’ in the ecological

setting [30]). For our purposes in estimating superinfection,

the cause of observed differences in relative abundance is

Page 3: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

3

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

unimportant. Evenness (E) encapsulates relative strain abun-

dances. Based on a Shannon entropy [31], this measure is

bounded by 0 (minimal evenness) to 1 (maximal evenness,

all species equivalently abundant) given a certain value of

the richness R.

In clinical superinfection studies, participants are sam-

pled with a blood draw, or a swab of an infected surface,

and viral DNA/RNA is genetically sequenced. We wish to

estimate the average richness (denoted kRl) and the average

prevalence of superinfection (denoted Ps) in a study popu-

lation. The data we consider come from studies with Nparticipants who were each sampled n times. The data are

collected as counts Nobs(Robs), serving as the number

of study participants having observed richness Robs ¼ 1, 2,

3, . . . . Making this a probability distribution by dividing

by the total number of participants we have pobs(Robs) ¼

Nobs(Robs)/N. Thus, a naive estimation of the average rich-

ness and a naive estimate of the average superinfection

prevalence can be calculated as

kRl ¼X

rr pobs(r), Ps ¼ 1� pobs(0): ð1:1Þ

However, both of these calculations will necessarily under-

estimate the values because the observed distribution is not

equal to the true distribution, i.e. pobs(Robs) = p(R).

Therefore, our main objective is to calculate the probability

of richness p(R) based on limited datasets. We develop the

model by considering the simplest case first, where strains

are equally abundant and a single sample from an individual

results in a single detected strain (Results I). We then go on

to relax the assumption of evenness and then allow for mul-

tiple strains to be detected in a single sample (Results II). For

reference, we include the following lexicon:

— variant: distinct viral DNA/RNA sequence arising from

within-host mutation and grouped with closely related

variants into a strain

— strain: collection of related variants indicating a distinct

viral acquisition

— superinfection: sequential infection with more than one

strain of a virus

— richness: (R) number strains infecting a participant

— observed richness: (Robs) number of strains observed to be

infecting a participant over n measurements

— evenness: (E) summary measure of the relative abundance

of infecting strains in a participant, calculated from

normalized Shannon entropy [31]

— superinfection prevalence: (Ps) population-level probability

of superinfection

2. Results I: single detected strain per sample2.1. Mathematical model corrects for limited sampling

relative to naive estimateViruses that oscillate between a latent and active cycle (such

as HSV-2) are likely to reactivate only a single strain, and so

detection leads to a single strain per sample [15]. To date,

only one strain of HSV-2 has been detected in most samples

that have undergone next-generation sequencing [6,32]. We

illustrate a clinical study of such a virus in figure 1. Here,

the true richness in each participant (stars on human silhou-

ettes) varies from 1 to 3 and strains are evenly abundant.

Based on the true distribution, the average richness is kRl ¼1.6 strains per participant and the superinfection prevalence

is Ps ¼ 4=10. Each participant is sampled twice, as indicated

by the purple circles. In three cases, the true richness is

underestimated because either a single strain is sampled

twice or because true richness exceeds the number of

samples taken. Observations on a population level take the

form Nobs(Robs): the number of participants observed to be

infected by a certain number of strains. Using a naive esti-

mate, we would underestimate average richness (1.3) and

superinfection prevalence (3/10).

To correct the underestimate, we infer a probability

distribution for the true richness from the data. This is accom-

plished in multiple steps. First, we note that the probability of

detecting Robs strains in a single participant arises from a

combinatoric process based on the true richness R and the

number of samples n. If we label each strain as 1, 2, 3 . . ., R,

where the number of strains is the richness R, we can think

of sampling n times as selecting a combination of strains

from the multinomial expansion (s1 þ s2 þ s3 þ . . . sR)n. We

call these combinations ‘words’. For example, it is possible

strain 1 is drawn twice in two samples which would lead

to the word W ¼ s21, or that a different strain is drawn in

each of three samples leading to the combination

W ¼ s1s3s5. We note that the combination’s exponents must

sum to the number of samples n.

If we allow each strain to have probability fi, the probabil-

ities of different observed richnesses can be visualized using a

hypercube of dimension n where each edge is divided into a

grid of f1, f2, . . .fR. In figure 2a, a hypothetical participant is

infected with three strains R ¼ 3 and is sampled twice n ¼ 2.

The samplings that lead to observing only one strain

(the words sisi) are coloured blue. Calculating the area of all

the blue boxes gives the probability of observing a single

strain. The probability of observing superinfection is the area

of the grey boxes. Similarly, in figure 2b, three samples are

shown and the probability of measuring precisely two strains

can be visualized as the sum of the volumes of the grey

cubes (18/27). Here, the probability of observing superinfec-

tion (24/27) is inclusive of measuring two or three strains,

respectively. In figure 2a, superinfection will be observed 2/3

of the time. If a third sample is obtained from the same

person, then the chance of observing superinfection rises to

fs ¼ �89% but the chance of detecting all three strains remains

low (approx. 22%). We present the probability distribution for

observed richness based on true richness and number of

samples with derivation in the Material and methods. A sim-

pler quantity is the probability of observing superinfection

from a participant. This can be calculated using the complete

expression equation (5.1), or by noticing it is the complement

of the probability of all ways to observe a single strain from

a participant:

fs ¼ 1�XR

i¼1

fni : ð2:1Þ

For evenly abundant strains, fi ¼ 1/R and the probability of

observing superinfection in a single participant amounts to

fs ¼ 1� R1�n: ð2:2Þ

We have derived the probability of observing a richness

given a true richness and number of samples within a partici-

pant, showing how superinfection can be unobserved based on

Page 4: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

R = 1R = 3 R = 2R = 1

R = 1R = 3 R = 2R = 1

R = 1

R = 1

strain measurement

n = 2 sequential samples per participant N = 10 participants

one strain detected per sample

0

2

4

6

1 2 3true richness (R)

coun

ts, N

(R)

avg richness = 1.6,superinfection prevalence = 4/10

observed richness (Robs)

0

2

4

6

1 2 3

coun

ts, N

obs (

Rob

s )

avg richness = 1.3,superinfection prevalence = 3/10

Figure 1. Because of incomplete sampling, population-level superinfection is underestimated. We show N ¼ 10 participants (human silhouettes) each with arandom richness (stars) drawn from p(R). These participants are sampled n ¼ 2 times ( purple rings). The observed richness distribution Nobs(Robs): the countsof participants with Robs strains—number of circled stars, does not agree with the true underlying distribution of richness N(R): counts of patients actuallyhaving a number of strains—number of stars. Therefore, a naive estimate of richness or superinfection prevalence (see equations (1.1)) is an underestimate.(Online version in colour.)

f1

(a) (b)

f (Robs = 3) = 6/27

f (Robs = 1) = 3/27

f (Robs = 2) = 18/27

f2

f3

f1

f2

f3

f1

f2

f3

f1

f2

f3 f

1

f2

f3

f (Robs = 1) = 1/3

R = 3, n = 2

f (Robs = 2) = 2/3f

S~ 0.7

R = 3, n = 3

fS~ 0.9

Figure 2. Within a single host, incomplete sampling leads to underestimation of superinfection. The probability of detecting superinfection (red boxes, fS) increaseswith number of samples (n). A geometric interpretation demonstrates the probability of observing Robs infections given a true richness of R and n samples from asingle infected participant. The true richness R determines the number of divisions on the hypercube of dimension n. (a) Blue sections indicate the combinations thatresult in a single strain measurement while grey sections indicate a two strain measurement. (b) Blue sections still indicate single strain measurements. There is nowthe possibility of two strain (grey) and three strain measurements (green). By increasing the number of samples from 2 to 3, the probability of observingsuperinfection fs increases rapidly.

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

4

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

limited sampling. To proceed to the population-level inference,

we must assume a function for the population-level distri-

bution of the richness, for which we choose a zero-truncated

Poisson (ZTP) distribution

Z(R; l) ¼ lR

R!

1

el � 1: ð2:3Þ

The distribution satisfies several criteria. It is completely

defined by a single parameter referred to as the ZTP

parameter l. It is only defined on the whole numbers for

R . 1, which mimics the clinical studies where each partici-

pant is at least singly infected. Lastly, that there is some

constant average probability of infection represents the

most basic assumption. This assumption should be valid for

viruses that are homogeneously dispersed in a population

based on equivalent route of transmission and contact net-

works. One pure example may be spread of a respiratory

virus in a day care setting. A widely distributed sexually

Page 5: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

5

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

transmitted disease such as HSV-2 may also meet this

criterion if a cohort with relatively uniform risk is assembled.

We used the EM algorithm [33] to find the best fit-value l

based on study data. The optimization is accomplished

numerically in R and detailed in the Material and methods.

We then calculate the EM inferred average richness and the

superinfection prevalence as

kRl ¼ l

1� e�l, Ps ¼ 1� Z(0; l), ð2:4Þ

based on the properties of the ZTP distribution.

2.2. Decreased evenness lowers the probability ofobserving superinfection in a single host regardlessof true richness

A more general model allows for uneven strain abundance.

Uneven strains means the probability of detecting strain 1 is

not equal to the probability of detecting strain 2, i.e. f1 = f2.

We parameterize evenness with an exponential function

and a single parameter that we call the ‘superinfection

parameter’: a. The probability of measuring the j-th strain

(rank ordered from j ¼ 1! Robs) in a single sample from an

individual is

fj ¼ c (a, R)e�aj, ð2:5Þ

where c is a normalization constant that depends on a and R

c(a, R) ¼ ea � 1

1� e�aR , ð2:6Þ

see Material and methods for derivation. In figure 3a, we

demonstrate two geometric examples analogous to those in

figure 2. The formula for the probability of observing Robs

given a value of the superinfection parameter, the true richness

and the number of samples is presented and discussed in the

Material and methods, see equation (5.7).

We also use a summary measure describing the relative

abundances called the within-host strain evenness E. This

measure is a normalized Shannon entropy [34] common in

ecology,

E ¼ � 1

log R

XR

i¼1

fi logfi: ð2:7Þ

The exponential abundance model can be inserted into the

evenness to show that as a increases, evenness decreases as

E(a, R) ¼ � 1

ln R( lnc þ ac@ac

�1), ð2:8Þ

where @a denotes a partial derivative with respect to a.

Plots of this function are shown in figure 3b. Increasing a

means the probability of observing the most abundant

strain rises.

Again, we can also express the simpler probability of

observing superinfection from a single participant in the

uneven case. Inserting equation (2.5) into equation (2.1), we

have for the uneven case

fs ¼ 1� 1� e�anR

ean � 1

� �cn: ð2:9Þ

This expression is plotted in figure 3c, showing that the prob-

ability of observing superinfection fs increases with the

number of samples in an individual. This relationship

depends slightly on the true richness of that individual.

However, as unevenness increases (e.g. a ¼ 3, figure 3c(iv)),

the probability of observing superinfection is drastically

lower and only increases slightly with number of samples.

In that case, the increasing richness up to 5 has no impact

on the probability of observing superinfection because the

fifth strain is so rare compared to the first.

2.3. Simulated data illustrate the advantage of theexpectation-maximization estimation over thenaive estimation

To compare the accuracy of the EM method to the naive

method we simulated a theoretical distribution of partici-

pants (see Material and methods for procedure), varying

the number of samples n ¼ 2! 6, the number of study

participants N ¼ 40! 103, the average richness kRl ¼ 1!1.4 and the evenness (through superinfection parameter

a ¼ 1! 3) for a total of 1050 parameter combinations. We

show in figure 4 that as simulated richness or unevenness

increases (figure 4a,b), the EM method performs equivalently

well in terms of returning the correct parameter. However,

the error of the naive method increases with richness and

unevenness, in both cases resulting in larger underestimates

of the true average richness. The error in richness estimation

with the naive estimate can be as large as 0.3, which even at

the highest richness can mean a 20% underestimation of

superinfection prevalence. Conversely, even under extremely

unfavourable conditions (low richness, high unevenness,

low sample size), the EM method is off by less than 1%. Of

practical concern, the number of samples per person and

number of participants does not affect the accuracy of the

EM algorithm on this scale (figure 4c,d ).

2.4. Model estimates improve equally with increasedsampling due to more participants or moresamples per participant

Clinical studies may be constrained by the total number of

samples. We thus interrogated the properties of the EM

estimation technique on simulated data (see Material and

methods for procedure) having a fixed total sample size

nN. We found that regardless of average richness or evenness

between strains, estimation of the average richness is more

accurate and precise with increased numbers of total samples,

though the number of samples per participant or number of

participants were not influential given a fixed total number

of samples.

In figure 5, we demonstrate the EM method on simulated

data with fixed nN ¼ 250, nN ¼ 500, nN ¼ 1000 and nN ¼2000 (columns) for even (a ¼ 0) and highly uneven cases

(a ¼ 2). The lines are coloured by the superinfection preva-

lence (Ps) from 1% to 30% of participants having more than

one strain. We calculated our method’s accuracy, or bias, as

difference from the true value of richness and our methods

precision as standard deviation in the estimated ZTP

parameter lest over 2000 simulations. As nN increased, esti-

mate accuracy increases (figure 5a). Similarly, estimate

precision improved with increased sampling (figure 5b).

Accuracy was lowest when population richness was low,

unevenness was high and overall sample size was low. Bias

was, however, of low magnitude throughout scenarios

addressed, indicating the accuracy of our method over a

Page 6: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

0.1

0

–0.1

–0.2

–0.3

0.2

–0.4

0.1

0

–0.1

–0.2

–0.3

0.2

–0.4

0.9 1.41.31.21.11.0 1.5actual average richness, ·RÒ superinfection param, a

2 543 6no. samples, n no. participants, log

10N

102 103

0 21

naiveEM

rich

ness

acc

urac

y(e

stim

ated

– a

ctua

l)ri

chne

ss a

ccur

acy

(est

imat

ed –

act

ual)

(a) (b)

(c) (d)

Figure 4. Comparison of naive and our EM estimation using simulated data varying the number of samples n ¼ 2! 6, the number of study participants N ¼ 40! 103, the average richness kRl ¼ 1! 1.4 and the evenness (through superinfection parameter a ¼ 1! 3) for a total of 1050 parameter combinations.Estimates were calculated for each parameter combination. Jitter was introduced artificially on the x-axis for visualization purposes. The EM algorithm was superiorin all cases, particularly as the naive method worsened for increased richness and increased unevenness. The naive method improved slightly with increasedsampling per person, while neither samples nor the number of participants affected the EM method substantially.

prob

abili

ty o

f ob

serv

ing

supe

rinf

ectio

n, f

S

no. samples, n

f1

f2

f1

fS~ 2f1f2

f2

f1 f2

f1 f2

1.0

0.8

0.6

0.4

0.2

1.0

0.8

0.6

0.4

0.2

0

1.0

0.8

0.6

0.4

0.2

0

even

ness

, E

0 21 3superinfection parameter, a

R = 2345

2 6543 7 2 6543 7

a = 0

(i) (ii)

(iii) (iv)

R=

2, n

=2,

a=

1R

=2,

n=

2, a

=2

a = 1

a = 2 a = 3

R = 2345

(a) (b) (c)

Figure 3. Observing superinfection becomes significantly unlikely even with large richness as unevenness increases. (a) A geometric demonstration of the probabilityof detecting the combinations of two strains for varying evenness. In the top example, the first strain is twice as abundant as the second (f1 � 2f2, a ¼ 1), inthe bottom the first is five times as abundant (f1 � 5f2, a ¼ 2). These relationships are calculated with equation (2.5). (b) The evenness E is plotted as afunction of the superinfection parameter a for different richness R (see equation (2.8)). Examples in (a) are noted with arrows. (c) The probability of observingsuperinfection increases with more samples, but the rate of increase is drastically slower for uneven strain abundances. The probabilities of observing superinfectioncorresponding to the examples in (a) can be seen as fs � 0.4 and � 0.2, respectively.

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

6

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

wide range of conditions. Increasing samples per participant

at the expense of number of participants had little effect on

accuracy or precision regardless of average richness or

evenness (no slope in lines). A simulation consisting of

nN ¼ 2000 total samples would eliminate nearly all bias

and most imprecision except in highly uneven conditions.

Page 7: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rich

ness

est

imat

e pr

ecis

ion

(s.d

. les

t for

200

0 si

mul

atio

ns)

rich

ness

est

imat

e ac

cura

cy(e

stim

ated

– a

ctua

l)

no. samples (n)2 654

0.02

a=

0a

=2

a=

2a

=0

–0.02

–0.01

0

0.01

0.02

–0.02

0.2

0

0.1

0.2

0

0.1

–0.01

0

0.01

3 2 6543 2 6543 2 6543

no. samples (n)2 6543 2 6543 2 6543 2 6543

nN = 250 nN = 500 nN = 1000 nN = 2000

nN = 250 nN = 500 nN = 1000 nN = 2000(b)

(a)

25102030

s = 1%

25102030

s = 1%

Figure 5. Accuracy and precision improve with increased sampling equivalently. We apply our method to simulated data for fixed total number of samples (nN—columns), evenness (a—rows) and richness (Ps—lines). Two thousand replicate simulations were completed for each parameter set. (a) Estimate accuracymeasured by the difference of the actual richness and the average estimated richness. (b) Estimate precision measured by standard deviation (s.d.) of estimatedZTP parameter across 2000 replicate simulations. Both accuracy and precision improve with total number of samples, but do not change with n or N when the totalnumber of samples (nN) remains constant.

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

7

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

2.5. Estimation of average richness is less biased ifinitial assumptions of evenness are too low ratherthan too high

The current model requires an assumption of the evenness.

Yet, this information is often not directly available for a given

infection. Preliminary studies of simultaneous inference of

evenness and richness show that many samples are needed

per person but will be the subject of further work. To show

the bias resulting from the evenness assumption, we tested

the accuracy of the model when the superinfection parameter

was incorrectly specified. We simulated a study with N ¼ 500

participants each sampled n ¼ 3 times. The evenness of the

virus was specified by a ¼ 0, 1, 2. Then, we used the EM algor-

ithm with varying estimated a to estimate the ZTP parameter.

The procedure was replicated 2000 times for each parameter

set. The results are shown in figure 6. For all values of average

richness (true l), estimating a too low (the true value is

denoted by the location of the black circles), resulted in an

underestimate of l, while overestimating a resulted in an over-

estimate of l (figure 6). Overestimates of a resulted in a greater

magnitude of bias than underestimates. It is thus advisable

to assume evenness in cases where no information about the

relative abundance is present.

2.6. Sensitivity to the assumption of homogeneouspopulation-level superinfection prevalence

A ZTP distribution assumes the superinfection risk is homo-

geneously distributed across the population. To assess

whether our estimator would perform well under departures

from the assumption of a true underlying ZTP, we simulated

superinfection to be possible only among half the population.

For example, when simulating a 30% superinfection rate,

instead of simulating a constant average number of infections

of 1.4 per person (l ¼ 0.68), we simulated an average of 2.0

infections (l ¼ 1.62) in half the population (to achieve a

60% expected superinfection rate) and exactly 1.0 infections

(or 0% superinfection rate) in the other half. We then esti-

mated superinfection prevalence from the resulting data

assuming the data were distributed with a single average

richness (i.e. keeping the ZTP distribution assumption in

Page 8: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

1.5

a = 0 a = 1 a = 2

0

0 1.00.5 0 2 1.0 2.01.51

0.3

0.6

0.9

1.2

estim

ated

l

estimated a

true l

Figure 6. Richness estimates are less biased when the superinfection parameter is assumed to be too low relative to its true value rather than too high. The valuesof the true l and a are shown by the black dots, and for all l the estimate of l diverges quickly and repeatably (2000 replicates) for assumed a above true a.(Online version in colour.)

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

8

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

the model). We used 10 000 participants per each of 100

simulations to quickly estimate accuracy.

We varied superinfection rates (Ps ¼ 2%! 30%), per-

person numbers of samples (n ¼ 2! 6), and unevenness

(a ¼ 0! 3). Percent bias in the EM estimate of superinfection

prevalence was less than 20% over all simulations. Percent

bias was less than 5% for lower true superinfection preva-

lence (2–10%). Of note, superinfection estimates tended to

be slight overestimates when the population was highly

dichotomized in terms of risk rather than well mixed; the

highest overestimate was a bias of 36% for a high true super-

infection prevalence (30%) even strain abundance (a ¼ 0) and

high number of samples (n ¼ 5) samples per-person. Bias was

lower with higher unevenness, fewer samples per person and

lower superinfection prevalence.

If presented with different data suggesting heterogene-

ous incidence of infection, our model might potentially

be extended to infer more than one infection probability.

However, the small per cent biases that were calculated for

many conditions indicate that our methods are likely to be

approximately accurate even when the assumption of constant

incidence is not met.

2.7. Example from the literature, herpes simplex virus-2superinfection estimate

Recently, evidence for dual-strain infection with multiple

strains of HSV-2 was found in a global study with N ¼ 459

participants [6]. Two specimens per individual (n ¼ 2) were

collected a median of five months apart. Sequencing and

phylogenetic analysis found 17 dual-strain infection events.

We use these data as an example dataset to illustrate the

power of the EM algorithm in estimating the prevalence of

dual-strain infection. The raw data and the naive estimate

are compared with the EM estimate in figure 7. While there

are not extra data available to estimate the relative abundance

of strains (and therefore the superinfection parameter), we

make the neutral estimate of perfect evenness, finding that

the prevalence of dual-strain infection is approximately 7%,

doubling the naive estimate of 3.7% [6]. Moreover, if the

strains were even slightly uneven (a ¼ 1), the dual-strain

infection prevalence might be as high as 10%, and higher

unevenness indicates even higher prevalence.

3. Results II: many detected strains per sampleSuperinfection has been observed in systemically replicating

viruses in which a single sample can detect multiple strains.

These viruses include HIV [1–3] and hepatitis C [4]. For

these viruses, inference of the population-level prevalence of

superinfections requires additions to the previous model. In

figure 8, we assume that sampled sequences have already

been preprocessed into a phylogenetic tree [35] that revealed

multiple strain populations that are sufficiently distinct to con-

fidently infer dual-strain infection rather than intra-host

evolution. In our illustration, the single participant has been

detected to have Robs ¼ 2 strains in a sample with n ¼ 19

sequences, 12 of strain A and seven of strain C. It is possible

to use the ratio of these strain counts to estimate the evenness

of the virus, but viruses may change proportion over time. We

use the EM formalism as in the previous model to perform the

estimate of the average strain richness in a population.

We performed simulations to test the accuracy of the new

multi-strain per sample model. By simulating data from a

study with 200 individuals we compared the EM method

to the naive estimate. We varied the number of samples per

participant n ¼ 10! 200, the actual average richness of the

population kRl ¼ 1! 1.4 and the superinfection parameter

a ¼ 0! 5 for a total of 180 parameter combinations. Shown in

figure 9, the EM method has excellent accuracy for all parameter

combinations, while the naive method worsens with decreasing

number of samples, increasing average richness, and increasing

unevenness. Unevenness is modelled as before with an

exponential function and the single ‘superinfection’ parameter

a. Particularly in cases where small sample size is desirable

for practical reasons (e.g. cost), the EM method performs

substantially more accurately for modest unevenness a � 2.

Again, simulations were accomplished in R. A population

of participants (N ¼ 200) with random richness was created

by sampling from the ZTP distribution. Then, the observed

data were created by multinomial sampling with intra-host

probabilities for each strains calculated from the specified a

and that individual’s richness (see equation (5.2)) where the

number of samples n represents the number of sequences

measured and analysed simultaneously.

As a brief illustration, we used our method on freely

available data from a clinical study of HIV superinfection [2].

Piantadosi et al. acquired samples at two-time points (‘initial’

Page 9: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

supe

rinf

ectio

n pr

eval

ence

, s

500

0

100

200

300

400

1.5

1.0

1.1

1.2

1.3

1.4

0.40

0.35

01 2

strains observed (Robs) assumed a

naiveestimate

naiveestimate

assumed a

estim

ated

ave

rage

ric

hnes

s, ·

part

icip

ant c

ount

s, N

(Rob

s )

0 321 0 321

0.30

0.25

0.20

0.15

0.10

0.05

(a) (b) (c)

Figure 7. HSV-2 data (a) showing counts of study participants found to be infected by one N(1) ¼ 442 or two N(2) ¼ 17 strains in a two-sample study. Even withperfectly even strains, the EM estimate of prevalence of dual-strain infection is a factor of two higher than the naively determined prevalence. (b) Likewise, richnessis underestimated substantially by the naive approach. (c) However, increasingly uneven strain abundance rapidly decreases the accuracy of the naive estimate ofaverage richness. (Online version in colour.)

0

5

10

15

A C

coun

ts

A (n = 12)

C (n = 7)

R = 2

strain

Figure 8. The model is adjusted for viruses that allow for multiple strain detection in a single sample. We use preprocessed phylogenetic trees to develop the straincounts. Here, strains A and C were detected 12 and seven times, respectively, for a total of n ¼ 19 and Robs ¼ 2. The population prevalence of superinfection iscalculated by incorporating multiple individuals. The relative proportion of observed strains provides a first-order estimate of the virus evenness. (Online versionin colour.)

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

9

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

and ‘chronic’ approx. 100 and 1000 days post infection,

respectively) in 36 HIV-infected participants. There was a

median of 15 sequences per participant. Possible superinfec-

tion was identified in seven participants, and additional

intermediate time points were analysed in those seven

patients. Using only sequences read on or following the

first time superinfection was observed in those seven persons,

we found that 9–15% of sequences observed were of the

less abundant (subdominant) strain. Only data from time

points with two or more strains detected are presented in

their work (1/2 of time points) indicating a 5–7% prevalence

of the subdominant strain, when two strains are present. This

unevenness corresponds to a � 2.6! 3.0, for which we

choose the lower bound. Using this superinfection parameter,

we applied our EM estimation algorithm to a single time

point (the final time point) for the env gene of HIV. If those

data were naively analysed, 4/36 or 11% superinfection

prevalence would be found. However, the algorithm

calculates 25% superinfection prevalence. This is similar to

the total study superinfection prevalence of 7/36 or 20%,

illustrating the inferential ability of algorithm using under-

sampled data. Moreover, the total study superinfection

prevalence also likely represents an underestimate due to

the unevenness of strains and relatively low sample size

(median seven sequences per env gene per time point).

4. DiscussionBecause clinical superinfection studies are challenging,

accounting for limited sampling is necessary to realistically

define the true prevalence of superinfection. We have shown

that if richness (total number of strains) or unevenness (relative

abundance of strains) is high within a host, a standard calcu-

lation (naive estimate) of prevalence at the population level is

likely to represent an underestimate particularly if the under-

lying biology of infection only allows for the detection of a

single viral strain per sample. Viruses that periodically reacti-

vate such as HSV-2 are likely to fall within category.

For infections such as HIV-1, in which multiple strains can

be obtained per sample, even minimal unevenness (a . 2)

still creates conditions in which naive estimates are likely to

underestimate true superinfection prevalence.

We incorporated an ecological framework and developed

an EM methodology to estimate the true prevalence of super-

infection on a population level, beginning with minimal

assumptions about the incidence distribution and the within-

host evenness. Our effort to describe uneven strains is justified

by the fact that individual viral strains expand and contract on

exponential scales in accordance with the intensity of the

immune response, making unevenness and single strain pre-

dominance a likely phenomenon [15]. The method infers a

single parameter distribution, the ZTP distribution, that can

Page 10: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

0.05

rich

ness

acc

urac

y(e

stim

ated

– a

ctua

l)

–0.400 200 1.0 1.4 0 543211.31.21.1150

naiveEM

100number samples, n actual average richness, ·RÒ superinfection param, a50

–0.35

–0.30

–0.25

–0.20

–0.15

–0.10

–0.05

0

Figure 9. In the multi-strain per sample scenario we compared estimates of richness with naive and our EM method using simulated data. We fixed N ¼ 200 andvaried the number of samples per participant n ¼ 10! 200, the actual average richness of the population kRl ¼ 1! 1.4 and the superinfection parametera ¼ 0! 5 for a total of 180 parameter combinations. Estimates were calculated for each parameter combination. Jitter was introduced artificially on the x-axis forvisualization purposes. The EM method returns the simulated richness accurately for all parameter combinations, whereas the naive method worsens for increasingaverage richness or unevenness (encapsulated by the superinfection parameter). When even subtle unevenness exists (e.g. a � 2), the naive estimate can be up to15% off when fewer than 20 sequences can be detected. This error increases with unevenness. Conversely, the EM method is accurate (less than 1% error) for allparameter combinations including high unevenness and low number of samples.

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

10

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

be used to extrapolate the population-level prevalence of

superinfection from limited data.

Our first method was designed for infections such as

HSV-2 that typically admit a single strain per sample. In

this case, superinfection is particularly likely to be underesti-

mated by low sampling. When applied to a clinical study, we

found a twofold increase in the prevalence compared to the

naive estimate. The estimate marks a lower bound; the esti-

mate only increases with increasing assumed unevenness.

We specifically examine HSV-2, but similar outcomes can

be expected for other chronic viral infections that sequentially

reactivate single strains longitudinally. A caveat to this esti-

mation is provided in that current methods to estimate

evenness may not be adequate, and the assumption of even-

ness may alter the estimation of richness. We determined that

in cases where no information of evenness is available, the

most conservative estimates of superinfection prevalence

can be calculated by using perfect evenness (a ¼ 0).

We also extended our approach to include systemically

infecting viruses such as HIV-1 and hepatitis C that allow

for concurrent detection of multiple strains in a single

sample. These datasets allow us to use a specific evenness

distribution for each participant, making simultaneous infer-

ence of evenness and richness possible, and resulting in the

fact that underestimation of superinfection in these datasets

are much less extreme. As shown in a short analysis of litera-

ture data, our inference method may allow for substantially

less sampling to obtain the same results as previous studies

with large sample sizes.

By using large simulated datasets, we found that the accu-

racy and precision of the EM estimation is consistently

excellent for parameter ranges of interest (less than 1%

error in estimated average richness). Our model provides

the most benefit in situations where underestimation is high-

est with the naive method: when very limited samples are

collected from samples of individuals with higher richness

and uneven strains. The simulations also highlight a practical

consideration: that increased sampling per participant, or

enrolling more participants provide the same benefit to esti-

mation accuracy. This flexibility is valuable for studies

where one or the other option is simpler or more affordable.

Our tools can be applied to quantify superinfection in a

broad range of viruses. However, certain broad assumptions

must be made. First, deep sequencing must be able to dis-

tinguish strains, which may be difficult for viruses with low

population-level diversity. Second, we assume that a majority

of dual-strain infections results from superinfection rather

than dual-strain infection (simultaneous transmission). The

validity of this assumption is related to the particular epi-

demiology of the virus of interest. In general, if immunity

against primary infection is poor, and there are multiple

exposures to infection, then superinfection becomes far more

likely. Third, for our model, sampling must occur from a

population with relatively similar risks of multiple exposures

to a virus of interest. These populations would tend to include

those with homogeneous mixing and similar behavioural

risks. Estimating superinfection in heterogeneous populations

will be possible, but considerably more complex, requiring a

deep understanding of virology and exposure patterns to

accurately parameterize those situations. We have shown

that for limited data, bias generated by this assumption is

low. In future work, we might generalize our method to

other distributions, particularly to describe a study popu-

lation stratified by low and risk subpopulations. Fourth, we

must estimate the unevenness of the virus separately. This

can be done by comparing the relative abundance of strains

detected from viruses that allow multiple strains to be

detected simultaneously. Otherwise, we must guess uneven-

ness, and our sensitivity analysis shows that the least biased

assumption without extra information is to assume perfectly

even strain abundance (a ¼ 0). Unless additional data are

available to say otherwise, we recommend that perfect even-

ness should be assumed. Preliminary simulations on

simultaneous inference of evenness and richness in the

single strain per sample setting require very high numbers

of samples per person n�10. For practical purposes, it may

be possible to use a highly sampled subset of a study popu-

lation to create an estimation of the evenness, and use this

estimate across the rest of the low-sampled cohort. Moreover,

it may be that the biologic importance of superinfecting

strains may correlate with their abundance, making the

rarest strains less relevant in a vaccine context.

Page 11: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R

11

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

Prior estimates of richness are of practical value in terms

of deciding the samples required per person in a clinical

study. Thus, our methods are clinically relevant as increased

average richness or decreased strain evenness necessitates

higher sampling. Quantifying viral superinfection addresses

an important biologic question because superinfection

implies the failure of the memory immune response to

protect against re-exposure to a virus. The failure of cross-

immunity becomes immediately important in the context of

vaccine design and our results suggest in some contexts,

the phenomenon might occur more frequently than

previously envisioned.

.Soc.Interface15:201

5. Material and methodsCode used in all methods, and to make all plots is hosted on

https://github.com/dbrvs/superinfection. Short examples are

provided that could be modified by the user for their purposes.

70968

5.1. Derivation of within-host probability distributionfor even, single strain per measurement scenario

We are interested in the number of combinations of R strains in

n samples that contain exactly Robs strains. For a given Robs,

then we think of this as a ‘Polya urn’ problem, distribut-

ing each of the n samples into Robs urns, where no urn

can be left empty. The solution is the alternating sum

Robs

0

� �ðRobs � 0Þn� Robs

1

� �ðRobs � 1Þn þ Robs

2

� �ðRobs � 2Þn � . . . [36].

Then, there are RRobs

� �ways to select Robs strains from the true

underlying richness, and finally, to make it probability we

must divide by the total number of possible combinations from

the true richness, i.e. Rn. The result is the probability of observing

the richness r given n samples from an individual with a true

underlying richness R and a perfect evenness of viral strains.

This leads to the conditional probability discussed in the Results,

F(Robs ¼ rjR, n) ¼ 1

Rn

Rr

� �Xr�1

k¼0

rk

� �(�1)k(r� k)n: ð5:1Þ

5.2. Notes on exponential unevennessTo parameterize the uneven abundances, we made the assump-

tion that strain abundances could be modelled as exponentially

uneven. That is the j-th strain has probability

fj ¼ c exp (�aj): ð5:2Þ

The exponential function was chosen because given a single par-

ameter model, it is the ‘maximum entropy’ distribution, or that

which makes the minimally informative assumption [31]. We cal-

culate the normalization constant c, which depends on the

richness R, using a partial sum of the geometric seriesPRk¼1 xk ¼ x(xR � 1)=(x� 1). We replace x ¼ exp (�a), which

makes jxj , 1 as required. Some simplifying leads to

c(a, R) ¼XR

k¼1

e�ak

" #�1

¼ ea � 1

1� e�aR : ð5:3Þ

Note these formulae are only valid for a . 0, but allowing a! 0

we can expand terms to first order in a, i.e. using e2ax � 1 2 ax,

we have

fj �a!0

(1þ a)� 1

1� (1� aR)(1� aj) ¼ 1

R, ð5:4Þ

thus recovering the even abundance model.

Inserting the uneven abundance formulation into the

evenness function leads to

E ¼ � 1

ln R

XR

i¼1

ce�ai lnce�ai

¼ � c

ln R

XR

i¼1

(e�ai lncþ�aie�ai): ð5:5Þ

The first term in the sum is computedPR

i¼1 e�ai lnc ¼ c�1 lnc

recognizing again the partial sum. The second term can be com-

puted by using a partial derivative with respect to a, that is

�i exp (�ai) ¼ @a exp (�ai), shifting the derivative outside the

sum, and identifying c21. Thus we have the expression for

evenness in the exponentially abundant model

E(a, R) ¼ � 1

ln R(lncþ ac@ac

�1): ð5:6Þ

5.3. Derivation of within-host probability distributionfor uneven, single strain per measurement scenario

When strains are not assumed to be evenly abundant, detection

probabilities differ by strain. We name the strains by ranking

them in terms of their relative abundance, i.e. s2 is the second

most abundant strain. In the context of uneven abundance, the

probability of observing exactly k strains in an individual requi-

res enumeration of all possible observed combinations because

f1f2 = f2f3 (see visualization in figure 3a). Here, we derive the

general expression for the probability of observing a certain rich-

ness Robs given uneven strains probabilities parameterized by a,

underlying richness R and number of samples n in an individual.

We call each of the possible strain combinations a word W. The

word s1s2s4 means strain 1, 2 and 4 were observed. There are RRobs

� �combinations that result in the same number of observed strains.

We record these combinations in the set of all words {W}, the set

of all ways to select Robs observed strains from R true strains.

Then, keeping with the urn analogy, we define the set {U} to be

the set of all possible ordered words that distribute n samples

into Robs urns. This provides the sequence of observations of

strain 1! Robs. The sequence U l ¼ (1, 2, 1) means in the l-thsequence first strain was counted once, the second twice and the

third once. We also further define references to single strain prob-

abilities within words by subscript j such that fW ijdenotes the

probability of the strain in the j-th position of the i-th word.

To calculate the conditional probability we desire, we sum

over {W} and {U} where j . j indicates the magnitude of the set,

multiplying the probability of each strain k in a word Wik

being counted U jk times. We also divide by U jk! to account for

the fact that when any strain has been selected more than once

the order does not need to be considered. Finally, we obtain

CðRobs ¼ rjR, n,a) ¼Xj{W}j

i¼1

Xj{U}j

j¼1

Yr

k¼1

1

U jk!fWik

� �U jk : ð5:7Þ

The expression is used in the EM algorithm to calculate the

population-level prevalence of superinfection. Note in the case

where Robs ¼ 1, the expression simplifies toPR

i¼1 (fi)n. A more

involved example is illustrative:

Example 5.1 (R 5 6, Robs 5 3, n 5 3). The set of all words is

{W} ¼ {s1s2s3, s1s2s4, . . . ,s4s5s6} which has magnitude j{W}j ¼63

� �¼ 20. The set of all observation sequences is fUg¼f(1,1,4),

(1,2,3), (1,3,2), (1,4,1), (2,1,3), (2,2,2), (2,3,1), (3,1,2), (3,2,1),

(4,1,1)g which has j{U}j ¼ 10. Summing over all these combi-

nations, we compute the probability of the k-th listed strain in

the i-th word, exponentiating and dividing by the observed

Page 12: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

12

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

count U jk. For example, using the set {W} above, we read off

fW2,3¼ f4 ¼ c exp (�4a) and U2,3 ¼ 3.

5.4. Derivation of within-host probability distributionfor multiple strain per measurement scenario

Because some viruses allow for concurrent measurements of mul-

tiple strains, we expand the model to account for this scenario. We

again sum over fWg the set of all ways to select Robs observed

strains from R true strains. Since the strains could be present in

any relative abundance, we then sum over fSg, the set of

all permutations of the observation counts of the Robs strains (n1,

n2, . . ., nRobs), and take the product of probabilities of that set. For

clarity, each member of a single permutation (Sj) is the list (Sj1,

Sj2, . . ., SjRobs), where each element is unique and represents a re-

ordering of the elements of the samples fnkg; there are Robs!

elements Sj within fSg. Thus the conditional probability is

Q(Robs ¼ rjR, n,a) ¼ nn1,n2, . . . ,nRobs

� �Xj{W}j

i¼1

Xj{S}j

j¼1

Yr

k¼1

(fWik)S jk , ð5:8Þ

where Q distinguishes from the first model. For clarity, nk ¼ fn1,

n2, . . ., nRobsg denotes the list of counts of each observed strain.

For example, from a sample of 20 reads, nk ¼ f195, 5g indicates

Robs ¼ 2. The real richness R may be 2, or R may be 3 and the

sample may contain five copies of the second or the third most

abundant strains. Our code currently stops at R ¼ 6 infecting

strains before computation time becomes an issue.

5.5. Estimating richness in a clinical study populationusing the expectation-maximization algorithm

To calculate the true average richness of a virus in a population, we

use conditional probability to estimate the real distribution from the

observed distribution. We write the general probability for the true

richness as p(RjRobs, n, a, l), which is conditional on the

experimental variables Robs, n and the parameters of our modela, l.

By Bayes’ rule, the probability of the true richness can be

calculated using the calculated probabilities above as

p(RjRobs, n,a, l) ¼ p(RobsjR, n,a)Z(R,l)PR p(RobsjR, n,a)Z(R, l)

: ð5:9Þ

We have calculated the observation probabilities p(RobsjR, n, a)

as F in equation (5.1) for the even case, C in equation (5.7) for the

uneven case, and Q in equation (5.8) for the multiple strains per

sample case. The denominator of the conditional distribution has

been written using the identity p(y) ¼P

x p(yjx)p(x). Throughout,

we have chosen the single parameter (l) ZTP distribution for the

distribution of richness Z(R,l).

The model assumes that we have counts of participants

infected with Robs strains N(Robs) which we transform into the

observed richness probability p(Robs) ¼ N(Robs)=P

Robs N(Robs).

The model also requires that the number of samples (or sequen-

ces found per person in the multi-strain per sample case) is equal

for all participants. Finally, we assume a single value of the

superinfection parameter (see discussion in Results for the

impact of this assumption).

With these assumptions, we determine the optimal model

parameter (l) with the EM algorithm. We begin with an initial

guess of l0 from the naive calculation of average richness (see

equation (1.1)) and the definition of the average richness with

respect to l,

kR0l ¼XRobs

p(Robs)Robs ¼ l0

1� e�l0: ð5:10Þ

We then use maximum-likelihood iteratively, given l0. We

calculate the conditional probability given the ZTP parameter

p(RjRobs, n, a, li) using equation (5.9) and equations (5.1), (5.7)

or 5.8 depending on the biology of the virus being studied.

Then, the expectation value of the average observed richness is

kRil ¼XRobs

p(Robs)XRmax

R¼1

R p(RjRobs, n,a, li), ð5:11Þ

where Rmax ¼ 6. We then calculate the ZTP parameter numeri-

cally by maximizing the likelihood, in this case analytically

equivalent to setting

liþ1

1� e�liþ1¼ kRil: ð5:12Þ

Then, this updated ZTP parameter is used to update the

conditional probability distribution p(RjRobs, n, a, liþ 1), and the

algorithm proceeds. We iterate the process until we reach our

desired convergence tolerance of liþ 1 2 li , e ¼ 0.002, or the

number of iterations increases beyond 50. In this way we com-

pute the maximum-likelihood value of l, which uniquely

specifies the average richness and the superinfection prevalence

in our population.

Data accessibility. All clinical data are available in referenced publi-cations, all code is freely available at https://github.com/dbrvs/superinfection

Authors’ contributions. J.T.S. and C.J. originally posed the problem of esti-mating superinfection. D.B.R. developed the simple model, andA.S.M. did the heavy lifting, deriving the generalized formalism,introducing the EM algorithm solver, and writing code for all compu-tational simulations. A.L.G. provided support for sequencing details.D.B.R. made figures and (with J.T.S.) wrote the paper. All authorscontributed to revised drafts.

Competing interests. We declare we have no competing interests.

Funding. We thank the VIDD faculty initiative at the Fred HutchinsonCancer Research Center, and the NIH for grants U19: AI096111(J.T.S.), UM1: AI12662 (J.T.S.) and P01: AI030731 (C.J., A.S.M.,J.T.S.). D.B.R. is supported by the Washington Research Foundation.

Acknowledgements. We thank the study participants, and several refereeswhose comments substantially improved the work.

References

1. Smith DM, Wong JK, Hightower GK, Ignacio CC,Koelsch KK, Daar ES, Richman DD, Lottle SJ. 2004Incidence of HIV superinfection following primaryinfection. JAMA 292, 1177 – 1178. (doi:10.1371/journal.ppat.0030177)

2. Piantadosi A, Chohan B, Chohan V, McClelland RS,Overbaugh J. 2007 Chronic HIV-1 infection frequentlyfails to protect against superinfection. PLoS. Pathog.3, 1745 – 1760. (doi:10.1371/journal.ppat.0030177)

3. Redd AD, Quinn TC, Tobian AAR. 2013 Frequencyand implications of HIV superinfection. Lancet Infect.Dis. 13, 322 – 328. (doi:10.1016/S1473-3099(13)70066-5)

4. Grebely J et al. 2012 Hepatitis C virusreinfection and superinfection among treatedand untreated participants with recent infection.Hepatology 55, 1058 – 1069. (doi:10.1002/hep.24754)

5. Johnston C, Koelle DM, Wald A. 2014 Current statusand prospects for development of an HSV vaccine.Vaccine 32, 1553 – 1560. (doi:10.1016/j.vaccine.2013.08.066)

6. Johnston C et al. 2017 Dual-strain genital herpessimplex virus type 2 (HSV-2) infection in the USA, Peru,and eight countries in sub-Saharan Africa: a nestedcross-sectional viral genotyping study. PLoS. Med. 14,e1002475. (doi:10.1371/journal.pmed.1002475)

Page 13: Model-based estimation of superinfection prevalence from ... · Superinfection is highly relevant to the disciplines of public health and immu-nology because it indicates that primary

rsif.royalsocietypublishing.orgJ.R.Soc.Interface

15:20170968

13

on March 1, 2018http://rsif.royalsocietypublishing.org/Downloaded from

7. Hansen S et al. 2010 Evasion of CD8þ T cells iscritical for superinfection by cytomegalovirus.Science 328, 102 – 106. (doi:10.1126/science.1185350)

8. Walker MR, Li H, Teutsch S, Betz-stablein B, Luciani F,Lloyd AR, Bull RA. 2016 Incident hepatitis C virusgenotype distribution and multiple. J. Clin. Microbiol.54, 1855 – 1861. (doi:10.1128/JCM.00287-16)

9. Pham ST, Bull RA, Bennett JM, Rawlinson WD, DoreGJ, Lloyd AR, White PA. 2010 Frequent multiplehepatitis C virus infections among injection drugusers in a prison setting. Hepatology 52, 1564 –1572. (doi:10.1002/hep.23885)

10. Ronen K et al. 2013 HIV-1 superinfection occurs lessfrequently than initial infection in a cohort of high-risk Kenyan women. PLoS. Pathog. 9, 14 – 16.(doi:10.1371/journal.ppat.1003593)

11. Gao Y, Tian W, Han X, GAO F. 2017 Immunological andvirological characteristics of humanimmunodeficiency virus type 1 superinfection:implications in vaccine design. Front. Med. 11, 480 –489. (doi:10.1007/s11684-017-0594-8)

12. Vesa J, Chaillon A, Wagner GA, Anderson CM,Richman DD, Smith DM, Little SJ. 2017 IncreasedHIV-1 superinfection risk in carriers of specifichuman leukocyte antigen alleles. AIDS (London,England) 31, 1149 – 1158. (doi:10.1097/QAD.0000000000001445)

13. Serwanga J et al. 2018 HIV-1 superinfection canoccur in the presence of broadly neutralizingantibodies. Vaccine 36, 578 – 586. (doi:10.1016/j.vaccine.2017.11.075)

14. Schiffer JT, Abu-Raddad L, Mark KE, Zhu J, Selke S,Magaret A, Corey WL. 2009 Frequent release of lowamounts of herpes simplex virus from neurons:results of a mathematical model. Sci. Transl. Med. 1,7ra16. (doi:10.1126/scitranslmed.3000193)

15. Schiffer JT et al. 2013 Rapid localized spread andimmunologic containment define herpes simplexvirus-2 reactivation in the human genital tract. eLife2013, 1 – 28. (doi:10.7554/eLife.00288)

16. Barzon L, Lavezzo E, Militello V, Toppo S, Palu G.2011 Applications of next-generation sequencingtechnologies to diagnostic virology. Int. J. Mol. Sci.12, 7861 – 7884. (doi:10.3390/ijms12117861)

17. Quinones-Mateu ME, Avila S, Reyes-Teran G,Martinez MA. 2014 Deep sequencing: becominga critical tool in clinical virology. J. Clin. Virol. 61,9 – 19. (doi:10.1016/j.jcv.2014.06.013)

18. Beerenwinkel N, Gunthard HF, Roth V, Metzner KJ. 2012Challenges and opportunities in estimating viral geneticdiversity from next-generation sequencing data. Front.Microbiol. 3, 1 – 16. (doi:10.3389/fmicb.2012.00329)

19. Joseph SB, Swanstrom R, Kashuba ADM, Cohen MS.2015 Bottlenecks in HIV-1 transmission: insightsfrom the study of founder viruses. Nat. Rev.Microbiol. 13, 414 – 425. (doi:10.1038/nrmicro3471)

20. Bukh J, Miller RH, Purcell RH. 1995 Geneticheterogeneity of hepatitis C virus: quasispecies andgenotypes. Semin. Liver Dis. 15, pp. 41 – 63.(doi:10.1055/s-2007-1007262)

21. Liu SL, Rodrigo AG, Shankarappa R, Learn GH, HsuL, Davidov O, Zhao LP, Mulins JI. 1996 HIVquasispecies and resampling. Science 273,413 – 417. (doi:10.1126/science.273.5274.413c)

22. Gorzer I, Guelly C, Trajanoski S, Puchhammer-StocklE. 2010 The impact of PCR-generated recombinationon diversity estimation of mixed viral populationsby deep sequencing. J. Virol. Methods 169,248 – 252. (doi:10.1016/j.jviromet.2010.07.040)

23. McCrone JT, Lauring AS. 2016 Measurements ofintrahost viral diversity are extremely sensitive tosystematic errors in variant calling. J. Virol. 90,6884 – 6895. (doi:10.1128/JVI.00667-16)

24. Lorentzon M, Hooker JC. 2006 Nonparametricestimation of Shannon’s index of diversity when thereare unseen species in sample. J. Adv. Nurs. 55, 273 –275. (doi:10.1111/j.1365-2648.2006.03971_1.x)

25. Burnham KP, Overton WS. 1979 Robust estimationof population size when capture probabilities varyamong animals. Ecology 60, 927 – 936. (doi:10.2307/1936861)

26. Chao A, Colwell RK, Lin CW, Gotelli NJ. 2009Sufficient sampling for asymptotic minimum speciesrichness estimators. Ecology 90, 1125 – 1133.(doi:10.1890/07-2147.1)

27. Chao A, Chiu C, Hsieh T, Davis T, Nipperess D,Faith D. 2015 Rarefaction and extrapolation ofphylogenetic diversity. Methods Ecol. Evol. 6,380 – 388. (doi:10.1111/2041-210X.12247)

28. Anthony SJ et al. 2013 A strategy to estimateunknown viral diversity in mammals. mBio 4,e00598-13. (doi:10.1128/mBio.00598-13)

29. Korber BT, Kunstman KJ, Patterson BK, Furtado M,Levy R, Wolinsky SM. 1994 Genetic differencesbetween blood- and brain-derived viral sequencesfrom human immunodeficiency virus type 1-infected patients: evidence of conserved elements inthe V3 region of the envelope protein of brain-derived sequences. J. Virol. 68, 7467 – 7481.

30. Chao A. 1987 Estimating the population size forcapture – recapture data with unequal catchability.Biometrics 43, 783 – 791. (doi:10.2307/2531532)

31. Jaynes ET. 2003 Probability theory: the logic ofscience. Cambridge, UK: Cambridge University Press.

32. Johnston C, et al. 2017 Highly conserved intragenicHSV-2 sequences: results from next-generationsequencing of HSV-2 U L and U S regions fromgenital swabs collected from 3 continents. Virology510, 90 – 98. (doi:10.1016/j.virol.2017.06.031)

33. Dempster AP, Laird NM, Rubin DB. 1977 Maximumlikelihood from incomplete data via the EMalgorithm. J. R. Stat. Soc. Ser. B Methodol. 39,1 – 38. (doi:10.2307/2984875)

34. Hill M. 1973 Diversity and evenness: a unifyingnotation and its consequences. Ecology 54,427 – 432. (doi:10.2307/1934352)

35. Felsenstein J. 1988 Phylogenies and quantitativecharacters. Annu. Rev. Ecol. Syst. 19, 445 – 471.(doi:10.1146/annurev.es.19.110188.002305)

36. Feller W. 1968 An introduction to probability theoryand its applications, vol. 3. New York, NY: JohnWiley & Sons New York.