12
Unifying the spatial epidemiology and molecular evolution of emerging epidemics Oliver G. Pybus a,1,2 , Marc A. Suchard b,c,d,1 , Philippe Lemey e,1 , Flavien J. Bernardin f,g , Andrew Rambaut h,i , Forrest W. Crawford b , Rebecca R. Gray a , Nimalan Arinaminpathy j , Susan L. Stramer k , Michael P. Busch f,g , and Eric L. Delwart f,g a Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom; Departments of b Biomathematics, c Biostatistics, and d Human Genetics, University of California, Los Angeles, CA 90095; e Department of Microbiology and Immunology, Rega Institute, KU Leuven, 3000 Leuven, Belgium; f Blood Systems Research Institute, San Francisco, CA 94118; g Department of Laboratory Medicine, University of California, San Francisco, CA 94143; h Institute for Evolutionary Biology, Edinburgh University, Edinburgh EH9 3JT, United Kingdom; i Fogarty International Center, National Institutes of Health, Bethesda, MD 20892-2220; j Department of Ecology and Evolution, Princeton University, Princeton, NJ 08544-2016; and k Scientic Support Ofce, American Red Cross, Gaithersburg, MD 20877 Edited by David M. Hillis, University of Texas at Austin, Austin, TX, and approved July 27, 2012 (received for review April 19, 2012) We introduce a conceptual bridge between the previously un- linked elds of phylogenetics and mathematical spatial ecology, which enables the spatial parameters of an emerging epidemic to be directly estimated from sampled pathogen genome sequences. By using phylogenetic history to correct for spatial autocorrela- tion, we illustrate how a fundamental spatial variable, the diffu- sion coefcient, can be estimated using robust nonparametric statistics, and how heterogeneity in dispersal can be readily quan- tied. We apply this framework to the spread of the West Nile virus across North America, an important recent instance of spatial invasion by an emerging infectious disease. We demonstrate that the dispersal of West Nile virus is greater and far more variable than previously measured, such that its dissemination was criti- cally determined by rare, long-range movements that are unlikely to be discerned during eld observations. Our results indicate that, by ignoring this heterogeneity, previous models of the epidemic have substantially overestimated its basic reproductive number. More generally, our approach demonstrates that easily obtainable genetic data can be used to measure the spatial dynamics of nat- ural populations that are otherwise difcult or costly to quantify. phylogeny | phylogeography | transmission T he explanation of spatial patterns of infectious disease, par- ticularly those of emerging pathogens, has remained a central problem of epidemiology since its inception (1). The existence and nature of traveling waves of infection were rst explained in theoretical models (2, 3) and later quantied in empirical studies of rabies and the Black Death (4, 5). These and other studies highlighted the fundamental problem of spatial autocorrelation: observations of infection are statistically dependent due to transmission among proximate individuals, greatly complicating the analysis of spatiotemporal incidence. Consequently, many recent analyses of spatial epidemic behavior use detailed math- ematical models of spatial structure to account for autocorrela- tion (6). Entirely independently, in the eld of evolutionary biology there has developed a separate body of work, now termed phylogeography, which focuses on reconstructing past movement events from the genome sequences of sampled organisms (710). However, these evolutionary tools typically generate descriptive results that, though informative, remain divorced from epidemi- ological theory. Crucially neither approach can be considered complete when applied to rapidly evolving viruses, whose spatial, epidemic, and evolutionary dynamics occur on the same timescale (11), necessitating the development of methods that consider all these processes together. Here we introduce a unique approach that integrates the disciplines of spatial epidemiology and phylogenetics. To illus- trate the utility of this approach, we show how, from pathogen genomes alone, it can estimate the diffusion coefcient (D) of an epidemic as well as variation in the process of spatial spread. D is a fundamental ecological measure of the intrinsic diffusivity of infected individuals, reecting the area that an infected host will explore per unit time (not to be confused with the area covered by the whole epidemic). It is derived from simple reactiondif- fusion models of spatial spread and, together with R 0 , deter- mines the wavefront velocity of an epidemic invasion (4, 5). Despite its theoretical importance, D is exceptionally difcult to estimate in nature and rarely reported; its estimation usually requires tracking the movements of a large number of infected hosts by mark/recapture or telemetry (5, 12). As well as being time-consuming, this approach will fail to adequately capture spatial dynamics when dispersal behavior is highly variable among individuals. Alternatively, D can be inferred indirectly via its theoretical relationship to an epidemics observed wavefront ve- locity (4, 13, 14); however, this requires R 0 and other transmission parameters to be known without error. We apply our approach to the invasion of North America by the West Nile virus (WNV), an important recent example of viral spatial emergence. WNV is a mosquito-borne RNA virus whose primary host is birds, and was rst detected in the United States in New York City in August 1999. The American epidemic resulted from the introduction of a single highly pathogenic lineage (15) and subsequently contributed to the decline of several North American bird species (16). Transmission from mosquitoes to humans has caused >1,200 deaths in the United States (17), although human cases are not thought to contribute to onward infection. Comprehensive records of WNV incidence in the United States demonstrate an apparent westward wave of infection that reached the countrys west coast by 2004 (17), representing a mean epidemic wavefront velocity of 1,000 km/y during invasion. However, incidence data alone cannot de- termine whether the invasion resulted primarily from local, short movements of hosts and vectors, or whether east/west spread was interrupted by long-distance bird migration movements to poorly sampled tropical locations (18, 19). Despite a plethora of mathematical models, many of which consider the transmission mechanisms of WNV in great detail (13, 14, 20, 21), models of Author contributions: O.G.P. designed research; O.G.P., M.A.S., P.L., F.J.B., A.R., F.W.C., R.R.G., N.A., S.L.S., M.P.B., and E.L.D. performed research; M.A.S., P.L., S.L.S., M.P.B., and E.L.D. contributed new reagents/analytic tools; O.G.P., M.A.S., P.L., F.J.B., A.R., F.W.C., R.R.G., and N.A. analyzed data; and O.G.P., M.A.S., and P.L. wrote the paper. The authors declare no conict of interest. This article is a PNAS Direct Submission. Data deposition: The sequences reported in this paper have been deposited in the Gen- Bank database, www.ncbi.nlm.nih.gov (accession nos. GQ507468GQ507484). 1 O.G.P., M.A.S., and P.L. contributed equally to this work. 2 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1206598109/-/DCSupplemental. 1506615071 | PNAS | September 11, 2012 | vol. 109 | no. 37 www.pnas.org/cgi/doi/10.1073/pnas.1206598109

Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

Unifying the spatial epidemiology and molecularevolution of emerging epidemicsOliver G. Pybusa,1,2, Marc A. Suchardb,c,d,1, Philippe Lemeye,1, Flavien J. Bernardinf,g, Andrew Rambauth,i,Forrest W. Crawfordb, Rebecca R. Graya, Nimalan Arinaminpathyj, Susan L. Stramerk, Michael P. Buschf,g,and Eric L. Delwartf,g

aDepartment of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom; Departments of bBiomathematics, cBiostatistics, and dHuman Genetics,University of California, Los Angeles, CA 90095; eDepartment of Microbiology and Immunology, Rega Institute, KU Leuven, 3000 Leuven, Belgium; fBloodSystems Research Institute, San Francisco, CA 94118; gDepartment of Laboratory Medicine, University of California, San Francisco, CA 94143; hInstitute forEvolutionary Biology, Edinburgh University, Edinburgh EH9 3JT, United Kingdom; iFogarty International Center, National Institutes of Health, Bethesda, MD20892-2220; jDepartment of Ecology and Evolution, Princeton University, Princeton, NJ 08544-2016; and kScientific Support Office, American Red Cross,Gaithersburg, MD 20877

Edited by David M. Hillis, University of Texas at Austin, Austin, TX, and approved July 27, 2012 (received for review April 19, 2012)

We introduce a conceptual bridge between the previously un-linked fields of phylogenetics and mathematical spatial ecology,which enables the spatial parameters of an emerging epidemic tobe directly estimated from sampled pathogen genome sequences.By using phylogenetic history to correct for spatial autocorrela-tion, we illustrate how a fundamental spatial variable, the diffu-sion coefficient, can be estimated using robust nonparametricstatistics, and how heterogeneity in dispersal can be readily quan-tified. We apply this framework to the spread of the West Nilevirus across North America, an important recent instance of spatialinvasion by an emerging infectious disease. We demonstrate thatthe dispersal of West Nile virus is greater and far more variablethan previously measured, such that its dissemination was criti-cally determined by rare, long-range movements that are unlikelyto be discerned during field observations. Our results indicate that,by ignoring this heterogeneity, previous models of the epidemichave substantially overestimated its basic reproductive number.More generally, our approach demonstrates that easily obtainablegenetic data can be used to measure the spatial dynamics of nat-ural populations that are otherwise difficult or costly to quantify.

phylogeny | phylogeography | transmission

The explanation of spatial patterns of infectious disease, par-ticularly those of emerging pathogens, has remained a central

problem of epidemiology since its inception (1). The existenceand nature of traveling waves of infection were first explained intheoretical models (2, 3) and later quantified in empirical studiesof rabies and the Black Death (4, 5). These and other studieshighlighted the fundamental problem of spatial autocorrelation:observations of infection are statistically dependent due totransmission among proximate individuals, greatly complicatingthe analysis of spatiotemporal incidence. Consequently, manyrecent analyses of spatial epidemic behavior use detailed math-ematical models of spatial structure to account for autocorrela-tion (6). Entirely independently, in the field of evolutionarybiology there has developed a separate body of work, now termedphylogeography, which focuses on reconstructing past movementevents from the genome sequences of sampled organisms (7–10).However, these evolutionary tools typically generate descriptiveresults that, though informative, remain divorced from epidemi-ological theory. Crucially neither approach can be consideredcomplete when applied to rapidly evolving viruses, whose spatial,epidemic, and evolutionary dynamics occur on the same timescale(11), necessitating the development of methods that consider allthese processes together.Here we introduce a unique approach that integrates the

disciplines of spatial epidemiology and phylogenetics. To illus-trate the utility of this approach, we show how, from pathogengenomes alone, it can estimate the diffusion coefficient (D) of anepidemic as well as variation in the process of spatial spread. D is

a fundamental ecological measure of the intrinsic diffusivity ofinfected individuals, reflecting the area that an infected host willexplore per unit time (not to be confused with the area coveredby the whole epidemic). It is derived from simple reaction–dif-fusion models of spatial spread and, together with R0, deter-mines the wavefront velocity of an epidemic invasion (4, 5).Despite its theoretical importance, D is exceptionally difficult toestimate in nature and rarely reported; its estimation usuallyrequires tracking the movements of a large number of infectedhosts by mark/recapture or telemetry (5, 12). As well as beingtime-consuming, this approach will fail to adequately capturespatial dynamics when dispersal behavior is highly variable amongindividuals. Alternatively, D can be inferred indirectly via itstheoretical relationship to an epidemic’s observed wavefront ve-locity (4, 13, 14); however, this requires R0 and other transmissionparameters to be known without error.We apply our approach to the invasion of North America by

the West Nile virus (WNV), an important recent example ofviral spatial emergence. WNV is a mosquito-borne RNA viruswhose primary host is birds, and was first detected in the UnitedStates in New York City in August 1999. The American epidemicresulted from the introduction of a single highly pathogeniclineage (15) and subsequently contributed to the decline ofseveral North American bird species (16). Transmission frommosquitoes to humans has caused >1,200 deaths in the UnitedStates (17), although human cases are not thought to contributeto onward infection. Comprehensive records of WNV incidencein the United States demonstrate an apparent westward wave ofinfection that reached the country’s west coast by 2004 (17),representing a mean epidemic wavefront velocity of ∼1,000 km/yduring invasion. However, incidence data alone cannot de-termine whether the invasion resulted primarily from local, shortmovements of hosts and vectors, or whether east/west spread wasinterrupted by long-distance bird migration movements to poorlysampled tropical locations (18, 19). Despite a plethora ofmathematical models, many of which consider the transmissionmechanisms of WNV in great detail (13, 14, 20, 21), models of

Author contributions: O.G.P. designed research; O.G.P., M.A.S., P.L., F.J.B., A.R., F.W.C.,R.R.G., N.A., S.L.S., M.P.B., and E.L.D. performed research; M.A.S., P.L., S.L.S., M.P.B.,and E.L.D. contributed new reagents/analytic tools; O.G.P., M.A.S., P.L., F.J.B., A.R.,F.W.C., R.R.G., and N.A. analyzed data; and O.G.P., M.A.S., and P.L. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The sequences reported in this paper have been deposited in the Gen-Bank database, www.ncbi.nlm.nih.gov (accession nos. GQ507468–GQ507484).1O.G.P., M.A.S., and P.L. contributed equally to this work.2To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1206598109/-/DCSupplemental.

15066–15071 | PNAS | September 11, 2012 | vol. 109 | no. 37 www.pnas.org/cgi/doi/10.1073/pnas.1206598109

Page 2: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

the epidemic’s spatial dynamics have been explored only theo-retically (22) or at very local scales (23, 24), and values reportedfor the basic reproductive number, R0, of the epidemic varywidely (14, 21, 25). Most phylogenetic studies have revealed littleabout the epidemic’s spatial structure due to the limited diversityof the subgenomic sequences typically used (26).

Linking Phylogeography and Spatial EcologyThis section explains how evolutionary analyses of viral spreadcan be formally linked with spatial ecology, enabling the esti-mation of spatial epidemiological variables from genomic data.The approach is based on the application of a simple yet pow-erful idea: phylogenies reconstructed from spatial epidemicsare branching structures that record the correlated histories oftransmission among sampled infections (Fig. 1 A and B), hencethe phylogeny of an epidemic can be used to correct for spatialautocorrelation. More specifically, if the dates and locations ofall phylogenetic nodes are known or posited, then each phylog-eny branch represents a conditionally independent trajectoryof viral movement, defined by a start location, end location, andduration (27) (Fig. 1 A and B). Independence is conditional onthe date and location values proposed for each node; any esti-mation or measurement uncertainty in these can be readily in-corporated bymarginalization. Consequently, the spatial dynamicsof an epidemic can be quantified using simple, nonparametricstatistics of these displacements. This approach is analogous to thatused by phylogenetic comparative methods, which convert corre-lated species trait values into independent observations amenableto statistical tests (28).Although many statistics of spatial dynamics could be calcu-

lated using this framework, we introduce the approach by esti-mating the diffusion coefficient, D, without an explicit model ofspatial autocorrelation. Given a set of n movement observations(phylogeny branches) whose durations and start and end loca-tions are specified, D can be estimated using

D≈1n

Xn

i=1

d2i4ti

; [1]

where ti denotes the duration in years of branch i, during whichthe lineage has moved di km away from its start position in twodimensions (5, 12) (Fig. 1 A and B). This estimator follows theclassical relationship between D and mean square displacement(29) and has been previously used to estimate the diffusivity ofintentionally released rabid foxes that were subsequently trackedvia telemetry (5).Estimates of the dates and locations of internal phylogenetic

nodes (ancestral infections; Fig. 1) can be readily obtained usingcurrent phylogeographic and molecular clock techniques (10).In our WNV analysis we infer the longitude and latitude of in-ternal nodes using a 2D anisotropic random walk (Materials andMethods). The marginal posterior probability densities of theselocations (and of D) can be estimated using standard BayesianMarkov chain Monte Carlo (MCMC) techniques; hence our pro-cedure fully incorporates statistical uncertainty (10). Sequencessampled from the epidemic are assumed to have a single com-mon ancestor (no recombination or introgression). Althoughthere must be sufficient temporal information to reliably esti-mate the timescale of the phylogeny, the approach does notnecessitate the assumption of neutral sequence evolution.We note two key benefits of this approach: first, it will be

applicable to a broad range of situations because the inference ofancestral locations is separated from the estimation of D (orother spatial variable); for each application, the most statisticallyappropriate model for inferring the former can be chosen. Second,the approach extends readily to more realistic, heterogeneous dis-persal processes. Specifically, in this study, we use a flexible relaxed

random walk that allows the rate of dispersal to vary among phy-logeny branches according to some probability distribution, whileconstraining it to be constant along each branch (Materials andMethods). As a result, we can directly measure heterogeneity in

di

(x,y)

t

1998.5 2000.5 2002.5 2004.5 2006.5

i

(x,y)

Tim

e (y

ears

)

A

B

C

D

EF

G

NY99lineage

A B

C

Fig. 1. (A and B) The link between spatial ecology and phylogenetics. Filledcircles represent viral sequences whose locations and dates of sampling areknown. Squares represent unsampled ancestral infections whose locationsand dates are estimated. The black squares in A and B denote the epidemic’sorigin in space and time, respectively. (A) Colored arrows indicate the di-rection and distance di of the movement trajectory defined by each lineage.Thin colored lines show the random walk undertaken by each lineage. (B)The phylogeny resulting from the spatial infection process in A. Colored linesin B show the duration ti of each lineage. Diffusivity can be inferred bycombining the information in A and B. Diffusivity is low for lineages withlong and winding paths that do not lead far (e.g., green), and is high forlineages that quickly move large distances (e.g., purple). (C) Maximum cladecredibility phylogeny of the North American WNV epidemic, estimated fromwhole genomes under the best-fitting dispersal model (Table 1). Posteriorprobabilities of branching events are indicated by red (P > 0.95) and yellow(P > 0.85) circles. Blue bars show the 95% HPD credible intervals of the es-timated dates of well-supported nodes. See Fig. S1 for full annotation.

Pybus et al. PNAS | September 11, 2012 | vol. 109 | no. 37 | 15067

POPU

LATION

BIOLO

GY

Page 3: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

epidemic spread (and inD) by evaluating the variability of dispersalpaths among phylogeny branches.

ResultsThe commonly sequenced WNV E gene contains insufficientgenetic variation to resolve the phylogeography of the NorthAmerican epidemic in detail (26); therefore, we chose to analyzeonly whole viral genomes. However, almost all genomes avail-able at the time of study were sampled before 2005. We there-fore extended the range of sampling by fully sequencing 17previously unreported WNV isolates sampled between 2004 and2008 (Materials and Methods), thereby obtaining enough di-vergence to estimate a reliable WNV molecular clock. Theresulting final alignment, comprising 104 genomes with definedsampling dates and locations and isolated from a variety of hostand vector species (Table S1), was analyzed using the frameworkintroduced above.To infer the locations of ancestral infections, we used a variety

of random walk models, all of which accurately recovered theepidemic’s temporal and geographic origin (Table 1). However,the homogeneous model (no dispersal rate variation) was verystrongly rejected in favor of heterogeneous models that permit-ted significant variability among lineages (Table 1) and providedmore precise estimates of spatial parameters. The phylogeo-graphic structure of WNV we obtain (Fig. 1C and Fig. S1) iscongruent with that obtained previously using subgenomicsequences (26, 30) while providing additional resolution anddates of lineage movement. In addition to discriminating thepreviously defined NY99 and WN02 lineages (30), our data re-veal structure in sequences sampled from western areas: themajority of Californian sequences cluster together with basallineages from Texas [defined as clade “D” in Gray et al. (26)]. AllMexican sequences cluster together (“F”) as do some sequencesfrom the southwest (“G”).When projected through space and time (Fig. 2 and Movie

S1), this phylogeny shows a westward dissemination of WNVlineages that matches the observed spatiotemporal incidence ofWNV (17). Of particular note are a handful of viral lineages thatexhibit atypically rapid and long-distance travel. Lineages thatmove north to south along the Atlantic coast (reaching Floridaby 2000) and along the Rocky Mountains are consistent with birdmigration corridors bounded by geographic barriers (18). In-terestingly, once WNV lineages reach the eastern boundary ofthe Rocky Mountains, in 2001, further westward movementappears to stall (Movie S1), possibly reflecting the impediment tomigration imposed by high elevations.A key parameter of any spatial epidemic is its wavefront ve-

locity. If we assume no variation in dispersal rates, then, as theory

predicts (12), our genetic analysis reconstructs a constant invasionvelocity of ∼1, 000 km/y (before the western seaboard is reached;Fig. 3A). However, under our best-fitting heterogeneous model(Table 1), we observe an accelerating invasion: from 1999 to 2003the origin-to-wavefront distance doubled every 0.8 y on average(Fig. 3B). This acceleration rate, estimated solely from viral ge-nomic data, is almost identical to that independently estimatedfrom large-scale patterns of spatiotemporal WNV incidence (31).Such acceleration is theoretically predicted to occur when there ishigh variance in dispersal among infected hosts—specifically,when the dispersal kernel is positively skewed and “fat-tailed”(32). This result implies a WNV wavefront with a long leading edge,explaining the discontinuous spread of infection into new areas.We report empirical estimates of the diffusion coefficient, D,

of the WNV epidemic, and we further quantify variability in itsspatial spread (Fig. 3 C and D). Mean D under homogenousdiffusion is estimated to be ∼200 km2/d. However, the best-fittingheterogeneous model indicates that WNV’s spatial spread isboth extraordinarily variable (coefficient of variation of D amongbranches ∼4–8) and, on average, highly diffusive (mean D ∼1,000km2/d; Fig. 3D). This exceptional mean diffusivity exceeds thatestimated for the historical spread of Black Death throughoutEurope (4) (∼70 km2/d) and can only be explained if somephylogeny branches represent long-distance colinear displace-ments (e.g., a branch representing 1,000 km unidirectional travelover 25 d would correspond to D = 10,000 km2/d). The existenceof a few rapid, long-range movements also explains the strongcorrelation between the mean and variation ofD among branches(Fig. 3D). The remaining less-diffusive lineages likely representlocal transmission among hosts and vectors as they move withintheir typical home ranges.

DiscussionWe introduce a conceptual link between phylogeny and spatialecology and demonstrate that the large-scale dynamics of bi-ological invasions can be quantified from easily sampled and in-creasingly inexpensive sets of genetic data. Our frameworkprovides a practical method for estimating the diffusion co-efficient of a spatial outbreak and for measuring the variabilityamong hosts in spatial spread. Despite being rarely reported,diffusion coefficients are practically and theoretically valuablebecause they quantify the intrinsic diffusivities of epidemics,analogous to the manner in which R0 summarizes intrinsictransmission potential. Our approach will be most applicable tovector-borne viruses and to viral epizootics and epiphytotics, andis also suitable for newly emergent pathogens. Once a newpathogen has been identified, retrospective screening of availablearchived sera could generate a set of pathogen genomes, from

Table 1. Estimates of genetic and spatial parameters under different spatial models

Spatial model

Homogeneous dispersal†

Heterogeneous dispersal*

Cauchy Gamma Lognormal

ln marginal likelihood −643.45 −427.24 −399.43 −424.69ln Bayes factor 244.02 27.81 Best-fitting model 25.26Date of epidemic origin 1998.6 (1997.9–1999.3) 1998.5 (1997.7–1999.2) 1998.5 (1997.8–1999.1) 1998.6 (1997.9–1999.1)Mean genome evolution rate(substitutions per site per year)

0.00058 (0.00049–0.00066) 0.00058 (0.00051–0.00064) 0.00057 (0.00051–0.00064) 0.00058 (0.00051–0.00064)

Variability of evolution rateamong branches (SD)

0.38 (0.23–0.53) 0.33 (0.21–0.45) 0.33 (0.21–0.45) 0.33 (0.20–0.44)

Latitude of epidemic origin 40.3 (37.1, 43.7) 41.3 (40.4, 43.2) 41.1 (40.4, 43.2) 41.1 (40.3, 43.2)Longitude of epidemic origin −76.5 (−82.9, −70.5) −74.4 (−76.2, −73.2) −74.6 (−76.1, −73.3) −74.2 (−76.1, −72.9)

*Dispersal rate varies among branches; rates for each are independently drawn from the corresponding distribution.†Dispersal rate is equal for all branches.

15068 | www.pnas.org/cgi/doi/10.1073/pnas.1206598109 Pybus et al.

Page 4: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

which the spatial dynamics of the outbreak before its date ofdiscovery can be inferred.Our WNV analysis shows that the epidemic cannot be ade-

quately described by homogeneous dispersal, and instead wascritically shaped by high variation in dissemination of infectedhosts. The importance of such heterogeneity in determining thedynamics of spatial invasions is increasingly recognized (24, 33).Bird migrations are the most likely source of rapid, long-distanceWNV movements, yet their role in the spread of WNV has beenquestioned (19), and our current data cannot exclude the pos-sibility of anthropogenic transport of infected hosts or vectors.However, a key benefit of our framework is that long-range viralmovements (by whatever mechanism) will leave a detectablephylogenetic footprint even when such events are too rare to befeasibly detected by direct observation. Our results demonstratethat many current mathematical models of North AmericanWNV (13, 14) that have assumed homogenous diffusion areunrealistic, despite their use of complex transmission structures.Such studies have typically modeled host dispersal using data onthe short-term home-range movements of birds, which exhibitlow mean diffusion coefficients of D < 14 km2/d. By ignoring the

substantial variability in WNV dispersal we have uncovered,these models significantly overestimate the R0 of the epidemic(e.g., R0 > 25) (14, 21). We do not need to assume an excep-tionally transmissible pathogen in a weakly diffusive host to ex-plain the observed wavefront velocity of ∼1,000 km/y. Instead,the invasion behavior of WNV is best explained by a pathogenwith a lower mean R0 that transmits among hosts whose dispersalis very variable.Despite capturing the broad-scale spatial dynamics of the WNV

invasion of North America, our spatial sampling is not compre-hensive and precludes more detailed inferences—for example,whether elliptical migration and central American/Caribbeanbird populations were important to WNV dissemination (18, 19).However, our main conclusions are robust to the absence of datafrom the tropics, because if such movements were common, thenestimates of D and its variability would be even greater thanthose presented here. Migratory movements might explain viralreintroduction into previously colonized locations, e.g., lineagesmoving northeastward in 2002. More specific hypotheses couldbe addressed within our framework as further data (includinggenomes from the tropics) become available. Higher-resolution

(a) 1999.5

(c) 2001.5 (d) 2002.5

(b) 2000.5

(e) 2003.5 (f) 2004.5

(g) 2005.5 (h) 2006.5

Fig. 2. The reconstructed spatiotemporal diffusion of WNV in North America, shown at annual intervals from mid-1999 onwards (A–H). White circles indicateisolate sampling locations. Black lines show a spatial projection of a representative phylogeny, with each node being mapped to its known (external node) orestimated (internal node) location. In each panel colored clouds represent statistical uncertainty in the estimated locations of WNV lineages (95% HPDregions) (42).

Pybus et al. PNAS | September 11, 2012 | vol. 109 | no. 37 | 15069

POPU

LATION

BIOLO

GY

Page 5: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

sampling would also allow the application of more complex spatialprocesses (e.g., Lévy flights or advection-diffusion models).The genomes of rapidly evolving pathogens are already used

to estimate the date of origin and R0 of emerging epidemics (34),most recently for pandemic H1N1/09 influenza (35). The meth-ods introduced here could similarly enable the rate, direction, andmode of spatial spread of future emergent viruses to be inferredfrom genetic data. Such methods also open the door for the de-velopment of future approaches that could potentially jointly esti-mate R0 and D from sampled pathogen genomes (9); however, anysuch approach will require a much better understanding of theeffects on lineage coalescence of nonequilibrium spatial dynamics.Further, the connection between phylogeny and spatial autocorre-lation exploited here could be applied to other problems in spatialecology, such as the control of invasive species, provided that suitablydiverse genetic markers for the species in question are available.

Materials and MethodsHuman Samples. Only four WNV complete genomes available at the time ofstudywere sampled after 2004. To characterizemore recent isolates (and thusestimate a reliable molecular clock) we obtained 17 infected human plasmasamples detected during blood donor screening at blood centers across theUnited States (36). The isolates reported here were sampled during 2003–2007 (Table S1). This study was approved by the University of California SanFrancisco Committee on Human Research and informed consent was obtained.

RT-PCR and Genome Sequencing. WNV genomes were amplified and se-quenced in four fragments. Briefly, total RNA was extracted from plasmausing the QIAamp Viral RNA Mini Kit (Qiagen) and eluted in 50 mL of elutionbuffer in the presence of 40 U Protector RNase inhibitor (Roche). First-strandcDNA synthesis was initiated using 12.5 mL of RNA and 0.5 mg of primer R1,R2, R3, or R4a (37) and 400 U of murine leukemia virus reverse transcriptase(Promega). For amplification of each portion of the genome, a nested PCRwas performed using 5 mL each of cDNA and TaKaRa Ex Taq DNA poly-merase (TaKaRa Bio). Primer sequences and PCR cycling conditions wereidentical to those in Herring et al. (37). A single 2.8- to 3.2-kb band wasdetected on 0.8% agarose gel. PCR products were purified with QIAquick

PCR (Qiagen) and sequenced using previously reported primers (37) and theBigDye Kit on an ABI3700 capillary sequencer. After manual editing, sequenceswere assembled using SeqMan (GenBank accession nos. GQ507468–GQ507484).Sequence collation and annotation. All available North American WNV near-complete genome sequences were obtained from GenBank, one of which(DQ211652) was a duplicate of AF202541 and removed; these were added toour genomes, resulting in a final data set comprising 104 genomes, 11,029 ntlong. Sequences were codon aligned by hand. Host species, sampling date,and location of each sequence were obtained from the literature or providedby previous authors. ZIP code locations were converted into latitude andlongitude coordinates using ZIPList5. For 27 sequences, only the US or Mex-ican state was known; the latitude and longitude of these was defined as thegeographic centroid of the state. If only the year of sampling was known,then the sampling date was defined as the midpoint of the year (Table S1).Model selection analyses. Model selection analyses were first undertaken toselect a statistically appropriate evolutionary model (Table S2). Eight modelcombinations were explored, representing all permutations of (i) theHasegawa-Kishino-Yano (HKY) vs. general time-reversible (GTR) substitutionmodel, (ii) incorporation vs. omission of a Γ distribution of among-site rateheterogeneity, and (iii) strict molecular clock vs. an uncorrelated lognormalrelaxed molecular clock (38). For each model, parameters were estimatedusing the Bayesian MCMC approach implemented in BEAST alongsidea Bayesian skyline coalescent model (39). Other coalescent models were in-vestigated but performed poorly. MCMC chains were run for 50 millionstates, sampled every 5,000 states. MCMC convergence was evaluated usingTracer 1.5 (http://beast.bio.ed.ac.uk). The performance of each combinationwas compared using Bayes factors (40). Estimated evolutionary rates anddivergence times were almost identical among models. The best-fittingmodel was GTR + Γ with a lognormal relaxed molecular clock (Table S2), andwas thus used in subsequent analyses.Relaxed random-walk models. We extended the phylogeographic approach inBEAST 1.7 (10) and used the BEAGLE library to accelerate computation (41).Movement in two dimensions was modeled as a scaled-mixture generaliza-tion of a Brownian motion process (SI Text). This model is motivated byformal Lévy flight models while not strictly enforcing dispersal kernels withpower-law tails. Realized dispersal path lengths were corrected for theEarth’s curvature using great circle distances. As in Lemey et al. (10), diffu-sion rate variation was implemented by rescaling the diffusion process alongeach phylogeny branch, with the scalars for each being drawn from

Mean diffusion coefficient (km /day)Time

Heterogeneous diffusion

Homogeneous diffusion

Furth

est e

xten

t of e

pide

mic

wav

efro

nt (k

m fr

om e

pide

mic

orig

in)

100 1000 10000D

iffus

ion

coef

fcie

nt v

aria

tion

amon

g lin

eage

s1999.5 2000.5 2001.5 2002.5 2003.5 2004.5 2006.52005.5

1000

2000

3000

4000

5000

6000

7000

0

2

4

6

8

10

12

14

0

1000

2000

3000

4000

5000

6000

7000

0

2

4

6

8

10

12

14

2

A B

C D

Fig. 3. Characteristics of the North American WNV invasion estimated from viral genomes. Plots A and C were estimated under a homogenous dispersalmodel; plots B and D under the best-fitting heterogeneous model (Table 1). Plots A and B show the reconstructed epidemic wavefront. For each point in time,the black line is the estimated distance from the epidemic wavefront to its estimated origin: the gradient of this line is thus the invasion velocity. Gray linesindicate the 95% credible regions of the estimated wavefront position. Plots C and D show kernel density estimates of the diffusion coefficient (D)parameters. The horizontal axis shows the estimated mean D among lineages; the vertical axis shows the coefficient of variation of D among lineages. Thethree contours show, in shades of decreasing darkness, the 50%, 75%, and 95% HPD regions via kernel density estimation.

15070 | www.pnas.org/cgi/doi/10.1073/pnas.1206598109 Pybus et al.

Page 6: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

a specified distribution: these scaled mixtures generate a wide range ofrelaxed random walks. We evaluated different probability distributions(Cauchy, gamma, lognormal) to accommodate among-branch diffusion ratevariation and compared their fit to a homogeneous process. To aid com-putation, we developed unique analytical solutions to the marginalizationof unobserved multivariate traits at internal nodes under relaxed random-walk models (SI Text). Methods are implemented in BEAST 1.7 (source codeavailable from http://beast-mcmc.googlecode.com).Postprocessing and visualization. MCMC chains were run for 250 million states,sampled every 50,000 states. The posterior distribution of phylogenies wassummarized using maximum clade credibility (MCC) trees in TreeAnnotator.MCC trees and 95% highest posterior density (HPD) contours were visualized

using SPREAD (42). Various statistics (e.g., the wavefront velocity) wereextracted from the posterior distribution by sampling each rooted phylog-eny at multiple time points and summarizing the resulting distributions.

ACKNOWLEDGMENTS. We thank Eddie Holmes, Mike Bonsall, SunetraGupta, John Drake, Robert May, and Paul Harvey for discussion. Supportfor this work was provided by the Royal Society (O.G.P. and A.R.);National Institutes of Health Grant R01 GM086887 (to M.A.S. and F.W.C.); Centers for Disease Control/National Center for Infectious DiseasesGrant R01-CI-000214 (to F.J.B, M.P.B., and E.L.D.); UK Medical Research Council(R.R.G.); European Research Council Seventh Framework Programme Grant260864 (to P.L.); and the Institute for Mathematical Sciences, NationalUniversity of Singapore (M.A.S. and P.L.).

1. Snow J (1854) The cholera near Golden Square and at Deptford.Med Times Gazette 9:321–322.

2. Skellam JG (1951) Random dispersal in theoretical populations. Biometrika 38:196–218.

3. Mollison D (1977) Spatial contact models for ecological and epidemic spread. J RoyStat Soc B 39:283–326.

4. Noble JV (1974) Geographic and temporal development of plagues. Nature 250:726–729.

5. Murray JD, Stanley EA, Brown DL (1986) On the spatial spread of rabies among foxes.Proc R Soc Lond B Biol Sci 229:111–150.

6. Grenfell BT, Bjørnstad ON, Kappey J (2001) Travelling waves and spatial hierarchies inmeasles epidemics. Nature 414:716–723.

7. Fitch WM (1996) The variety of human virus evolution. Mol Phylogenet Evol 5:247–258.8. Bourhy H, et al. (1999) Ecology and evolution of rabies virus in Europe. J Gen Virol 80:

2545–2557.9. Biek R, Henderson JC, Waller LA, Rupprecht CE, Real LA (2007) A high-resolution

genetic signature of demographic and spatial expansion in epizootic rabies virus. ProcNatl Acad Sci USA 104:7993–7998.

10. Lemey P, Rambaut A, Welch JJ, Suchard MA (2010) Phylogeography takes a relaxedrandom walk in continuous space and time. Mol Biol Evol 27:1877–1885.

11. Grenfell BT, et al. (2004) Unifying the epidemiological and evolutionary dynamics ofpathogens. Science 303:327–332.

12. Shigesada N, Kawasaki K (1997) Biological Invasions: Theory and Practice (Oxford UnivPress, London).

13. Lewis M, Renc1awowicz J, van den Driessche P (2006) Traveling waves and spreadrates for a West Nile virus model. Bull Math Biol 68:3–23.

14. Maidana NA, Yang HM (2009) Spatial spreading of West Nile Virus described bytraveling waves. J Theor Biol 258:403–417.

15. Lanciotti RS, et al. (1999) Origin of the West Nile virus responsible for an outbreak ofencephalitis in the northeastern United States. Science 286:2333–2337.

16. LaDeau SL, Kilpatrick AM, Marra PP (2007) West Nile virus emergence and large-scaledeclines of North American bird populations. Nature 447:710–713.

17. Centers for Disease Control (2011) Statistics, Surveillance and Control Archive. Avail-able at http://www.cdc.gov/ncidod/dvbid/westnile.

18. Reed KD, Meece JK, Henkel JS, Shukla SK (2003) Birds, migration and emerging zoonoses:West Nile Virus, Lyme disease, influenza A and enteropathogens. Clin Med Res 1:5–12.

19. Rappole JH, et al. (2006) Modeling movement of West Nile virus in the Westernhemisphere. Vector Borne Zoonotic Dis 6:128–139.

20. Bowman C, Gumel AB, van den Driessche P, Wu J, Zhu H (2005) A mathematical modelfor assessing control strategies against West Nile virus. Bull Math Biol 67:1107–1133.

21. Wonham MJ, Lewis MA, Renc1awowicz J, van den Driessche P (2006) Transmissionassumptions generate conflicting predictions in host-vector disease models: A casestudy in West Nile virus. Ecol Lett 9:706–725.

22. Liu R, Shuai J, Wu J, Zhu H (2006) Modeling spatial spread of West Nile virus andimpact of directional dispersal of birds. Math Biosci Eng 3:145–160.

23. Yiannakoulias NW, Schopflocher DP, Svenson LW (2006) Modelling geographic var-iations in West Nile virus. Can J Public Health 97:374–378.

24. Magori K, Bajwa WI, Bowden S, Drake JM (2011) Decelerating spread of West Nilevirus by percolation in a heterogeneous urban landscape. PLOS Comput Biol 7:e1002104.

25. Cruz-Pacheco G, Esteva L, Montaño-Hirose JA, Vargas C (2005) Modelling the dy-namics of West Nile virus. Bull Math Biol 67:1157–1172.

26. Gray RR, Veras NM, Santos LA, Salemi M (2010) Evolutionary characterization of theWest Nile Virus complete genome. Mol Phylogenet Evol 56:195–200.

27. Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125:1–15.28. Harvey PH, Pagel MD (1991) The Comparative Method in Evolutionary Biology (Ox-

ford Univ Press, London).29. Einstein A (2003) Investigations on the Theory of the Brownian Movement, ed Furth R

(Dover, New York).30. Davis CT, et al. (2005) Phylogenetic analysis of North American West Nile virus iso-

lates, 2001–2004: Evidence for the emergence of a dominant genotype. Virology 342:252–265.

31. Mundt CC, Sackett KE, Wallace LD, Cowger C, Dudley JP (2009) Long-distance dis-persal and accelerating waves of disease: Empirical relationships. Am Nat 173:456–466.

32. Kot M, Lewis MA, Van den Driessche P (1996) Dispersal data and the spread of in-vading organisms. Ecology 77:2027–2042.

33. Melbourne BA, Hastings A (2009) Highly variable spread rates in replicated biologicalinvasions: Fundamental limits to predictability. Science 325:1536–1539.

34. Pybus OG, et al. (2001) The epidemic behavior of the hepatitis C virus. Science 292:2323–2325.

35. Fraser C, et al.; WHO Rapid Pandemic Assessment Collaboration (2009) Pandemicpotential of a strain of influenza A (H1N1): Early findings. Science 324:1557–1561.

36. Busch MP, et al. (2005) Screening the blood supply for West Nile virus RNA by nucleicacid amplification testing. N Engl J Med 353:460–467.

37. Herring BL, et al. (2007) Phylogenetic analysis of WNV in North American blood do-nors during the 2003–2004 epidemic seasons. Virology 363:220–228.

38. Drummond AJ, Ho SY, Phillips MJ, Rambaut A (2006) Relaxed phylogenetics anddating with confidence. PLoS Biol 4:e88.

39. Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian phylogenetics withBEAUti and the BEAST 1.7. Mol Biol Evol 29:1969–1973.

40. Suchard MA, Weiss RE, Sinsheimer JS (2001) Bayesian selection of continuous-timeMarkov chain evolutionary models. Mol Biol Evol 18:1001–1013.

41. Suchard MA, Rambaut A (2009) Many-core algorithms for statistical phylogenetics.Bioinformatics 25:1370–1376.

42. Bielejec F, Rambaut A, Suchard MA, Lemey P (2011) SPREAD: Spatial phylogeneticreconstruction of evolutionary dynamics. Bioinformatics 27:2910–2912.

Pybus et al. PNAS | September 11, 2012 | vol. 109 | no. 37 | 15071

POPU

LATION

BIOLO

GY

Page 7: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

Supporting InformationPybus et al. 10.1073/pnas.1206598109SI TextSpatial Phylogenetic Diffusion in Continuous Space. To estimate thelocations of ancestral phylogenetic nodes in continuous space, weconsider a Brownian motion process (1, 2) along the branches ofan unknown yet estimable bifurcating phylogeny τ. This randomwalk process produces bivariate trait observations (latitude andlongitude) at the N tree tips by imposing bivariate normallydistributed displacements along each branch of τ. The bivariatenormal distribution is centered on zero and its variance is pro-portional to time (the duration of the branch). Displacements,each in a random direction, cumulate along successive branchesas the process moves from the phylogeny root to its tips. Dis-placements are characterized by an infinitesimal precision matrixP that is invariant with respect to changes in time units. To relaxthe assumption of constant-variance displacements throughoutthe phylogeny (i.e., the homogeneous dispersal model in Table1), we follow a recently proposed Bayesian procedure that re-scales P along each branch of the phylogeny using a branch-specific scaling factor ϕb that is independently drawn from anunderlying distribution. This procedure allows the diffusionprocess to vary from branch to branch in τ and is therefore re-ferred to as a relaxed random walk (RRW) model. Lemey et al.(3) propose different distributional choices on ϕb, including aone-parameter gamma distribution:

ϕb ∼iid Gammaðν=2; ν=2Þ; [S1]

where ν counts degrees of freedom, which replaces the normallydistributed displacements occurring along each branch withStudent’s t independent increments. Other distributions includea more restrictive Cauchy distribution (fixing ν = 1 in the one-parameter gamma distribution) and a normal distribution onthe log-scale:

log ϕb ∼iid Normalð1; σ2Þ; [S2]

with a fixed mode and estimable variance σ2. Each of these RRWspecifications generalizes the phylogenetically guided Brownianprocess into a different scale-mixture of normally distributedvariates (4, 5).

Multivariate Continuous Trait Peeling. Previous approaches to theinference of the above RRWmodels in 2D space have exploiteddata augmentation of the unobserved locations of ancestralnodes in the phylogeny (3). Here, this algorithm performedpoorly under combinations of large sample sizes (>100 taxa)and high diffusion heterogeneity, resulting in poor Markovchain Monte Carlo mixing. Consequently, we developed andimplemented a substantially more efficient procedure thatanalytically integrates out internal and root node states. Pre-vious solutions to this problem had been restricted to internalnodes under one-dimensional homogenous Brownian modelsonly (6, 7).Consider an N-tipped bifurcating phylogenetic tree τ = (V, t)

that is a graph with a set of vertices (nodes) V and edge weights t.Each external node V i ∈ V for i = 1, . . ., N is of degree 1, havingone parent node Vpa(i) from within the internal or root nodes.Each internal node V i for i = N + 1, . . ., 2N − 2 is of degree 3and the root node V2N−1 is of degree 2. Connecting V i to V j liesan edge with weight ti, and t = (t1, . . ., t2N−2).

Let (Y1, . . ., Y2N−2) record a d-dimensional continuous trait foreach of the corresponding nodes in τ. In the usual recon-struction problem, we observe (Y1, . . ., YN) at the external nodes,but do not observe (YN+1, . . ., Y2N−1). Given the RRW thatLemey et al. (3) develop, one straightforwardly writes

Yi ∼Multivariate-NormalðYpaðiÞ; tiPÞ [S3]

for i = 1, . . ., 2N − 2, where P is an unknown d × d precisionmatrix for the unscaled Brownian motion. The RRW modelfurther posits a conjugate root trait prior

Y2N−1 ∼Multivariate-Normalðμ;ϕPÞ; [S4]

which becomes relatively uninformative for ϕ small, e.g., ϕ =0.001. For notational convenience, we augment the edge weightswith t2N−1 = 1/ϕ. Combining [S3] and [S4] yields the joint dis-tribution over all traits given t, ϕ, and P (left off of the remainingequations for clarity),

pðY1; . . . ;Y2N−1Þ ¼�

∏2N−2

i¼1pðYijYpaðiÞÞ

�pðY2N−1Þ: [S5]

We wish to compute the density of the observed traits only byintegration over all possible realizations of the unobserved traits,

pðY1; . . . ;YNÞ ¼ZZ

⋯Z

pðY1; . . . ;Y2N−1ÞdYNþ1dYNþ2⋯dY2N−1:

[S6]

Fortunately, Eq. S5 suggests a dynamic-programming ap-proach to achieving an analytic solution. To see this, considerbriefly a three-tipped tree in which V1 and V2 connect to V4, andV3 and V4 connect to V5. Here, we decompose

pðY1;Y2;Y3Þ ¼Z �Z

pðY1jY4ÞpðY2jY4ÞpðY4jY5ÞdY4

× pðY3jY5ÞpðY5ÞdY5

¼Z

pðY1;Y2jY5ÞpðY3jY5ÞpðY5ÞdY5;

[S7]

such that integration becomes a postorder traversal and ste-reotyped operation on triples (1, 2, 4) and (4, 3, 5) of nodesalong τ. One might argue the rest is simply bookkeeping,but it stands as bookkeeping worth reporting to ease futureimplementation.Let {Yi} represent the set of the observed trait values de-

scendent from and including V i, then the stereotyped operationwe must forge is computing

pðfYkgjYpaðkÞÞ ¼Z

pðfYigjYkÞ pð�Yj���YkÞpðYkjYpaðkÞÞdYk; [S8]

where pa(i) = pa(j) = k. Fortunately, all four of the functionsp(·) in [S8] are proportional to multivariate normals, so it sufficesin our traversal to keep track of partial mean vectors mk; partialprecision scalars pk; and remainder terms ρk for k = N + 1, . . .,2N − 1. By construction, let mi = Yi, pi = 1/ti, and ρi = 1 fori = 1, . . ., N. Rewriting the product of descendent functions in[S8] simplifies

Pybus et al. www.pnas.org/cgi/content/short/1206598109 1 of 6

Page 8: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

pðfYigjYkÞpð�Yj���YkÞ ¼

�pi2π

�d=2

jPj1=2 exph−pi2ðmi −YkÞ′Pðmi −YkÞ

i

×�pj2π

�d=2

jPj1=2 exph−pj2ðmj −YkÞ′Pðmj −YkÞ

i

¼ ρk ×MVNðYk;mk; ðpi þ pjÞPÞ;[S9]

where MVN(·; κ, Λ) signifies a normalized multivariate densityfunction centered around κ with precision Λ,

mk ¼ pimiþ pjmj

pi þ pj; [S10]

and we introduce remainders

ρk ¼�

pipj2πðpi þ pjÞ

�d=2jPj1=2

exph−pi2m′iPmi −

pj2m′jPmj

i

exp�−pi þ pj

2m′kPmk

� : [S11]

To complete the integration of [S9] with respect to p(YkjYpa(k)),we exploit our normalization and discern that the Brownian pro-cess connecting Yk and Ypa(k) inflates the variance, but does notinfluence the mean, such that

1pk

¼ tk þ 1pi þ pj

: [S12]

Computing mk, pk, and ρk for k = N + 1, . . ., 2N − 1 inpostorder traversal is straightforward to implement and furnishesp(Y1, . . ., YNjY2N−1).The final integration of Y2N−1 with respect to its prior proceeds

in a similar manner, because one can interpret the conjugatemultivariate-normal prior as an additional increment of theBrownian process. However, instead of ending the incrementwith length 1/ϕ at an unknown quantity, the process ends ata constant point μ. Exploiting this insight yields

pðY1; . . .YNÞ ¼�

∏2N−1

k¼Nþ1ρk

�MVNðm2N−1; μ; p2N−1PÞ: [S13]

1. Wiener N (1958) Nonlinear Problems in Random Theory (MIT Press, Cambridge, MA;Wiley, New York).

2. Edwards AWT, Cavalli-Sforza LL (1964) Phenetic and Phylogenetic Classification, 6, edsHeywood VH, McNeil J (Systematics Assoc Publishers, London), pp 67–76.

3. Lemey P, Rambaut A, Welch JJ, Suchard MA (2010) Phylogeography takes a relaxedrandom walk in continuous space and time. Mol Biol Evol 27:1877–1885.

4. AndrewsDF,MallowsCL (1974) Scalemixturesofnormaldistributions. JRStatSoc,B36:99–102.

5. West M (1984) Outlier models and prior distributions in Bayesian linear regression. J RStat Soc B 46:431–439.

6. Blum MGB, Damerval C, Manel S, François O (2004) Brownian models and coalescentstructures. Theor Popul Biol 65:249–261.

7. Novembre J, Slatkin M (2009) Likelihood-based inference in isolation-by-distancemodels using the spatial distribution of low-frequency alleles. Evolution 63:2914–2925.

Pybus et al. www.pnas.org/cgi/content/short/1206598109 2 of 6

Page 9: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

1998.5 1999.5 2000.5 2001.5 2002.5 2003.5 2004.5 2005.5 2006.5 2007.5

AF202541 (NY)FJ151394 (NY)

AF404754 (NJ)

AF260967 (NY)AF533540 (NY)

AF404756 (NY)DQ164188 (NY)

AF196835 (NY)DQ164194 (NY)

AF404753 (MD)AY289214 (TX)

DQ164192 (NY)DQ164187 (NY)

AF404755 (NY)DQ983578 (FL)

DQ164202 (OH)DQ080072 (FL)

DQ080062 (LA)

DQ080071 (FL)DQ431697 (FL)

DQ164189 (NY)DQ431698 (FL)

DQ080069 (MEXICO)

DQ080070 (MEXICO)

DQ080068 (MEXICO)

DQ080066 (MEXICO)

DQ080063 (MEXICO)

DQ080065 (MEXICO)

DQ080067 (MEXICO)

DQ080064 (MEXICO)

WG148 (CA)WG101 (CA)

WG103 (CA)WG116 (CA)

DQ080051 (AZ)

WG007 (TX)

DQ666449 (TX)

WG124 (AZ)

DQ666451 (AZ)

WG149 (CA)

DQ164201 (AZ)

WG011 (TX)

DQ080053 (AZ)

WG013 (TX)

DQ080052 (AZ)

WG132 (CA)

WG009 (NM)

DQ431702 (CO)

DQ431712 (AZ)

DQ431704 (CO)

DQ666448 (AZ)

WG144 (AZ)

DQ431711 (AZ)

DQ431706 (NM)DQ431707 (NM)

DQ164193 (NY)

WG142 (NE)

DQ164191 (NY)

DQ666452 (SD)

DQ666450 (TX)

DQ431699 (FL)

DQ431695 (IL)

AY712948 (TX)

DQ164196 (GA)

DQ164205 (TX)

DQ164203 (CO)

DQ164199 (TX)

DQ164197 (GA)

DQ431696 (WI)

DQ431693 (TX)

DQ164198 (TX)

DQ080057 (CA)

DQ164204 (CO)

DQ431708 (CA)

AY712945 (TX)

WG080 (CA)

DQ080058 (CA)

DQ431709 (CA)

DQ080059 (CA)

DQ164186 (NY)

WG091 (CA)

DQ176637 (TX)

DQ431710 (CA)

DQ431703 (CO)

DQ080061 (LA)

DQ080056 (CA)

WG099 (CA)DQ080054 (CA)

AY712946 (TX)

DQ431700 (CA)

DQ164190 (NY)

DQ080060 (MEXICO)

DQ431701 (CO)

DQ080055 (CA)

AY795965 (MI)

AY712947 (TX)

AY646354 (NY)DQ164200 (IN)

DQ164195 (NY)

DQ164206 (TX)

DQ431705 (SD)DQ005530 (UT)

DQ431694 (TX)WG024 (CA)

A

B

C

D

E

F

G

NY99lineage

(WN02lineage)

Fig. S1. This phylogeny is an annotated version of that presented in Fig. 1C. Accession numbers or isolate names of each West Nile virus genome are noted atthe tips (see Table S1 for further details), followed by a two-letter code representing the US state from which the isolate was sampled. Taxa labels are coloredaccording to the species from which the genome was isolated (black, human; blue, bird; magenta, mosquito; green, horse). The tree is the maximum cladecredibility phylogeny estimated under the best-fitting diffusion model. The posterior probability of internal nodes is indicated by red (P > 0.95) and yellow (P >0.85) circles. Blue bars show the 95% credible regions of the estimated dates of ancestral nodes. Previously defined taxonomic groups are indicated, specificallythe NYT99 and WN02 lineages, and the clades A–G described in Gray et al. (1).

1. Gray RR, Veras NM, Santos LA, Salemi M (2010) Evolutionary characterization of the West Nile virus complete genome. Mol Phylogenet Evol 56:195–200.

Pybus et al. www.pnas.org/cgi/content/short/1206598109 3 of 6

Page 10: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

Table S1. Sample information for all available West Nile virus complete genome sequences

GenBank accessionno./isolate name* Host species Date of sampling State code (M, Mexico) ZIP code Latitude Longitude New isolate?

AF260967 Ec 1999.5 NY 10065 40.7656 −73.9624 NoFJ151394 Cb 1999.5 NY 10101 40.7661 −73.9874 NoAF202541 Hs 1999.66 NY n/a 42.7561 −75.8166 NoAF196835 Pc 1999.71 NY 10065 40.7656 −73.9624 NoAF404753 Cb 2000.5 MD 21117 39.4250 −76.7779 NoAF404754 Cp 2000.5 NJ 07652 40.9453 −74.0713 NoAF404756 Cb 2000.5 NY 10956 41.1560 −73.9936 NoAF404755 Bu 2000.5 NY 13114 43.4604 −76.2434 NoDQ080072 Dc 2001.5 FL 33480 26.6941 −80.0379 NoDQ164194 Cb 2001.5 NY 11901 40.9530 −72.6420 NoAF533540 Hs 2001.71 NY 11566 40.6667 −73.5562 NoDQ080071 Ec 2002.5 FL 33514 28.6964 −82.0038 NoDQ164198 Hs 2002.5 TX n/a 31.1697 −100.0787 NoDQ164205 Hs 2002.5 TX n/a 31.1697 −100.0787 NoDQ164196 Hs 2002.5 GA n/a 32.6808 −83.2519 NoDQ164197 Hs 2002.5 GA n/a 32.6808 −83.2519 NoDQ176637 Qq 2002.5 TX 79261 34.4489 −100.6899 NoDQ164200 Hs 2002.5 IN n/a 39.7709 −86.4445 NoDQ164202 Hs 2002.5 OH n/a 40.1937 −82.6660 NoDQ164195 Cx 2002.5 NY 11566 40.6667 −73.5562 NoDQ164186 Cb 2002.5 NY 11365 40.7391 −73.7931 NoDQ164187 Cb 2002.5 NY 13901 42.1934 −75.8849 NoAY646354 Hs 2002.5 NY n/a 42.7561 −75.8166 NoAY795965 Hs 2002.5 MI n/a 43.7389 −84.6273 NoDQ164193 Cb 2002.5 NY 12910 44.8543 −73.6630 NoDQ080062 Cx 2002.5 LA 70560 29.9366 −91.8695 NoAY289214 Hs 2002.62 TX 77701 30.0727 −94.1066 NoDQ080069 Ec 2003.5 TAM (M) n/a 24.9314 −98.6356 NoDQ983578 Cn 2003.5 FL 32966 27.6393 −80.6822 NoDQ080070 Qq 2003.5 SON (M) n/a 29.3524 −111.6824 NoAY712945 Zm 2003.5 TX 77011 29.7427 −95.3079 NoAY712946 Cc 2003.5 TX 77011 29.7427 −95.3079 NoAY712947 Cc 2003.5 TX 77011 29.7427 −95.3079 NoAY712948 Cq 2003.5 TX 77011 29.7427 −95.3079 NoDQ080063 Cl 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ080064 Fa 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ080065 Qq 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ080066 Px 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ080067 Bvs 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ080068 Ah 2003.5 BCN (M) n/a 30.2977 −114.9399 NoDQ164199 Hs 2003.5 TX n/a 31.1697 −100.0787 NoDQ164203 Ph 2003.5 CO n/a 38.9962 −105.5465 NoDQ164204 Bj 2003.5 CO n/a 38.9962 −105.5465 NoDQ005530 Hs 2003.5 UT n/a 39.4970 −111.5452 NoDQ164190 Cb 2003.5 NY 11901 40.9530 −72.6420 NoDQ164188 Cb 2003.5 NY 10570 41.1279 −73.7929 NoDQ164192 Cb 2003.5 NY 10956 41.1560 −73.9936 NoDQ431695 Hs 2003.5 IL 60025 42.0783 −87.8242 NoDQ164191 Cb 2003.5 NY 14757 42.2326 −79.5161 NoDQ164189 Cb 2003.5 NY 12067 42.5451 −73.9360 NoDQ080051 Ct 2003.5 AZ 85638 31.7487 −110.024 NoDQ080055 Ct 2003.5 CA 92283 32.9392 −114.9009 NoDQ080056 Ct 2003.5 CA 92283 32.9392 −114.9009 NoDQ080052 Ct 2003.5 AZ 85326 33.3813 −112.5735 NoDQ080057 Cb 2003.5 CA 91007 34.1291 −118.0483 NoDQ080058 Cb 2003.5 CA 91007 34.1291 −118.0483 NoDQ080054 Cq 2003.5 CA 93550 34.4834 −118.0804 NoDQ080053 Ct 2003.5 AZ 86505 35.6053 −109.4589 No

Pybus et al. www.pnas.org/cgi/content/short/1206598109 4 of 6

Page 11: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

Table S1. Cont.

GenBank accessionno./isolate name* Host species Date of sampling State code (M, Mexico) ZIP code Latitude Longitude New isolate?

DQ080059 Pn 2003.5 CA 95814 38.5795 −121.4913 NoDQ431697 Hs 2003.56 FL 33601 27.9826 −82.3401 NoDQ431693 Hs 2003.60 TX 79119 35.0767 −102.0462 NoDQ431694 Hs 2003.62 TX 79035 34.6538 −102.7472 NoDQ431698 Hs 2003.65 FL 33601 27.9826 −82.3401 NoDQ431696 Hs 2003.70 WI 53201 43.0386 −87.9067 NoDQ431699 Hs 2003.73 FL 33601 27.9826 −82.3401 NoDQ431711 Hs 2004.5 AZ 85280 33.4052 −111.9254 NoDQ164206 Cc 2004.5 TX 77020 29.7731 −95.3138 NoDQ080061 Ccs 2004.5 LA 70560 29.9366 −91.8695 NoDQ164201 Hs 2004.5 AZ n/a 34.1698 −111.9337 NoDQ666448 Hs 2004.5 AZ n/a 34.1698 −111.9337 NoDQ431700 Hs 2004.5 CA 94118 37.7817 −122.4615 NoDQ080060 Cc 2004.5 BCN (M) n/a 30.2977 −114.9399 NoDQ431705 Hs 2004.52 SD 57701 44.0731 −103.2051 NoDQ431701 Hs 2004.53 CO 81501 39.0723 −108.5429 NoDQ431708 Hs 2004.54 CA 92101 32.7253 −117.1721 NoDQ431709 Hs 2004.54 CA 91762 34.0578 −117.6703 NoDQ431702 Hs 2004.57 CO 81501 39.0723 −108.5429 NoDQ431712 Hs 2004.57 AZ 85280 33.4052 −111.9254 NoDQ431706 Hs 2004.57 NM 87101 35.1995 −106.6442 NoDQ431703 Hs 2004.57 CO 81501 39.0723 −108.5429 NoDQ431704 Hs 2004.57 CO 81501 39.0723 −108.5429 NoDQ431710 Hs 2004.58 CA 90631 33.9421 −117.9517 NoDQ431707 Hs 2004.58 NM 87101 35.1995 −106.6442 NoDQ666449 Hs 2005.5 TX n/a 31.1697 −100.0787 NoDQ666450 Hs 2005.5 TX n/a 31.1697 −100.0787 NoDQ666451 Hs 2005.5 AZ n/a 34.1698 −111.9337 NoDQ666452 Hs 2005.5 SD n/a 44.2160 −100.2502 NoGQ507472 / WG024 Hs 2003.53 CA 92867 33.8143 −117.8277 NoGQ507475 / WG099 Hs 2004.49 CA 92399 34.0360 −117.0172 YesGQ507473 / WG080 Hs 2004.56 CA 91402 34.2243 −118.4446 YesGQ507474 / WG091 Hs 2004.66 CA 92346 34.1353 −117.1532 YesGQ507476 / WG101 Hs 2005.57 CA 91709 33.9508 −117.7322 YesGQ507477 / WG103 Hs 2005.58 CA 90712 33.8491 −118.1468 YesGQ507468 / WG007 Hs 2005.59 TX 79922 31.8156 −106.561 YesGQ507478 / WG116 Hs 2005.65 CA 90715 33.8404 −118.0797 YesGQ507469 / WG009 Hs 2005.67 NM 88003 32.2825 −106.737 YesGQ507479 / WG124 Hs 2005.69 AZ 85743 32.3120 −111.2105 YesGQ507480 / WG132 Hs 2005.76 CA 90280 33.9490 −118.202 YesGQ507481 / WG142 Hs 2006.64 NE 68154 41.2628 −96.1193 YesGQ507470 / WG011 Hs 2006.66 TX 79902 31.7840 −106.4972 YesGQ507482 / WG144 Hs 2006.67 AZ 85716 32.2436 −110.9238 YesGQ507471 / WG013 Hs 2007.48 TX 79912 31.8385 −106.5311 YesGQ507483 / WG148 Hs 2007.59 CA 91722 34.0976 −117.9068 YesGQ507484 / WG149 Hs 2007.63 CA 91325 34.2368 −118.5185 Yes

n/a , not available; Ah, Ardea herodias (blue heron); Bj, Buteo jamaicensis (red-tailed hawk); Bu, Bonasa umbellus (ruffed grouse); Bvs, Butorides virescens(green heron); Cb, Corvus brachyrhynchos (common crow); Cc, Cyanocitta cristata (blue jay); Ccs, Cardinalis cardinalis (northern cardinal); Cl, Columba livia(pigeon); Cn, Culex nigripalpus (mosquito); Cp, Culex pipiens (mosquito); Cq, Culex quinquefasciatus (mosquito); Ct, Culex tarsalis (mosquito); Cx, Culex sp.(mosquito); Dc, Dumetella carolinensis (catbird); Ec, Equus caballus (horse); Fa, Fulica Americana (American coot); Hs, Homo sapiens (humans); Pc, Phoenicop-terus chilensis (Chilean flamingo); Ph, Pica hudsonia (black-billed magpie); Pn, Pica nuttalli (yellow-billed magpie); Px, Phalacrocorax sp. (cormorant); Qq,Quiscalus quiscula (common grackle); Zm, Zenaida macroura (mourning dove).*Isolate names for newly sequenced genomes.

Pybus et al. www.pnas.org/cgi/content/short/1206598109 5 of 6

Page 12: Unifying the spatial epidemiology and molecular evolution ...evolve.zoo.ox.ac.uk/Evolve/Oliver_Pybus_files/UnifyingSpatialEpi... · theoretical models (2, 3) and later quantified

Table S2. Statistical performance of molecular evolutionary models

Substitutionmodel*

Molecularclock model

Estimated rate of evolution(95% HPD interval)†

Estimated date of most recentcommon ancestor (95% HPD interval)

Estimated log marginallikelihood (SE)

Coefficient ofvariation‡

GTR + Γ Relaxed 5.65 (5.07–6.32) 1998.6 (1997.8–1999.1) −25038.3 (0.52) 0.33 (0.22, 0.46)GTR + Γ Strict 5.69 (5.10–6.32) 1998.4 (1997.7–1999.0) −25070.9 (0.50) n/aGTR Relaxed 5.60 (5.01–6.28) 1998.6 (1997.8–1999.1) −25159.4 (0.56) 0.33 (0.21–0.44)GTR Strict 5.63 (5.08–6.21) 1998.4 (1997.7–1998.9) −25190.5 (0.43) n/aHKY + Γ Relaxed 5.70 (5.13–6.39) 1998.6 (1997.8–1999.1) −25193.5 (0.46) 0.33 (0.22, 0.45)HKY + Γ Strict 5.74 (5.16–6.40) 1998.4 (1997.8–1999.0) −25225.3 (0.33) n/aHKY Relaxed 5.59 (5.04–6.20) 1998.6 (1997.8–1999.1) −25342.8 (0.49) 0.33 (0.22–0.45)HKY Strict 5.63 (5.09–6.21) 1998.4 (1997.7–1998.9) −25372.9 (0.31) n/a

n/a, not applicable.*GTR, general time-reversible model; HKY, Hasegawa-Kishino-Yano model; Γ, gamma rate-heterogeneity model.†Units are 10−4 substitutions per nucleotide per year. HPD, highest posterior density.‡A measure of the variation in evolutionary rate among phylogeny branches.

Movie S1. The movie displays the estimated spatiotemporal diffusion dynamics of West Nile virus (WNV) in North America. The timescale of this process isindicated in the top right of the movie. The images in Fig. 2 represent snapshots of this continuous process. Colored lines show the estimated phylogeny (Fig.1C), which has been spatiotemporally transformed so that each branch and node is mapped to its known or estimated point in space and time. Phylogeneticbranches are colored using a red-to-blue gradient, such that early branches are redder and later branches bluer. Colored clouds represent the estimated lo-cations of sampled WNV lineages (95% highest posterior density areas). The clouds are colored using a blue-to-red gradient, such that early locations are bluerand later locations redder.

Movie S1

Pybus et al. www.pnas.org/cgi/content/short/1206598109 6 of 6