26
Annual Review of Genomics and Human Genetics Genotype Imputation from Large Reference Panels Sayantan Das, 1 Gonc ¸alo R. Abecasis, 1 and Brian L. Browning 2 1 Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109-2029, USA; email: [email protected], [email protected] 2 Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington 98195-7720, USA; email: [email protected] Annu. Rev. Genom. Hum. Genet. 2018. 19:73–96 First published as a Review in Advance on May 23, 2018 The Annual Review of Genomics and Human Genetics is online at genom.annualreviews.org https://doi.org/10.1146/annurev-genom-083117- 021602 Copyright c 2018 by Annual Reviews. All rights reserved Keywords imputation, genotype imputation, genome-wide association study, GWAS Abstract Genotype imputation has become a standard tool in genome-wide associ- ation studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide poly- morphism array data. Genotype imputation increases statistical power, fa- cilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously ob- served in a reference panel of sequenced individuals can be imputed. How- ever, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with mil- lions of individuals. 73 Annu. Rev. Genom. Hum. Genet. 2018.19:73-96. Downloaded from www.annualreviews.org Access provided by University of Washington on 08/23/19. For personal use only.

Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Annual Review of Genomics and Human Genetics

Genotype Imputation fromLarge Reference PanelsSayantan Das,1 Goncalo R. Abecasis,1

and Brian L. Browning2

1Center for Statistical Genetics, Department of Biostatistics, University of Michigan, AnnArbor, Michigan 48109-2029, USA; email: [email protected], [email protected] of Medical Genetics, Department of Medicine, University of Washington, Seattle,Washington 98195-7720, USA; email: [email protected]

Annu. Rev. Genom. Hum. Genet. 2018. 19:73–96

First published as a Review in Advance onMay 23, 2018

The Annual Review of Genomics and Human Geneticsis online at genom.annualreviews.org

https://doi.org/10.1146/annurev-genom-083117-021602

Copyright c© 2018 by Annual Reviews.All rights reserved

Keywords

imputation, genotype imputation, genome-wide association study, GWAS

Abstract

Genotype imputation has become a standard tool in genome-wide associ-ation studies because it enables researchers to inexpensively approximatewhole-genome sequence data from genome-wide single-nucleotide poly-morphism array data. Genotype imputation increases statistical power, fa-cilitates fine mapping of causal variants, and plays a key role in meta-analysesof genome-wide association studies. Only variants that were previously ob-served in a reference panel of sequenced individuals can be imputed. How-ever, the rapid increase in the number of deeply sequenced individuals willsoon make it possible to assemble enormous reference panels that greatlyincrease the number of imputable variants. In this review, we present anoverview of genotype imputation and describe the computational techniquesthat make it possible to impute genotypes from reference panels with mil-lions of individuals.

73

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 2: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

INTRODUCTION

The field of human genetics has made great strides since R.A. Fisher first explored the geneticarchitecture of quantitative traits a century ago (29). In 2005, the advent of single-nucleotide poly-morphism (SNP) genotyping arrays made it possible to conduct the first genome-wide associationstudy (GWAS) (52). Although this study had a very small sample (∼100 macular degenerationcases and ∼50 controls), it was the first in a series of studies that definitively implicated the alter-nate complement pathway in macular degeneration. Just two years later, the Wellcome Trust CaseControl Consortium published a landmark GWAS, the largest of its time, which analyzed ∼17,000cases and controls for seven common diseases (92). Along with replicating many previously impli-cated genetic loci, the study revealed multiple new risk loci for several diseases, including Crohn’sdisease and rheumatoid arthritis.

The success of these early GWASs led to an explosion of interest in the field. Numerous GWASshave systematically evaluated the contributions of genetic factors to various complex diseases (68)along with quantitative traits such as human height (53), body mass index (60), and cholesterol(33). These studies revealed unexpected pathways in disease etiology, such as the importance ofcomplement factor genes in macular degeneration (52), the role of the central nervous system inobesity susceptibility (60), and the function of genes in the autophagy pathway in Crohn’s disease(7). They have also provided evidence for previously suspected molecular mechanisms [e.g., therole of IL-23 signaling in psoriasis (70) and the role of APOE in Alzheimer’s disease (15)] andare expected to enable the development of new drugs and treatment strategies (75, 97). AlthoughGWASs are now commonplace, they have greatly changed human genetics in the last 10 years byproviding a new, systematic method that can provide deeper insights into disease biology.

One limitation of SNP genotyping arrays is that they assay only a small fraction of humangenetic variation. The variants assayed on SNP arrays are chosen based on the linkage disequilib-rium structure of the human genome. The International HapMap Project facilitated the design ofthe first arrays to be used for GWASs (44–46). Without imputation, GWASs that test variants ona commercial genotyping array must rely on pairwise linkage disequilibrium between an assayedSNP and a causal variant to detect association between the assayed SNP and trait. However, rarevariants, which are more often associated with dramatic functional consequences, tend to have lowlevels of pairwise linkage disequilibrium with common variants on SNP genotyping arrays (32),which makes it difficult to detect signals of association from rare variants. However, given sufficientcoverage, whole-genome sequencing can detect the rarest of mutations with very high accuracy(5). Although next-generation technologies have significantly reduced the cost of sequencing agenome, it remains prohibitively expensive to whole-genome sequence the millions of samplesthat are included in genetic studies each year (34).

A more cost-efficient way of genotyping rare variants is to impute them (58). Rare-variantgenotypes that are not directly assayed on GWAS arrays can be reconstructed by comparingeach sample to a reference panel of sequenced genomes. This method of estimating genotypesor genotype probabilities at markers that have not been directly genotyped is known as genotypeimputation. The first two GWASs to use genotype imputation were a study of type 2 diabetes inFinnish samples (85) and the Wellcome Trust Case Control Consortium study (92). In the type 2diabetes study, imputation helped the researchers identify and replicate multiple risk variants andcompare their results with those of two other studies that used different genotyping arrays. Sincethen, imputation has been a key step in the analysis of human genetic studies—accelerating fine-mapping efforts, aiding the combination of results across studies (meta-analysis), and increasingthe power of gene mapping analyses (66). Some examples of the benefits of genotype imputationare given below.

74 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 3: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Fine Mapping

Imputation provides a higher-resolution view of a genetic region by adding more variants, therebyincreasing the chances of identifying a causal variant. For example, a study on blood triglyceridelevels that aimed to fine map the GCKR gene found that the strongest signal came from a missensevariant that was imputed and later confirmed by direct genotyping (76). Similarly, a recent fine-mapping study on type 2 diabetes used imputation to enhance the discovery and increase the SNPresolution of causal type 2 diabetes risk alleles (63).

Meta-Analysis

Imputation also helps in meta-analysis by facilitating the combination of results across studies.Different studies often use different genotyping arrays containing different sets of variants. Forinstance, only 20% of the SNPs included on the Affymetrix 6.0 SNP array are included on theIllumina 660K array. Genotype imputation can generate a common set of variants that can beanalyzed across all the studies to boost power. The first two examples of an imputation-based meta-analysis date from early 2008. In both cases, researchers were able to combine studies conductedusing different arrays and identify new association signals that could not be discerned in any of theoriginal studies individually (93, 94). Since then, this approach has been successful in discoveringassociated loci for many different traits, including type 1 and type 2 diabetes (16, 23), Parkinson’sdisease (14), coronary artery disease (73), different types of cancer (4, 51, 84), height (95), bodymass index (60, 86), and lipid levels (33).

Increasing the Power of Association Studies

Another benefit of imputation lies in increasing the power to detect an association signal. WhenSNPs are genotyped in only a portion of the samples, imputation can increase the effective samplesize by filling in the missing genotypes. This was demonstrated in a study on triglycerides andcholesterol, where a common variant in a known risk gene (LDLR) was missed when only genotypedSNPs were analyzed but was then identified following imputation (93). This was because thegenotyping chip used to assay most of the samples did not contain the common variant or anyvariant strongly correlated with the common variant. Some simulation studies have shown thatimputation can increase power by up to 10% when compared with testing only genotyped SNPs(87), while others have predicted more modest gains (6, 37). The differences in these estimates canbe attributed to differences in experimental design, including filtering thresholds and the differentgenotyping arrays that were used (87).

Other Benefits

Imputed data have also been used to test for pleiotropic effects by imputing a known risk variantinto multiple disease studies. As a case in point, Hoffmann et al. (38) found evidence suggestinga pleiotropic effect from a HOXB13 mutation across multiple cancers. Imputation has been usedto estimate other types of genetic variations, such as copy number variants (36) and classical HLAalleles (24, 49, 55, 98).

GENOTYPE IMPUTATION METHODS: MAJOR MILESTONES

Estimation of missing data is a ubiquitous problem in statistics, and human genetic studies are noexception. However the advent of GWASs ushered in a new era, with a new type of imputation

www.annualreviews.org • Genotype Imputation 75

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 4: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

problem that traditional methods were ill equipped to solve (96). The primary reason is theextremely high rate of missing data: Commercial GWAS arrays genotype <1% of the knowngenetic variants, and the remaining >99% of the genetic variation is missing data that needsto be imputed. The second reason is that common statistical imputation techniques, such aslinear regression, regression trees, and k-nearest neighbors, do not model key characteristics ofgenetic data (linkage patterns, recombination hot spots, mutations, genotyping errors, etc.). Thesechallenges necessitated the development of statistical methods and computational tools createdexplicitly for genotype imputation in GWASs (see Figure 1).

The basic intuition behind genotype imputation is as follows: Any two individuals, even ifapparently unrelated, can share short stretches of chromosome derived from a distant commonancestor. Consequently, once a study sample is genotyped on a commercial array (with mostlymissing data), the observed genotypes can be used to identify DNA segments shared between thestudy sample and a reference panel of sequenced genomes (with no missing data). In this way,a study haplotype can be represented as a mosaic of short segments of related haplotypes foundin the reference panel, enabling one to impute the sites that were not genotyped (see Figure 2).Points where the reference haplotype template changes represent historical recombination events.Points where the observed target allele differs from the template allele represent historical mutationevents, gene conversion events, genotype error, or even erroneously assigned matches. Since astudy haplotype can be represented by many possible mosaics of reference haplotype segments,a probabilistic framework is needed to summarize information from all possible mosaics intoimputed alleles.

The Li and Stephens Model

Although multiple research groups have developed numerous genotype imputation methods overthe last decade, the basic framework behind most of them is fundamentally the same and is knownas the Li and Stephens model. It was first described in 2003 (56) to allow haplotype estimationmethods to handle large stretches of chromosome, where individual haplotypes are all unique butare expected to share contiguous, mosaic stretches with other haplotypes in the sample. A modifiedversion of this approach was implemented in fastPHASE by Scheet & Stephens (83) to enablegenotype phasing of larger samples. The framework uses a hidden Markov model (HMM) (80) todescribe the data, where the observed genotypes of unknown phase in a study sample representthe observed data of the HMM, while an underlying and unobserved set of phased genotypesrepresent the hidden states of the HMM. The HMM framework was immediately beneficial andprovided notable and substantial improvements in the quality of inferred haplotypes comparedwith previous approaches (65).

The Li and Stephens model state space can be visualized as a two-dimensional grid of HMMstates, with rows corresponding to reference haplotypes and columns corresponding to markersin the reference panel. In Figure 2, each allele on each reference haplotype corresponds to anHMM state. Each study sample haplotype is assumed to trace an unobserved path through thegrid, proceeding left to right from the first reference marker to the last reference marker. This pathis equivalent to a mosaic of templates. A new segment in the mosaic begins when the path switchesreference haplotypes (rows) between one marker and the next. The probability of a template switchbetween markers is determined by the HMM transition probabilities and is closely related to thepopulation recombination rate. The probability that an observed allele differs from the templateallele is determined by the HMM emission probabilities. Given an observed haplotype with missingalleles, the probability of each possible path through the HMM states can be calculated. A pathprobability is penalized (i.e., decreased) each time the path switches reference haplotypes (via the

76 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 5: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

2002

2018

Milestone International HapMap Project inception

Milestone Human Genome Project completion

2003Milestone Klein et al. 2005First genome-wide association study

2005

Referencepanel International HapMap Project phase 1

Software Scheet & Stephens 2006FastPHASE

2006

Milestone First genome-wide association studiesto use imputation

2007

Software Marchini et al. 2007IMPUTE

Software Browning & Browning 2007Beagle

Referencepanel International HapMap Project phase 2

Software Howie et al. 2009IMPUTE2

2009

Referencepanel International HapMap Project phase 3

2010

Referencepanel 1000 Genomes Project pilot

Software Li et al. 2010MaCH

Milestone Howie et al. 2012

Howie et al. 2012

Prephasing

2012

Referencepanel 1000 Genomes Project phase 1

Software Minimac

Software Fuchsberger et al. 2015Minimac2

2014

Referencepanel 1000 Genomes Project phase 3

Milestone Michigan Imputation Server

2015

Software Minimac3

Software Browning & Browning 2016Beagle 4.1

Referencepanel McCarthy et al. 2016Haplotype Reference Consortium

2016

Referencepanel TOPMed Consort., manuscript in preparationTrans-Omics for Precision Medicine

2017

Software Bycroft et al. 2017IMPUTE4

Software S. Das, K. Yu & G.R. Abecasis, manuscript inpreparation

S. Das, K. Yu & G.R. Abecasis, manuscript inpreparation

Minimac4

Software MetaMinimac

Software B.L. Browning, Y. Zhou & S.R. Browning,manuscript in preparation

Beagle 5.0

Int. HapMap Consort. 2003

Int. Hum. Genome Seq. Consort. 2004

Int. HapMap Consort. 2005

Scott et al. 2007,Wellcome Trust Case Control Consort. 2007

Int. HapMap Consort. 2007

Int. HapMap 3 Consort. 2010

1000 Genomes Proj. Consort. 2010

1000 Genomes Proj. Consort. 2012

1000 Genomes Proj. Consort. 2015

Das et al. 2016

Das et al. 2016

Figure 1A brief time line summarizing the major developments in genotype imputation. Each major development has been categorized as amilestone ( green), a reference panel (blue), or software (white).

www.annualreviews.org • Genotype Imputation 77

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 6: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

X1 A T A A G C A C T G A A A C G G C G C A T

X2 A T A T G C A C T G A A A C G G G G T A C

X3 T G A A G C T C T G A A A C G G C G C A C

X4 A C T T G C A C T G A A A C G G C G C A T

X5 A T A A A C T G A C T A T C T A C G T A T

X6 T T T A A T A C T G A A A C G A G C C T C

SI T g a a G c t g A c t a T c g A g c C t c

SG T . . . G . . . A . . . T . . A . . C . .

Imputed:

Missing:Re

fere

nce

hapl

otyp

es

Figure 2An illustration of genotype imputation, showing the process of imputation for a study haplotype (SG)genotyped at 6 markers using a reference panel of sequenced haplotypes at 21 markers. The alleles in SG areused to match short segments from the reference panel. For example, in the first genomic segment, thealleles T and G imply that the corresponding segment might have been copied from haplotype X3. In thesecond segment, the alleles A and T imply that haplotype X5 might have been copied. Proceeding similarly,the study haplotype can be represented as a mosaic of DNA segments from haplotypes X3, X5, and X6.Consequently, the missing sites can be imputed to obtain the final imputed haplotype, SI.

HMM transition probabilities) and each time the reference allele on the path differs from theobserved allele (via the HMM emission probabilities). The probability that the (unobserved) pathof a target haplotype goes through a particular HMM state (the state probability) can be calculatedefficiently with the HMM forward–backward algorithm (80). Imputed allele probabilities at amarker are obtained from the state probabilities. The probability that the target haplotype carriesa particular allele is the sum of the state probabilities corresponding to reference haplotypes thatcarry the allele.

Although most contemporary imputation tools employ an HMM framework, they differ in howthey define the state space and the parameters of the HMM. Table 1 summarizes the major impu-tation tools in the last decade. While fastPHASE, MaCH, and IMPUTE were quite similar, thefirst Beagle imputation algorithm was different because it did not employ the usual transition andemission functions, and the haplotype model was constructed from both reference and study sam-ples as opposed to only reference samples (11). However, the second Beagle imputation algorithm(introduced in version 4.1) uses the Li and Stephens model and is similar to the other tools (9).

While the term genotype imputation typically refers to imputing variants not directly assayedin a GWAS, it has also been used to refer to inferring genotypes from genotype likelihoods thatare estimated from SNP array or low-coverage sequence data. This inference of genotypes fromgenotype likelihoods, which is also called genotype refinement, is outside the scope of this review.However, HMMs are also used by several methods in this setting in tools such as Beagle, MaCH,SNPTools, and Stitch (10, 20, 57, 91).

Other imputation engines have employed direct methods of haplotype matching that donot employ the HMM framework. For example, PBWT uses the positional Burrows–Wheelertransformation to find set-maximal matches at each marker of the genotyped sample, whichare in turn used to assign alleles at the missing markers (27). While such methods are highly

78 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 7: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Table 1 Genotype imputation tools that employ a hidden Markov model (HMM)

Tool Year Description of state spaceComputational

complexity HMM parameter functions

FastPHASE 2006 All genotype configurationsfrom a fixed number oflocalized haplotype clusters

Maximization-steplinear in number ofhaplotypes, quadraticin number of clusters

Depends on recombination andmutation rates; parameters are fitusing an expectation–maximizationalgorithm

IMPUTE 2007 All genotype configurationsfrom all reference haplotypes

Quadratic in number ofhaplotypes

Depends on a fine-scale recombinationmap that is fixed and providedinternally by the program

Beagle 2007 All genotype configurationsfrom a variable number oflocalized haplotype clusters

Quadratic in number ofhaplotypes

Empirical model with no explicitparameter functions

IMPUTE2 2009 All reference haplotypes Phasing quadratic innumber of haplotypes,imputation linear innumber of haplotypes

Same as IMPUTE

MaCH 2010 All genotype configurationsfrom all reference haplotypes

Quadratic in number ofhaplotypes

Depends on recombination rate,mutation rate, and genotyping error;parameters are fit using a Markovchain Monte Carlo orexpectation–maximization algorithm

Minimac andMinimac2

2012 All reference haplotypes Linear in number ofhaplotypes

Same as MaCH

Minimac3 2016 All unique allele sequencesobserved in reference data in asmall genomic segment

Linear in number ofhaplotypes

Same as MaCH, but parameterestimates are precalculated and fixed

Beagle 4.1 2016 All reference haplotypes atgenotyped markers

Linear in number ofhaplotypes

Depends on recombination rates anderror rates, which are precalculatedand fixed

Minimac4 2017 Collapsed allele sequences fromreference data that match atgenotyped positions in smallgenomic segments

Linear in number ofhaplotypes

Same as Minimac3

IMPUTE4a 2017 All possible referencehaplotypes

Linear in number ofhaplotypes

Same as IMPUTE2

Beagle 5.0 2018 A user-specified number ofreference haplotypes

Linear in number ofhaplotypes

Same as Beagle 4.1

This table describes the typical state space and parameter functions used to model the Li and Stephens framework. Minimac and IMPUTE2 were the firsttools to use the prephasing approach. Minimac3 and Beagle 4.1 exploit local haplotype redundancy to reduce the size of the state space and hence thecomputational burden.aIMPUTE4 uses the same HMM as IMPUTE2; however, to reduce memory usage and increase speed, it uses compact binary data structures and takesadvantage of high correlations between inferred copying states in the HMM to reduce computation.

computationally efficient, in current panels, their imputation accuracy is reduced because themethods do not integrate over all possible mosaic configurations but instead use the longest hap-lotype matches flanking each location to impute each genotype. Additionally, some other genotypeimputation methods [e.g., PLINK (79), SNPMStat (59), TUNA (72), and UNPHASED (26)] useSNP-tagging approaches to carry out imputation. Although these methods are simpler and can

www.annualreviews.org • Genotype Imputation 79

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 8: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

be very fast, they do not utilize information from the entire chromosome to perform imputationand hence generally provide less accurate allele estimates.

Prephasing

Genotype imputation is a highly computationally intensive process because of the probabilisticframework and a high rate of missing data. One of the major milestones to reduce the computa-tional burden in the Li and Stephens framework was the introduction of prephasing. This ideainvolves a two-step imputation process: the initial step of prephasing (i.e., haplotype estimation)of the GWAS genotypes and a subsequent step of imputation into the estimated study haplotypes(41). These separate steps benefited researchers in several ways. First, the decomposed haplotypescould be reused for imputation from different reference panels, allowing researchers to conve-niently explore the trade-offs of different imputation strategies. Second, separating the phasing (orhaplotype estimation step) from imputation allowed researchers to quickly benefit from separateadvances in phasing and imputation technology, which no longer needed to be tightly integrated.Third, it reduced the complexity of the imputation step from quadratic to linear in the numberof reference haplotypes, because prephasing allowed matches to be found by comparison againstphased haplotypes rather than against all pairs of haplotypes. Although splitting up the processdoes marginally reduce the imputation accuracy in some populations, such as African Americans(41), the ability to use much larger reference panels with prephasing makes it possible to attaingreater imputation accuracy. The current versions of Minimac, IMPUTE, and Beagle all employthis prephasing approach.

Public Reference Panels

Over the years, the quality of genotype imputation has benefited greatly from improved genotyp-ing technologies (e.g., high-density genotyping arrays) and more efficient analytical methods (e.g.,prephasing) but most notably from the increase of genetic information in publicly available datasets. Examples of such data sets include those from the International HapMap Project (45–47),the 1000 Genomes Project (1000G) (1), the UK10K Project (43), the Haplotype Reference Con-sortium (HRC) (69), and the Trans-Omics for Precision Medicine (TOPMed) program (71). Thedevelopment of next-generation sequencing technologies has led to a rapid increase in the sizes ofdata sets used as reference panels for genotype imputation. For example, while the first release ofthe International HapMap Project panel had 269 individuals (45), the subsequent 1000G phase 3panel had 2,504 individuals with low-coverage sequence data (3), and upcoming reference panelsfrom the TOPMed program are expected to soon include more than 100,000 deeply sequencedsamples. Table 2 summarizes the major public reference panels.

For association studies, the immediate benefits of a larger panel include a more detailed catalogof genetic variants, which increases the chance of imputing a causal variant, and better imputationaccuracy, which improves the power of downstream association analyses, especially for rare variants(58). Panels derived from disease-focused sequencing efforts often have restrictions that limitdirect access to the underlying haplotypes (e.g., HRC and TOPMed), making their use as abroadly distributed imputation resource challenging. These challenges led to the evolution ofweb imputation servers, another major milestone in the field of genotype imputation.

Web Imputation Servers

Imputation servers enable users to upload GWAS data to a remote server through secure filetransfer protocols. The server then carries out imputation along with automated quality control,

80 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 9: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Table 2 The most commonly used public reference panels to date

Referencepanel

Number ofreferencesamples

Number of sites(autosomes + Xchromosome)

Averagesequencing

coverageAncestry

distributionPubliclyavailable

Indelsavailable Reference

InternationalHapMapProject phase 3

1,011 1.4 million NAa Multiethnic Yes No 47

1000G phase 1 1,092 28.9 million 2–6× Multiethnic Yes Yes 1

1000G phase 3 2,504 81.7 million 7× genomes,65× exomes

Multiethnic Yes Yes 3

UK10K Project 3,781 42.0 million 7× genomes,80× exomes

European Yes Yes 89

HRC 32,470 40.4 million 4–8×b PredominantlyEuropeanc

Partiallyd No 69

TOPMed 60,039 239.7 million 30× Multiethnic Partiallye Yes 71

Abbreviations: 1000G, 1000 Genomes Project; HRC, Haplotype Reference Consortium; indel, insertion or deletion; NA, not applicable; TOPMed,Trans-Omics for Precision Medicine.aThe International HapMap Project phase 3 data were genotyped on the Illumina Human1M and Affymetrix 6.0 SNP arrays.bThe HRC panel was obtained by combining sequencing data across many low-coverage (4–8×) and a few high-coverage sequencing studies.cThe only non-European samples in the HRC panel are through the 1000G reference panel (which was a contributing study).dMost of the HRC samples (∼27,000) are available for download through controlled access from the European Genome-Phenome Archive.eSome of the TOPMed samples (∼18,000) are available for download through controlled access from the Database of Genotypes and Phenotypes (dbGaP).

such as checking strand orientation, allele labels, file integrity, minor allele frequency distribution,and per-sample missingness (see Figure 3). After the analysis is complete, users receive a notifi-cation and download link to the imputed data, which are encrypted with a onetime password. Amajor benefit of imputation servers is that they allow researchers to spend more time on analyzingand interpreting their data instead of learning about imputation tools and data preprocessing. Inaddition, they provide a uniform platform for comparing and consolidating results across studies,thereby aiding collaborative efforts. Finally, for reference panels derived from disease-focused se-quencing studies or from other studies that also have data-sharing restrictions, keeping referencepanel data behind a firewall, where it can be used for low-risk analyses without exposing individual-level data, greatly increases the number of users who can benefit from these panels. Imputationservers have also motivated the development of remote web servers for other genomic analyses,such as estimation of principal components [Locating Ancestry from Sequence Reads (LASER)](88) and phenome-wide analysis of participants in electronic health record–based studies [suchas those in the Michigan Genomics Initiative (MGI) PheWeb (http://pheweb.sph.umich.edu)].Currently, the University of Michigan and the Wellcome Sanger Institute host web imputationservers, using the underlying imputation engines Minimac3 and PBWT, respectively (19, 27). TheMichigan server has imputed more than 18 million genomes for ∼3,000 registered users, while theSanger server has imputed ∼9 million genomes and has ∼550 users. The current throughput ofthe Michigan server, which consists of 23 multiprocessor computers (∼600 cores in total), is∼7 million genomes per month using the International HapMap Project phase 2 panel and∼900,000 genomes per month using the HRC panel [estimates include time to prephase usingEagle (61)].

www.annualreviews.org • Genotype Imputation 81

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 10: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Researcher Imputation server Researcher

The researcher uploadsgenome-wide associationstudy data using SFTP orGlobus.

The server performs qualitycontrol, prephasing, imputation,and encryption.

When the data are ready,the researcher receives anotification email with adownload link and a onetimepassword for decryption.The data are deleted from theserver after seven days.

Figure 3An outline of the pipeline in the Michigan imputation server.

COMPUTATIONAL METHODS FOR LARGE REFERENCE PANELS

In the last 15 years, the cost of DNA sequencing has decreased by five orders of magnitude (35).The super-exponential decrease in the cost of genome sequencing is widely known. Less wellknown is the even more rapid decrease in the cost of genotype imputation since 2009. Timingresults in 2009 for one of the first imputation engines (IMPUTE 0.5.0) showed an imputationtime of 0.43 s per genotype per target sample when imputing from 1,200 reference samples (8).Timing results in 2015 for Minimac3 and Beagle 4.1 showed imputation times of 6.4 × 10−6 and7.5 × 10−6 s per genotype per sample, respectively, when imputing from 2,452 reference samplesfrom the 1000G phase 3 panel (9). During this time, these imputation methods also increasedthe number of reference samples that could be included in an imputation analysis by two to threeorders of magnitude. In this section, we discuss computational methods for reducing computationtime and memory requirements. The methods described here can be combined to produce a largecumulative effect.

Use of a Custom Subset of the Reference Panel

One of the first strategies for reducing computation time was to use only a subset of the availablereference haplotypes. The idea is to select a small subset of reference haplotypes that appear to beclosely related to the sample of interest in a genomic region and to impute genotypes in the targethaplotype in that region using the small custom reference panel instead of the full reference panel.This technique was first implemented in IMPUTE2 (40, 42). If the custom reference panel for aregion contains the reference haplotypes that are most closely related to the target haplotype andif the algorithm for selecting the custom reference panel is sufficiently fast, genotype imputationusing the custom reference panel can be much faster than imputation from the full reference paneland have similar accuracy (40, 42). The first method for selecting closely related haplotypes wasbased on Hamming distance (40), but more recent approaches select closely related haplotypesbased on long identity-by-state segments (43). Typically, a different custom subset of the referencepanel is selected for each observed haplotype and each region of interest.

82 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 11: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Prephasing Target Data

The first genotype imputation methods assumed unphased target genotype data and internallyestimated the target genotype phase during genotype imputation (8, 57, 67). The first step awayfrom this paradigm was the realization that, instead of imputing unphased genotypes, one couldimpute alleles directly onto the estimated target haplotypes (40). Once methods began performinghaploid imputation, the next step was to separate the genotype phasing and imputation steps andto require the input target data to be phased (41).

Unlike most optimizations, which trade increased algorithmic and software complexity forreduced run time, the use of prephased target data simplifies the algorithms and software.This reduced complexity makes it easier to discover and implement additional computationaloptimizations.

Specialized Input Formats for Reference Data

One of the challenges of imputation from large reference panels is reading and storing the refer-ence genotype data. A genotype is stored in 4 bytes in the Variant Call Format (VCF) (17). If thereare 1 million reference samples, VCF storage requirements can exceed 1 TB per megabase of chro-mosome. Consequently, genotype data for large reference panels are typically stored on hard diskdrives in compressed form. General-purpose compression algorithms, such as gzip (22), greatlyreduce the reference panel file size but do not address the problem of storing reference genotypedata in memory during analysis. In addition, the time required for decompressing large referencepanels compressed with gzip is substantial and can exceed the time required for imputation. Theselimitations of general-purpose compression algorithms have motivated the development of spe-cialized compression formats for reference genotype data that achieve high data compression orpermit fast retrieval of individual genotypes from the compressed representation so that referencehaplotypes can be stored in compressed form in memory when imputing genotypes.

One of the first specialized compression formats to be developed for reference data was basedon the Burrows–Wheeler transform (27). Two additional compression formats were developedspecifically for genotype imputation. One of these formats (M3VCF) (19) exploits local redundancyamong haplotypes. Rather than storing all haplotypes in a chromosome segment, only the uniqueallele sequences are stored, along with a map from the reference haplotypes to the allele sequencecarried by each haplotype. A second compression format, called binary reference format (bref )(9), employs this same general strategy for markers with a high nonmajor allele frequency butuses an alternate compression scheme for markers with a low nonmajor allele frequency, whichare the bulk of markers in large reference panels. For a low-frequency marker, bref stores a list ofreference haplotypes that carry the minor allele. If the marker has more than two alleles, it storesa separate list for each nonmajor allele. The allele on a given haplotype can be found by searchingthe lists of haplotypes. If the haplotype is found in a list, the haplotype carries the correspondingnonmajor allele. If the haplotype is not found in any list, the haplotype carries the major allele.For low-frequency markers, bref increases the time required to query the allele carried by a givenhaplotype, but the query time is not prohibitive if the lists of haplotypes are sorted in increasingorder and a binary search algorithm is used. For reference panels with more than 100,000 samples,the use of the bref format reduces computation time by more than 30% and the use of the M3VCFformat reduces computation time by more than 90% compared with the use of the VCF format (9).

Clustering of Identical Reference Haplotype Segments

In short segments of the genome, the same allele sequence can be carried by many haplotypes inthe reference panel. HMM states that correspond to identical reference haplotypes in a genomic

www.annualreviews.org • Genotype Imputation 83

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 12: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

segment will have the same emission probabilities at each marker in the segment. Consequently,in a short region, the identical haplotypes can be clustered together, and the HMM forward–backward algorithm calculations can traverse the relatively small number of unique referenceallele sequences rather than all reference haplotypes. This speeds up HMM calculations withinsmall regions with no loss in accuracy. However, the full set of reference haplotypes must still betraversed when performing the HMM forward and backward algorithm calculations across regionboundaries. Minimac3 implements a version of local clustering that is based exclusively on thereference panel, which allows the local clustering to be precomputed for a reference panel andused in all subsequent imputation analyses (19). This precomputed clustering also provides thebasis for the Minimac3 M3VCF format (19). Beagle 4.1 performs local clustering on the fly duringimputation because the short regions used for clustering are defined by the genotyped markers inthe target samples (9).

Imputation via Linear Interpolation

Although it is most natural to calculate HMM state probabilities in one forward pass and onebackward pass through the markers in the reference panel, it is also possible to perform thesecalculations in two stages. In a two-stage approach, the HMM forward–backward algorithm isperformed first using only the markers that are genotyped in the target samples. In the secondstage, the forward–backward algorithm is performed in each chromosome interval bounded by twoadjacent genotyped markers or by a chromosome boundary and the adjacent genotyped marker.For the Li and Stephens model, the one-stage and two-stage approaches produce identical HMMstate probabilities.

The two-stage approach to HMM calculations permits two optimizations based on linearinterpolation. The first optimization uses linear interpolation in the second stage instead of theforward–backward algorithm to calculate HMM state probabilities at imputed markers. Afterthe first stage, HMM state probabilities are calculated for genotyped markers, and HMM stateprobabilities at imputed markers are estimated by linear interpolation on genetic distance. Overshort genetic distances, linear interpolation generally provides an accurate approximation of theHMM state probability (9).

A second optimization arises from the observation that interpolated HMM state probabilitiesin an interval are bounded by the HMM state probabilities at the bounding genotyped markers.If the state probabilities at the bounding genotyped markers are sufficiently small for a referencehaplotype, the interpolated HMM state probabilities in the interval can be approximated by 0,and the linear interpolation step can be skipped altogether.

These optimizations based on linear interpolation can also be combined with clustering ofidentical haplotype segments. In particular, one can cluster reference haplotypes that have identicalallele sequences between the two genotyped markers before performing linear interpolation.

Reducing Memory Requirements

Imputation from large reference panels must be performed within the constraints imposed by theavailable computer memory. A standard approach to reducing memory requirements for genome-wide imputation is to divide the genome into small overlapping genomic windows and performimputation in each window separately. Although one could compensate for increasing referencepanel size by decreasing the length of the analysis window, there are limits to this strategy. If thewindow is too short, imputation accuracy will suffer if information from outside the window isignored. This is most evident when imputing rare variants from large reference panels. Individuals

84 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 13: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

who share a rare variant will typically share a long haplotype around the variant, and a majorbenefit of large panels is their ability to facilitate identification of these long shared haplotypes.If the genomic window is too short, the shared long haplotype containing the rare variant will betruncated by the window boundary, and imputation accuracy will decrease.

Another approach to reducing memory requirements is multithreading, which allows multipleCPU cores to share a single copy of the reference and target genotype data. This provides somereduction in memory use, but memory requirements generally will still increase linearly with thenumber of computational threads because each thread must allocate memory for its probabilitycalculations.

Fortunately, all of the above-described optimizations that reduce compute time have the addedbenefit of reducing memory requirements. This makes it possible to use relatively long analysiswindows (say, >10 cM in length) when imputing from large reference panels.

MEASURING IMPUTATION ACCURACY

Most imputation methods estimate a probability distribution for the allele carried by each haplo-type at each imputed marker. Posterior genotype probabilities can be derived from the posteriorallele probabilities under the assumption of Hardy–Weinberg equilibrium (40). One of the mostcommon uses of imputed genotype data is to test each imputed marker for association with a trait.Standard regression-based approaches for genetic association studies, including generalized linearmodels and linear mixed models, extend naturally to imputed data by replacing the allele dose of theobserved genotype with the expected allele dose of the imputed genotype, which is the sum of theposterior allele probabilities for each haplotype (66). In this context, it is helpful to assess accuracyfor imputed markers so that poorly imputed markers may be excluded prior to association testing.

The most interpretable measures of imputation accuracy are based on the correlation betweenthe imputed and true dose of an allele. One correlation-based measure is the r2 measure (41),which is the squared correlation between the true and estimated dose of an allele across all imputedsamples. Note that the MaCH r2 measure (57) is slightly different from the Minimac r2 measure(which is also reported by Beagle) since the former is based on correlation between true andestimated diploid dosages across all samples instead of haploid dosages across all haplotypes (theyare equivalent under assumptions of Hardy–Weinberg equilibrium). For each haplotype, the trueallele dose is 0 or 1, and the estimated allele dose is the posterior allele probability. A formula forthe Minimac r2 measure is derived in the sidebar titled Estimating r2.

The r2 measure has two attractive features: It can be estimated from posterior allele probabilitieswithout knowledge of the true allele dose on each chromosome if the allele probabilities are wellcalibrated (see the sidebar titled Estimating r2), and it has a useful interpretation in terms ofsample size and statistical power when testing a binary trait. This relationship is as follows: Ifr2 = r2

0 for an imputed marker, the power of an allelic test with N samples and imputed allelesis approximately equal to the power of an allelic test with r2

0 N samples and true alleles. Thus,the r2 measure can be interpreted as the effective reduction in sample size when testing imputedalleles rather than the true alleles for association with a binary trait. Pritchard & Przeworski (78)gave a derivation of this result in the context of two correlated markers. This result generalizesto the correlation between the estimated allele dose and the true allele dose (18). It is commonto exclude poorly imputed markers from downstream analysis by requiring the r2 measure foran imputed variant to exceed some threshold. Thresholds of 0.3 or larger are commonly used.An r2 threshold of 0.3 means that one is willing to accept an effective reduction in sample sizeof approximately two-thirds when performing an allelic test with imputed alleles. Since r2 is acorrelation, it is defined in terms of the variance of the true allele dose, and thus if it is correctly

www.annualreviews.org • Genotype Imputation 85

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 14: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

ESTIMATING r2

One attractive feature of r2, the squared correlation between true and imputed allele dose, is that it can be estimatedfrom posterior allele probabilities without knowing the true allele on each chromosome. Here, we derive an estimateof r2 in terms of the posterior allele probabilities.

Let X be 1 if a chromosome carries the allele of interest and be 0 otherwise, and let Z be the estimated posteriorallele probability that X = 1. Then r2 is defined to be the squared correlation of X and Z. We say that the posteriorallele probabilities are correctly calibrated if E[X | Z] = Z. If the posterior allele probabilities are correctly calibrated,we can use the law of total expectation and the fact that X2 = X to obtain

E[X 2] = E[X ] = E[E[X |Z]] = E[Z]

Var(X ) = E[X 2] − E[X ]2

= E[Z] − E[Z]2

and

Cov(X , Z) = E[XZ] − E[X ]E[Z]

= E[E[XZ|Z]] − E[E[X |Z]] E[Z]

= E[Z2] − E[Z]E[Z]

= Var(Z).

Consequently,

r2 = (Cov(X , Z))2

Var(X )Var(Z)

= Var(Z)Var(X )

= E[Z2] − E[Z]2

E[Z] − E[Z]2

If there are n imputed chromosomes and zi is the estimated reference allele probability in the ith haplotype, onecan estimate E[Zk] as E[Zk] ≈ (1/n)

∑zk

i and r2 as

r2 ≈ n∑

z2i − (

∑zi )

2

n∑

zi − (∑

zi )2 .

estimated, its interpretation does not depend on allele frequency. However, the estimate of r2

becomes noisier (i.e., larger standard error) (see Figure 4). Consequently, one could considerapplying a frequency-dependent r2 threshold for marker filtering. One limitation of r2 is thatthere must be enough imputed samples so that it can be accurately estimated. Another is that itis defined in terms of a particular allele. If a marker has only two alleles, the estimated r2 will bethe same for both alleles. If a marker has more than two alleles, one could combine the alternatealleles into a single composite allele or estimate r2 separately for each allele.

A second correlation-based measure of imputation accuracy that is closely related to r2 is allelicR2 (8). The allelic R2 measure differs from r2 in that it estimates the correlation between the trueallele dose and the most probable (i.e., best guess) allele dose instead of the estimated allele dose.Thus, the r2 measure is more closely aligned than allelic R2 to the power of downstream analyses

86 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 15: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Alternate allele frequency (log10 scale)

Esti

mat

ed r2

– tr

ue r2

0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5

−1.0

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1.0

Figure 4Plot of noise in estimated imputation r2 as a function of nonreference allele frequency when imputing 10European-ancestry genome-wide association study samples from the Trans-Omics for Precision Medicinereference panel (n = 18,000). The x axis is the alternate allele frequency on a log10 scale. The y axis is thedifference between the estimated imputation accuracy (from Minimac) and the true imputation r2. The blueshaded area shows the smoothed scatter plot. The solid line is the mean difference at each frequency, and theupper and lower dashed lines are the 5% and 95% quantiles, respectively, for the difference at each frequency.

that use estimated allele dose instead of directly observed genotypes. Allelic R2 more closelymirrors the loss in power one might expect when using best-guess genotypes in analyses instead ofestimated allele doses. One particular limitation of allelic R2 is that it cannot be computed whenthe most probable target allele is the same for all target haplotypes.

A third commonly used measure of imputation accuracy that is not directly based on correlationis IMPUTE’s info measure, which is an estimate of the ratio of statistical information about thepopulation allele frequency in the imputed genotypes and in the true genotypes (66). However,it can be shown that if the Hardy–Weinberg equilibrium holds, then IMPUTE’s info measure isequal to the Minimac r2 measure (where the genotype probabilities to calculate IMPUTE’s infomeasure are estimated under Hardy–Weinberg equilibrium).

FACTORS AFFECTING IMPUTATION ACCURACY

Multiple factors can affect imputation accuracy:

� Size of reference panel: Increasing the size of the reference panel generally increases imputationaccuracy, especially for rarer variants, provided that the level of genetic similarity between

www.annualreviews.org • Genotype Imputation 87

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 16: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Impu

tati

on r2

Impu

tati

on r2

Impu

tati

on r2

Alternate allele frequency

Alternate allele frequency

TOPMedHRC1000G phase 3

a b

c d

e

1.0

0.8

0.6

0.4

0.2

1.0

0.8

0.6

0.4

0.2

1.0

0.8

0.6

0.4

1.0

0.9

0.8

0.7

0.6

1.0

0.8

0.6

0.4

0.2

10–4 10–3 10–2 10–1 1 10–3 10–2 10–1 1

10–3 10–2 10–1 1

10–3 10–2 10–1 1

10–3 10–2 10–1 1

European Admixed American

East Asian Southeast Asian

African

Figure 5Imputation accuracy for five ancestries: (a) European, (b) admixed American, (c) East Asian, (d ) Southeast Asian, and (e) African. Weextracted 10 samples from each of these ancestries from the 1000 Genomes Project (1000G) phase 3 data, masked all variants exceptthose on the Illumina 1M chip, and imputed them using the Trans-Omics for Precision Medicine (TOPMed) (with 18,000 samples),Haplotype Reference Consortium (HRC), and 1000G phase 3 (after removing overlaps) reference panels. The aggregate r2 (measuringthe imputation accuracy) is plotted as a function of the alternate allele frequency.

the reference panel and study samples is maintained. A larger panel provides a larger set oftemplate haplotypes to match against, which improves the imputation accuracy. For example,for imputing into European samples, the HRC panel, which has 33,000 samples, is a bettera choice than the 1000G phase 3 panel, which has 2,500 samples (see Figure 5a). However,expanding the reference panel to include samples with little genetic similarity to the targetpanel or less accurately estimated haplotypes may decrease imputation accuracy. This isevident from Figure 5b–e, where the HRC panel, despite being a superset of the 1000Gphase 3 panel, decreases the imputation accuracy in non-European samples.

� Density of genotyping array: A denser genotyping array increases the number of sites to matchwith, thereby improving the chances of finding shared haplotype segments. For example, theIllumina Human Omni5Exome array (∼5 million variants) is a better choice for imputationthan the Human Omni2.5Exome array (∼2.5 million variants). The Human Omni5Exomearray increases the average imputation accuracy (r2) by 0.1 for variants with a minor allele

88 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 17: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

frequency of approximately 0.01 when imputing into European samples using the HRCreference panel.

� Minor allele frequency of variant being imputed (in the reference panel): Rare variants are harderto impute than common variants because rare variants will generally be observed only afew times in the reference panel. This makes it harder to establish the haplotype backgroundfor these variants and reduces the set of template haplotypes available for matching. Overall,this is expected to lower imputation accuracy. The plots of imputation accuracy (Figure 5)demonstrate this feature, as we see that the imputation accuracy always increases with theminor allele frequency (controlling all other factors).

� Haplotype accuracy in reference and study samples: Genotype imputation depends on findinghaplotype segments that are shared between reference haplotypes and a target haplotype.Haplotype phase error tends to break up shared haplotype segments. Increases in refer-ence panel size over time should lead to lower haplotype phase error rates (62, 74). Large,accurately phased reference panels will also make it possible to improve haplotype phaseestimates in the study samples (61).

� Sequencing coverage of reference panel: The sequencing coverage directly correlates with theaccuracy of the genotypes in the reference panel, which in turn correlates with the accuracyof the inferred haplotypes. For example, in Figure 5a, the pilot TOPMed panel givesconsiderably higher accuracy than the HRC panel, although the latter is twice as large. Thisis most likely because the TOPMed panel was generated with high-resolution sequencing(an average of 40× coverage), whereas the HRC panel was generated by combining low-coverage sequencing data across different studies.

Imputation accuracy depends strongly on the number of copies of the allele in the referencepanel. In a study with simulated UK European data, mean imputation accuracy for a fixed minorallele count remained relatively constant as the reference panel size increased from 50,000 to200,000 samples (9). However, this stable relationship between minor allele count and imputationaccuracy may not hold in other contexts. For example, as reference panels grow in size, more alleleswill arise from recurrent mutation. Recurrent mutation introduces the same allele on differenthaplotypic backgrounds. For a recurrent mutation, the number of copies of the allele that sharethe same haplotypic background as the target haplotype is more relevant to imputation accuracythan the minor allele count in the reference panel. Similarly, genotype imputation methods cannotimpute de novo variants. If a de novo variant is also present in the reference panel owing to recurrentmutation, it generally will not be on the same haplotypic background in both the reference paneland the target sample.

IMPUTATION FROM MULTIPLE REFERENCE PANELS

The most common strategy for imputation is to use a single publicly available data set (suchas the 1000G or HRC data set), which works quite well for studies with samples of Europeanancestry. However, studies on non-European populations or relatively homogeneous Europeansubpopulations often benefit more from using custom reference panels of samples with greatergenetic similarity. For instance, in a study on samples in Sardinia, an island in Italy with a geneticallyisolated population, a custom reference of Sardinian samples provided better imputation accuracythan the 1000G and other large European panels (77). This result has also been replicated in otherstudies on homogeneous European subpopulations (21) and non-European populations, such asAfrican Americans (25). Additionally, disease-specific studies often benefit from a hybrid approachthat combines a custom reference panel with an existing public reference panel. This approach isparticularly beneficial for disease studies because it significantly increases the accuracy of imputing

www.annualreviews.org • Genotype Imputation 89

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 18: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

a causal variant (since the hybrid panel is enriched with more copies of rare alleles). One examplehighlighting this feature is a recent study on prostate cancer that was eventually able to imputethe rare HOXB13 G84E variant only after using a hybrid imputation approach that combinedthe 1000G data with an enriched set of cases carrying the mutation (38), even though previousstudies had suggested that it might not be possible to impute this variant (82). The hybrid methodis also useful in studies with homogeneous subpopulations, given that it enriches the panel withgenetically similar samples while retaining the benefits of large public panels.

The hybrid approach currently requires studies to merge multiple reference panels, and thestreamlining of this process is still an active field of research. One proposed approach has been tocombine the reference panels by first treating them as reference panels for each other and thencross-imputing the missing variants (40). Although this approach enables one to use all variantsfound in any panel, the performance of this method has not been fully evaluated. One studyfound this approach to be neither helpful nor harmful for large, population-specific panels (43).At present, reference panels do not generally include monomorphic markers. The merging ofreference panels would be facilitated if reference panels included auxiliary data identifying themarkers that are monomorphic and the observed allele at each monomorphic marker.

Some studies have repeated the imputation process for each reference panel. For example,a study repeatedly imputed against multiple reference panels to reveal a missense variant in theMYH6 gene (c.2161C>T) associated with high risk of sick sinus syndrome (39). The ideal solutionwith multiple reference panels would be to call the variants in all reference panels jointly fromtheir respective sequence alignment files for all the samples. The HRC panel was generated byjointly calling variants from the respective sequence alignment files from 20 contributing studies.However, variant calling is highly computationally intensive and may not be a feasible or practicalsolution for merging large reference panels. In the era of imputation servers, a question still tobe answered is whether future methods will be able to combine imputation results from multipleservers (each using its own private panel) into a better set of posterior genotype probabilities foreach sample than could be derived using a simple panel.

FUTURE DIRECTIONS

Genotype imputation is now a standard part of the human geneticist’s toolbox. When combinedwith a large set of sequenced genomes, genotype imputation extends the resolution of genotypingarrays, allowing many variants that are not directly genotyped (particularly including rare variantsand non-SNP variants) to be studied at scale. We expect that this capability will accelerate studiesof rare variants, shortening the time between when whole-genome and whole-exome sequenc-ing experiments can discover a variant and when scientists can explore its effect by examiningdownstream phenotypic consequences in large numbers of carriers.

As reference panels and human genetic studies continue to grow in scale, we expect con-tinued research in imputation methods. There are several interesting challenges, including thepotential for general-purpose machine learning methods—which are evolving rapidly, with theemergence of deep convolution neural networks and other highly efficient and effective computa-tional techniques—to eventually compete with or displace the current methods based on HMMand positional Burrows–Wheeler transformation, which have been crafted specifically to modelfeatures and properties of human genetic data.

Imputation of human genetic variation uses detailed observation of a reference set of samplesto enable understanding of unobserved genotypes in samples where only a scaffold of markers ismeasured. Imputation experiments are increasingly being attempted in other areas of biology andhuman genetics, and it will be interesting to see whether any of these methods and approaches

90 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 19: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

become as useful and ubiquitous as genotype imputation is in GWASs. Notable examples includeongoing attempts to impute gene expression levels and other types of genomic variation usingGWAS array data (31, 90, 99) in order to search for associations between expression levels, pro-tein levels, methylation patterns (81), and other genomic states and human diseases and traitscharacterized in large sets of genotyped individuals. Another notable example is the imputationof epigenomic states in detailed assessments of cellular function and biology (28). Here, a largeset of epigenomic assays for expression levels, methylation patterns, transcription factor binding,and other information are carried out in a small set of cell lines; in parallel, a subset of theseexperiments are carried out in additional cell lines, forming a scaffold that can be used to predictresults of missing assays.

DISCUSSION

In the last decade, the cost of sequencing a human genome has dropped from $10,000,000 to$1,000 (35). At the same time, continued improvements in genotype imputation methods haveenabled genotype imputation to remain an attractive low-cost alternative to sequencing. Thedecrease in sequencing cost enhances the advantages of genotype imputation because large-scalehuman sequencing will make it possible to assemble reference panels of hundreds of thousands(and eventually millions) of individuals. These enormous reference panels will make it possible toaccurately impute variants with very low population frequency. In turn, imputation will providean accessible strategy for studying the variants discovered by sequencing in tens of millions ofgenotyped samples.

Imputation methods are now mature and highly efficient, which permits genotype imputationto be offered as a free web-based service (19). Some related areas that require further developmentinclude methods for merging reference panels and methods for merging multiple sets of imputeddata when an individual has been imputed from multiple, possibly overlapping reference panels.In addition, the ability of imputation methods to impute non-single-nucleotide variation andstructural variation will need to be continually reassessed as sequencing technology improvesthe number and genotype accuracy of these variants in reference panels. Current methods aremore than four orders of magnitude faster than the first imputation methods based on the Li andStephens haplotype mosaic model. These methods have evolved through successive iterations andrefinements, and we anticipate that further iterations and refinements will be needed as the scaleof human genetic data sets continues to increase.

Genotype imputation methodology has outstripped advances in reference panels (9), and thereis a clear need for large reference panels that can take full advantage of the capabilities of cur-rent imputation methods. One obstacle to creating large reference panels is the computationalchallenge of jointly calling genotypes in large samples of sequenced individuals. This is an activearea of methodological research, with recent methods leveraging cloud computing to address thecomputational demands (50). A second challenge is the scarcity of sequenced samples that areconsented for general research use. In October 2017, only 1,245 individuals in the Database ofGenotypes and Phenotypes (dbGaP) (64) compilation of individual-level genomic data for generalresearch use (accession phs000688.v1.p1) had whole-genome sequence data. One approach to in-cluding samples with restrictive sample consents is to include restrictions on the use of referencepanels. The sequence data for most of the individuals in the HRC reference panel are availablevia application to the European Genome-Phenome Archive (accession EGAD00001002729) (54).The HRC reference panel may be used for genotype phasing and imputation but not for anyother purpose. Phasing and imputation servers provide an alternative way to perform phasing andimputation from large reference panels that have restrictive sample consents (19).

www.annualreviews.org • Genotype Imputation 91

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 20: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

Beyond simply increasing reference sample size, there is also a need for reference panels tobetter represent human genetic variation (3) through the inclusion of deeply sequenced individualsfrom diverse populations. Large, diverse reference panels will ensure that the benefits of genotypeimputation are available to all.

DISCLOSURE STATEMENT

G.R.A. is a member of the scientific advisory boards for 23andMe and Regeneron Pharmaceuticalsand a consultant for Merck.

ACKNOWLEDGMENTS

Research reported in this publication was supported by the National Human Genome ResearchInstitute of the National Institutes of Health under award number R01HG008359 to B.L.B. Thecontent is solely the responsibility of the authors and does not necessarily represent the officialviews of the National Institutes of Health.

LITERATURE CITED

1. 1000 Genomes Proj. Consort. 2010. A map of human genome variation from population-scale sequencing.Nature 467:1061–73

2. 1000 Genomes Proj. Consort. 2012. An integrated map of genetic variation from 1,092 human genomes.Nature 491:56–65

3. 1000 Genomes Proj. Consort. 2015. A global reference for human genetic variation. Nature 526:68–744. Al Olama AA, Kote-Jarai Z, Berndt SI, Conti DV, Schumacher F, et al. 2014. A meta-analysis of 87,040

individuals identifies 23 new susceptibility loci for prostate cancer. Nat. Genet. 46:1103–95. Alioto TS, Buchhalter I, Derdak S, Hutter B, Eldridge MD, et al. 2015. A comprehensive assessment of

somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6:100016. Anderson CA, Pettersson FH, Barrett JC, Zhuang JJ, Ragoussis J, et al. 2008. Evaluating the effects of

imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am. J. Hum.Genet. 83:112–19

7. Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, et al. 2008. Genome-wide association definesmore than 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet. 40:955–62

8. Browning BL, Browning SR. 2009. A unified approach to genotype imputation and haplotype-phaseinference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84:210–23

9. Browning BL, Browning SR. 2016. Genotype imputation with millions of reference samples. Am. J. Hum.Genet. 98:116–26

10. Browning BL, Yu Z. 2009. Simultaneous genotype calling and haplotype phasing improves genotypeaccuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet.85:847–61

11. Browning SR. 2008. Missing data imputation and haplotype phase inference for genome-wide associationstudies. Hum. Genet. 124:439–50

12. Browning SR, Browning BL. 2007. Rapid and accurate haplotype phasing and missing-data inference forwhole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81:1084–97

13. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, et al. 2017. Genome-wide genetic data on ∼500,000UK Biobank participants. bioRxiv 166298. https://doi.org/10.1101/166298

14. Chang D, Nalls MA, Hallgrimsdottir IB, Hunkapiller J, van der Brug M, et al. 2017. A meta-analysis ofgenome-wide association studies identifies 17 new Parkinson’s disease risk loci. Nat. Genet. 49:1511–16

15. Coon KD, Myers AJ, Craig DW, Webster JA, Pearson JV, et al. 2007. A high-density whole-genomeassociation study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer’sdisease. J. Clin. Psychiatry 68:613–18

92 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 21: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

16. Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, et al. 2008. Meta-analysis of genome-wideassociation study data identifies additional type 1 diabetes risk loci. Nat. Genet. 40:1399–401

17. Danecek P, Auton A, Abecasis GR, Albers CA, Banks E, et al. 2011. The variant call format and VCFtools.Bioinformatics 27:2156–58

18. Das S. 2017. Next generation of genotype imputation methods. PhD Thesis, Dep. Biostat., Univ. Mich., AnnArbor

19. Das S, Forer L, Schonherr S, Sidore C, Locke AE, et al. 2016. Next-generation genotype imputationservice and methods. Nat. Genet. 48:1284–87

20. Davies RW, Flint J, Myers S, Mott R. 2016. Rapid genotype imputation from sequence without referencepanels. Nat. Genet. 48:965–69

21. Deelen P, Menelaou A, van Leeuwen EM, Kanterakis A, van Dijk F, et al. 2014. Improved imputationquality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands.’Eur. J. Hum. Genet. 22:1321–26

22. Deutsch LP. 1996. GZIP file format specification version 4.3. RFC 1952, Netw. Work. Group, Internet Eng.Task Force, Fremont, CA. https://tools.ietf.org/search/rfc1952

23. Diabetes Genet. Replication Meta-Anal. (DIAGRAM) Consort., Asian Genet. Epidemiol. Netw. Type 2Diabetes (AGEN-T2D) Consort., South Asian Type 2 Diabetes (SAT2D) Consort., Mex. Am. Type 2Diabetes (MAT2D) Consort., Type 2 Diabetes Genet. Explor. Next-Gener. Seq. Multi-Ethnic Samples(T2D-GENES) Consort., et al. 2014. Genome-wide trans-ancestry meta-analysis provides insight intothe genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 46:234–44

24. Dilthey A, Leslie S, Moutsianas L, Shen J, Cox C, et al. 2013. Multi-population classical HLA typeimputation. PLOS Comput. Biol. 9:e1002877

25. Duan Q, Liu EY, Auer PL, Zhang G, Lange EM, et al. 2013. Imputation of coding variants in AfricanAmericans: better performance using data from the exome sequencing project. Bioinformatics 29:2744–49

26. Dudbridge F. 2008. Likelihood-based association analysis for nuclear families and unrelated subjects withmissing genotype data. Hum. Hered. 66:87–98

27. Durbin R. 2014. Efficient haplotype matching and storage using the positional Burrows-Wheeler trans-form (PBWT). Bioinformatics 30:1266–72

28. Ernst J, Kellis M. 2015. Large-scale imputation of epigenomic datasets for systematic annotation of diversehuman tissues. Nat. Biotechnol. 33:364–76

29. Fisher RA. 1918. The correlation between relatives on the supposition of Mendelian inheritance. Philos.Trans. R. Soc. Edinb. 52:399–433

30. Fuchsberger C, Abecasis GR, Hinds DA. 2015. minimac2: faster genotype imputation. Bioinformatics31:782–84

31. Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, et al. 2015. A gene-basedassociation method for mapping traits using reference transcriptome data. Nat. Genet. 47:1091–98

32. Gibson G. 2012. Rare and common variants: twenty arguments. Nat. Rev. Genet. 13:135–4533. Glob. Lipids Genet. Consort. 2013. Discovery and refinement of loci associated with lipid levels. Nat.

Genet. 45:1274–8334. Goodwin S, McPherson JD, McCombie WR. 2016. Coming of age: ten years of next-generation sequenc-

ing technologies. Nat. Rev. Genet. 17:333–5135. Green ED, Rubin EM, Olson MV. 2017. The future of DNA sequencing. Nature 550:179–8136. Handsaker RE, Van Doren V, Berman JR, Genovese G, Kashin S, et al. 2015. Large multiallelic copy

number variations in humans. Nat. Genet. 47:296–30337. Hao K, Chudin E, McElwee J, Schadt EE. 2009. Accuracy of genome-wide imputation of untyped markers

and impacts on statistical power for association studies. BMC Genet. 10:2738. Hoffmann TJ, Sakoda LC, Shen L, Jorgenson E, Habel LA, et al. 2015. Imputation of the rare HOXB13

G84E mutation and cancer risk in a large population-based cohort. PLOS Genet. 11:e100493039. Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, et al. 2011. A rare variant in MYH6 is

associated with high risk of sick sinus syndrome. Nat. Genet. 43:316–2040. Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the

next generation of genome-wide association studies. PLOS Genet. 5:e1000529

www.annualreviews.org • Genotype Imputation 93

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 22: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

41. Howie BN, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. 2012. Fast and accurate genotypeimputation in genome-wide association studies through pre-phasing. Nat. Genet. 44:955–59

42. Howie BN, Marchini J, Stephens M. 2011. Genotype imputation with thousands of genomes. G3 1:457–7043. Huang J, Howie B, McCarthy S, Memari Y, Walter K, et al. 2015. Improved imputation of low-frequency

and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6:457–7044. Int. HapMap Consort. 2003. The International HapMap Consortium. Nature 426:789–9645. Int. HapMap Consort. 2005. A haplotype map of the human genome. Nature 437:1299–32046. Int. HapMap Consort. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature

449:851–6147. Int. HapMap 3 Consort. 2010. Integrating common and rare genetic variation in diverse human popula-

tions. Nature 467:52–5848. Int. Hum. Genome Seq. Consort. 2004. Finishing the euchromatic sequence of the human genome.

Nature 431:931–4549. Jia X, Han B, Onengut-Gumuscu S, Chen WM, Concannon PJ, et al. 2013. Imputing amino acid poly-

morphisms in human leukocyte antigens. PLOS ONE 8:e6468350. Jun G, Wing MK, Abecasis GR, Kang HM. 2015. An efficient and scalable analysis framework for variant

extraction and refinement from population-scale DNA sequence data. Genome Res. 25:918–2551. Klein AP, Wolpin BM, Risch HA, Stolzenberg-Solomon RZ, Mocci E, et al. 2018. Genome-wide meta-

analysis identifies five new susceptibility loci for pancreatic cancer. Nat. Commun. 9:55652. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. 2005. Complement factor H polymorphism in

age-related macular degeneration. Science 308:385–8953. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. 2010. Hundreds of variants clustered

in genomic loci and biological pathways affect human height. Nature 467:832–3854. Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, et al. 2015. The European Genome-

phenome Archive of human data consented for biomedical research. Nat. Genet. 47:692–9555. Leslie S, Donnelly P, McVean G. 2008. A statistical method for predicting classical HLA alleles from

SNP data. Am. J. Hum. Genet. 82:48–5656. Li N, Stephens M. 2003. Modeling linkage disequilibrium and identifying recombination hotspots using

single-nucleotide polymorphism data. Genetics 165:2213–3357. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. 2010. MaCH: using sequence and genotype data to

estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34:816–3458. Li Y, Willer CJ, Sanna S, Abecasis GR. 2009. Genotype imputation. Annu. Rev. Genom. Hum. Genet.

10:387–40659. Lin DY, Hu Y, Huang BE. 2008. Simple and efficient analysis of disease association with missing genotype

data. Am. J. Hum. Genet. 82:444–5260. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, et al. 2015. Genetic studies of body mass index yield

new insights for obesity biology. Nature 518:197–20661. Loh P-R, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, et al. 2016. Reference-based phasing

using the Haplotype Reference Consortium panel. Nat. Genet. 48:1443–4862. Loh P-R, Palamara PF, Price AL. 2016. Fast and accurate long-range phasing in a UK Biobank cohort.

Nat. Genet. 48:811–1663. Mahajan A, Taliun D, Thurner M, Robertson NR, Torres JM, Rayner NW. 2018. Fine-mapping of

an expanded set of type 2 diabetes loci to single-variant resolution using high-density imputation andislet-specific epigenome maps. bioRxiv 245506. https://doi.org/10.1101/245506

64. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, et al. 2007. The NCBI dbGaP database of genotypesand phenotypes. Nat. Genet. 39:1181–86

65. Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, et al. 2006. A comparison of phasing algorithmsfor trios and unrelated individuals. Am. J. Hum. Genet. 78:437–50

66. Marchini J, Howie B. 2010. Genotype imputation for genome-wide association studies. Nat. Rev. Genet.11:499–511

67. Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A new multipoint method for genome-wideassociation studies by imputation of genotypes. Nat. Genet. 39:906–13

94 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 23: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

68. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. 2008. Genome-wide associationstudies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9:356–69

69. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, et al. 2016. A reference panel of 64,976haplotypes for genotype imputation. Nat. Genet. 48:1279–83

70. Nair RP, Duffin KC, Helms C, Ding J, Stuart PE, et al. 2009. Genome-wide scan reveals association ofpsoriasis with IL-23 and NF-κB pathways. Nat. Genet. 41:199–204

71. Natl. Heart Lung Blood Inst. 2018. Trans-Omics for Precision Medicine (TOPMed) program. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program

72. Nicolae DL. 2006. Testing untyped alleles (TUNA)—applications to genome-wide association studies.Genet. Epidemiol. 30:718–27

73. Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, et al. 2015. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47:1121–30

74. O’Connell J, Sharp K, Shrine N, Wain L, Hall I, et al. 2016. Haplotype estimation for biobank-scale datasets. Nat. Genet. 48:817–20

75. Okada Y, Wu D, Trynka G, Raj T, Terao C, et al. 2014. Genetics of rheumatoid arthritis contributes tobiology and drug discovery. Nature 506:376–81

76. Orho-Melander M, Melander O, Guiducci C, Perez-Martinez P, Corella D, et al. 2008. Common missensevariant in the glucokinase regulatory protein gene is associated with increased plasma triglyceride and C-reactive protein but lower fasting glucose concentrations. Diabetes 57:3112–21

77. Pistis G, Porcu E, Vrieze SI, Sidore C, Steri M, et al. 2015. Rare variant genotype imputation withthousands of study-specific whole-genome sequences: implications for cost-effective study designs. Eur.J. Hum. Genet. 23:975–83

78. Pritchard JK, Przeworski M. 2001. Linkage disequilibrium in humans: models and data. Am. J. Hum.Genet. 69:1–14

79. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81:559–75

80. Rabiner LR. 1989. A tutorial on hidden Markov models and selected applications in speech recognition.Proc. IEEE 77:257–86

81. Rawlik K, Rowlatt A, Tenesa A. 2016. Imputation of DNA methylation levels in the brain implicates arisk factor for Parkinson’s disease. Genetics 204:771–81

82. Saunders EJ, Dadaev T, Leongamornlert DA, Jugurnauth-Little S, Tymrakiewicz M, et al. 2014. Fine-mapping the HOXB region detects common variants tagging a rare coding allele: evidence for syntheticassociation in prostate cancer. PLOS Genet. 10:e1004129

83. Scheet P, Stephens M. 2006. A fast and flexible statistical model for large-scale population genotype data:applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78:629–44

84. Schully SD, Yu W, McCallum V, Benedicto CB, Dong LM, et al. 2011. Cancer GAMAdb: database ofcancer genetic associations from meta-analyses and genome-wide association studies. Eur. J. Hum. Genet.19:928–30

85. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, et al. 2007. A genome-wide association study oftype 2 diabetes in Finns detects multiple susceptibility variants. Science 316:1341–45

86. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. 2010. Association analyses of249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42:937–48

87. Spencer CC, Su Z, Donnelly P, Marchini J. 2009. Designing genome-wide association studies: samplesize, power, imputation, and the choice of genotyping chip. PLOS Genet. 5:e1000477

88. Taliun D, Chothani SP, Schonherr S, Forer L, Boehnke M, et al. 2017. LASER server: ancestry tracingwith genotypes or sequence reads. Bioinformatics 33:2056–58

89. UK10K Consort. 2015. The UK10K project identifies rare variants in health and disease. Nature 526:82–90

90. Wang J, Gamazon ER, Pierce BL, Stranger BE, Im HK, et al. 2016. Imputing gene expression in uncol-lected tissues within and beyond GTEx. Am. J. Hum. Genet. 98:697–708

91. Wang Y, Lu J, Yu J, Gibbs RA, Yu F. 2013. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 23:833–42

www.annualreviews.org • Genotype Imputation 95

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 24: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19CH04_Browning ARI 25 July 2018 17:44

92. Wellcome Trust Case Control Consort. 2007. Genome-wide association study of 14,000 cases of sevencommon diseases and 3,000 shared controls. Nature 447:661–78

93. Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, et al. 2008. Newly identified loci that influencelipid concentrations and risk of coronary artery disease. Nat. Genet. 40:161–69

94. Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, et al. 2009. Six new loci associated with body massindex highlight a neuronal influence on body weight regulation. Nat. Genet. 41:25–34

95. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, et al. 2014. Defining the role of common variation inthe genomic and biological architecture of adult human height. Nat. Genet. 46:1173–86

96. Yu Z, Schaid DJ. 2007. Methods to impute missing genotypes for population data. Hum. Genet. 122:495–504

97. Zhang J, Jiang K, Lv L, Wang H, Shen Z, et al. 2015. Use of genome-wide association studies for cancerresearch and drug repositioning. PLOS ONE 10:e0116477

98. Zheng X, Shen J, Cox C, Wakefield JC, Ehm MG, et al. 2014. HIBAG—HLA genotype imputation withattribute bagging. Pharmacogenom. J. 14:192–200

99. Zhou W, Han L, Altman RB. 2016. Imputing gene expression to maximize platform compatibility.Bioinformatics 33:522–28

96 Das · Abecasis · Browning

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 25: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19-TOC ARI 11 June 2018 13:10

Annual Review ofGenomics andHuman Genetics

Volume 19, 2018

Contents

From a Single Child to Uniform Newborn Screening:My Lucky Life in Pediatric Medical GeneticsR. Rodney Howell � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1

Single-Cell (Multi)omics TechnologiesLia Chappell, Andrew J.C. Russell, and Thierry Voet � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �15

Editing the Epigenome: Reshaping the Genomic LandscapeLiad Holtzman and Charles A. Gersbach � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �43

Genotype Imputation from Large Reference PanelsSayantan Das, Goncalo R. Abecasis, and Brian L. Browning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �73

Rare-Variant Studies to Complement Genome-WideAssociation StudiesA. Sazonovs and J.C. Barrett � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �97

Sickle Cell Anemia and Its PhenotypesThomas N. Williams and Swee Lay Thein � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 113

Common and Founder Mutations for Monogenic Traits inSub-Saharan African PopulationsAmanda Krause, Heather Seymour, and Michele Ramsay � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 149

The Genetics of Primary MicrocephalyDivya Jayaraman, Byoung-Il Bae, and Christopher A. Walsh � � � � � � � � � � � � � � � � � � � � � � � � � � � 177

Cystic Fibrosis Disease Modifiers: Complex Genetics Defines thePhenotypic Diversity in a Monogenic DiseaseWanda K. O’Neal and Michael R. Knowles � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 201

The Genetics and Genomics of AsthmaSaffron A.G. Willis-Owen, William O.C. Cookson, and Miriam F. Moffatt � � � � � � � � � � � 223

Does Malnutrition Have a Genetic Component?Priya Duggal and William A. Petri Jr. � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 247

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.

Page 26: Genotype Imputation from Large Reference PanelsGenotype Imputation from Large Reference Panels Sayantan Das,1 Gonc¸alo R. Abecasis,1 and Brian L. Browning2 1Center for Statistical

GG19-TOC ARI 11 June 2018 13:10

Small-Molecule Screening for Genetic DiseasesSarine Markossian, Kenny K. Ang, Christopher G. Wilson, and Michelle R. Arkin � � � � 263

Using Full Genomic Information to Predict Disease: Breaking Downthe Barriers Between Complex and Mendelian DiseasesDaniel M. Jordan and Ron Do � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 289

Inferring Causal Relationships Between Risk Factors and Outcomesfrom Genome-Wide Association Study DataStephen Burgess, Christopher N. Foley, and Verena Zuber � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 303

Drug-Induced Stevens–Johnson Syndrome and Toxic EpidermalNecrolysis Call for Optimum Patient Stratification and Theranosticsvia PharmacogenomicsChonlaphat Sukasem, Theodora Katsila, Therdpong Tempark,

George P. Patrinos, and Wasun Chantratita � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 329

Population Screening for HemoglobinopathiesH.W. Goonasekera, C.S. Paththinige, and V.H.W. Dissanayake � � � � � � � � � � � � � � � � � � � � � � � � 355

Ancient Genomics of Modern Humans: The First DecadePontus Skoglund and Iain Mathieson � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 381

Tales of Human Migration, Admixture, and Selection in AfricaCarina M. Schlebusch and Mattias Jakobsson � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 405

The Genomic CommonsJorge L. Contreras and Bartha M. Knoppers � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 429

Errata

An online log of corrections to Annual Review of Genomics and Human Genetics articlesmay be found at http://www.annualreviews.org/errata/genom

ii

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

018.

19:7

3-96

. Dow

nloa

ded

from

ww

w.a

nnua

lrev

iew

s.or

g A

cces

s pr

ovid

ed b

y U

nive

rsity

of

Was

hing

ton

on 0

8/23

/19.

For

per

sona

l use

onl

y.