Introduction to QTL mapping in model organismskbroman/talks/Louisville/qtllect_ho.pdf · Interval...

Introduction to QTL mapping

in model organisms

Karl W Broman

Department of BiostatisticsJohns Hopkins University

kbroman@jhsph.edu

www.biostat.jhsph.edu/˜kbroman

Outline

• Experiments and data

•Models

• ANOVA at marker loci

• Interval mapping

• Haley-Knott regression

• LOD thresholds

•Multiple QTL mapping

• CIs for QTL location

• Selection bias

• The X chromosome

• Selective genotyping

• Covariates

• Non-normal traits

• The need for good data

Backcross experiment

BCBC BC BC

Intercross experiment

F2F2 F2 F2

Phenotype distributions

• Within each of the parentaland F1 strains, individuals aregenetically identical.

• Environmental variation mayor may not be constant withgenotype.

• The backcross generation ex-hibits genetic as well as envi-ronmental variation.

Parental strains

0 20 40 60 80 100

F1 generation

0 20 40 60 80 100

Phenotype

Backcross generation

Phenotypes: yi = trait value for individual i

Genotypes: xij = 0/1 if mouse i is BB/AB at marker j

(or 0/1/2, in an intercross)

Genetic map: Locations of markers

Chromosome

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genetic map

20 40 60 80 100 120

Markers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genotype data

•Detect QTLs (and interactions between QTLs)

•Confidence intervals for QTL location

•Estimate QTL effects (effects of allelic substitution)

Statistical structure

Markers Phenotype

Covariates

The missing data problem:

Markers ←→ QTL

The model selection problem:

QTL, covariates −→ phenotype

Models: Recombination

We assume no crossover interference.

=⇒ Points of exchange (crossovers) are accordingto a Poisson process.

=⇒ The {xij} (marker genotypes) form a Markovchain

Example

A B − B A A

B A A − B A

A A A A A B

Models: Genotype ←→ Phenotype

Let y = phenotypeg = whole genome genotype

Imagine a small number of QTLs with genotypes g1, . . . , gp.(2p distinct genotypes)

E(y|g) = µg1,...,gpvar(y|g) = σ2

g1,...,gp

Models: Genotype ←→ Phenotype

Homoscedasticity (constant variance): σ2g ≡ σ2

Normally distributed residual variation: y|g ∼ N (µg, σ2).

Additivity: µg1,...,gp= µ +

∑pj=1 ∆j gj (gj = 1 or 0)

Epistasis: Any deviations from additivity.

The simplest method: ANOVA

• Also known as markerregression.

• Split mice into groupsaccording to genotypeat a marker.

• Do a t-test / ANOVA.

• Repeat for each marker.40

Genotype at D1M30

Genotype at D2M99

ANOVA at marker loci

Advantages• Simple.

• Easily incorporatescovariates.

• Easily extended to morecomplex models.

• Doesn’t require a geneticmap.

Disadvantages•Must exclude individuals

with missing genotype data.

• Imperfect information aboutQTL location.

• Suffers in low density scans.

• Only considers one QTL at atime.

Interval mapping (IM)

Lander & Botstein (1989)

• Take account of missing genotype data

• Interpolate between markers

•Maximum likelihood under a mixture model

0 20 40 60

Map position (cM)

Interval mapping (IM)

Lander & Botstein (1989)

• Assume a single QTL model.

• Each position in the genome, one at a time, is posited as theputative QTL.

• Let z = 1/0 if the (unobserved) QTL genotype is BB/AB.Assume y ∼ N (µz, σ)

• Given genotypes at linked markers, y ∼mixture of normal dist’nswith mixing proportion Pr(z = 1|marker data):

QTL genotypeM1 M2 BB ABBB BB (1− rL)(1− rR)/(1− r) rLrR/(1− r)BB AB (1− rL)rR/r rL(1− rR)/rAB BB rL(1− rR)/r (1− rL)rR/rAB AB rLrR/(1− r) (1− rL)(1− rR)/(1− r)

The normal mixtures

M1 M2Q

7 cM 13 cM

• Two markers separated by 20 cM,with the QTL closer to the leftmarker.

• The figure at right show the dis-tributions of the phenotype condi-tional on the genotypes at the twomarkers.

• The dashed curves correspond tothe components of the mixtures.

20 40 60 80 100

Phenotype

Interval mapping (continued)

Let pi = Pr(zi = 1|marker data)

yi|zi ∼ N (µzi, σ2)

Pr(yi|marker data, µ0, µ1, σ) = pif (yi; µ1, σ) + (1− pi)f (yi; µ0, σ)

where f (y; µ, σ) = exp[−(y − µ)2/(2σ2)]/√

2πσ2

Log likelihood: l(µ0, µ1, σ) =∑

i log Pr(yi|marker data, µ0, µ1, σ)

Maximum likelihood estimates (MLEs) of µ0, µ1, σ:

values for which l(µ0, µ1, σ) is maximized.

EM algorithm

Dempster et al. (1977)

E step:

Let w(k+1) = Pr(zi = 1|yi, marker data, µ(k)0 , µ

(k)1 , σ(k))

= pif(yi;µ(k)1 ,σ(k))

pif(yi;µ(k)1 ,σ(k))+(1−pi)f(yi;µ

(k)0 ,σ(k))

M step:

Let µ(k+1)1 =

i yiw(k+1)i /

i w(k+1)i

µ(k+1)0 =

i yi(1− w(k+1)i )/

i(1− w(k+1)i )

σ(k+1) = [not worth writing down]

The algorithm:

Start with w(1)i = pi; iterate the E & M steps until convergence.

Example

Iteration µ0 µ1 σ log likelihood

1 5.903 6.492 1.668 –770.752

2 5.835 6.562 1.654 –770.291

3 5.818 6.579 1.651 –770.264

4 5.815 6.583 1.650 –770.262

... ... ... ... ...

∞ 5.813 6.584 1.649 –770.262

LOD scores

The LOD score is a measure of the strength of evidence for thepresence of a QTL at a particular location.

LOD(γ) = log10 likelihood ratio comparing the hypothesis of aQTL at position γ versus that of no QTL

= log10

Pr(y|QTL at γ,µ0γ,µ1γ,σγ)Pr(y|no QTL,µ,σ)

µ0γ, µ1γ, σγ are the MLEs, assuming a single QTL at position γ.

No QTL model: The phenotypes are independent and identicallydistributed (iid) N (µ, σ2).

An example LOD curve

0 20 40 60 80 100

Chromosome position (cM)

0 500 1000 1500 2000 2500

Map position (cM)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19LOD curves

Interval mapping

Advantages• Takes proper account of

missing data.

• Allows examination ofpositions between markers.

• Gives improved estimatesof QTL effects.

• Provides pretty graphs.

Disadvantages• Increased computation

• Requires specializedsoftware.

• Difficult to generalize.

• Only considers one QTL ata time.

Haley-Knott regression

A quick approximation to Interval Mapping.

E(y | QTL = q) = µ + β 1{q = AB}

E(y | marker data) = µ + β Pr(QTL = AB | marker data)

• Regress y on Pr(QTL = AB | marker data).

• Pretend that the residual variation is normally distributed.

• Calculate

LOD(γ) = (n/2) log10

RSSa(γ)

Example

5 10 15 20

Markers

Missing genotypes

0 20 40 60

Map position (cM)

IMH−K

5 10 15 20

Markers

Missing genotypes

0 20 40 60

Map position (cM)

IMH−K

LOD thresholds

Large LOD scores indicate evidence for the presence of a QTL.

Q: How large is large?

→ We consider the distribution of the LOD score under the nullhypothesis of no QTL.

Key point: We must make some adjustment for our examination ofmultiple putative QTL locations.

→We seek the distribution of the maximum LOD score, genome-wide. The 95th %ile of this distribution serves as a genome-wideLOD threshold.

Estimating the threshold: simulations, analytical calculations, per-mutation (randomization) tests.

Null distribution of the LOD score

• Null distribution derived bycomputer simulation of backcrosswith genome of typical size.

• Solid curve: distribution of LODscore at any one point.

• Dashed curve: distribution ofmaximum LOD score,genome-wide.

0 1 2 3 4

LOD score

Permutation tests

markers

genotypedata

phenotypes

LOD(z)

(a set of curves)-

maxz LOD(z)

• Permute/shuffle the phenotypes; keep the genotype data intact.

• Calculate LOD?(z) −→ M ? = maxz LOD?(z)

• We wish to compare the observed M to the distribution of M ?.

• Pr(M ? ≥M) is a genome-wide P-value.

• The 95th %ile of M ? is a genome-wide LOD threshold.

• We can’t look at all n! possible permutations, but a random set of 1000 is feasi-ble and provides reasonable estimates of P-values and thresholds.

• Value: conditions on observed phenotypes, marker density, and pattern of miss-ing data; doesn’t rely on normality assumptions or asymptotics.

Permutation distribution

maximum LOD score

0 1 2 3 4 5 6 7

95th percentile

Multiple QTL methods

Why consider multiple QTLs at once?

• Reduce residual variation.

• Separate linked QTLs.

• Investigate interactions between QTLs (epistasis).

Epistasis in a backcross

Additive QTLs

Interacting QTLs

A HQTL 1

Epistasis in an intercross

Additive QTLs

Interacting QTLs

A H BQTL 1

BQTL 2

A H BQTL 1

BQTL 2

Two-dimensional genome scan

Consider each pair of positions, (γ1, γ2)

Models

• Full

• Additive

• QTL 1

• QTL 2

• Null

Possible comparisons

• Full vs. null

• Full vs. additive

• Full vs. Best of QTL 1 & 2

• Add’ve vs. Best of QTL 1 & 2

Example

Chromosome

Model selection

•Select class of models– Additive models

– Add’ve plus pairwise interactions

– Regression trees

•Compare models

– BICδ(γ) = log RSS(γ) + |γ| (δ log n

– Sequential permutation tests

•Search model space– Forward selection (FS)

– Backward elimination (BE)

– FS followed by BE

– MCMC

•Assess performance– Maximize no. QTLs found;

control false positive rate

What’s different here?

• Association among the covariates.

•Missing covariate information.

• Find the model rather than minimize prediction error.

1.5-LOD support interval

0 5 10 15 20 25 30 35

Map position (cM)

1.5−LOD support interval

Selection bias

• The estimated effect of a QTL willvary somewhat from its true effect.

• Only when the estimated effect islarge will the QTL be detected.

• Among those experiments in whichthe QTL is detected, the estimatedQTL effect will be, on average,larger than its true effect.

• This is selection bias.

• Selection bias is largest in QTLswith small or moderate effects.

• The true effects of QTLs that weidentify are likely smaller than wasobserved.

Estimated QTL effect

0 5 10 15

QTL effect = 5Bias = 79%

0 5 10 15

Implications of selection bias

•Estimated % variance explained by identified QTLs

•Repeating an experiment

•Congenics

•Marker-assisted selection

The X chromosome

In a backcross, the X chromosome may or may not besegregating.

(A × B) × A

Females: XA·B XA

Males: XA·B YA

A × (A × B)

Females: XA XA

Males: XA YB

The X chromosome

In an intercross, one must pay attention to the paternalgrandmother’s genotype.

(A × B) × (A × B) or (B × A) × (A × B)

Females: XA·B XA

Males: XA·B YB

(A × B) × (B × A) or (B × A) × (B × A)

Females: XA·B XB

Males: XA·B YA

Selective genotyping

• Save effort by only typing themost informative individuals(say, top & bottom 10%).

• Useful in context of a single,inexpensive trait.

• Tricky to estimate the effects ofQTLs: use IM with allphenotypes.

• Can’t get at interactions.

• Likely better to also genotypesome random portion of therest of the individuals.

All genotypes

Top and botton 10%

Covariates

• Examples: treatment, sex,litter, lab, age.

• Control residual variation.

• Avoid confounding.

• Look for QTL × environ’tinteractions

• Adjust before intervalmapping (IM) versus adjustwithin IM.

0 10 20 30 40 50

Map position (cM)

Chromosome 6

malefemale

AA AB BB AA AB BB

Males Females

D6Mit373

Non-normal traits

• Standard interval mapping assumes normally distributedresidual variation. (Thus the phenotype distribution is amixture of normals.)

• In reality: we see dichotomous traits, counts, skeweddistributions, outliers, and all sorts of odd things.

• Interval mapping, with LOD thresholds derived frompermutation tests, generally performs just fine anyway.

• Alternatives to consider:– Nonparametric approaches (Kruglyak & Lander 1995)

– Transformations (e.g., log, square root)

– Specially-tailored models (e.g., a generalized linear model, the Coxproportional hazard model, and the model in Broman et al. 2000)

Check data integrity

The success of QTL mapping depends crucially on the integrity ofthe data.

• Segregation distortion

• Genetic maps / marker positions

• Genotyping errors (tight double crossovers)

• Phenotype distribution / outliers

• Residual analysis

Summary I

• ANOVA at marker loci (aka marker regression) is simple and easily extendedto include covariates or accommodate complex models.

• Interval mapping improves on ANOVA by allowing inference of QTLs topositions between markers and taking proper account of missing genotypedata.

• ANOVA and IM consider only single-QTL models. Multiple QTL methods allowthe better separation of linked QTLs and are necessary for the investigation ofepistasis.

• Statistical significance of LOD peaks requires consideration of the maximumLOD score, genome-wide, under the null hypothesis of no QTLs. Permutationtests are extremely useful for this.

• 1.5-LOD support intervals indicate the plausible location of a QTL.

• Estimates of QTL effects are subject to selection bias. Such estimated effectsare often too large.

Summary II

• The X chromosome must be dealt with specially, and can be tricky.

• Study your data. Look for errors in the genetic map, genotyping errors andphenotype outliers. But don’t worry about them too much.

• Selective genotyping can save you time and money, but proceed with caution.

• Study your data. The consideration of covariates may reveal extremelyinteresting phenomena.

• Interval mapping works reasonably well even with non-normal traits. Butconsider transformations or specially-tailored models. If interval mappingsoftware is not available for your preferred model, start with some version ofANOVA.

References

• Broman KW (2001) Review of statistical methods for QTL mapping in experimentalcrosses. Lab Animal 30(7):44–52

A review for non-statisticians.

• Doerge RW (2002) Mapping and analysis of quantitative trait loci in experimental popu-lations. Nat Rev Genet 3:43–52

A very recent review.

• Doerge RW, Zeng Z-B, Weir BS (1997) Statistical issues in the search for genes affectingquantitative traits in experimental populations. Statistical Science 12:195–219

Review paper.

• Jansen RC (2001) Quantitative trait loci in inbred lines. In Balding DJ et al., Handbookof statistical genetics, John Wiley & Sons, New York, chapter 21

Review in an expensive but rather comprehensive and likely useful book.

• Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Asso-ciates, Sunderland, MA, chapter 15

Chapter on QTL mapping.

• Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traitsusing RFLP linkage maps. Genetics 121:185–199

The seminal paper.

• Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait map-ping. Genetics 138:963–971

LOD thresholds by permutation tests.

• Kruglyak L, Lander ES (1995) A nonparametric approach for mapping quantitative traitloci. Genetics 139:1421–1428

Non-parameteric interval mapping.

• Broman KW (2003) Mapping quantitative trait loci in the case of a spike in the phenotypedistribution. Genetics 163:1169–1175

QTL mapping with a special model for a non-normal phenotype.

• Miller AJ (2002) Subset selection in regression, 2nd edition. Chapman & Hall, New York.

A good book on model selection in regression.

• Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter 11.

An old but excellent general genetics textbook with a very interesting discussion of epis-tasis.

R/qtl: An extensible QTL mapping environment

Karl W. BromanDepartment of Biostatistics, Johns Hopkins University

Hao Wu, Gary A. ChurchillThe Jackson Laboratory

Saunak SenUniversity of California, San Francisco

http://www.biostat.jhsph.edu/˜kbroman/qtl

Why R/qtl?

• Iteractive QTL mapping environment.

• Allow user to focus on modeling rather than computing.

• Embedded within general data analysis environment, R.

• Access to a variety of QTL mapping approaches, includingsophisticated multiple QTL methods (soon, anyway).

• Includes functions for estimating genetic maps, identifyinggenotyping errors, and visualizing data.

• Easy extensibility for use with specialized crosses orspecially-tailored models.

• Available for Unix, Windows, and MacOS.

About R

• Open-source implementation of the S language. (LikeS-PLUS, and sort of like Matlab, but free.)

• Language and environment for statistical computing andgraphics.

• Provides a wide variety of statistical and graphicaltechniques (including linear and nonlinear modelling,statistical tests, time series analysis, classification,clustering).

• Available for UNIX, Windows and MacOS.

Functionality

Currently• Analysis of intercross, backcross

and 4-way cross.

• One- and two-dimensional scansby interval mapping, imputationand Haley-Knott regression, withcovariates.

• Permutation tests.

• Re-estimation of linkage map.

• “Ripple” marker order.

• Calculation of Lincoln & Landererror LOD scores.

• Visualization of genotype data.

Soon• AILs, RIs, and more complex

types of crosses.

• Analysis of multiple QTL models(by MIM or imputation).

• Sophisticated model searchtechniques.

• Advanced phenotype models,such as generalized linearmodels or Cox models

• Analysis of (and under)crossover interference.

• Graphical user interface (GUI)

Introduction to QTL mapping in model organismskbroman/talks/Louisville/qtllect_ho.pdf · Interval...

Documents

Mapping Quantitative Trait Loci (QTL) in sheep. III. QTL for carcass

Molecular Markers and QTL Mapping; An Introduction, …maizeandgenetics.tamu.edu/SethCMurray/Teaching/Lectures/Molecular... · Molecular Markers and QTL Mapping; An Introduction,

A guide to QTL mapping with R/qtl

Plant Molecular Breeding - download.e-bookshelf.de€¦ · 1.4 Statistical principles and methods for mapping QTL 5 1.4.1 Molecular markers for QTL mapping 6 1.4.2 QTL mapping in

QTL Mapping

Genetic analysis, QTL mapping and gene expression analysis

QTL mapping for physiology, yield and plant architecture

Introduction to Linkage and QTL mapping

Functional Mapping of QTL and Recent Developments

QTL mapping in micekbroman/presentations/qtl... · 2003. 1. 14. · References C Broman KW (2001) Review of statistical methods for QTL mapping in experi- mental crosses. Lab Animal

RESEARCH ARTICLE Open Access QTL mapping of the …

Eucalyptus genomics: Linkage mapping, QTL analysis … · Eucalyptus genomics: Linkage mapping, QTL analysis and population genomic ... January 2012 . i ... This thesis reports on

What is a QTL? What are QTL?. Current methods for QTL Single Marker Methods ( Student, 17?? ) t-tests Interval Mapping Method (Lander and Botstein,

Qtl mapping sachin pbt

QTL MAPPING OF WHEAT (TRITICUM AESTIVUM L) IN RESPONSE …. QTL... · 2016-06-10 · QTL MAPPING OF WHEAT (TRITICUM AESTIVUM L.) IN RESPONSE TO SALT STRESS AYMAN Y. AMIN1 & AYMAN

Physiology, Genetics and QTL Mapping of Salt …research-repository.uwa.edu.au/files/3232922/Samineni_Srinivasan...Physiology, Genetics and QTL Mapping of ... generation mean analysis

Linkage Analysis and QTL Mapping - University of Washington

QTL mapping in animals

Roslin qtl mapping

Introduction to QTL mapping in model organismskbroman/teaching/... · Interval mapping (IM) Lander & Botstein (1989) •Assume a single QTL model. •Each position in the genome,