From Genotype to Phenotypehomes.cs.washington.edu/~suinlee/genome541/lecture2-genetics.pdfgenotype of QTL 2, and vice versa The effect of QTL1 depends on the genotype of QTL 2, and

1

Lecture 2 – Apr 28, 2011GENOME 541, Introduction to Computational Molecular Biology, Spring 2011

Instructor: Su-In LeeUniversity of Washington, Seattle

From Genotype to Phenotype

2

Why are we so different?Human genetic diversity

Different “phenotype”AppearanceDisease susceptibilityDrug responses

:Different “genotype”

Individual-specific DNA3 billion-long string

……ACTGTTAGGCTGAGCTAGCCCAAAATTTATAGCGTCGACTGCAGGGTCCACCAAAGCTCGACTGCAGTCGACGACCTAAAATTTAACCGACTACGAGATGGGCACGTCACTTTTACGCAGCTTGATGATGCTAGCTGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATTCACTTTTACGCAGCTTGATGACGACTACGAGATGGGCACGTTCACCATCTACTACTACTCATCTACTCATCAACCAAAAACACTACTCATCATCATCATCTACATCTATCATCATCACATCTACTGGGGGTGGGATAGATAGTGTGCTCGATCGATCGATCGTCAGCTGATCGACGGCAG……

Any observable characteristic or trait

TGATCGAAGCTAAATGCATCAGCTGATGATCCTAGC…

TGATCGTAGCTAAATGCATCAGCTGATGATCGTAGC…

TGATCGCAGCTAAATGCAGCAGCTGATGATCGTAGC…

2

3

cellcell

MotivationWhich sequence variation affects a trait?

Better understanding disease mechanismsPersonalized medicine

Obese? 15%Bold? 30%Diabetes? 6.2%Parkinson’s disease? 0.3%Heart disease? 20.1%Colon cancer? 6.5%

:

A person

ACTTCGGAACATATCAAATCCAACGC

DNA – 3 billion long!

…… XXX

GTCDifferent instructionInstruction

Sequence variations

XX

AG

A different person

Appearance, Personality, Disease susceptibility, Drug responses, …

4

From DNA to Trait

…ACTCGGTAGACCTAAATTCGGCCCGG…

…ACCCGGTAGACCTTTATTCGGCCCGG…

…ACCCGGTAGACCTTAATTCGGCCGGG…

:

…ACCCGGTAGTCCTATATTCGGCCCGG…

…ACTCGGTAGTCCTATATTCGGCCGGG…

DNA sequence Trait




:



obesity

s1 s2… sp

Individual1

Individual2

Individual3

IndividualN-1

IndividualN

obesity

:

A

A

A

T

T

A⇒thin, T ⇒fat

p≈106 !

Feature selection problem!Standard approach: find a simple rule!

Can explain only 5% of the trait variationWhy?

Cell,a complex system

??

Environmental factors

Causality? Predictive?

N instances

Single nucleotide polymorphism (SNP) [snip] = a variation at a single site in DNA

3

OutlineStatistical methods for mapping QTL

What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mapping

Learning regulatory networks from genetically diverse set of individuals

Understand how DNA variations perturb the network

5

Quantitative Trait Locus (QTL)Definition of QTLs

The genomic regions that contribute to variation in a quantitative phenotype (e.g. blood pressure)

Mapping QTLsFinding QTLs from data

Experimental animalsBackcross experimentF2 intercross experiment

6

4

QTL mappingData

Phenotypes: yi = trait value for mouse iGenotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker kGenetic map: Locations of genetic markers

Goals: Identify the genomic regions (QTLs) contributing to variation in the phenotype.

7

:

1 2 3 4 5 … 3,000

mouseindividuals

0101100100…0111011110100…0010010110000…010

:

0000010100…101

0010000000…100

Genotype data3000 markers

010:0

100:0

110:0

Phenotype data

Backcross experiment

Inbred strainsHomozygous genomes

AdvantageOnly two genotypes

DisadvantageRelatively less genetic diversity

8Karl Broman, Review of statistical methods for QTL mapping in experimental crosses

first filial (F1) generation

parental generation

Xgamete

ABAA

AB

5

F2 intercross experiment

9Karl Broman, Review of statistical methods for QTL mapping in experimental crosses

F1 generation

parental generation

X

gametes F2 generation

AABB

AB

Trait distributions: a classical view

10

X

6

QTL mappingData

Phenotypes: yi = trait value for mouse iGenotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker k (backcross)Genetic map: Locations of genetic markers

GoalsIdentify the genomic regions (QTLs) contributing to variation in the phenotype.Identify at least one QTL.Form confidence interval for QTL location.Estimate QTL effects.

11

The simplest method: ANOVA

12

t-test/F-statistic will tell us whether there is sufficient evidence to believe that measurements from one condition (i.e. genotype) is significantly different from another.LOD score (“Logarithm of the odds favoring linkage”)

= log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model.

“Analysis of variance”: assumes the presence of single QTL

For each marker: Split mice into groups according to their genotypes at each marker.Do a t-test (backcross)/F-statistic (intercross)Repeat for each typed marker

⎥⎦

⎤⎢⎣

⎡)QTL no|(

)marker at the QTL|(10logDP

DP

7

ANOVA at marker lociAdvantages

Simple.Easily incorporate covariates (e.g. environmental factors, sex, etc).Easily extended to more complex models.

DisadvantagesMust exclude individuals with missing genotype data.Imperfect information about QTL location.Suffers in low density scans.Only considers one QTL at a time. (assumes the presence of a single QTL)

13

Interval mapping [Lander and Botstein, 1989]

Consider any one position in the genome as the location for a putative QTL.

For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA.

Calculate P(z = 1 | marker data).Assume no meiotic interference.Need only consider flanking typed markers.May allow for the presence of genotypic errors.

Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2).

Given marker data, phenotype follows a mixture of normal distributions.

14

8

IM: the mixture model

Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB.The QTL has effect ∆ = µB - µA.What are unknowns?

µA and µBGenotype of QTL

15

0 7 20

M1 QTL M2

M1/M2Nearest flanking markers

65% AB35% AA

35% AB65% AA

99% AB

99% AA

IM: estimation and LOD scoresUse a version of the EM algorithm to obtain estimates of µA, µB, σ and expectation on z (an iterative algorithm).

Calculate the LOD score

Repeat for all other genomic positions (in practice, at 0.5 cM steps along genome).

16

9

A simulated exampleLOD score curves

17

Genetic markers

Interval mappingAdvantages

Make proper account of missing dataCan allow for the presence of genotypic errorsPretty picturesHigh power in low-density scansImproved estimate of QTL location

DisadvantagesGreater computational effort (doing EM for each position)Requires specialized softwareMore difficult to include covariatesOnly considers one QTL at a time

18

10

Statistical significanceLarge LOD score → evidence for QTLQuestion: How large is large?Answer 1: Consider distribution of LOD score if there were no QTL.Answer 2: Consider distribution of maximum LOD score.

19

Null distribution of the LOD scores at a particular genomic position (solid curve)

Null hypothesis – assuming that there are no QTLs segregating in the population.

⎥⎦

⎤⎢⎣

⎡)QTL no|(

)position at the QTL|(10logDP

DP

Only ~3% of chance that the genomic position gets LOD score≥1.

Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve).

LOD thresholdsTo account for the genome-wide search, compare the observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere.

LOD threshold = 95th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere.

Methods for obtaining thresholdsAnalytical calculations (assuming dense map of markers) (Lander & Botstein, 1989)Computer simulationsPermutation/ randomized test (Churchill & Doerge, 1994)

20

11

More on LOD thresholdsAppropriate threshold depends on:

Size of genomeNumber of typed markersPattern of missing dataStringency of significance thresholdType of cross (e.g. F2 intercross vs backcross)Etc

21

An examplePermutation distribution for a trait

22

12

Modeling multiple QTLsAdvantages

Reduce the residual variation and obtain greater power to detect additional QTLs.Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs.

Interactions between two loci

23

The effect of QTL1 is the same, irrespective of the genotype of QTL 2, and vice versa

The effect of QTL1 depends on the genotype of QTL 2, and vice versa

Trait variation that is not explained by a detected putative QTL.

Multiple marker modelLet y = phenotype,

x = genotype data.

Imagine a small number of QTL with genotypes x1,…,xp2p or 3p distinct genotypes for backcross and intercross, respectively

We assume thatE(y|x) = µ(x1,…,xp), var(y|x) = σ2(x1,…,xp)

24

13

Multiple marker modelConstant varianceσ2(x1,…,xp) =σ2

Assuming normalityy|x ~ N(µg, σ2)

Additivityµ(x1,…,xp) = µ + ∑j ∆jxj

Epistasisµ(x1,…,xp) = µ + ∑j ∆jxj + ∑j,k wj,kxjxk

25

Computational problemN backcross individuals, M markers in all with at most a handful expected to be near QTL

xij = genotype (0/1) of mouse i at marker jyi = phenotype (trait value) of mouse i

Assuming addivitity,yi = µ + ∑j ∆jxij + e which ∆j ≠ 0?Variable selection in linear regression models

26

14

Mapping QTL as model selectionSelect the class of models

Additive modelsAdditive with pairwise interactionsRegression trees

27

xN…x1 x2

w1w2 wN

Phenotype (y)

y = w1 x1+…+wN xN+ε

minimizew (w1x1 + … wNxN - y)2 ?

28

Linear Regressionminimizew (w1x1 + … wNxN - y)2+model complexity

Search model spaceForward selection (FS)Backward deletion (BE)FS followed by BE

xN…x1 x2

w1w2 wN

Phenotype (y)parameters

w1w2 wN

Y = w1 x1+…+wN xN+ε

15

29

Lasso* (L1) Regressionminimizew (w1x1 + … wNxN - y)2+ Σ C |wi|

Induces sparsity in the solution w (many wi‘s set to zero)Provably selects “right” features when many features are irrelevant

Convex optimization problemNo combinatorial searchUnique global optimumEfficient optimization

xN…x1 x2

w1w2 wN

Phenotype (y)parameters

w1w2

x1 x2

* Tibshirani, 1996

L2 L1

L1 term

Model selectionCompare models

Likelihood function + model complexity (eg # QTLs)Cross validation testSequential permutation tests

Assess performanceMaximize the number of QTL foundControl the false positive rate

30

16

Interval mapping for multiple QTLsComposite interval mapping (CIM)

Has been widely used in practice.Performs IM using a subset of marker loci as covariates.The key problem concerns the choice of suitable marker loci to serve as covariates.

Multiple interval mappingAllows interactions between QTLs

31

OutlineStatistical methods for mapping QTL

What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mapping



32

17

33

From DNA to Trait




:



DNA sequence Trait

obesity

Individual1

Individual2

Individual3

IndividualN-1

IndividualN

obesity

:

P≈3x106 !

Cell,a complex system

??s1 s2… sP

Learn the complex web of interactions from data?

Better understand the traitBetter detect the causative Si’s

What training data (instances) ?

RNA leveltoo faint to be detected

34

Model Organism [Brem et al, Science 2002]

Strain 0

×Strain 1

:

112 progeny

RNAProtein

Gene

1 2 3 4 5 … 3,000

Expression data

Individuals

6000 genes

0101100100…0111011110100…0010010110000…010

:

0000010100…101

0010000000…100

Genotype data3000 markers

010:0

100:0

110:0

Genetic perturbationGenotyping

Expression profiling

18

35

Single-marker expression quantitative trait loci (eQTL) mapping

For each gene, find the marker that is most predictive of its expression level [Yvert G et al. (2003) Nat Gen].

Traditional Approach: Single Marker

genes

Genotype data Expression data

markers

individualsindividuals

0101100100100…0111011110111100…0010010110001000…010

:

0000010110100…101

1110000110000…100

Gene iMarker j

induced

repressed

1 2 3 4 5 …Marker

mRNA

Gene

Gmarker j

Egene i

36

Genetic variation and regulation

RegulatorTargets

Activity level of Regulator changes the expression levels of Targets it binds to.Regulator’s expression is predictive of Targets’ expression

ERegulator

ETargets

Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006

AGTCTTAACGTTTGACCGCTAATT

19

37

Regulation variation & mechanisms

Regulator SNPs ⇒ change in regulator function

Regulator’s genotype is predictive of Targets’ expression

Regulator

AGTCTTAACGTTTGACCGCTAAXC

XA

ERegulator

ETargets

Modeling assumptions [Segal E et al. (2003) Nat Gen]:

Genes are organized into co-regulated groups of genes (i.e. modules)Each module has its own “regulatory program”

Targets

Co-regulated genes (module)

GRegulator

38

Modularity …Multiple genes are regulated by the same regulatorsCo-regulated genes have a similar “regulation program”

Regulatory network

module

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

targets

Candidate regulators (x1,..,xN): Sequence variationsExpression levels of genes that have regulatory roles

“Regulation program” ?

Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006

S1

S120

S22

S1011

S321S321

variation of a certain site on DNA

A and B regulate the expression of C(A and B are regulators of C)

AB

C

expression level of a gene

20

39

Regulation as Linear Regressionminimizew (w1x1 + … wNxN - ETargets)2

But we often have very large N… and linear regression gives them all nonzero weight!

xN…x1 x2

w1w2 wN

EModule

Problem: This objective learns too many regulators

parametersw1

w2 wN

ETargets= w1 x1+…+wN xN+ε

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

S1

S120S1011

S321S321

S22

40

Lasso* (L1) Regressionminimizew (w1x1 + … wNxN - EModule)2+ Σ C |wi|

Induces sparsity in the solution w (many wi‘s set to zero)Provably selects “right” features when many features are irrelevant

Convex optimization problemNo combinatorial searchUnique global optimumEfficient optimization

xN…x1 x2

w1w2 wN

EModule

parametersw1

w2

x1 x2

* Tibshirani, 1996

L2 L1

L1 term

21

41

Cluster genes into modulesLearn a regulatory program for each module

Learning regulatory network

S1

S120

S22

S1011

S321

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

Lee et al., PLoS Genet 2009

L1 regressionminimizew (Σwixi - ETargets)2+ Σ C |wi|

S321

M120=

MFA1

Module

GPA1-3 x+

0.5 x+

-1.2 x

Is this predicted relationship “real”?

42

Challenges?Too large N!

# regulatory genes + # sequence variationsFor human: 2000+1,000,000

Redundant features{xi,xj,…,xk} are perfectly correlated

Learning the regulatory network

xN…x1 x2

Emodule 1

w11w12 w1N

:

xN…x1 x2

Emodule M

wM1wM2 wMN

Module 1

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

S1

S120S1011

S321S321

S22

Module 1=0

=0 =0

Multiple regression tasksminimizew1 (Σ w1nxn–Emodule1)2+ Σ

C|w1n|

minimizewn (Σ wMnxn–EmoduleM)2+ Σ C|wMn|:

Module M

Module M

22

43

What Regulates the P-bodies?Bad news

A marker that covers a large region in Chr 14.Region contains ~30 genes and ~318 SNPs.

ChrXIV:449,639-502,316

DHH1

Strain 0Strain 1

GCN20

GCN1KEM1

BLM3Regulators of Puf3 Module

318 redundant features !

44

Challenge: redundant features!

Selected 318 sequence variations perfectly correlatedWhich of 318 is real causative variation?Experiments for all 318 variations not feasible!

ChrXIV:449639-502316“Type B”“Type A”

Lee et al., PLOS Genetics 2009

RNA degradation module PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

MFA1 PHO4

S1-318

…ACTCGGTAGCCC…TACATTCGGCCCGG……ACTCGGTAGCCC…TACATTCGGCCCGG……ACCCGGTAGACC…TTAATTCGGCCGGG…

:…ACCCGGTAGACC…TTAATTCGGCCGGG…

s1 s2… s318

All individuals have either TC…ACC (Type A) or CA…TAG (Type B) for S1~S138

23

45

MotivationNot all sequence variations are equally likely to be causal.

Gene

SNPs

TACGTAGGAACCTGTACCA … GGAAAATATCAAATCCAACGACGTTAGCCAATGCGATCGAATGGGAACGTA

ChrXIV: 449,639-502,316

S1:On the protein coding region of a gene involved in RNA degradation

S2:Not on any gene or a regulatory region

“Regulatory features” F1. Inside a gene region?2. Protein coding region?3. Change the protein letter?4. Create a stop codon?5. Strong conservation?

:

β1

β2

β3

β4

β5

Redundant features …

Idea: Prioritize SNPs that have “good” regulatory features

Problem: How much weight do we give to different regulatory featuresToo many weights to estimate using cross-validation

Metaprior LearningIterate steps 1,2 & 3 until convergence.

Regulatory weights ß

Strain0 MVLT ELVQ VSDASKQLWDI

Strain1 MVLT ELVQ VSDASKQLWDI

× 1× 1× 1

× 0

L

D

Regulatory potentials…x1 x2 xN

Ei

…x1 x2

Ei

xN

Module i+2

…x1 x2

Ei

xN

Module i+1

…x1 x2

Ei

xN

Module i

Regulatory programs

× 1× 0× 0

× 1

: :

0.3 0.9

= =

2. Learn w

1. Compute regulatory potential of SNPs

Regulatory features F

0 0.1 0.2 0.3

1

2

3

4

5

6

Non-synonymousConservationAA small ↔ large

Cell cycle0 0.1 0.2 0.3

A

G

3. Learn ß

“Weighted” L1 regularizationMaximize log P(E|X,W) + log P(W|β,F)

Maximizelog P(W|β,F) + log P(β)

Lee et al. PLoS Genet 2009

24

Learned regulatory weightsYeast regulatory weights

0 0.1 0.2 0.3 0.4 0.5

Non-synonymous codingStop codon

Synonymous coding3' UTR

500 bp upstream5' UTR

500 bp downstreamConservation score

Cis-regulationChange of average mass (Da)Change of isoelectric point (pI)

Change of pK1Change of pK2

Change of hydro-phobicityChange of pKa

Change of polarityChange of pH

Change of van der WaalsTranscription regulator activity

Telomere organization andProtein folding

Glucose metabolic processRNA modification

Hexose metabolic processCell cycle checkpoint

Proteolysissame GO processsame GO functionChIP-chip binding

0 0.1 0.2 0.3 0.4 0.5

Non-synonymous codingSynonymous coding

IntronLocus region

Splice siteUTR region

Conservation scoreCis-regulation

Change of average mass (Da)Change of isoelectric point

Change of pK1Change of pKa

Change of polarityChange of pH

Change of van der Waals volumecell communicationsignal transduction

transcription

Human regulatory weights

Location

AA property change

Genefunction

Pairwise feature

Regulatory features

What Regulates the RNA degradation module?

Regu

lato

ry p

oten

tial

0.7

0.6

0.5

0.4

ChrXIV:415,000-495,000

Saccharomyces Genome Database (SGD)

MKT1

The regulatory potential over all 318 variations in the region

ChrXIV:449639

Lee et al., PLOS Genetics 2009

Biological validation succeeded!

25

Predicting Causal Regulators

Region Zhu et al [Nat Genet 08] Lirnet (top 3 are considered)

1 NoneSEC18 RDH54 SPT7

2TBS1, TOS1, ARA1, CSH1, SUP45, CNS1, AMN1

AMN1 CNS1 TOS1

3 None TRS20 ABD1 PRP5

4 LEU2, ILV6, NFS1, CIT2, MATALPHA1 LEU2 PGS1 ILV65 MATALPHA1 MATALPHA1 MATALPHA2 RBK1

6 URA3 URA3 NPP2 PAC2

7 GPA1 STP2 GPA1 NEM1

8 HAP1 HAP1 NEJ1 GSY2

9 YRF1-4, YRF1-5, YLR464W SIR3 HMG2 ECM7

10 None ARG81 TAF13 CAC2

11 SAL1, TOP2 MKT1 TOP2 MSK1

12 PHM7 PHM7 ATG19 BRX1

13 None ADE2 ORT1 CAT5

Experimentally supported regulators: GENE

8 validated regulators in 7 regions

14 validated regulators in 11 regions

Summary: From Genotype To PhenotypeStatistical methods for mapping QTL

What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mappingMulti-marker models



50

Documents

From Genotype to Phenotypehomes.cs.washington.edu/~suinlee/genome541/lecture2-genetics.pdfgenotype of QTL 2, and vice versa The effect of QTL1 depends on the genotype of QTL 2, and