Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Lecture 2 – Apr 28, 2011GENOME 541, Introduction to Computational Molecular Biology, Spring 2011
Instructor: Su-In LeeUniversity of Washington, Seattle
From Genotype to Phenotype
2
Why are we so different?Human genetic diversity
Different “phenotype”AppearanceDisease susceptibilityDrug responses
:Different “genotype”
Individual-specific DNA3 billion-long string
……ACTGTTAGGCTGAGCTAGCCCAAAATTTATAGCGTCGACTGCAGGGTCCACCAAAGCTCGACTGCAGTCGACGACCTAAAATTTAACCGACTACGAGATGGGCACGTCACTTTTACGCAGCTTGATGATGCTAGCTGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATTCACTTTTACGCAGCTTGATGACGACTACGAGATGGGCACGTTCACCATCTACTACTACTCATCTACTCATCAACCAAAAACACTACTCATCATCATCATCTACATCTATCATCATCACATCTACTGGGGGTGGGATAGATAGTGTGCTCGATCGATCGATCGTCAGCTGATCGACGGCAG……
Any observable characteristic or trait
TGATCGAAGCTAAATGCATCAGCTGATGATCCTAGC…
TGATCGTAGCTAAATGCATCAGCTGATGATCGTAGC…
TGATCGCAGCTAAATGCAGCAGCTGATGATCGTAGC…
2
3
cellcell
MotivationWhich sequence variation affects a trait?
Better understanding disease mechanismsPersonalized medicine
Obese? 15%Bold? 30%Diabetes? 6.2%Parkinson’s disease? 0.3%Heart disease? 20.1%Colon cancer? 6.5%
:
A person
ACTTCGGAACATATCAAATCCAACGC
DNA – 3 billion long!
…… XXX
GTCDifferent instructionInstruction
Sequence variations
XX
AG
A different person
Appearance, Personality, Disease susceptibility, Drug responses, …
4
From DNA to Trait
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
DNA sequence Trait
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
obesity
s1 s2… sp
Individual1
Individual2
Individual3
IndividualN-1
IndividualN
obesity
:
A
A
A
T
T
A⇒thin, T ⇒fat
p≈106 !
Feature selection problem!Standard approach: find a simple rule!
Can explain only 5% of the trait variationWhy?
Cell,a complex system
??
Environmental factors
Causality? Predictive?
N instances
Single nucleotide polymorphism (SNP) [snip] = a variation at a single site in DNA
3
OutlineStatistical methods for mapping QTL
What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mapping
Learning regulatory networks from genetically diverse set of individuals
Understand how DNA variations perturb the network
5
Quantitative Trait Locus (QTL)Definition of QTLs
The genomic regions that contribute to variation in a quantitative phenotype (e.g. blood pressure)
Mapping QTLsFinding QTLs from data
Experimental animalsBackcross experimentF2 intercross experiment
6
4
QTL mappingData
Phenotypes: yi = trait value for mouse iGenotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker kGenetic map: Locations of genetic markers
Goals: Identify the genomic regions (QTLs) contributing to variation in the phenotype.
7
:
1 2 3 4 5 … 3,000
mouseindividuals
0101100100…0111011110100…0010010110000…010
:
0000010100…101
0010000000…100
Genotype data3000 markers
010:0
100:0
110:0
Phenotype data
Backcross experiment
Inbred strainsHomozygous genomes
AdvantageOnly two genotypes
DisadvantageRelatively less genetic diversity
8Karl Broman, Review of statistical methods for QTL mapping in experimental crosses
first filial (F1) generation
parental generation
Xgamete
ABAA
AB
5
F2 intercross experiment
9Karl Broman, Review of statistical methods for QTL mapping in experimental crosses
F1 generation
parental generation
X
gametes F2 generation
AABB
AB
Trait distributions: a classical view
10
X
6
QTL mappingData
Phenotypes: yi = trait value for mouse iGenotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker k (backcross)Genetic map: Locations of genetic markers
GoalsIdentify the genomic regions (QTLs) contributing to variation in the phenotype.Identify at least one QTL.Form confidence interval for QTL location.Estimate QTL effects.
11
The simplest method: ANOVA
12
t-test/F-statistic will tell us whether there is sufficient evidence to believe that measurements from one condition (i.e. genotype) is significantly different from another.LOD score (“Logarithm of the odds favoring linkage”)
= log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model.
“Analysis of variance”: assumes the presence of single QTL
For each marker: Split mice into groups according to their genotypes at each marker.Do a t-test (backcross)/F-statistic (intercross)Repeat for each typed marker
⎥⎦
⎤⎢⎣
⎡)QTL no|(
)marker at the QTL|(10logDP
DP
7
ANOVA at marker lociAdvantages
Simple.Easily incorporate covariates (e.g. environmental factors, sex, etc).Easily extended to more complex models.
DisadvantagesMust exclude individuals with missing genotype data.Imperfect information about QTL location.Suffers in low density scans.Only considers one QTL at a time. (assumes the presence of a single QTL)
13
Interval mapping [Lander and Botstein, 1989]
Consider any one position in the genome as the location for a putative QTL.
For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA.
Calculate P(z = 1 | marker data).Assume no meiotic interference.Need only consider flanking typed markers.May allow for the presence of genotypic errors.
Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2).
Given marker data, phenotype follows a mixture of normal distributions.
14
8
IM: the mixture model
Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB.The QTL has effect ∆ = µB - µA.What are unknowns?
µA and µBGenotype of QTL
15
0 7 20
M1 QTL M2
M1/M2Nearest flanking markers
65% AB35% AA
35% AB65% AA
99% AB
99% AA
IM: estimation and LOD scoresUse a version of the EM algorithm to obtain estimates of µA, µB, σ and expectation on z (an iterative algorithm).
Calculate the LOD score
Repeat for all other genomic positions (in practice, at 0.5 cM steps along genome).
16
9
A simulated exampleLOD score curves
17
Genetic markers
Interval mappingAdvantages
Make proper account of missing dataCan allow for the presence of genotypic errorsPretty picturesHigh power in low-density scansImproved estimate of QTL location
DisadvantagesGreater computational effort (doing EM for each position)Requires specialized softwareMore difficult to include covariatesOnly considers one QTL at a time
18
10
Statistical significanceLarge LOD score → evidence for QTLQuestion: How large is large?Answer 1: Consider distribution of LOD score if there were no QTL.Answer 2: Consider distribution of maximum LOD score.
19
Null distribution of the LOD scores at a particular genomic position (solid curve)
Null hypothesis – assuming that there are no QTLs segregating in the population.
⎥⎦
⎤⎢⎣
⎡)QTL no|(
)position at the QTL|(10logDP
DP
Only ~3% of chance that the genomic position gets LOD score≥1.
Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve).
LOD thresholdsTo account for the genome-wide search, compare the observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere.
LOD threshold = 95th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere.
Methods for obtaining thresholdsAnalytical calculations (assuming dense map of markers) (Lander & Botstein, 1989)Computer simulationsPermutation/ randomized test (Churchill & Doerge, 1994)
20
11
More on LOD thresholdsAppropriate threshold depends on:
Size of genomeNumber of typed markersPattern of missing dataStringency of significance thresholdType of cross (e.g. F2 intercross vs backcross)Etc
21
An examplePermutation distribution for a trait
22
12
Modeling multiple QTLsAdvantages
Reduce the residual variation and obtain greater power to detect additional QTLs.Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs.
Interactions between two loci
23
The effect of QTL1 is the same, irrespective of the genotype of QTL 2, and vice versa
The effect of QTL1 depends on the genotype of QTL 2, and vice versa
Trait variation that is not explained by a detected putative QTL.
Multiple marker modelLet y = phenotype,
x = genotype data.
Imagine a small number of QTL with genotypes x1,…,xp2p or 3p distinct genotypes for backcross and intercross, respectively
We assume thatE(y|x) = µ(x1,…,xp), var(y|x) = σ2(x1,…,xp)
24
13
Multiple marker modelConstant varianceσ2(x1,…,xp) =σ2
Assuming normalityy|x ~ N(µg, σ2)
Additivityµ(x1,…,xp) = µ + ∑j ∆jxj
Epistasisµ(x1,…,xp) = µ + ∑j ∆jxj + ∑j,k wj,kxjxk
25
Computational problemN backcross individuals, M markers in all with at most a handful expected to be near QTL
xij = genotype (0/1) of mouse i at marker jyi = phenotype (trait value) of mouse i
Assuming addivitity,yi = µ + ∑j ∆jxij + e which ∆j ≠ 0?Variable selection in linear regression models
26
14
Mapping QTL as model selectionSelect the class of models
Additive modelsAdditive with pairwise interactionsRegression trees
27
xN…x1 x2
w1w2 wN
Phenotype (y)
y = w1 x1+…+wN xN+ε
minimizew (w1x1 + … wNxN - y)2 ?
28
Linear Regressionminimizew (w1x1 + … wNxN - y)2+model complexity
Search model spaceForward selection (FS)Backward deletion (BE)FS followed by BE
xN…x1 x2
w1w2 wN
Phenotype (y)parameters
w1w2 wN
Y = w1 x1+…+wN xN+ε
15
29
Lasso* (L1) Regressionminimizew (w1x1 + … wNxN - y)2+ Σ C |wi|
Induces sparsity in the solution w (many wi‘s set to zero)Provably selects “right” features when many features are irrelevant
Convex optimization problemNo combinatorial searchUnique global optimumEfficient optimization
xN…x1 x2
w1w2 wN
Phenotype (y)parameters
w1w2
x1 x2
* Tibshirani, 1996
L2 L1
L1 term
Model selectionCompare models
Likelihood function + model complexity (eg # QTLs)Cross validation testSequential permutation tests
Assess performanceMaximize the number of QTL foundControl the false positive rate
30
16
Interval mapping for multiple QTLsComposite interval mapping (CIM)
Has been widely used in practice.Performs IM using a subset of marker loci as covariates.The key problem concerns the choice of suitable marker loci to serve as covariates.
Multiple interval mappingAllows interactions between QTLs
31
OutlineStatistical methods for mapping QTL
What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mapping
Learning regulatory networks from genetically diverse set of individuals
Understand how DNA variations perturb the network
32
17
33
From DNA to Trait
…ACTCGGTAGACCTAAATTCGGCCCGG…
…ACCCGGTAGACCTTTATTCGGCCCGG…
…ACCCGGTAGACCTTAATTCGGCCGGG…
:
…ACCCGGTAGTCCTATATTCGGCCCGG…
…ACTCGGTAGTCCTATATTCGGCCGGG…
DNA sequence Trait
obesity
Individual1
Individual2
Individual3
IndividualN-1
IndividualN
obesity
:
P≈3x106 !
Cell,a complex system
??s1 s2… sP
Learn the complex web of interactions from data?
Better understand the traitBetter detect the causative Si’s
What training data (instances) ?
RNA leveltoo faint to be detected
34
Model Organism [Brem et al, Science 2002]
Strain 0
×Strain 1
:
112 progeny
RNAProtein
Gene
1 2 3 4 5 … 3,000
Expression data
Individuals
6000 genes
0101100100…0111011110100…0010010110000…010
:
0000010100…101
0010000000…100
Genotype data3000 markers
010:0
100:0
110:0
Genetic perturbationGenotyping
Expression profiling
18
35
Single-marker expression quantitative trait loci (eQTL) mapping
For each gene, find the marker that is most predictive of its expression level [Yvert G et al. (2003) Nat Gen].
Traditional Approach: Single Marker
genes
Genotype data Expression data
markers
individualsindividuals
0101100100100…0111011110111100…0010010110001000…010
:
0000010110100…101
1110000110000…100
Gene iMarker j
induced
repressed
1 2 3 4 5 …Marker
mRNA
Gene
Gmarker j
Egene i
36
Genetic variation and regulation
RegulatorTargets
Activity level of Regulator changes the expression levels of Targets it binds to.Regulator’s expression is predictive of Targets’ expression
ERegulator
ETargets
Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006
AGTCTTAACGTTTGACCGCTAATT
19
37
Regulation variation & mechanisms
Regulator SNPs ⇒ change in regulator function
Regulator’s genotype is predictive of Targets’ expression
Regulator
AGTCTTAACGTTTGACCGCTAAXC
XA
ERegulator
ETargets
Modeling assumptions [Segal E et al. (2003) Nat Gen]:
Genes are organized into co-regulated groups of genes (i.e. modules)Each module has its own “regulatory program”
Targets
Co-regulated genes (module)
GRegulator
38
Modularity …Multiple genes are regulated by the same regulatorsCo-regulated genes have a similar “regulation program”
Regulatory network
module
PHO5PHM6
SPL2
PHO3PHO84
VTC3GIT1
PHO2
TEC1
GPA1
ECM18
UTH1MEC3
MFA1
SAS5SEC59
SGS1
PHO4
ASG7
RIM15
HAP1
PHO2
GPA1MFA1
SAS5PHO4
RIM15
targets
Candidate regulators (x1,..,xN): Sequence variationsExpression levels of genes that have regulatory roles
“Regulation program” ?
Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006
S1
S120
S22
S1011
S321S321
variation of a certain site on DNA
A and B regulate the expression of C(A and B are regulators of C)
AB
C
expression level of a gene
20
39
Regulation as Linear Regressionminimizew (w1x1 + … wNxN - ETargets)2
But we often have very large N… and linear regression gives them all nonzero weight!
xN…x1 x2
w1w2 wN
EModule
Problem: This objective learns too many regulators
parametersw1
w2 wN
ETargets= w1 x1+…+wN xN+ε
PHO5PHM6
SPL2
PHO3PHO84
VTC3GIT1
PHO2
TEC1
GPA1
ECM18
UTH1MEC3
MFA1
SAS5SEC59
SGS1
PHO4
ASG7
RIM15
HAP1
PHO2
GPA1MFA1
SAS5PHO4
RIM15
S1
S120S1011
S321S321
S22
40
Lasso* (L1) Regressionminimizew (w1x1 + … wNxN - EModule)2+ Σ C |wi|
Induces sparsity in the solution w (many wi‘s set to zero)Provably selects “right” features when many features are irrelevant
Convex optimization problemNo combinatorial searchUnique global optimumEfficient optimization
xN…x1 x2
w1w2 wN
EModule
parametersw1
w2
x1 x2
* Tibshirani, 1996
L2 L1
L1 term
21
41
Cluster genes into modulesLearn a regulatory program for each module
Learning regulatory network
S1
S120
S22
S1011
S321
PHO5PHM6
SPL2
PHO3PHO84
VTC3GIT1
PHO2
TEC1
GPA1
ECM18
UTH1MEC3
MFA1
SAS5SEC59
SGS1
PHO4
ASG7
RIM15
HAP1
PHO2
GPA1MFA1
SAS5PHO4
RIM15
Lee et al., PLoS Genet 2009
L1 regressionminimizew (Σwixi - ETargets)2+ Σ C |wi|
S321
M120=
MFA1
Module
GPA1-3 x+
0.5 x+
-1.2 x
Is this predicted relationship “real”?
42
Challenges?Too large N!
# regulatory genes + # sequence variationsFor human: 2000+1,000,000
Redundant features{xi,xj,…,xk} are perfectly correlated
Learning the regulatory network
xN…x1 x2
Emodule 1
w11w12 w1N
:
xN…x1 x2
Emodule M
wM1wM2 wMN
Module 1
PHO5PHM6
SPL2
PHO3PHO84
VTC3GIT1
PHO2
TEC1
GPA1
ECM18
UTH1MEC3
MFA1
SAS5SEC59
SGS1
PHO4
ASG7
RIM15
HAP1
PHO2
GPA1MFA1
SAS5PHO4
RIM15
S1
S120S1011
S321S321
S22
Module 1=0
=0 =0
Multiple regression tasksminimizew1 (Σ w1nxn–Emodule1)2+ Σ
C|w1n|
minimizewn (Σ wMnxn–EmoduleM)2+ Σ C|wMn|:
Module M
Module M
22
43
What Regulates the P-bodies?Bad news
A marker that covers a large region in Chr 14.Region contains ~30 genes and ~318 SNPs.
ChrXIV:449,639-502,316
DHH1
Strain 0Strain 1
GCN20
GCN1KEM1
BLM3Regulators of Puf3 Module
318 redundant features !
44
Challenge: redundant features!
Selected 318 sequence variations perfectly correlatedWhich of 318 is real causative variation?Experiments for all 318 variations not feasible!
ChrXIV:449639-502316“Type B”“Type A”
Lee et al., PLOS Genetics 2009
RNA degradation module PHO5PHM6
SPL2
PHO3PHO84
VTC3GIT1
MFA1 PHO4
S1-318
…ACTCGGTAGCCC…TACATTCGGCCCGG……ACTCGGTAGCCC…TACATTCGGCCCGG……ACCCGGTAGACC…TTAATTCGGCCGGG…
:…ACCCGGTAGACC…TTAATTCGGCCGGG…
s1 s2… s318
All individuals have either TC…ACC (Type A) or CA…TAG (Type B) for S1~S138
23
45
MotivationNot all sequence variations are equally likely to be causal.
Gene
SNPs
TACGTAGGAACCTGTACCA … GGAAAATATCAAATCCAACGACGTTAGCCAATGCGATCGAATGGGAACGTA
ChrXIV: 449,639-502,316
S1:On the protein coding region of a gene involved in RNA degradation
S2:Not on any gene or a regulatory region
“Regulatory features” F1. Inside a gene region?2. Protein coding region?3. Change the protein letter?4. Create a stop codon?5. Strong conservation?
:
β1
β2
β3
β4
β5
Redundant features …
Idea: Prioritize SNPs that have “good” regulatory features
Problem: How much weight do we give to different regulatory featuresToo many weights to estimate using cross-validation
Metaprior LearningIterate steps 1,2 & 3 until convergence.
Regulatory weights ß
Strain0 MVLT ELVQ VSDASKQLWDI
Strain1 MVLT ELVQ VSDASKQLWDI
× 1× 1× 1
× 0
L
D
Regulatory potentials…x1 x2 xN
Ei
…x1 x2
Ei
xN
Module i+2
…x1 x2
Ei
xN
Module i+1
…x1 x2
Ei
xN
Module i
Regulatory programs
× 1× 0× 0
× 1
: :
0.3 0.9
= =
2. Learn w
1. Compute regulatory potential of SNPs
Regulatory features F
0 0.1 0.2 0.3
1
2
3
4
5
6
Non-synonymousConservationAA small ↔ large
Cell cycle0 0.1 0.2 0.3
A
G
3. Learn ß
“Weighted” L1 regularizationMaximize log P(E|X,W) + log P(W|β,F)
Maximizelog P(W|β,F) + log P(β)
Lee et al. PLoS Genet 2009
24
Learned regulatory weightsYeast regulatory weights
0 0.1 0.2 0.3 0.4 0.5
Non-synonymous codingStop codon
Synonymous coding3' UTR
500 bp upstream5' UTR
500 bp downstreamConservation score
Cis-regulationChange of average mass (Da)Change of isoelectric point (pI)
Change of pK1Change of pK2
Change of hydro-phobicityChange of pKa
Change of polarityChange of pH
Change of van der WaalsTranscription regulator activity
Telomere organization andProtein folding
Glucose metabolic processRNA modification
Hexose metabolic processCell cycle checkpoint
Proteolysissame GO processsame GO functionChIP-chip binding
0 0.1 0.2 0.3 0.4 0.5
Non-synonymous codingSynonymous coding
IntronLocus region
Splice siteUTR region
Conservation scoreCis-regulation
Change of average mass (Da)Change of isoelectric point
Change of pK1Change of pKa
Change of polarityChange of pH
Change of van der Waals volumecell communicationsignal transduction
transcription
Human regulatory weights
Location
AA property change
Genefunction
Pairwise feature
Regulatory features
What Regulates the RNA degradation module?
Regu
lato
ry p
oten
tial
0.7
0.6
0.5
0.4
ChrXIV:415,000-495,000
Saccharomyces Genome Database (SGD)
MKT1
The regulatory potential over all 318 variations in the region
ChrXIV:449639
Lee et al., PLOS Genetics 2009
Biological validation succeeded!
25
Predicting Causal Regulators
Region Zhu et al [Nat Genet 08] Lirnet (top 3 are considered)
1 NoneSEC18 RDH54 SPT7
2TBS1, TOS1, ARA1, CSH1, SUP45, CNS1, AMN1
AMN1 CNS1 TOS1
3 None TRS20 ABD1 PRP5
4 LEU2, ILV6, NFS1, CIT2, MATALPHA1 LEU2 PGS1 ILV65 MATALPHA1 MATALPHA1 MATALPHA2 RBK1
6 URA3 URA3 NPP2 PAC2
7 GPA1 STP2 GPA1 NEM1
8 HAP1 HAP1 NEJ1 GSY2
9 YRF1-4, YRF1-5, YLR464W SIR3 HMG2 ECM7
10 None ARG81 TAF13 CAC2
11 SAL1, TOP2 MKT1 TOP2 MSK1
12 PHM7 PHM7 ATG19 BRX1
13 None ADE2 ORT1 CAT5
Experimentally supported regulators: GENE
8 validated regulators in 7 regions
14 validated regulators in 11 regions
Summary: From Genotype To PhenotypeStatistical methods for mapping QTL
What is QTL?Experimental animalsAnalysis of variance (marker regression)Interval mappingMulti-marker models
Learning regulatory networks from genetically diverse set of individuals
Understand how DNA variations perturb the network
50