Genotype-Phenotype Associationusers.umiacs.umd.edu/.../lectures/lect16/gwas.pdf · nonparametric risk models. Our goal is investigating the relative importance of these three sources

Genotype-Phenotype AssociationCMSC858P Spring 2012Hector Corrada BravoUniversity of Maryland

GWAS• Genome-wide association studies

• Scans for SNPs (or other structural variants)

• that show association with some phenotype

• categorical phenotypes: age-related macular degeneration

• continuous phenotypes (QTL): blood pressure

• Commonly: 10^3 samples, 10^6 SNPs

logistic regression

• Estimate log odds ratio

• f is linear

θ(x) = Pr{y = 1|x}

f(x) = logθ(x)

1− θ(x)

Predictors (genotypes)

Binary outcome, disease/no

disease

logistic regressionf(x) = log

θ(x)1− θ(x)

= β0 + β1x

Encoding genotype data • We usually think of major/minor alleles, where minor allele occurs at a less frequency in the population (e.g., 5%)

• haplotype:minor allele: AA, Aa -> x=0; aa -> x=1major allele: AA,Aa -> x=1; aa -> x=0both:AA->x1=1,x2=1;Aa->x1=1,x2=0,etc...

• genotype (dosage):AA -> x=0; Aa -> x=1; aa-> x=2

Interpretation

P (Y = 1|X = 0)P (Y = 0|X = 0)

= eβ0

P (Y = 1|X = 1)P (Y = 0|X = 1)

= eβ0+β1

Odds of outcome for, e.g, genotype AA

Odds of outcome for, e.g, genotype Aa

Odds-ratio

P (Y = 1|X = 1)/P (Y = 0|X = 1)P (Y = 1|X = 0)/P (Y = 0|X = 0)

= eβ1

GWAS

Discovering association: how unexpected is this odds ratio?

gwas

• Expensive and pervasive...

NHGRI GWA Catalogwww.genome.gov/GWAStudies

Published Genome-Wide Associations through 12/2010, 1212 published GWA at p<5x10-8 for 210 traits

!"#$%&''(%)*+&(%

0 3 6 9 12 16 20 24 28 32 36 40 44

0

10

20

30

40

50

60

,)-./01

'-23'%4053)%016-)7%%

84-3

%5(9%8)&:1%373(%%

;3(<&1

(3%=&

%>3<

*++(%?%=)3*=@

31=%

!4*-2&@*%A3BC&40*+&

1D%

$E01%<0.@31

=*+&

1%

;3'%5(%1&1

/)3'

%F*0)%2&4&)%%

G)32E43(%

>30.F=%

Most diseases here

GWAS

• Testing for marginal effects is limited

• Epistasis, interactions

• Environment/risk factors, unaccounted dependencies

• Not all SNPs are created equal (annotation)

Examining the relative influence of familial, genetic,and environmental covariate information in flexiblerisk modelsHéctor Corrada Bravoa,1, Kristine E. Leeb, Barbara E. K. Kleinb, Ronald Kleinb, Sudha K. Iyengarc, and Grace Wahbad,1

aDepartment of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205; bDepartment of Ophthalmology and Visual Science,University of Wisconsin, Madison, WI 53706; cDepartments of Epidemiology and Biostatistics, Genetics, and Ophthalmology, Case Western Reserve Univer-sity, Cleveland, OH 44106; and dDepartments of Statistics, Biostatistics and Medical Informatics, and Computer Sciences, University of Wisconsin, Madison,WI 53706b;

Contributed by Grace Wahba, March 19, 2009 (sent for review February 22, 2009)

We present a method for examining the relative influence of famil-ial, genetic, and environmental covariate information in flexiblenonparametric risk models. Our goal is investigating the relativeimportance of these three sources of information as they areassociated with a particular outcome. To that end, we developeda method for incorporating arbitrary pedigree information in asmoothing spline ANOVA (SS-ANOVA) model. By expressing pedi-gree data as a positive semidefinite kernel matrix, the SS-ANOVAmodel is able to estimate a log-odds ratio as a multicomponentfunction of several variables: one or more functional componentsrepresenting information from environmental covariates and/orgenetic marker data and another representing pedigree relation-ships. We report a case study on models for retinal pigmentaryabnormalities in the Beaver Dam Eye Study. Our model verifiesknown facts about the epidemiology of this eye lesion—foundin eyes with early age-related macular degeneration—and showssignificantly increased predictive ability in models that include allthree of the genetic, environmental, and familial data sources.The case study also shows that models that contain only twoof these data sources, that is, pedigree-environmental covariates,or pedigree-genetic markers, or environmental covariates-geneticmarkers, have comparable predictive ability, but less than themodel with all three. This result is consistent with the notions thatgenetic marker data encode—at least in part—pedigree data, andthat familial correlations encode shared environment data as well.

SS-ANOVA | retinal pigmentary abnormalities | RKHS | pedigrees

S moothing spline ANOVA (SS-ANOVA) models (1–4) havea successful history modeling ocular traits. In particular, an

SS-ANOVA model of retinal pigmentary abnormalities,! definedby the presence of retinal depigmentation and increased retinalpigmentation (5, 6), was able to show a nonlinear protective effectof high total serum cholesterol for a cohort of subjects in theBeaver Dam Eye Study (BDES) (2). However, multiple studieshave reported that risk variants at two loci, near the CFH andARMS2 genes, show strong association with the development ofage-related macular degeneration (AMD) (7–18), a leading causeof blindness and visual disability (19). Because retinal pigmentaryabnormalities are an early sign of age-related macular degenera-tion, a leading cause of blindness and visual disability in its latestages (19), we want to make use of genotype data for these twogenes to extend the SS-ANOVA model for pigmentary abnormal-ities risk. For example, by extending the SS-ANOVA model of Linet al. (2) with SNP rs10490924 in the ARMS2 gene region, we wereable to see that the protective effect of total serum cholesteroldisappears in older subjects that have the risk variant of this SNP.The supporting information (SI) Appendix replicates the model ofLin et al. (2) and shows the extended model, including the SNPdata. Smoothing spline logistic regression models are able to teaseout these types of complex nonlinear relationships that would notbe detected by more traditional parametric models—linear, or ofprespecified form.

Beyond genetic and environmental effects, we want to extendthe SS-ANOVA model for pigmentary abnormalities with famil-ial data. For instance, pedigrees (see Representing Pedigree Dataas Kernels) have been ascertained for a large number of subjectsof the BDES. In this article we present a general method that isable to incorporate arbitrary relationships encoded as a graph,e.g., pedigree data, into SS-ANOVA models. This method allowsone to examine the importance of relationships between subjectsrelative to other model terms in a predictive model.

We estimate SS-ANOVA models of the log-odds of pigmentaryabnormality risk of the form

f (ti) = µ + g1(ti) + g2(ti) + h(z(ti)),

where g1 is a term that includes only genetic marker data (e.g.,SNPs), g2 is a term containing only environmental covariate data,and h is a smooth function over a space that encodes relationshipsbetween subjects. In this relationship space, each subject ti may bethought of as being represented by a “pseudo-attribute” z(ti). Inthe remainder of the article we will refer to model terms g1 andg2 as S (for SNP) and C (for covariates), respectively, and term has P (for pedigrees); so, a model containing all three componentswill be referred to as S+C+P.

Formally, this SS-ANOVA model is defined over the tensorsum of multiple reproducing kernel Hilbert spaces: one or morecomponents representing information from environmental and/orgenetic covariates for each subject (corresponding to terms g1 andg2 above) and another representing pedigree relationships. Themodel is estimated as the solution of a penalized likelihood prob-lem with an additive penalty including a term for each reproducingkernel Hilbert space (RKHS) in the ANOVA decomposition, eachweighted by a coefficient. From this decomposition we can mea-sure the relative importance of each model component (S, C, orP). Our main tool in extending SS-ANOVA models with pedi-gree data is the Regularized Kernel Estimation framework (20).More complex models involving interactions between these threesources of information are possible but beyond the scope of thisarticle.

In Smoothing-Spline ANOVA Models we discuss the semiparam-etric risk models we use in this article; in Representing PedigreeData as Kernels we define pedigrees and introduce our method

Author contributions: K.E.L., B.E.K.K., R.K., and S.K.I. designed research; K.E.L., B.E.K.K.,R.K., and S.K.I. performed research; H.C.B. and G.W. contributed new reagents/analytictools; H.C.B., K.E.L., and G.W. analyzed data; and H.C.B. and G.W. wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.1To whom correspondence may be addressed. E-mail: [email protected] [email protected].!Hereafter, we will use the term pigmentary abnormalities when referring to retinal pig-mentary abnormalities.

This article contains supporting information online at www.pnas.org/cgi/content/full/0902906106/DCSupplemental.

8128–8133 PNAS May 19, 2009 vol. 106 no. 20 www.pnas.org / cgi / doi / 10.1073 / pnas.0902906106

• Environment/risk factors, unaccounted dependencies

• How to incorporate subject dependence

Splines and BDES

• History of Smoothing Spline (SS) models for analyzing BDES data

[Wahba et al. 1998a,b,1999,2000,2002,2006]

• In particular, SS-ANOVA model of pigmentary abnormalities

[Ann. Statistics 28 (2000)]

SS-ANOVA• [Ann. Statistics 28 (2000)]

• Model for pigmentary abnormalities (PA), female BDES I subjects

f(t) = µ + f1(sysbp) + f2(chol) + f12(sysbp, chol) + dage · age + dbmi · bmi +dhorm · I1(horm) + dhist · I2(hist),

hormone replacement yes/

nohistory of heavy

drinking

SS-ANOVA

cholesterol

prob

abilit

y

0.20.40.60.8

100 200 300 400 500

: age 55 : bmi 24.6

: age 66 : bmi 24.6

100 200 300 400 500

: age 73 : bmi 24.6

: age 55 : bmi 28

: age 66 : bmi 28

0.20.40.60.8

: age 73 : bmi 28

0.20.40.60.8

: age 55 : bmi 32.2

100 200 300 400 500

: age 66 : bmi 32.2

: age 73 : bmi 32.2

sysbp = 109sysbp = 124sysbp = 139sysbp = 160

nonlinear protective effect of

cholesterol

SS-ANOVA

• Recent results linking variation in specific genetic regions and AMD (age-related macular degeneration)

• In particular, CFH and LOC387715 (ARMS2) genes

SS-ANOVA (w/ ARMS2)

cholesterol

prob

abilit

y

0.20.40.60.8

100 200 300 400

: age 48.5 : snp2 11

: age 59.5 : snp2 11

100 200 300 400

: age 69.5 : snp2 11

: age 80.5 : snp2 11

: age 48.5 : snp2 12

: age 59.5 : snp2 12

: age 69.5 : snp2 12

0.20.40.60.8

: age 80.5 : snp2 12

0.20.40.60.8

: age 48.5 : snp2 22

100 200 300 400

: age 59.5 : snp2 22

: age 69.5 : snp2 22

100 200 300 400

: age 80.5 : snp2 22

sysbp = 109sysbp = 124sysbp = 139sysbp = 160

protective effect gone

Pedigrees100026

100012

100013

100019

100025

100011

100014

100020

101430

100024

100015

100021

101429

100023

100016

100113

101184

100017

100114

101183

100018

100115

101432

100022

101441

101431

100116

101442

101434

101185

101190

101433

101435

101187

101436

101186

101437101439101438101188101189101440 101189100116

male female

PA present

PA absent

Pedigree Distance

• Use Malecot’s kinship coefficient (φ):

• for subjects i and j: the probability that randomly chosen alleles, one from each subject, are identical by descent

• e.g. parent-offspring: 1/4

• e.g. siblings: 1/4

• Pedigree distance: (1-2 φ)

Relationship Graph

We will extend the SS-ANOVA model with an encoding of this relationship graph

Example Pedigree Graph

101188

101440

100020

101435

100018

0.875

1

0.5

1 1

0.875

11

0.75

1

Relationship Distance

sibsavuncular

first-cousinsunrelated

0.5000.7500.875

1

Metric embeddings

Interpretation: embedding gives relationship pseudo-attributes over which smooth functions can be estimated

−1000 −500 0 500 1000 1500−0.2

0−0

.15−0

.10−0

.05

0.0

0 0

.05

0.1

0 0

.15

0.2

0

−1.5−1.0

−0.5 0.0

0.5 1.0

1.5

!

!

!

!

!

26

40

10

35

8

Extensions

f(t) = µ + dSNP1,1 · I(X1 = 12) + dSNP1,2 · I(X1 = 22) +dSNP2,1 · I(X2 = 12) + dSNP2,2 · I(X2 = 22) +

f1(sysbp) + f2(chol) + f12(sysbp, chol) +dage · age + dbmi · bmi + dhorm · I1(horm) +

dhist · I2(hist) + dsmoke · I3(smoke) +h(z(t))

SNPdata

environmentalcovariates

pedigreedata

Comparison to Covariate-Only Model

S−only S+C P−only S+P C+P S+C+P

Percent change in mean AUC w.r.t. C−only model

!!AU

CC""only

−10

−50

510

[Corrada Bravo, et al., PNAS 2009]

Epistasis

• Testing marginal effects is limited

• We want to test interactions (epistasis)

• Modeling is straightforward:

• add non-linear interaction terms to logistic regression model

• Computationally, it’s a problem

• we started with 10^6 SNPs....

[12:05 19/10/2010 Bioinformatics-btq529.tex] Page: 2856 2856–2862

BIOINFORMATICS ORIGINAL PAPER Vol. 26 no. 22 2010, pages 2856–2862doi:10.1093/bioinformatics/btq529

Genetics and population analysis Advance Access publication September 24, 2010

RAPID detection of gene–gene interactions in genome-wideassociation studiesDumitru Brinza1, Matthew Schultz2, Glenn Tesler3 and Vineet Bafna4,!1Life Technologies, Foster City, CA, 2Graduate Bioinformatics Program, 3Department of Mathematics and4Department of Computer Science and Engineering, Institute for Genomic Medicine, University of California, SanDiego, CA, USAAssociate Editor: Jeffrey Barrett

ABSTRACTMotivation: In complex disorders, independently evolving locuspairs might interact to confer disease susceptibility, with only amodest effect at each locus. With genome-wide association studieson large cohorts, testing all pairs for interaction confers a heavycomputational burden, and a loss of power due to large Bonferroni-like corrections. Correspondingly, limiting the tests to pairs that showmarginal effect at either locus, also has reduced power. Here, wedescribe an algorithm that discovers interacting locus pairs withoutexplicitly testing all pairs, or requiring a marginal effect at eachlocus. The central idea is a mathematical transformation that maps‘statistical correlation between locus pairs’ to ‘distance between twopoints in a Euclidean space’. This enables the use of geometricproperties to identify proximal points (correlated locus pairs), withouttesting each pair explicitly. For large datasets ("106 SNPs), thisreduces the number of tests from 1012 to 106, significantly reducingthe computational burden, without loss of power. The speed ofthe test allows for correction using permutation-based tests. Thealgorithm is encoded in a tool called RAPID (RApid Pair IDentification)for identifying paired interactions in case–control GWAS.Results: We validated RAPID with extensive tests on simulatedand real datasets. On simulated models of interaction, RAPID

easily identified pairs with small marginal effects. On thebenchmark disease, datasets from The Wellcome Trust CaseControl Consortium, RAPID ran in about 1 CPU-hour per dataset,and identified many significant interactions. In many cases, theinteracting loci were known to be important for the disease, but werenot individually associated in the genome-wide scan.Availability: http://bix.ucsd.edu/projects/rapidContact: [email protected] information: Supplementary data are available atBioinformatics online.

Received on July 12, 2010; revised on September 3, 2010; acceptedon September 12, 2010

1 INTRODUCTIONRecent technological developments in sequencing and genotypinghave made it feasible to conduct genome-wide scans of largepopulation cohorts to find genetic markers for common diseases (TheWellcome Trust Case Control Consortium, 2007). Nevertheless,significant challenges remain. Many genome-wide association

!To whom correspondence should be addressed.

studies (GWASs) seek to associate each marker with the diseasephenotype. As multiple hypotheses are generated, individualassociations must have large effect to show up as significant.In complex disorders, many independently evolving loci mightinteract to confer disease susceptibility, with only a modest effect ateach locus. Here, we focus on detecting such interactions.

Detecting k-locus interactions in GWAS on large populationsis computationally and statistically challenging, even when k =2.A test involving all pairs of m markers, with a case–controlpopulation of n individuals, involves O(nm2) computations. ForGWAS, it is not atypical to have n"103, m"106 makingthese computations, especially with permutation-based testsof significance, intractable. A straightforward (Bonferroni-like)correction for the multiple tests would result in significant loss ofsensitivity.

Therefore, many strategies for two-locus interaction testing arebased on a two-stage, filtering approach. In the first stage (the filterstage), the objective is to discard the vast majority of locus pairs,while retaining the truly interacting pairs. If the filtering stage isfast and efficient (only a small fraction of all pairs are retained),then computationally intensive tests of association can be performedon the few remaining candidate pairs in a second, scoring, stage.For a filtering algorithm to be effective, it must have (a) speed, inthat the number of computations scale linearly with the size of thedata; (b) sensitivity/power (truly interacting pairs are retained); and,(c) efficiency (most pairs are discarded). Fast and efficient filtersallow non-parametric permutation tests to be employed to assesssignificance. With the advent of deep sequencing, the number ofvariants considered will grow far beyond 106 markers, and designsof filters will be critical to GWAS analysis of interactions.

While many approaches to detecting interactions have beenproposed (see Cordell, 2009, for an excellent review), the design offilters has not been investigated explicitly. A recent approach, bothpragmatic and effective for filtering, is based on the assumptionthat interacting pairs of loci should also show a marginal effect ateach locus (Marchini et al., 2005). Here, the filtering stage consistsof single marker tests at each locus. The scoring stage is thenlimited to pairs in which either one, or both loci, are individuallyassociated. In either strategy, the filter speed is high, as single-markeranalysis scales linearly with the number of loci and individuals.Empirical results show that only a small fraction of the loci showa marginal effect, leading to high efficiency. However, as Marchiniet al. point out, there is some loss of power in employing thesefilters, particularly in interaction models where the marginal effectsof the individual loci are small. Figure 1a provides a cartoon

2856 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

A filtering approach:

• Discover possible interactions quickly

• Test good candidates completely

RAPID

• If two SNPs (x and y) associate with disease (d) then at least one of the following must hold:

1. x associates with d

2. y associates with d

3. x associates with y in cases

4. x associates with y in controls

• RAPID finds SNPs where 3 holds

RAPID

• Look at cases only, and define vector for each SNP as:

vx(a) =a− Px√

n�

Px(1− Px)

0,1Proportion

of 1s

RAPID RAPID

dist(vx, vy) =�

2− 2�

χ2x,y/n

Associationbetween x and y

Statistical association is now a geometric problem

RAPID• Use random projections to find possible

interacting pairs

Hash(x, r, B) =

�|vx · r|

B

�

RAPID• Do this repeatedly, to avoid false positives

Interactions/Epistasis

• A MAJOR problem

• Inherently computational and statistical

• We are nowhere close

• We will be inundated with data (sequencing)

Learning a Prior on Regulatory Potential from eQTL DataSu-In Lee1, Aimee M. Dudley2, David Drubin3, Pamela A. Silver3, Nevan J. Krogan4, Dana Pe’er5, Daphne

Koller1*

1Computer Science Department, Stanford University, Stanford, California, United States of America, 2 Institute for Systems Biology, Seattle, Washington, United States of

America, 3Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 4Department of Cellular and Molecular

Pharmacology, University of California San Francisco, San Francisco, California, United States of America, 5Department of Biological Sciences, Columbia University, New

York, New York, United States of America

Abstract

Genome-wide RNA expression data provide a detailed view of an organism’s biological state; hence, a dataset measuringexpression variation between genetically diverse individuals (eQTL data) may provide important insights into the genetics ofcomplex traits. However, with data from a relatively small number of individuals, it is difficult to distinguish true causalpolymorphisms from the large number of possibilities. The problem is particularly challenging in populations withsignificant linkage disequilibrium, where traits are often linked to large chromosomal regions containing many genes. Here,we present a novel method, Lirnet, that automatically learns a regulatory potential for each sequence polymorphism,estimating how likely it is to have a significant effect on gene expression. This regulatory potential is defined in terms of‘‘regulatory features’’—including the function of the gene and the conservation, type, and position of geneticpolymorphisms—that are available for any organism. The extent to which the different features influence the regulatorypotential is learned automatically, making Lirnet readily applicable to different datasets, organisms, and feature sets. Weapply Lirnet both to the human HapMap eQTL dataset and to a yeast eQTL dataset and provide statistical and biologicalresults demonstrating that Lirnet produces significantly better regulatory programs than other recent approaches. Wedemonstrate in the yeast data that Lirnet can correctly suggest a specific causal sequence variation within a large, linkedchromosomal region. In one example, Lirnet uncovered a novel, experimentally validated connection between Puf3—asequence-specific RNA binding protein—and P-bodies—cytoplasmic structures that regulate translation and RNA stability—as well as the particular causative polymorphism, a SNP in Mkt1, that induces the variation in the pathway.

Citation: Lee S-I, Dudley AM, Drubin D, Silver PA, Krogan NJ, et al. (2009) Learning a Prior on Regulatory Potential from eQTL Data. PLoS Genet 5(1): e1000358.doi:10.1371/journal.pgen.1000358

Editor: Alan M. Moses, University of Toronto, Canada

Received June 27, 2008; Accepted December 29, 2008; Published January 30, 2009

Copyright: ! 2009 Lee et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by an NIH/NHGRI Genome Scholar/ Faculty Transition award and U.S. Department of Energy Genomes to Life Grant (AD),National Science Foundation Award DBI-0345474 (SL and DK), Burroughs Welcome Fund CASI award and an NIGMS Center of Excellence grant (DP), an NRSApostdoctoral fellowship (DD), funding from the NIH (PS), and a Sandler Family Fellowship (NJK).

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

The potential for using comprehensive data sets, such as RNAexpression data, as a means for uncovering complex genetic traits hasled to the production of eQTL data – gene expression data across apopulation of genetically diverse individuals – in a variety of differentorganisms [1–6]. One application of this data type is the use of subtleperturbations in the regulatory network induced by natural geneticvariations to reveal the regulatory interactions and influences. Thesedata thus provide a unique opportunity for uncovering the cell’sregulatory structure, and for revealing the genetic basis forphenotypic traits. Many approaches have been developed thatattempt to identify one or more genetic regions containingpolymorphism(s) that cause a change in gene expression [1–3,7,8].Some approaches [7,8] expand these ideas by searching for a moreintegrated regulatory network, where targets are viewed as affectednot only by changes in genotype, but also by changes in the activitylevel of regulatory proteins, estimated by their mRNA levels. Thesemethods have been used successfully to identify important regulatoryrelationships, including some that underlie key phenotypic traits.A key challenge in the application of these methods is that the

number of candidate regulators is enormous relative to the amount

of available data, making it difficult to robustly identify the correctregulator. This problem is exacerbated when multiple regulatorsare correlated, and therefore many regulators have similarpotential to explain the expression of their targets. Unfortunately,correlated regulators are the rule, rather than the exception, bothfor sequence polymorphisms (due to linkage disequilibrium) andfor regulatory genes identified by gene expression signature (due toco-expression). In these cases, methods that attempt to recognizeregulatory relationships are often forced to make choices that arearbitrary, misleading, or non-specific. For example, most linkage-based approaches identify only a chromosomal region, leaving ahuman to predict the true causal polymorphism(s) within theregion. This approach results in a large number of hypotheses forexperimental testing, especially in higher organisms, such ashumans where chromosomal regions are often very large andmethods of experimental validation are time and labor intensive.In this study, we propose a novel approach based on the

observation that not all candidate regulators are equally likely toplay a causative role in gene expression. Indeed, researchers oftenmanually select among candidate polymorphisms, favoring thosethat are in conserved regions, those that produce significantchanges in protein sequence, or those that lie in functionally

PLoS Genetics | www.plosgenetics.org 1 January 2009 | Volume 5 | Issue 1 | e1000358

SNP annotation•Outcome is gene expression (eQTL)•The goal is to learn regulatory programs•Potential for a mutation to have an effect on expression depends on SNP features

experimental data supporting Lirnet’s computational prediction,including the causal role of Mkt1.The resulting regulatory network for the yeast data and the

software are freely available on our website http://dags.stanford.edu/lirnet/; the learned network can be effectively explored usingour visualization tool, downloadable from the same website.

Results

Method OverviewWe briefly review the Lirnet method, referring to the Methods for

a full description. Lirnet uses genotype and expression data ofgenetically diverse individuals (eQTL data) and aims to learn a

regulatory prior concurrently with reconstructing a regulatorynetwork. Building on earlier work [7,14], Lirnet clusters genes intomoduleswith the assumption that expression of the target genes in eachmodule is governed by the same regulatory program. As with severalother methods for the reconstruction of regulatory networks, Lirnetcan accommodate two types of regulators: values of genotype markers(genotype regulators), representing genetic polymorphisms onchromosomal regions [1–3]; and expression levels of genes that areknown to have regulatory roles (expression regulators), representingactivity levels of genes that might regulate that module [7,9,14,15].Lirnet’s regulatory programs are based on linear regression, a

choice designed to allow for the incorporation and learning ofregulatory potentials. For each module m, the expression levels of

Figure 1. Outline of our approach. Our algorithm, called Lirnet, aims to learn the regulatory potential of an individual SNP, simultaneously withthe regulatory network from an eQTL data set. The regulatory potential of a regulator is defined as a function of its regulatory features, such as theconservation of a SNP or the function of a gene (Figure 2, Tables S1, S2, S3). The weight of each regulatory feature is called the regulatory prior. Allthree components – the regulatory programs, the regulatory potentials, and the regulatory priors – are learned from data, in an unbiased way, byiterating the following three steps: (i) Lirnet takes as input the regulatory potentials for each regulator, and constructs a set of regulatory programsfor the genes in the data, using the regulatory potentials to bias the choice of active regulators used. In the first iteration, the regulatory potentialsare taken to be uniform. (ii) Lirnet takes as input the regulatory programs, and learns which types of regulators are more predictive of their putativetargets (which ones occur more often in the learned regulatory programs), and adjusts the regulatory prior to match the observed trends. (iii) Lirnettakes as input the regulatory priors, and computes the regulatory potential of each SNP by computing the total contribution of its regulatoryfeatures, weighted by the learned regulatory priors. The regulatory potential of each chromosomal region (genotype regulator) is then computed byaggregating the contributions of the individual SNPs in the region.doi:10.1371/journal.pgen.1000358.g001

Learning Regulatory Potential from eQTL Data


Regulatory programs

Regulatory Potentials for Regions (G-Regulators)Based on the regulatory features of each individual SNP, we

modeled the regulatory potential of each genetic marker, representinghow likely sequence variations on the marker’s chromosomalregion regulate expression levels of genes. We also defined aregulatory potential for each e-regulator, representing how likelythe regulator’s expression is to regulate other genes’ expression.These potentials are based on the regulatory potential of

individual SNPs. We model the probability that each SNP n causesexpression variation (regulatory potential of n) as:

Pr SNP n causes variation in expression levels of genes! "

~sigmoidX

kbkfn,k

! ",

!1"

sigmoid t! "~1= 1zexp {t! "! ",

where bk is the parameter called regulatory prior that determines theimpact of each regulatory feature on the regulatory potential:higher values of bk encode the fact that the presence of the featurefnk increases the probability of having a regulatory effect. Thelearning algorithm of Lirnet automatically estimates the value ofthe b parameters from data. In our analysis, we focus only onregulatory features that are likely to increase the regulatorypotential, and hence restrict bk to be non-negative; this assumptioncan easily be relaxed in the context of other feature sets. A SNPwith many important regulatory features (with high b’s) will have ahigher regulatory prior, but the sigmoid function introduces asaturation effect, preventing the regulatory potential fromincreasing unboundedly and swamping the data.Due to linkage disequilibrium, each marker i can represent

genotypes of the chromosomal region where it resides. Wetherefore define the regulatory potential of each genetic markeras an aggregate of the regulatory potentials of the individual SNPsin the corresponding chromosomal region. We assigned each SNPto the region associated with its nearest genotyped marker (or tagSNP). Then, for each region r, we aggregated the contributions ofall SNPs (in the region), each modeled based on (1), by summingthem up and taking a sigmoid function:

Pr Region r is causal! "~

sigmoidX

n[ SNPs in region rf g2|sigmoid

Xkbkfn,k

! "{1

! "! ":!2"

Therefore, a region that contains a number of SNPs with highregulatory potentials is likely to have a high regulatory potential,but the outer-most sigmoid function again prevents it fromincreasing unboundedly. We note that a region that contains alarge number of moderately relevant SNPs can also achieve a highregulatory potential. This method of aggregation tends to preferregions with more SNPs, which is arguably justified, as they arealso more likely to contain a causal polymorphism. However,other methods of aggregation are also plausible. We experimentedwith several other approaches; the one selected achieved thehighest performance in prediction of expression profiles in testdata not used for training the model.

Regulatory Potentials for Expression RegulatorsWe also model the regulatory potential of candidate expression

regulators based on their regulatory features. We used theregulatory features of SNPs (Table S1) for constructing those of

an expression regulator. The regulatory features consist of fivecategories: (1) 7 features each representing the number of SNPs inthe gene region having one of the features 1 & 12–17 in Table S1;(2) 1 feature representing the conservation score of the gene region(analogous to 18 in Table S1); (3) 1 binary feature indicatingwhether the gene is cis-regulated (analogous to 19 in Table S1); (4)87 (for yeast)/ 48 (for human) binary features indicating whetherthe gene belongs to each of the GO categories listed in Table S12(analogous to 20 in Table S1); and (5) three pairwise binaryfeatures indicating whether the gene belongs to a GO processcategory enriched in the module, whether the gene belongs to aGO function category enriched in the module and whether thegene is the TF whose putative binding targets are enriched in themodule (analogous to 21–23 in Table S1).We define the regulatory potential of r to be the probability that

each candidate e-regulator r causes expression variation, which wemodel as follows:

Pr r is causal! "~sigmoidX

kakgr,k

! ", !3"

where grk represents the k’th regulatory feature of e-regulator r(explained above) and ak is the weight assigned to the k’thregulatory feature.

Learning Regulatory Programs using the Lirnet AlgorithmLirnet attempts to reconstruct regulatory programs that define

the regulatory interactions between each group of co-regulatedgenes (called a module) and its regulatory factors (regulators).Candidate regulators of a module consist of binary genotypevalues of genetic markers and expression levels of genes that arenot in the module. We model the expression level of each gene g ina module m (denoted by ym,g) as a linear combination of thepotential regulators (denoted by x1,…,xn):

ym,g~wm,1x1zwm,2x2z . . .zwm,nxnze,

for all g0s, in module m,!4"

where e represents a zero mean Gaussian noise, and x and y arestandardized.Our objective is to estimate the weights (wm,1,…,wm,n) for each

module m, from the data that best reflect the regulatoryrelationship between x’s and y. More precisely, given x and y, weaim to construct the network by maximizing the joint log-likelihood Log P(w,y|x) = Log P(y|x,w)+Log P(w), where for eachmodule and its regulators P(y|x,w),N(Siwixi,s

2) and P(w)represents the prior probability distribution of w. We model theprior probabilities on the weights based on the regulator’sregulatory potential: For a regulator r, which can be either aregion or an e-regulator, the prior probability is modeled as:

Pr wr! "!exp {Cr wrj j! ", !5"

Cr~C1 Pr Regulator r is causal! "

zC0 1{Pr Regulator r is causal! "# $:

The regulatory potentials, Pr(Regulator r is causal), are definedin (2) and (3) as functions of b and a, respectively. C0 and C1

represent the maximum and minimum regularization parametersCi, respectively. The prior on the weight wr is an L1 prior, whichtends to move the weights of less relevant coefficients to 0 [16]; thelarger Cr, the stronger the bias towards 0. As the regulatory



gene exp. SNPs

linear regression

Regulatory Potential





~sigmoidX

kbkfn,k

! ",

!1"





sigmoidX


Xkbkfn,k

! "{1

! "! ":!2"







kakgr,k

! ", !3"
















SNP features

Figure 2. Learned regulatory priors in yeast and human. Shown are the maximum effect of the different regulatory features, given the learnedregulatory priors for the different regulatory features, in (A) yeast and (B) human (CEU) data sets. Each bar lists the maximum contribution that a givenregulatory feature can make to the regulatory potential: the feature’s regulatory prior, multiplied by the difference between its maximal and minimalvalue. For clarity, only the regulatory features whose regulatory priors are greater than 0.05 are shown in this graph. The full list of regulatory priors,including that for human YRI dataset, can be found in Tables S2, S3. (*) As described in Methods, the pairwise features are constructed based on -log(p-value), indicating the enrichment of the corresponding regulator’s putative targets in the module. Since these values have much highervariation than others, for a more clear and intuitive presentation, we report the amount of contribution made by an increase of the 2log10(p-value)by 3.doi:10.1371/journal.pgen.1000358.g002



SNP features

Yeast dataset

Regulatory programs





~sigmoidX

kbkfn,k

! ",

!1"





sigmoidX


Xkbkfn,k

! "{1

! "! ":!2"







kakgr,k

! ", !3"
















gene exp. SNPs

linear regression





~sigmoidX

kbkfn,k

! ",

!1"





sigmoidX


Xkbkfn,k

! "{1

! "! ":!2"







kakgr,k

! ", !3"
















Hierarchical model

Estimationminimize

�

module m

�

gene g

�ym,g −

�

regulator r

wm,rxr

�2+

�

regulator r

Cr|wm,r| + D�

regulator r

w2m,r

+ E�

k

β2k

expression fit SNP selectionuses regulatory

potential

Parameters estimated iteratively

(Figure 8B; Table S9), making a systematic evaluation of thecandidates a significant effort, even in a genetically tractableorganism like yeast. We therefore used Lirnet to evaluate the genesin the region based on their regulatory potential for the PTRmodule. As shown in Figure 8B, MKT1 is ranked as the highestscoring gene in terms of the learned regulatory potential. The

regulatory potential of MKT1 is within the top 1% over all genes(Figure S2).Mkt1 interacts with the Poly(A)-binding protein associated

factor, Pbp1, and is present at the 39 end of RNA transcriptsduring translation [13]. Mkt1 contains a highly-conservednuclease domain, homologous to the human XPG endonuclease

Figure 5. The Zap1 module. (A) (i) The mRNA expression profiles (log2 ratios) of the module’s 10 target genes, where the rows are genes and thecolumns are strains. (ii) The module is regulated by five predicted regulators, where the two that have the most significant coefficients are theexpression pattern of ZAP1 and a genetic region on chromosome 10 containing ZAP1. The bar on the left of each regulator represents its coefficientin the regulatory program: the length encodes its absolute value, purple represents a negative weight and blue a positive one. (iii) Six of the targetgenes (ADH4, ZTR3, YNL254C, YGL258W, ZPS1/YOL154W, and YOR387C) were identified as probable Zap1 targets based on the presence of aconsensus ZRE element and RNA expression patterns [27]. (B) The genetic region on chromosome 10, with the inferred regulatory potentials for eachof the SNPs it contains (Table S7). Also shown are the regulatory features that contributed the most to the selection of a SNP in Zap1 as the causalpolymorphism: a known binding relationship between Zap1 and two of the target genes, the presence of non-synonymous coding changes and theireffect on various protein properties, and the gene’s annotation as having transcriptional regulator activity. All the other minor regulators of thismodule (Dhh1, Gcr1 and Gis2) are not located in this region; they are in chr 4, 16 and 14, respectively.doi:10.1371/journal.pgen.1000358.g005



Where are SNPs with largest regulatory potential?

Documents

Genotype-Phenotype Associationusers.umiacs.umd.edu/.../lectures/lect16/gwas.pdf · nonparametric risk models. Our goal is investigating the relative importance of these three sources