View
0
Download
0
Category
Preview:
Citation preview
1
Transethnicgeneticcorrelationestimatesfromsummarystatistics
BrielinC.Brown,AsianGeneticEpidemiologyNetwork-Type2Diabetes(AGEN-T2D)Consortium,ChunJimmieYe,AlkesL.Price,NoahZaitlen
Abstract
Theincreasingnumberofgeneticassociationstudiesconductedinmultiple
populationsprovidesunprecedentedopportunitytostudyhowthegeneticarchitectureof
complexphenotypesvariesbetweenpopulations,aproblemimportantforbothmedicaland
populationgenetics.Herewedevelopamethodforestimatingthetransethnicgenetic
correlation:thecorrelationofcausalvarianteffectsizesatSNPscommoninpopulations.We
takeadvantageoftheentirespectrumofSNPassociationsanduseonlysummary-level
GWASdata.Thisavoidsthecomputationalcostsandprivacyconcernsassociatedwith
genotype-levelinformationwhileremainingscalabletohundredsofthousandsof
individualsandmillionsofSNPs.Weapplyourmethodtogeneexpression,rheumatoid
arthritis,andtype-twodiabetesdataandoverwhelminglyfindthatthegeneticcorrelationis
significantlylessthan1.Ourmethodisimplementedinapythonpackagecalledpopcorn.
Introduction
Manycomplexhumanphenotypesvarydramaticallyintheirdistributionsbetween
populationsduetoacombinationofgeneticandenvironmentaldifferences.Forexample,
northernEuropeansareonaveragetallerthansouthernEuropeans1andAfricanAmericans
haveanincreasedrateofhypertensionrelativetoEuropeanAmericans2.Thegenetic
contributiontopopulationphenotypicdifferentiationisdrivenbydifferencesincausal
allelefrequencies,effectsizes,andgeneticarchitectures.Understandingtherootcausesof
phenotypicdifferencesworldwidehasprofoundimplicationsforbiomedicalandclinical
practiceindiversepopulations,thetransferabilityofepidemiologicalresults,aidingmulti-
ethnicdiseasemapping3,4,assessingthecontributionofnon-additiveandrarevariant
effects,andmodelingthegeneticarchitectureofcomplextraits.Inthisworkweconsidera
centralquestionintheglobalstudyofphenotype:dogeneticvariantshavethesame
phenotypiceffectsindifferentpopulations?
WhilethevastmajorityofGWAShavebeenconductedinEuropeanpopulations5,
thegrowingnumberofnon-Europeanandmulti-ethnicstudies4,6,7provideanopportunity
tostudygeneticeffectdistributionsacrosspopulations.Forexample,onerecentstudyused
mixed-modelbasedmethodstoshowthatthegenome-widegeneticcorrelationof
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
2
schizophreniabetweenEuropeanandAfricanAmericansisnonzero8.Whilepowerful,
computationalcostsandprivacyconcernslimittheutilityofgenotype-basedmethods.In
thiswork,wemaketwosignificantcontributionstostudiesoftransethnicgenetic
correlation.First,weexpandthedefinitionofgeneticcorrelationtobetteraccountfora
transethniccontext.Second,wedevelopanapproachtoestimatinggeneticcorrelation
acrosspopulationsthatusesonlysummarylevelGWASdata.Similartootherrecent
summarystatisticsbasedmethods9–20,ourapproachsupplementssummaryassociation
datawithlinkagedisequilibrium(LD)informationfromexternalreferencepanels,avoids
privacyconcerns,andisscalabletohundredsofthousandsofindividualsandmillionsof
markers.UnliketraditionalapproachesthatfocusonthesimilarityofGWASresults21–25we
utilizetheentirespectrumofGWASassociationswhileaccountingforLDinordertoavoid
filteringcorrelatedSNPs.
Inasinglepopulation,thegeneticcorrelationoftwophenotypesisdefinedasthe
correlationcoefficientofSNPeffectsizes18,26.Inmultiplepopulations,differencesinallele
frequencymotivatemultiplepossibledefinitionsofgeneticcorrelation.Becauseavariant
mayhaveahighereffectsizebutlowerfrequencyinonepopulation,weconsiderboththe
correlationofalleleeffectsizesaswellasthecorrelationofallelicimpact.Wedefinethe
transethnicgeneticeffectcorrelation(ρge,previouslydefinedbyLeeetal26andimplemented
inGCTA)asthecorrelationcoefficientoftheper-alleleSNPeffectsizes,andthetransethnic
geneticimpactcorrelation(ρgi)asthecorrelationcoefficientofthepopulation-specificallele
variancenormalizedSNPeffectsizes.
Intuitively,thegeneticeffectcorrelationmeasurestheextenttowhichthesame
varianthasthesamephenotypicchange,whilethegeneticimpactcorrelationgivesmore
weighttocommonallelesthanrareonesseparatelyineachpopulation.Considerthecaseof
aSNPthatisrareinpopulation1butcommoninpopulation2,andhasanidenticaleffect
sizeinbothpopulations.Inthiscase,thecorrelationofeffectsizes(thegeneticeffect
correlationρge)is1,butthisprovidesanincompletepictureoftherelationshipbetweenthe
twopopulations,astheallelehasamuchbiggerimpactonthedistributionofthephenotype
inpopulation2.Therefore,wedefinethegeneticimpactcorrelationρgiasthecorrelationof
effectsizesafternormalizinggenotypestohavemean0andvariance1.Inourhypothetical
caseρgi
3
twopopulationswillbesimilarandthereforeρgiwillbecloseto1.Whileotherdefinitionsof
thegeneticcorrelationarepossible(seediscussion),thesequantitiescapturetwoimportant
questionsaboutthestudyofdiseaseinmultiplepopulations:towhatextentdothesame
mutationsinmultiplepopulationsdifferintheirphenotypiceffects?And,towhatextentare
thesedifferencesmitigatedorexacerbatedbydifferencesinallelefrequency?
Toestimategeneticcorrelation,wetakeaBayesianapproachwhereinweassume
genotypesaredrawnseparatelyfromwithineachpopulationandeffectsizeshaveanormal
prior(theinfinitesimalmodel27).Whileunlikelytorepresentreality,thismodelhasbeen
usedsuccessfullyinpractice8,16,17,28,29.Theinfinitesimalassumptionyieldsamultivariate
normaldistributionontheobservedteststatistics(Z-scores),whichisafunctionofthe
heritabilityandgeneticcorrelation.RatherthanpruningSNPsinLD10,30,31,thisallowsusto
explicitlymodeltheresultinginflationofZ-scores.Wethenmaximizeanapproximate
weightedlikelihoodfunctiontofindtheheritabilityandgeneticcorrelation.Thismethodis
implementedinapythonpackagecalledpopcorn.Thoughderivedforquantitative
phenotypes,popcornextendseasilytobinaryphenotypesundertheliabilitythreshold
model.Weshowviaextensivesimulationthatpopcornproducesunbiasedestimatesofthe
geneticcorrelationandthepopulationspecificheritabilities,withastandarderrorthat
decreasesasthenumberofSNPsandindividualsinthestudiesincreases.Furthermore,we
showthatourapproachisrobusttoviolationsoftheinfinitesimalassumption.
WeapplypopcorntoEuropeanandYorubangeneexpressiondata32aswellasGWAS
summarystatisticsfromEuropeanandEastAsianrheumatoidarthritisandtype-two
diabetescohorts,33,34.OuranalysisofgEUVADISshowsthatoursummarystatisticbased
estimatorisconcordantwiththemixedmodelbasedestimator.Wefindthatthemean
transethnicgeneticcorrelationacrossallgenesislow(ρge=0.320(0.009)),butincreases
substantiallywhenthegeneishighlyheritableinbothpopulations(ρge=0.772(0.017)).in
RAandT2D,wefindthegeneticeffectcorrelationtobe0.463(0.058)and0.621(0.088),
respectively.
Acrossallphenotypesconsidered,weoverwhelminglyfindthatthetransethnic
geneticcorrelationissignificantlylessthanone.Thisobservationhighlightstheneedto
studyphenotypesinmultiplepopulationsasitimpliesthat,uptotheeffectsofun-observed
variants,effectsizesatcommonSNPstendtodifferbetweenpopulations.Thisindicates
thatGWASresultsmaynottransferbetweenpopulations,andthereforediseaserisk
predictioninnon-EuropeansbasedoncurrentGWASresultsmaybeproblematic,
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
4
necessitatingamulti-populationapproachtogaininsightintointer-populationdifferences
inthegeneticarchitectureofcomplextraits.
Methods
Ourmethodtakesasinputsummaryassociationstatisticsfromtwostudiesofa
phenotypeintwodifferentpopulations,alongwithtwosetsofreferencegenotypeseach
matchingoneofthepopulationsinthestudy.Ourmethodhastwosteps:first,weestimate
thediagonalelementsoftheLDmatrixproductsΣ12,Σ22andΣ1Σ2,then,usingthese
estimates,wefindthemaximumlikelihoodvaluesandestimatestandarderrorsofthe
parametersofinterest:h12,h22andρgeorρgi.Thedetailsfollow.
ConsiderGWASofaphenotypeconductedintwodifferentpopulations.Assumewe
haveN1individualsgenotypedonMSNPsinstudyoneandN2individualsgenotypedonthe
sameSNPsstudytwo.LetX1,X2bethematricesofmean-centeredgenotypesinstudyone
andstudytwo,respectively,andletY1,Y2betheirnormalizedphenotypes.Letf1,f2be
vectorsoftheallelefrequenciesoftheMSNPscommontobothpopulations.Assuming
Hardy-Weinbergequilibriumwithineachpopulationseparately,theallelevariancesareσ12
=2f1(1–f1),σ22=2f2(1–f2).Letβ1,β2bethe(unobserved)per-alleleeffectsizesforeachSNP
instudiesoneandtwo,respectively.Theheritabilityinstudyoneisthenh12=Σiσ1i2β1i2
(andlikewiseforstudytwo).Theobjectiveofthisworkistoestimatetransethnicgenetic
correlationfromsummarystatisticsofcommonvariants (andlikewise
forstudytwo)andestimatesofpopulationLDmatrices(Σ1andΣ2)fromexternalreference
panels.Definethegeneticeffectcorrelationρge=Cor(β1,β2)andthegeneticimpact
correlationρgi=Cor(σ1β1,σ2β2).
Weassumethegenotypesaredrawnrandomlyfromeachpopulationandthat
phenotypesaregeneratedbythelinearmodelY1=X1β1+ε1(likewiseforphenotypetwo).
Wheneffectsizesβareassumedinverselyproportionaltoallelefrequency,asiscommonly
done16,29,weshow(Appendix)thatunderthelinearinfinitesimalgeneticarchitecture,the
jointdistributionoftheZ-scoresfromeachstudyisasymptoticallymultivariatenormalwith
mean andvariance:
(1)
Z1 =(X1/�1)>Y1p
N1
�!0
Var(Z) =
"⌃1 +
N1+1M h
21⌃
21 ⇢gi
ph21h
22
pN1N2M ⌃1⌃2
⇢gip
h21h22
pN1N2M ⌃2⌃1 ⌃2 +
N2+1M h
22⌃
22
#
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
5
However,wheneffectsizesareassumedindependentofallelefrequencyweshow:
(2)
Giventheseequationsforvariance,thequantitiesρgiorρgeandh12,h22canbeestimatedby
maximizingthemultivariatenormallikelihood,
,whereCiseitheroftheabovecovariance
matrices(1)and(2).BecauseΣ1andΣ2areestimatedfromfiniteexternalreferencepanels,
maximumlikelihoodestimationoftheabovemultivariatenormalleadstoover-fitting.We
employtwooptimizationstoavoidthisproblem.First,wemaximizeanapproximate
weightedlikelihoodthatusesonlythediagonalelementsofeachblockofVar(Z).This
allowsustoaccountfortheLD-inducedinflationoftestsstatistics,butdiscardscovariance
informationbetweenpairsofZ-scores,andthereforeleadstoover-countingZ-scoresof
SNPsinhighLD.Tocompensateforthis,wedownweightZ-scoresofSNPsinproportion
theirLD.Second,ratherthancomputethefullproductsΣ12,Σ22andΣ1Σ2overallMSNPsin
thegenome,wechooseawindowsizeWandapproximatetheproductby
.TheseoptimizationsaresimilartothoseemployedbyLDscore
regression16.Thefulldetailsofthederivationandoptimizationareprovidedinthe
appendix.
Results
SimulatedGenotypesandSimulatedPhenotypes
Wesimulated50,000European-like(EUR)and50,000EastAsian-like(EAS)
individualsat248,953SNPsfromchromosomes1-3withallelefrequencyabove1%inboth
EuropeanandEastAsianHapMap3populationswithHapGen235.HapGen2implementsa
haplotyperecombinationwithmutationmodelthatresultsinexcesslocalrelatedness
amongthesimulatedindividuals.Toaccountforthislocalstructure,weusedPlink236to
filterindividualswithgeneticrelatednessabove0.05,resultingin4499EUR-likeindivudals
and4837EAS-likeindividuals.Fromthesesimulatedindividuals,500perpopulationwere
chosenuniformlyatrandomtoserveasanexternalreferencepanelforestimatingΣ1andΣ2.
l(⇢g{i,e}, h21, h
22|Z,⌃,�) / �ln(|C|)� Z>C�1Z
(⌃a⌃b)ii =w=i+WX
w=i�Wraiwrbiw
Var(Z) =
2
4⌃1 +
N1+1k�21k1
h21⌃1�21⌃1 ⇢ge
ph21h
22
pN1N2p
k�21k1k�22k1⌃1
p�21�
22⌃2
⇢gep
h21h22
pN1N2p
k�21k1k�22k1⌃2
p�22�
21⌃1 ⌃2 +
N2+1k�22k1
h22⌃2�22⌃2
3
5
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
6
Ineachsimulationeffectsizesweredrawnfroma“spikeandslab”model,where
withprobabilitypand with
probability1-p.ρgiwasanalyticallycomputedfromthesimulatedeffectsizesandallele
frequenciesinthesimulatedreferencegenotypes.Quantitativephenotypesweregenerated
underalinearmodelwithi.i.d.noiseandnormalizedtohavemean0andvariance1,while
binaryphenotypesweregeneratedunderaliabilitythresholdmodelwhereindividualsare
labeledcaseswhentheirliabilityexceedsathreshold ,withKthe
populationdiseaseprevalence37.
Wevariedh12,h22,ρge,andρgi,aswellasthenumberofindividualsineachstudy(N1,
N2),thenumberofSNPs(M),thepopulationprevalenceK,andproportionofcausalvariants
(p)inthesimulatedGWASandgeneratedsummarystatisticsforeachstudy.Theresults
showninFigure1andFigureS1demonstratethattheestimatorsarenearlyunbiasedasthe
geneticcorrelationandheritabilitiesvary.Furthermore,byvaryingtheproportionofcausal
variantspweshowthatourestimatorisrobusttoviolationsoftheinfinitesimalassumption
(FigureS2).InfigureS3,weshowthatthestandarderroroftheestimatordecreasesasthe
numberofSNPsandindividualsinthestudyincreases.Finally,weshowinTableS1thatour
estimatesoftheheritabilityofliabilityincasecontrolstudiesarenearlyunbiased.
Simulationswithnonstandarddiseasemodels
Ourapproach,aswellasgenotype-basedmethodssuchasGCTA,makes
assumptionsaboutthegeneticarchitectureofcomplextraits.Previousworkhasshownthat
violationsoftheseassumptionscanleadtobiasinheritabilityestimation38,thereforewe
soughttoquantifytheextentthatthisbiasmayeffectourestimates.Wesimulated
phenotypesundersixdifferentdiseasemodels.Independent:effectsizeindependentof
allelefrequency.Inverse:effectsizeinverselyproportionaltoallelefrequency.Rare:only
SNPswithallelefrequencyunder10%affectthetrait.Common:onlySNPswithallele
frequencybetween40%and50%affectthetrait.Difference:effectsizeproportionalto
differenceinallelefrequency.Adversarial:differencemodelwithsignofbetasettoincrease
thephenotypeinthepopulationwherethealleleismostcommon.Additionalgenetic
architecturesarepossible,includingoneswhereeffectsizesarenotadirectfunctionof
MAF39.
�1i,�2i ⇠ N✓0,
h21 ⇢ge
ph21h
22
⇢geph21h
22 h
22
�◆
�1i,�2i = (0, 0)
⌧ = ��1(1�K)
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
7
Wesimulatedphenotypesusinggenotypeswithallelefrequencyabove1%or5%
andcomparedthetrueandestimatedgeneticimpactandeffectcorrelationamongall
models(Table1).WefindthatwhenonlySNPswithfrequencyabove5%inboth
populationsareused,thedifferenceinρgeandρgiisminimalexceptinthemostadversarial
cases.Evenintheadversarialmodel,thetruedifferenceisonly7%.Thoughunlikelyto
representreality,thefournonstandarddiseasemodelsresultinsubstantialbiasinour
estimators.WhenSNPswithallelefrequencyabove1%inbothpopulationsareincluded,
thedifferencesaremorepronounced.Thisisbecausethenormalizingconstant1/σrapidly
increasesastheSNPbecomesmorerare.Indeed,asSNPsbecomemorerarehavingan
accuratediseasemodelbecomesincreasinglyimportant.Thereforeweproceedwitha5%
MAFcutoffinouranalysisofrealdata,andusethenotationhc2torefertotheheritabilityof
SNPswithallelefrequencyabove5%inbothpopulations(thecommon-SNPheritability).
Note,however,thatoneoftheadvantagesofmaximumlikelihoodestimationingeneralis
thatthelikelihoodcanbereformulatedtomimicthediseasemodelofinterest.
ValidationofPopcornusinggeneexpressioninGEUVADIS
Wecomparedthecommon-SNPheritability(hc2)andgeneticcorrelationestimates
ofpopcorntoGCTAinthegEUVADISdatasetforwhichrawgenotypesarepubliclyavailable.
gEUVADISconsistsofRNA-seqdatafor464lymphoblastoidcellline(LCL)samplesfrom
fivepopulationsinthe1000genomesproject.Ofthese,375areofEuropeanancestry(CEU,
FIN,GBR,TSI)and89areofAfricanancestry(YRI).RawRNA-sequencingreadsobtained
fromtheEuropeanNucleotideArchivewerealignedtothetranscriptomeusingUCSC
annotationsmatchinghg19coordinates.RSEMwasusedtoestimatetheabundancesofeach
annotatedisoformandtotalgeneabundanceiscalculatedasthesumofallisoform
abundancesnormalizedtoonemilliontotalcountsortranscriptspermillion(TPM).For
eQTLmapping,CaucasianandYorubansampleswereanalyzedseparately.Foreach
population,TPMsweremedian-normalizedtoaccountfordifferencesinsequencingdepth
ineachsampleandstandardizedtomean0andvariance1.Ofthe29763totalgenes,9350
withTPM>2inbothpopulationswerechosenforthisanalysis.
Foreachgene,weconductedacis-eQTLassociationstudyatallSNPswithin1
megabaseofthegenebodywithallelefrequencyabove5%inbothpopulationsusing30
principalcomponentsascovariates.WefoundthatGCTAandpopcornagreeontheglobal
distributionofheritability(FigureS3),andthatGCTA’sestimatesofgeneticcorrelation
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
8
haveasimilardistributiontopopcorn’sgeneticeffectandgeneticimpactcorrelation
estimates(Figure2).WhilethenumberofSNPsandindividualsincludedineachgene
analysisaretoosmalltoobtainaccuratepointestimatesofthegeneticcorrelationonaper-
genebasis(N=464,M=4279.5),thelargenumberofgenesenablesaccurateestimationof
theglobalmeanheritabilityandgeneticcorrelation.
Common-SNPheritabilityandgeneticcorrelationofgeneexpressioningEUVADIS
Wefindthattheaveragecis-hc2oftheexpressionofthegenesweanalyzedwas
0.093(0.002)inEURand0.088(0.002)inYRI.Ourestimatesarehigherthanpreviously
reportedaveragecis-heritabilityestimatesof0.055inwholebloodand0.057inadipose40,
whichcouldariseforseveralreasons.First,weremove68%ofthetranscriptsthatare
lowlyexpressedineithertheYRIorEURdata.Second,estimatesfromRNA-seqanalysisof
celllinesmightnotbedirectlycomparabletomicroarraydatafromtissue.
Theaveragegeneticeffectcorrelationwas0.320(0.010),whiletheaveragegenetic
impactcorrelationwas0.313(0.010).Notably,thegeneticcorrelationincreasesasthecis-
hc2ofexpressioninbothpopulationsincreases(Figure3).Inparticular,whenthecis-hc2of
thegeneisatleast0.2inbothpopulationsthegeneticeffectcorrelationwas0.772(0.017)
whilethegeneticimpactcorrelationwas0.753(0.018).
Inordertoverifythattherewerenosmall-samplesizeorconditioningbiasesinour
analysis,weanalyzedthegeneticcorrelationofsimulatedphenotypesoverthegEUVADIS
genotypes.Wesampledpairsofheritabilitiesfromtheestimatedexpressionheritability
distributionandsimulatedpairsofphenotypestohavethegivenheritabilityandagenetic
effectcorrelationof0.0overrandomlychosen4000baseregionsfromchromosome1ofthe
gEUVADISgenotypes.Withoutconditioning,theaverageestimatedgeneticeffect
correlationwas-0.002(0.003),indicatingthattheestimatorremainedunbiased.
Furthermore,theaverageestimatedgeneticeffectcorrelationwasnotsignificantlydifferent
from0.0conditionalontheestimatesofheritabilitybeingaboveacertainthreshold(Figure
S4).
Wefindthatwhiletheaveragegeneticcorrelationislow,thegeneticcorrelation
increaseswiththecis-hc2ofthegene,indicatingthatascis-geneticregulationofgene
expressionincreasesitdoessosimilarlyinbothYRIandEURpopulations.Thismayhelp
interprettherecentobservationthatwhiletheglobalgeneticcorrelationofgeneexpression
acrosstissuesislow40,cis-eQTL’stendtoreplicateacrosstissues41.Asthepresenceofacis-
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
9
eQTLindicatessubstantialcis-geneticregulation,ananalysisofeQTLreplicationacross
tissuesisimplicitlyconditioningontheheritabilityofgeneexpressionbeinghighand
thereforemayindicatemuchhighergeneticcorrelationthanaverage.
SummarystatisticsofRAandT2D
Finally,wesoughttoexaminethetransethnicρgiandρgeinRAandT2Dcohortsfor
whichrawgenotypesarenotavailable.WeobtainedsummarystatisticsofGWASfor
rheumatoidarthritisandtype-2diabetesconductedinEuropeanandEastAsian
populations.Weusedgenotypesfrom504EastAsianand503Europeanindividuals
sequencedaspartofthe1000genomesprojectaspopulation-specificexternalreference
panelsforourEASandEURsummarystatistics,respectively.WeremovedtheMHCregion
(chromosome6,25–35Mb)fromtheRAsummarystatistics.Weestimatedthecommon-
SNPheritabilityandgeneticcorrelationusing2,539,629SNPsgenotypedorimputedinboth
RAstudiesand1,054,079SNPsgenotypedorimputedinbothT2Dstudieswithallele
frequencyabove5%in1000genomesEURandEASpopulations.Thehc2andgenetic
correlationestimatesarepresentedinTable2.OurRAhc2estimatesof0.177(0.015)and
0.221(0.026)forEURandEAS,respectively,arelowerthanapreviouslyreportedmixed-
modelbasedheritabilityestimatesof0.32(0.037)inEuropeans42.Similarly,ourT2Dhc2
estimatesof0.242(0.013)and0.105(0.021)forEURandEAS,respectively,arelowerthan
apreviouslyreportedmixed-modelbasedestimateof0.51(0.065)inEuropeans42.We
stressthatthisdiscrepancyislikelyduetothedifferencebetweencommon-SNPheritability
hc2andtotalnarrow-senseheritabilityhg2.Furthermore,estimatesoftheheritabilityofT2D
fromfamilystudiescanvarysignificantly43,44.
WefindthegeneticeffectcorrelationinRAandT2Dtobe0.463(0.058)and0.621
(0.088),respectively,whilethegeneticimpactcorrelationisnotsignificantlydifferentat
0.455(0.056)and0.606(0.083).Thetransethnicgeneticimpactandeffectcorrelationfor
bothphenotypesaresignificantlydifferentfromboth1and0(Table2),showingthatwhile
thereiscleargeneticoverlapbetweenthephenotypes,theperalleleeffectsizesdiffer
significantlybetweenthetwopopulations.
Discussion
Wehavedevelopedthetransethnicgeneticeffectandgeneticimpactcorrelationand
providedanestimatorforthesequantitiesbasedonlyonsummary-levelGWASinformation
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
10
andsuitablereferencepanels.Wehaveappliedourestimatortoseveralphenotypes:
rheumatoidarthritis,type-2diabetesandgeneexpression.WhilethegEUVADISdataset
lacksthepowerrequiredtomakeinferencesaboutthegeneticcorrelationofsingleorsmall
subsetsofgenes,wecanmakeinferencesabouttheglobalstructureofgeneticcorrelationof
geneexpression.Wefindthattheglobalmeangeneticcorrelationislow,butthatit
increasessubstantiallywhentheheritabilityishighinbothpopulations.Inallphenotypes
analyzed,thegeneticcorrelationissignificantlydifferentfromboth0and1.Ourresults
showthatglobaldifferencesinSNPeffectsizeofcomplextraitscanbelarge.Incontrast,
effectsizesofgeneexpressionappeartobemoreconservedwherethereisstronggenetic
regulation.
Itisnotpossibletodrawconclusionsaboutpolygenicselectionfromestimatesof
transethnicgeneticcorrelation.Theeffectsizesmaybeidentical(!!" = 1)whilepolygenicselectionactstochangeonlytheallelefrequencies.Similarly,theeffectsizesmaybe
different(!!" < 1)withoutselection.DifferencesineffectsizesatcommonSNPscanresultfrommanyphenomena.Weexpectun-typedandun-imputedvariantsdifferentiallylinked
toobservedSNPstocontributesignificantly,alongwithrareorpopulation-specificvariants
differentiallylinkedtoobservedSNPs.Ifagene-geneorgene-environmentinteraction
exists,butonlymarginaleffectsaretested,theobservedmarginaleffectsmaybedifferentin
eachpopulationduetoallelefrequencydifferenceseveniftheinteractioneffectisthesame
inbothpopulations,andthiswillresultindecreasedgeneticcorrelation.Whilewithin-locus
(dominance)interactionsmayalsoplayarole45,themagnitudeofthiseffecthasbeen
debated46.Weemphasizethatwecannotdifferentiatebetweentheseeffectsonthebasisof
thisanalysisalone,andfurtherresearchisrequiredtoestablishthemagnitudeofthe
contributionofeachoftheseeffectstointer-populationeffectsizedifferences.
Estimatesofthetransethnicgeneticcorrelationareimportantforseveralreasons.
Theymayhelpinformbestpracticesfortransethnicmeta-analysis,potentiallyoffering
improvementsovercurrentmethodsthatuseFsttoclusterpopulationsforanalysis4.
Further,thetransethnicgeneticcorrelationconstrainsthelimitofoutofsamplephenotype
predictivepower.IfthemaximumwithinpopulationcorrelationofpredictedphenotypeP
totruephenotypeYis ,thenthemaximumoutofpopulationcorrelationis
(Appendix).OurobservationthatforRA,T2D,andgeneexpressionthe
geneticcorrelationislowshowsthatoutofpopulationphenotypicpredictivepowerisquite
⇢maxY P
=q
h21
⇢maxY P
= ⇢ge
qh21
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
11
low.Similarly,itimpliesthatdiseaseriskassessmentinnon-Europeansbasedoncurrent
GWASresultsmaybeproblematic,necessitatingincreasedstudyofdiseaseinmany
populationstogaininsightintodifferencesingeneticarchitectureandimproverisk
assessment.
Whilethegeneticcorrelationofmultiplephenotypesinonepopulationhasa
relativelystraightforwarddefinition,extendingthistomultiplepopulationsmotivates
multiplepossibleextensions.Inthisworkwehaveprovidedestimatorsforthecorrelation
ofgeneticeffectandgeneticimpactbutotherquantitiesrelatedtothesharedgeneticsof
complextraitsbetweenpopulationsincludethecorrelationofvarianceexplained
andproportionofsharedcausalvariantsbetweenthetwo
populations.Interestingly,whileourgoalwastoconstructanestimatorthatdeterminedthe
extentofgeneticsharingindependentofallelefrequency,weobservethatthecorrelationof
geneticeffectandgeneticimpactaresimilar.Furthermore,oursimulationsshowthatunder
arandomeffectsmodelutilizingonlySNPswithallelefrequencyabove5%inboth
populationsthetruegeneticeffectandgeneticimpactcorrelationaresimilar.Weconclude
thatatvariantscommoninbothpopulations,differencesineffectsizeandnotallele
frequencyaredrivingthetransethnicphenotypicdifferencesinthesetraits.
Ourapproachtoestimatinggeneticcorrelationhastwomajoradvantagesover
mixed-modelbasedapproaches.First,utilizingsummarystatisticsallowsapplicationofthe
methodwithoutdata-sharingandprivacyconcernsthatcomewithrawgenotypes.Second,
ourapproachislinearinthenumberofSNPsavoidingthecomputationalbottleneck
requiredtoestimatethegeneticrelationshipmatrix.Conceptually,ourapproachisvery
similartothattakenbyLDscoreregression.Indeed,thediagonaloftheLDmatrixproduct
inonepopulationareexactlytheLD-scores( ).Onecouldignoreourlikelihood-
basedapproachanddefinecross-populationscores inordertoexploitthe
linearrelationship (asimilarapproachcanbetakenfor
thegeneticeffectcorrelation).SinceLD-scoreregressionhasbeensuccessfullyusedto
computethegeneticcorrelationoftwophenotypesinasinglepopulation,thisderivation
canbeviewedasanextensionofLD-scoreregressiontoonephenotypeintwodifferent
populations.Themaindifferenceinourapproachischoosingmaximumlikelihoodrather
⇢ge = Cor(�21�
21 ,�
22�
22)
⌃21ii = li
ci =X
m
r1imr2im
E[Z1iZ2i] =pN1N2M
⇢gi
qh21h
22ci
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
12
thanregressioninordertofitthemodel.Acomparisonofourmethodtotheldscsoftware
showstheyperformsimilarlyasheritabilityestimators(FigureS5).
Ofcourse,ourmethodisnotwithoutdrawbacks.First,itrequiresalargesample
sizeandlargenumberofSNPstoachievestandarderrorslowenoughtogenerateaccurate
estimates.UntilrecentlylargesampleGWAShavebeenrareinnon-Europeanpopulations,
thoughtheyarebecomingmorecommon.Similarly,referencepanelqualitymaysufferin
non-Europeanpopulationsandthismayimpactdownstreamanalysis47.Second,itislimited
toanalyzingrelativelycommonSNPs,bothbecausehavinganaccuratediseasemodelis
importantfortheanalysisofrarevariantsandbecauseeffectsizeandcorrelationcoefficient
estimateshaveahighstandarderroratrareSNPs16.Third,ouranalysisiscurrentlylimited
toSNPsthatarepresentinbothpopulations.Indeeditiscurrentlyunclearhowbestto
handlepopulation-specificvariantsinthisframework.Fourth,ourestimatorofρis
boundedbetween-1and1.Thismayinducebiaswhenthetruevalueisclosetothe
boundaryandthesamplesizeissmall.Finally,admixedpopulationsinduceverylong-range
LDthatisnotaccountedforinourapproachandwearethereforelimitedtoun-admixed
populations16.
Ouranalysisleavesopenseveralavenuesforfuturework.Weapproximately
maximizethelikelihoodofan!×!multivariatenormaldistributionviaamethodthatusesonlythediagonalelementsofeachblock,discardingcovarianceinformationbetweenZ-
scores.Abetterapproximationmaylowerthestandarderroroftheestimator,facilitating
ananalysisofthegeneticcorrelationoffunctionalcategories,pathwaysandgeneticregions.
Wewouldalsoliketoextendouranalysistoincludepopulationspecificvariantsaswellas
variantsatfrequenciesbetween1-5%orlowerthan1%.Oursimulationsindicatethat
havinganaccuratediseasemodelisimportantfordeterminingthedifferencebetweenthe
geneticeffectandgeneticimpactcorrelationwhenrarevariantsareincluded.Maximum
likelihoodapproachesarewellsuitedtodifferentgeneticarchitectures.Forexample,one
couldestimateboththeglobalrelationshipbetweenallelefrequencyandeffectsizeandthe
globalrelationshipbetweenper-SNPFSTandgeneticcorrelationbyincorporating
parametersαandϒintothepriordistributionoftheeffectsizes,
.Weexpectthatincorporating
theseparameterswillimproveestimatesofheritabilityandgeneticcorrelationwhile
revealingimportantbiologicalinsights.
�1i,�2i ⇠ N✓0,
h21�
↵1i ⇢ge
ph21h
22F
�STi
⇢gep
h21h22F
�STi h
22�
↵1i
�◆
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
Appendix
Consider two GWAS of a phenotype conducted in di↵erent populations populations. Assume we have N1
individuals genotyped or imputed to M SNPs in study one and N2 individuals genotyped or imputed to
M SNPs in study two. Let X1, X2 and Y1, Y2 be the matrices of mean-centered genotypes and phenotypes
of the individuals in study one and two, respectively, with f1, f2 the allele frequencies of the M SNPs
common to both populations. Assuming Hardy-Weinberg equilibrium, the allele variances are �21 = 2f1(1�
f1),�22 = 2f2(1� f2). Let �1,�2 be the (unobserved) per-allele e↵ect sizes for each SNP in studies one and
two, respectively. Define the genetic impact correlation ⇢gi
= Cor(p�21�1,
p�22�2) and the genetic e↵ect
correlation ⇢ge
= Cor(�1,�2). We present a maximum likelihood framework for estimating the heritability of
the phenotype in study 1 and it’s standard error, the heritability of the phenotype in study 2 and it’s standard
error, and the genetic e↵ect and impact correlation of the phenotype between the studies and it’s standard
error given only the summary statistics Z1, Z2 and reference genotypes G1, G2 representing the populations
in the studies. We assume that genotypes are drawn randomly from populations with expected correlation
matrices ⌃1 (and similarly for study two), and that every SNP is causal with a normally distributed e↵ects
size (though this assumption is not necessary in practice, see Figure S1).
Genetic impact correlation
Let X 01 =X1p�
21
(and similarly for study 2) be normalized genotype matrices. We consider the standard linear
model for generation of the phenotypes, where
Y1 = X01�1 + ✏1
Y2 = X02�2 + ✏2
For convenience of notation let h2ix
= ⇢gi
ph21h
22. We assume the SNP e↵ects follow the infinitesimal model,
where every SNP has an e↵ect size drawn from the normal distribution, and that the residuals are independent
for each individual and normally distributed:
0
@ �1�2
1
A ⇠ N
0
@
2
4 0
0
3
5 , 1M
2
4 h21IM h2
ix
IM
h2ix
IM
h22IM
3
5
1
A (1)
0
@ ✏1✏2
1
A ⇠ N
0
@
2
4 0
0
3
5 ,
2
4 (1� h21)IM 0
0 (1� h22)IM
3
5
1
A (2)
1
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
where h21, h22 are the heritability of the disease in study one and two, respectively, and ⇢gi is the genetic
impact correlation.
Using the above model, we compute the distribution of the observed Z scores as a function of the reference
panel correlations and the model parameters (h21, h22, ⇢gi). Given a distribution for Z and an observation of
Z we can then choose parameters which give the highest probability of observing Z. First, we compute the
distribution of Z. It is well known that the Z-scores of a linear regression are normally distributed given
� when the sample size is large enough. Since P(Z) / P(Z|�)P(�) and the product of normal distributions
is normal, we only need to compute the unconditional mean and variance of Z to know its distribution.
Specifically, let Z = [Z>1 , Z>2 ]
>, then it’s mean is
E[Z] = E
2
4X
0>1 Y1pN1
X
0>2 Y2pN2
3
5 =
2
41pN1
(E[X 0>1 X 01]E[�1] + E[X 0>1 ]E[✏1])1pN2
(E[X 0>2 X 02]E[�2] + E[X 0>2 ]E[✏2])
3
5 = 0
The within-population variance is:
Cov[Z1i, Z1j ] = E[Z1iZ1j ] = EX,�,✏[E[Z1iZ1j |X,�, ✏]]
=1
N1EX,�,✏
[X 0>1i (X01�1 + ✏1)(X
01�1 + ✏1)
>X 01j ]
=1
N1EX,�
[X 0>1i X01�1�
>1 X
0>1 X2j ] +
1
N1EX,✏
[X 0>1i ✏1✏>1 X
01j ]
=h21
MN1EX
[X 0>1i X01X
0>1 X
01j ] +
1� h21N1
EX
[X 0>1i X01j ]
=h21
MN1(N1Mr1ij +N1
MX
m=1
r1imr1jm +N21
MX
m=1
r1imr1jm) +1� h21N1
r1ij
= r1ij +N1 + 1
Mh21⌃1(i)⌃
(j)1
where rpij
= ⌃pij
is the correlation coe�cient of SNP i and j in population p. Similarly, the between-
population variance is:
Cov[Z1i, Z2j ] =1p
N1N2EX,�
[X 0>1i X01�1�
>2 X
0>2 X
02j ] +
1pN1N2
EX,✏
[X 0>1i ✏1✏>2 X
02j ]
=h2ix
MpN1N2
EX
[X 0>1i X01X
0>2 X
02j ]
=h2ix
MpN1N2
(N1N2
MX
m=1
r1imr2jm)
=
pN1N2M
h2ix
⌃1(i)⌃(j)2
where ⌃(i) denotes the i’th row of ⌃ and ⌃(j) denotes the j’th column. The covariance of the Z-scores is
2
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
thus
C = Var(Z) =
2
4 ⌃1 +N1+1M
h21⌃21 h
2gx
pN1N2
M
⌃1⌃2
h2gx
pN1N2
M
⌃2⌃1 ⌃2 +N2+1M
h22⌃22
3
5 (3)
and Z ⇠ N (0, C).
Genetic e↵ect correlation
Let hex
= ⇢ge
ph21h
22. We modify the procedure above to use mean-centered instead of normalized genotype
matrices and model the distribution of the e↵ect sizes as
0
@ �1�2
1
A ⇠ N
0
B@
2
4 0
0
3
5 ,
2
64h
21
k�21k1IM
h
expk�21k1k�22k1
IM
h
expk�21k1k�22k1
IM
h
22
k�22k1IM
3
75
1
CA (4)
Notice that a linear model with e↵ects sizes acting on un-normalized genotypes is the same as a linear model
with e↵ect sizes acting on normalized genotypes under the substitution �1,2 !q
�21,2�1,2. Therefore the
covariance of Z-scores on the per allele scale can be immediately inferred from the prior derivation
C = Var(Z) =
2
64⌃1 +
N1+1k�21k1
h21⌃1�21⌃1 h
2gx
pN1N2p
k�21k1k�22k1⌃1
p�21�
22⌃2
h2gx
pN1N2p
k�21k1k�22k1⌃2
p�22�
21⌃1 ⌃2 +
N2+1k�22k1
h22⌃2�22⌃2
3
75
Approximate maximum likelihood estimation
Let C =
2
4 C11 C12C21 C22
3
5 be either of the above covariance matrices written in block form. We approximately
optimize the above likelihood as follows: first we find h21 and h22 by maximizing the likelihood corresponding
to C11 and C22, then we find ⇢gi or ⇢ge by maximizing the likelihood corresponding to C12:
l(h21|Z1,⌃,�) ⇡ �MX
i=1
w11i
✓ln(C11ii) +
Z21iC11ii
◆
l(h22|Z2,⌃,�) ⇡ �MX
i=1
w22i
✓ln(C22ii) +
Z22iC22ii
◆
l(⇢g{i,e}|Z, ĥ21, ĥ22,⌃,�) ⇡ �
MX
i=1
w12i
✓ln(C12ii) +
Z1iZ2iC12ii
◆
Because we are discarding between-SNP covariance information (Cov(Z1i, Z1j)), highly correlated SNPs will
be overcounted in our approximate likelihood. As a simple example, notice that two SNPs in perfect LD
will each contribute identical terms to the approximate likelihood, and therefore should be downweighted
by a factor of 1/2. The extent to which SNP i is over-counted is exactly the i’th entry in it’s corresponding
3
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
LD-matrix product. Therefore we let wgijki
= 1/ (⌃j
⌃k
)ii
and wgejki
= 1/⇣⌃
j
q�2j
�2k
⌃k
⌘
ii
to reduce the
variance in our estimates of the parameters h21, h22, ⇢gi and ⇢ge.
Furthermore, rather than compute the full products ⌃21, ⌃22 and ⌃1⌃2 over all M SNPs in the genome, we
choose a window size W and approximate the product by (⌃a
⌃b
)ii
=P
w=i+Ww=i�W raiwrbiw. Though maximum
likelihood estimation admits a straightforward estimate of the standard error via the fisher information, we
found these estimates to be inaccurate in practice. Instead, we use block jackknife with block size equal to
min(100, M200 ) SNPs to ensure that blocks are large enough to remove residual correlations.
Out of population prediction of phenotypic values
Consider using the results of a GWAS with perfect power in population 2 to predict the phenotypic values of
a set of individuals from population 1. This defines the upper limit of the correlation of true and predicted
phenotypic values. Let the true values of the e↵ects sizes in population 2 be �2. Let the true phenotypes
in population 1 be Y = X1�1 + ✏1 while the predicted phenotypes are P = X1�2. We are interested in the
correlaiton of the predicted and true phenotypes ⇢MAXY P
= Cor(Y, P ). Notice that, given X, the true and
predicted phenotype of each individual is an a�ne transformation of a multivariate normal random variable
(�) 2
4 YiPi
3
5 =
2
4 X(i) 0M0M
X(i)
3
5
2
4 �1�2
3
5+
2
4 ✏i0
3
5
and therefore (Yi
, Pi
) for individual i is multivariate normal with expected covariance matrix
EX
[Cov(Yi
, Pi
)] = EX
2
4 X(i) 0M0M
X(i)
3
5
2
641
k�21k1IM
h
expk�21k1k�22k1
IM
h
expk�21k1k�22k1
IM
h
22
k�22k1IM
3
75
2
4 X(i) 0M0M
X(i)
3
5>
= EX
2
64
Pm
X
2im
k�21k1h
ex
Pm
X
2imp
k�21k1k�22k1h
ex
Pm
X
2imp
k�21k1k�22k1h
22
Pm
X
2im
k�22k1
3
75
=
2
4 1 hexq
k�21k1k�22k1
hex
qk�21k1k�22k1
h22k�21k1k�22k1
3
5
and therefore the expected correlation E[Cor(Yi
, Pi
)] is hexph
22
qk�21k1k�22k1
k�22k1k�21k1
= ⇢ge
ph21. The expected popula-
tion correlation tends to the sample correlation as the number of samples increases, therefore
⇢MAXY P
= Cor(Y, P ) ! ⇢ge
qh21 (5)
as N ! 1
4
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
13
DescriptionofSupplementaldata
Supplementaldataincludesixfiguresandonetable.
Acknowledgements
TheauthorswouldliketoacknowledgeLiorPachterandHilaryFinucaneforinsightful
discussionabouttheproblem.BCBissupportedbytheNSFGRFP.ALPissupportedbyNIH
grantR01HG006399.NZissupportedbyNIHgrantK25HL121295.
WebResources
Popcornisavailableathttps://github.com/brielin/popcorn
References
1. Robinson, M.R., Hemani, G., Medina-Gomez, C., Mezzavilla, M., Esko, T., Shakhbazov, K., Powell, J.E., Vinkhuyzen, A., Berndt, S.I., Gustafsson, S., et al. (2015). Population genetic differentiation of height and body mass index across Europe. Nat. Genet. 47, 1357–1362. 2. Burt, V.L., Whelton, P., Roccella, E.J., Brown, C., Cutler, J.A., Higgins, M., Horan, M.J., and Labarthe, D. (1995). Prevalence of hypertension in the US adult population. Results from the Third National Health and Nutrition Examination Survey, 1988-1991. Hypertension 25, 305–313. 3. Coram, M.A., Candille, S.I., Duan, Q., Chan, K.H.K., Li, Y., Kooperberg, C., Reiner, A.P., and Tang, H. (2015). Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach. Am. J. Hum. Genet. 96, 740–752. 4. Morris, A.P. (2011). Transethnic Meta-Analysis of Genomewide Association Studies. Genet. Epidemiol. 35, 809–822. 5. Bustamante, C.D., De La Vega, F.M., and Burchard, E.G. (2011). Genomics for the world. Nature 475, 163–165. 6. (2011). A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nat Genet 43, 339–344. 7. Okada, Y., Wu, D., Trynka, G., Raj, T., Terao, C., Ikari, K., Kochi, Y., Ohmura, K., Suzuki, A., Yoshida, S., et al. (2014). Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381. 8. de Candia, T.R., Lee, S.H., Yang, J., Browning, B.L., Gejman, P.V., Levinson, D.F., Mowry, B.J., Hewitt, J.K., Goddard, M.E., O’Donovan, M.C., et al. (2013). Additive Genetic Variation in Schizophrenia Risk Is Shared by Populations of African and European Descent. Am. J. Hum. Genet. 93, 463–470. 9. Lee, S., Teslovich, T.M., Boehnke, M., and Lin, X. (2013). General Framework for Meta-analysis of Rare Variants in Sequencing Association Studies. Am. J. Hum. Genet. 93, 42–53. 10. Palla, L., and Dudbridge, F. (2015). A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait. Am. J. Hum. Genet. 97, 250–259. 11. Yang, J., Ferreira, T., Morris, A.P., Medland, S.E., Madden, P.A.F., Heath, A.C., Martin, N.G., Montgomery, G.W., Weedon, M.N., Loos, R.J., et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369–375. 12. Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., Hirschhorn, J., Strachan, D.P., Patterson, N., and Price, A.L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906–2914.
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
14
13. Hormozdiari, F., Kostem, E., Kang, E.Y., Pasaniuc, B., and Eskin, E. (2014). Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics 198, 497–508. 14. Hormozdiari, F., Kichaev, G., Yang, W.-Y., Pasaniuc, B., and Eskin, E. (2015). Identification of causal genes for complex traits. Bioinformatics 31, i206–i213. 15. Kichaev, G., Yang, W.-Y., Lindstrom, S., Hormozdiari, F., Eskin, E., Price, A.L., Kraft, P., and Pasaniuc, B. (2014). Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies. PLoS Genet. 10, e1004722. 16. Bulik-Sullivan, B.K., Loh, P.-R., Finucane, H.K., Ripke, S., Yang, J., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson, N., Daly, M.J., Price, A.L., and Neale, B.M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47, 291–295. 17. Finucane, H.K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P.-R., Anttila, V., Xu, H., Zang, C., Farh, K., et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 47, 1228–1235. 18. Bulik-Sullivan, B., Finucane, H.K., Anttila, V., Gusev, A., Day, F.R., Loh, P.-R., ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3, Duncan, L., et al. (2015). An atlas of genetic correlations across human diseases and traits. Nat Genet 47, 1236–1241. 19. Park, D.S., Brown, B., Eng, C., Huntsman, S., Hu, D., Torgerson, D.G., Burchard, E.G., and Zaitlen, N. (2015). Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses. Bioinformatics 31, i181–i189. 20. Xu, Z., Duan, Q., Yan, S., Chen, W., Li, M., Lange, E., and Li, Y. (2015). DISSCO: direct imputation of summary statistics allowing covariates. Bioinformatics 31, 2434–2442. 21. International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia that overlaps with bipolar disorder. Nature 460, 748–752. 22. Zuo, L., Zhang, C.K., Wang, F., Li, C.-S.R., Zhao, H., Lu, L., Zhang, X.-Y., Lu, L., Zhang, H., Zhang, F., et al. (2011). A Novel, Functional and Replicable Risk Gene Region for Alcohol Dependence Identified by Genome-Wide Association Study. PLoS ONE 6, e26726. 23. Fesinmeyer, M., North, K., Ritchie, M., Lim, U., Franceschini, N., Wilkens, L., Gross, M., Bůžková, P., Glenn, K., Quibrera, P., et al. (2013). Genetic risk factors for body mass index and obesity in an ethnically diverse population: results from the Population Architecture using Genomics and Epidemiology (PAGE) Study. Obes. Silver Spring Md 21, 10.1002/oby.20268. 24. Chang, M., Ned, R.M., Hong, Y., Yesupriya, A., Yang, Q., Liu, T., Janssens, A.C.J.W., and Dowling, N.F. (2011). Racial/Ethnic Variation in the Association of Lipid-Related Genetic Variants With Blood Lipids in the US Adult Population. Circ. Cardiovasc. Genet. 4, 523–533. 25. Waters, K.M., Stram, D.O., Hassanein, M.T., Le Marchand, L., Wilkens, L.R., Maskarinec, G., Monroe, K.R., Kolonel, L.N., Altshuler, D., Henderson, B.E., et al. (2010). Consistent Association of Type 2 Diabetes Risk Variants Found in Europeans in Diverse Racial and Ethnic Groups. PLoS Genet. 6, e1001078. 26. Lee, S.H., Yang, J., Goddard, M.E., Visscher, P.M., and Wray, N.R. (2012). Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542. 27. Falconer, D.S., and Mackay, T.F.C. (1996). Introduction to quantitative genetics (Essex, England: Longman). 28. Loh, P.-R., Tucker, G., Bulik-Sullivan, B.K., Vilhjálmsson, B.J., Finucane, H.K., Salem, R.M., Chasman, D.I., Ridker, P.M., Neale, B.M., Berger, B., et al. (2015). Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290. 29. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. 30. So, H.-C., Li, M., and Sham, P.C. (2011). Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet. Epidemiol. n/a – n/a. 31. Vattikuti, S., Guo, J., and Chow, C.C. (2012). Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits. PLoS Genet. 8, e1002637.
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
15
32. ’t Hoen, P.A.C., Friedländer, M.R., Almlöf, J., Sammeth, M., Pulyakhina, I., Anvar, S.Y., Laros, J.F.J., Buermans, H.P.J., Karlberg, O., Brännvall, M., et al. (2013). Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022. 33. Morris, A.P., Voight, B.F., Teslovich, T.M., Ferreira, T., Segrè, A.V., Steinthorsdottir, V., Strawbridge, R.J., Khan, H., Grallert, H., Mahajan, A., et al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990. 34. Cho, Y.S., Chen, C.-H., Hu, C., Long, J., Hee Ong, R.T., Sim, X., Takeuchi, F., Wu, Y., Go, M.J., Yamauchi, T., et al. (2012). Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in east Asians. Nat Genet 44, 67–72. 35. Su, Z., Marchini, J., and Donnelly, P. (2011). HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305. 36. Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., and Lee, J.J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4,. 37. Lee, S.H., Wray, N.R., Goddard, M.E., and Visscher, P.M. (2011). Estimating Missing Heritability for Disease from Genome-wide Association Studies. Am. J. Hum. Genet. 88, 294–305. 38. Speed, D., Hemani, G., Johnson, M.R., and Balding, D.J. (2012). Improved Heritability Estimation from Genome-wide SNPs. Am. J. Hum. Genet. 91, 1011–1021. 39. Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A.A.E., Lee, S.H., Robinson, M.R., Perry, J.R.B., Nolte, I.M., van Vliet-Ostaptchouk, J.V., et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120. 40. Price, A.L., Helgason, A., Thorleifsson, G., McCarroll, S.A., Kong, A., and Stefansson, K. (2011). Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-by-Descent in Related or Unrelated Individuals. PLoS Genet. 7, e1001317. 41. Gaffney, D.J. (2013). Global Properties and Functional Complexity of Human Gene Regulatory Variation. PLoS Genet. 9, e1003501. 42. Stahl, E.A., Wegmann, D., Trynka, G., Gutierrez-Achury, J., Do, R., Voight, B.F., Kraft, P., Chen, R., Kallberg, H.J., Kurreeman, F.A.S., et al. (2012). Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 44, 483–489. 43. Marenberg, M.E., Risch, N., Berkman, L.F., Floderus, B., and de Faire, U. (1994). Genetic susceptibility to death from coronary heart disease in a study of twins. N. Engl. J. Med. 330, 1041–1046. 44. Nora, J.J., Lortscher, R.H., Spangler, R.D., Nora, A.H., and Kimberling, W.J. (1980). Genetic--epidemiologic study of early-onset ischemic heart disease. Circulation 61, 503–508. 45. Chen, X., Kuja-Halkola, R., Rahman, I., Arpegård, J., Viktorin, A., Karlsson, R., Hägg, S., Svensson, P., Pedersen, N.L., and Magnusson, P.K.E. (2015). Dominant Genetic Variation and Missing Heritability for Human Complex Traits: Insights from Twin versus Genome-wide Common SNP Models. Am. J. Hum. Genet. 97, 708–714. 46. Zhu, Z., Bakshi, A., Vinkhuyzen, A.A.E., Hemani, G., Lee, S.H., Nolte, I.M., van Vliet-Ostaptchouk, J.V., Snieder, H., Esko, T., Milani, L., et al. (2015). Dominance Genetic Variation Contributes Little to the Missing Heritability for Human Complex Traits. Am. J. Hum. Genet. 96, 377–385. 47. Marchini, J., and Howie, B. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet 11, 499–511. 48. Ma, R.C.W., and Chan, J.C.N. (2013). Type 2 diabetes in East Asians: similarities and differences with populations in Europe and the United States: Diabetes in East Asians. Ann. N. Y. Acad. Sci. 1281, 64–91.
FigureTitlesandLegends
Figure1:Trueandestimatedgeneticimpactandeffectcorrelation.Allsimulations
conductedwithsimulatedEURandEASheritabilityof0.5using4499simulatedEURand
4836simulatedEASindividualsat248,953SNPs.
Figure2:DistributionofgeneticcorrelationcomparisonbetweenpopcornandGCTA.
Distributionwascomputedusingagaussiankdeonthesetofgeneticcorrelationestimates.
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
16
Figure3:Geneticcorrelationasafunctionofheritabilityforgeneexpression.Themeanand
standarderrorofthegeneticcorrelationofthesetofgeneswithh12andh22exceeding
thresholdhineachanalysis(y-axis)isplottedagainsth(x-axis).
Tables
MAF>0.01 MAF>0.05
Model Independent 0.500 0.478 0.500 0.460 0.500 0.488 0.509 0.469
Inverse 0.431 0.500 0.567 0.496 0.479 0.500 0.555 0.482
Rare 0.500 0.467 0.382 0.863 0.500 0.496 0.998 0.756
Common 0.500 0.500 0.522 0.493 0.500 0.500 0.502 0.496
Difference 0.500 0.416 0.354 0.435 0.500 0.461 0.410 0.412
Adversarial 0.710 0.604 0.525 0.651 0.714 0.667 0.601 0.675
Table1:Trueandestimatedvaluesofthegeneticimpactandeffectcorrelationinsimulated
EUR-likeandEAS-likegenotypes.Resultsaretheaverageof100simulationswith
phenotypeheritabilityof0.5ineachpopulation.
Table2:HeritabilityandgeneticcorrelationofRAandT2DbetweenEURandEAS
populations.EURRAdatacontained8,875casesand29,367controlsforastudyprevalence
of0.23.EASRAdatacontained4,873casesand17,642controlsforastudyprevalenceof
0.22.RAdiseaseprevalencewasassumedtobe0.5%inbothpopulations7.T2DEURdata
contained12171casesand56862controlsforastudyprevalenceof0.18.T2DEASdata
⇢ge ⇢gi ⇢̂ge ⇢̂gi ⇢ge ⇢gi ⇢̂ge ⇢̂gi
hEUR2liability hEAS2liability ρge ρgi
RA Est.(SE) 0.18(0.02) 0.22(0.03) 0.46(0.06) 0.46(0.06)
95%CI [0.15,0.21] [0.16,0.28] [0.34, 0.58] [0.34, 0.58]
p>0 3.90e-32 1.89e-17 1.37e-15 8.16e-16
p0 2.41e-77 5.73e-7 1.70e-12 2.85e-13
p
17
contained6952casesand11865controlsforastudyprevalenceof0.37.T2DEUR
prevalencewasassumedtobe8%33whileT2DEASprevalencewasassumedtobe9%48.
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/
.CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint
https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/Recommended