1
GeneticMappingbyBulkSegregantAnalysisinDrosophila:ExperimentalDesignand1
Simulation-BasedInference2
3
JohnE.Pool4
5
LaboratoryofGenetics,UniversityofWisconsin–Madison,Madison,WI,537066
7
8
9
RunningTitle:BulkSegregantMappinginDrosophila10
11
Keywords:bulksegregantanalysis,quantitativetraitlocusmapping,simulation,12
experimentaldesign,Drosophila13
14
Contact:15
425-GHenryMall16
Madison,WI5370617
+1-608-265-103618
20
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
2
ABSTRACT:21
22
Identifyingthegenomicregionsthatunderliecomplexphenotypicvariationisakey23
challengeinmodernbiology.Manyapproachestoquantitativetraitlocusmappingin24
animalandplantspeciessufferfromlimitedpowerandgenomicresolution.Here,I25
investigatewhetherbulksegregantanalysis(BSA),whichhasbeensuccessfullyappliedfor26
yeast,mayhaveutilityinthegenomicerafortraitmappinginDrosophila(andother27
organismsthatcanbeexperimentallybredinsimilarnumbers).Iperformsimulationsto28
investigatethestatisticalsignalofaquantitativetraitlocus(QTL)inawiderangeofBSA29
andintrogressionmapping(IM)experiments.BSAconsistentlyprovidesmoreaccurate30
mappingsignalsthanIM(inadditiontoallowingthemappingofmultipletraitsfromthe31
sameexperimentalpopulation).TheperformanceofBSAandIMismaximizedbyhaving32
multipleindependentcrosses,moregenerationsofinterbreeding,largernumbersof33
breedingindividuals,andgreatergenotypingeffort,butislessaffectedbytheproportionof34
individualsselectedforphenotypicextremepools.Ialsointroduceaprototypeanalysis35
methodforSimulation-basedInferenceforBSAMapping(SIBSAM).Thismethodidentifies36
significantQTLsandestimatestheirgenomicconfidenceintervalsandrelativeeffectsizes.37
Importantly,italsotestswhetheroverlappingpeaksshouldbeconsideredastwodistinct38
QTLs.ThisapproachwillfacilitateimprovedtraitmappinginDrosophilaandotherspecies39
forwhichhundredsorthousandsofoffspring(butnotmillions)canbestudied.40
41
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
3
INTRODUCTION:42
43
Connectingphenotypicdiversitytothegeneticvariantsthatencodeitisafundamental44
challengeformodernbiology.Inevolutionaryresearch,thereisstronginterestin45
revealingthegeneticarchitectureofadaptivephenotypicchange,includingthenumberof46
causativegenesandmutations,andtheirfunctionalandpopulationgeneticproperties.In47
moleculargenetics,themappingofphenotypicdifferencesfromnaturalorinduced48
mutationshasgreatutilityforelucidatinggeneticpathwaysthatunderliespecificbiological49
processes.Inanimalandplantbreeding,localizingthegenesunderlyingagronomically50
importanttraitvariationcanbeakeysteptowardgeneticimprovement.51
Especiallyinspeciesthatcanbeexperimentallycrossed,quantitativetraitlocus(QTL)52
mappingprovidesanimportanttoolforidentifyinggenomicregionsthatcontaincausative53
geneticvariantsunderlyingatraitdifference.Often,theF2orlateroffspringofacross54
betweenphenotypicallycontrastingparentalstrainsaregenotyped,individuallyorin55
groups,toidentifysectionsofthegenomethatwereinheritednon-randomlywithrespect56
tothephenotype(oftenonthemegabasescale).ThesimplestexampleofQTLanalysisis57
F2mapping,inwhichindividualsecondgenerationoffspringarephenotypedand58
genotyped.Toachievemuchgenomicprecision,however,thismethodrequiresthe59
individualgenotypingofalargenumberofF2offspring.PreparingmanygenomicDNA60
librariesfornextgenerationsequencingisoftenatime-andresource-intensive61
proposition,althoughprogresshasbeenmadeinthisregard(Andolfattoetal.2011).62
Introgressionmapping(IM)providesanotheralternativeforQTLanalysis.Here,63
followinganinitialcrossbetweenparentalstrainsAandB,offspringofsubsequent64
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
4
generationsarerepeatedlyselectedforstrainA’sphenotype,butareback-crossedtostrain65
B(Figure1).Toallowrecessivevariantstobeselected,thisselectionandintrogressioncan66
beperformedineverysecondgeneration.Thedesiredresultisanintrogressionlinethatis67
largelysimilartostrainBacrossthegenome,butthatmatchesstrainAatlocithatwere68
selectedalongwiththephenotype.Anotablemodernexampleofthisapproachis69
describedbyEarlyandJones(2011),whointrogressedabehavioraldifferencefromD.70
simulansintoD.sechellia.Here,30F2femalesweretestedforsimulans-likebehavior,anda71
subsetwasthenback-crossedtoD.sechellia.Afterrepeatingthisprocessfor1572
generations,next-generationsequencingwasusedtoidentifygenomicregionsthat73
introgressedwiththetraitfromD.simulans.74
Inbulksegregantanalysis(BSA),largenumbersofprogeny(fromF2orlater75
generations)aresorted/selectedbyphenotype,thencontrastingphenotypicpoolsof76
individualsareeachgenotyped(Figure1)(Michelmoreetal.1991).ComparedtoIM,BSA77
mayallowforalargernumberofuniquerecombinationeventstobegeneratedand78
sampled,whichcouldyieldsharperQTLpeaks.LikeIM,BSAdoesnotrequirelarge79
numbersofoffspringtobeindividuallygenotyped–insteadeachphenotypicextremecan80
besequencedasasinglepool.BSAhasbeenappliedverysuccessfullyforselectabletraits81
inyeast(e.g.Ehrenreichetal.2010;Magweneetal.2011;Partsetal.2011),facilitatedbya82
smallgenomeandtheeaseofgeneratingmillionsofsegregants.BSAhasalsoseendiverse83
applicationstotraitmappinginmulticellularorganisms(e.g.Michelmoreetal.1991;Wicks84
etal.2001;Bairdetal.2008;VanLeeuwenetal.2012;Haaseetal.2015),including85
Drosophila(Laietal.2007).86
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
5
Here,IusesimulationstoexaminethemappingsignalsofBSAandIMunderawide87
rangeofexperimentalparametersforthemappingofmulti-genetraits.IfindthatBSA88
producesstrongerandbetter-localizedmappingsignalsforallstudiedexperimental89
designs.Thetradeoffsofeffortandperformanceindicatedbytheseresults,alongwiththe90
newsimulationprogramsthatproducedthem,willhelpresearchersdesignmoreeffective91
mappingexperiments.92
IalsousethisBSAsimulationapproachtodeviseanewQTLinferencemethod.Existing93
BSAanalysismethodseffectivelyidentifyQTLsfromyeastdata(e.g.Magweneetal.2011;94
Edwardsetal.2012).However,thesemethodsdonotallowthediscriminationoftwo95
nearbyQTLpeaksversusasinglepeakwithnoisy,raggedcontours–anissuethatmaybe96
moreproblematicfororganismsinwhichmanyfewersegregantscanbesurveyedrelative97
toyeast.ThesemethodsalsodonotestimatetherelativestrengthofeachQTL.TheBSA98
inferencemethodproposedhereusesamulti-stepsimulationprocessto(1)identify99
significantQTLsandtheirgenomicconfidenceintervals,(2)separatesinglefrommultiple100
linkedQTLs,and(3)providearoughestimateoftheeffectsizesoftheidentifiedQTLs.101
Thismethodisvalidatedusingsimulationsinthepresentstudy,andappliedtodatainan102
accompanyingarticle(Bastideetal.2016).103
104
MATERIALSANDMETHODS105
106
PreliminarysimulationsforBSAandIM107
SimulationprogramswerewrittentoassesstheQTLsignalsofBSAandIM(software108
relatedtothisarticleisavailableathttps://github.com/JohnEPool/SIBSAM1).BSA109
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
6
simulationanalysesfocusedonasummarystatistic,“ancestrydifference”(ad).Foragiven110
geneticmarkerlocusorgenomicwindowofsequence,adreferstothedifferencebetween111
thehighandlowphenotypicpoolsintheproportionofancestryfromtheparentalstrain112
withthehigherphenotypicvalue.Forexample,ifthehighphenotypicpoolisestimatedto113
have60%ofitsancestryfromthisparentalstrainataparticularlocus,andthelow114
phenotypicpool40%,thenad=0.6–0.4=0.2.ForIM,theproportionofancestryinthe115
mappingpopulationfromthenon-backcrossparentalstrain(ap)wasevaluated.This116
quantitymayapproachzerofornon-causativelociaftermanygenerationsofback-crossing117
totheotherparentalstrain.Foreachstatistic,IexaminedhowoftenthetallestlocalQTL118
peakwasobservedwithin0.5centiMorgans(cM)ofthetruesimulatedtargetlocus,andthe119
average(median)distancebetweentheQTLpeakandthetargetlocus.120
TheBSAandIMsimulatorsarelargelysimilar.Theseprogramstrackparentalstrain121
ancestryalongthechromosomesofeachindividualinthemappingpopulation,fromtheF1122
generationuntiltheendoftheexperiment.APoisson-distributednumberof123
recombinationeventshappeneachgeneration,withtheexpectednumberforeach124
chromosomebeingitslengthinMorgans(interferenceisnotmodeled).Tofocusonthe125
caseofDrosophila,chromosomesX,2,and3wereexplicitlysimulated,andno126
recombinationwasallowedinmales.Atotalof5,000markers/windowsweresimulated127
oneachchromosome.IntheBSAsimulation,aspecifiednumberofindividualsexistineach128
newgeneration,andeachonedrawsrandomparentsfromthepreviousgeneration,with129
nophenotypicselectionuntilthelastgeneration.IntheIMsimulations,individualswere130
subjecttophenotypicselectionineverysecondgeneration.131
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
7
Phenotypesforeachindividualweremodeledbasedongenotypesandrandomvariance132
(thelattermaystemfromenvironmentaleffects,measurementerror,orothercauses).For133
mostofthesepreliminarysimulations,thesamenumberofequal-effectlociweresimulated134
oneachchromosomearm(X,2L,2R,3L,3R).Randomvariancewasaddedbymodifying135
eachindividual’sphenotypicvaluebyanormally-distributedrandomeffectwithmeanzero136
andstandarddeviation(SD)equaltotheaveragetraitvalue.Forexample,ifeachofthe137
fivearmsholdsasingleQTLthatadds1toadiploidindividual’sphenotypicvalueforeach138
alleleinheritedfromthehighparentalstrain,therangeofgeneticcontributionscould139
rangefrom0to10,withameanof5,andtheSDforenvironmentalvariancewouldalsobe140
5.Phenotypicselectionwasthenbasedonchoosingadefinedquantile(q)ofindividuals141
fromthemappingpopulationwiththehighestandthelowestphenotypicvalues.142
ForBSA,phenotypicselectionhappensonlyattheendoftheexperiment,followedby143
sequencing/genotypingofbothhighandlowphenotypicpools.ForIM,thelastbatchof144
selectedindividualsissequencedandcomparedagainsttheparentalstrains.The145
simulationsmodel“depth”ofsequencingcoverage(orgenotypesampling),drawingan146
appropriatenumberofrandomancestry-informativereadsfromtheselectedpoolof147
individualsforeachwindow/marker.Theproportionofancestryfromeachparentalstrain148
isthencalculated,andthusdependsonboththesamplingofindividualsandthesampling149
ofsequencereads.150
Tofacilitateconsistentanalysis,QTLsinthesepreliminarysimulationswerespaced151
uniformlyandeachwasassignedaspecificanalysiszonealongthechromosome.For152
example,iftheXchromosomehadfiveQTLs,theywouldbeplacedatrelativepositions0.1,153
0.3,0.5,0.7,and0.9(representingthechromosomeasa0to1interval).Theirzonesof154
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
8
analysiswouldthenbe0to0.2,0.2to0.4,andsoon.TheassessmentofQTLsignal155
strengthandprecisionwasbasedonthelocationwithinitszoneofthehighestQTLpeak156
(i.e.themaximumadorap),relativetothetrueQTLposition.157
Mostsimulationanalysesassumedthateachmappingexperimentwouldbeanalyzed158
separately,However,Ialsoinvestigatedcaseswheremultipleindependentmapping159
populationswereconstructedfromparentalstrainssharingthesamecausativegenetic160
differences.Here,adorapforeachwindowwassummedacrossreplicatedmapping161
populations.162
Forawidevarietyofexperimentalparametercombinations,1,000independent163
replicatesweresimulatedandanalyzed,andstatisticalperformancewascompared164
betweenthesescenariostoaidintheoptimizationofexperimentaldesign.165
166
Simulation-basedinferenceofQTLfromBSA:Overview167
PreliminaryempiricalBSAdatafromthePoollaboratoryindicatedtheneedforaQTL168
inferencemethodcapableofdealingwithneighboringQTLsthathavewide,overlapping169
statisticalsignals.Suchscenariosaredifficulttoaccountforinmostanalysisapproaches,170
butthesimulationframeworkdescribedaboveoffersapotentiallyflexiblefoundationfor171
QTLinference.Ithereforedevelopedamethodof“Simulation-basedInferenceforBulk172
SegregantAnalysisMapping”(SIBSAM).SIBSAMusesBSAsimulationsanalogoustothose173
describedabove,inordertoidentifyandlocalizesignificantQTLs,estimatetheirstrength,174
anddistinguishindividualQTLamongclustersoflinkedcausativeloci.175
ThroughouttheSIBSAMpipeline,thedistinctionbetweenprimaryQTLpeaksand176
secondaryQTLpeaksisrelevant.AprimaryQTLpeakisdefinedbasedonthehighestvalue177
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
9
ofadacrossacontinuousintervalinwhichthisstatisticremainsabovezero(whichisthe178
nullvalueexpectedintheabsenceofcausativeloci).AsecondaryQTLpeakwithinthat179
sameintervalhasalowerheightthanitsassociatedprimarypeak.Animportantquantity180
inassessingthesignificanceofasecondarypeakisits“secondarydeviation”(v),definedas181
thedifferencebetweensecondarypeakheightandtheminimumadvaluebetweenthe182
primaryandsecondarypeaks(Figure2).Multiplesecondarypeaksmaybeassociatedwith183
thesameprimarypeak,impactingthecalculationofv,asdiscussedbelow.184
AschematicoftheSIBSAMpipelineisillustratedinFigure3.First,primaryand185
secondarypeaksofadareidentifiedfromtheempiricaldata.Todeterminewhichprimary186
peaksareunexpectedintheabsenceoftrueQTLs,nullsimulationsareconductedinwhich187
phenotypesaredeterminedbynon-geneticfactorsonly.Pvaluescanthenbeobtainedfor188
eachprimarypeak.Next,simulationswithasinglecausativeQTLareconducted.Basedon189
arejectionsamplingapproach,estimatesofthestrengthandgenomicconfidenceintervals190
ofeachsignificantprimarypeakareobtained,alongwithaPvalueforeachsecondarypeak.191
Lastly,simulationsinvolvingaclusteroflinkedQTLsareconducted,reflectingaprimary192
peakanditsassociatedsecondarypeak(s).Thisphaseallowsfortherefinementof193
strengthestimatesandgenomicconfidenceintervalsforeachpeakinthecluster.194
Alloftheabovesimulationsoperatewithuser-definedwindowsofvariablecMlength.195
ThesewindowscouldalsobeviewedasmarkersseparatedbyvariouscMdistances,but196
thisarticle’sterminologymainlyassumesthatQTLmappingdatacomesfromthefull197
resequencingofmappingpopulationgenomes.Intheexamplespresentedhere,the198
windowbpspanswerebasedonD.melanogasterpolymorphismdata(Lacketal.2015)and199
cMdistanceswerecalculatedfromempiricalrecombinationrateestimates(Comeronetal.200
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
10
2012).Windowsweredefinedtoeachcontain200non-singletonvariablesitesfromthe201
Zambia-Siavongapopulationsample.Theusercanalsodefinethe“informativedepth”for202
eachwindowineachphenotypicpool.Thisquantityreferstothenumberofsequence203
readsthatcontaininformationaboutparentalstrainancestry.Thesimulatorwilldrawa204
correspondingnumberofallelesatthiswindowforancestryproportioncalculations.205
206
SIBSAMidentificationofprimaryandsecondarypeaksfromempiricaldata207
Primaryandsecondarypeaksofadareidentifiedfromdatabasedonpreliminary208
thresholdsforprimarypeakheightandsecondarypeakdeviation(adtandvt,respectively),209
plusanoptionalsmoothingstep.Thetwothresholdsshouldrepresentvalueslowenough210
thatnoshorterpeakwouldbestatisticallysignificant(thedefaultvalueforbothis0.1).211
Thesmoothingenabledhereisasimpleweightedaverage.Oneachsideofthefocal212
window,mflankingwindowsareincluded(thedefaultusedhereism=4).Thefocal213
windowreceivesaweightofm+1,theadjacentwindowoneachsidereceivesaweightof214
m,thenextwindowsreceiveaweightofm–1,andsoonuntilthemthwindowtoeachside215
receivesaweightof1.Alternativesmoothingschemesarenotafocusofthisstudy;the216
optimalstrategyshoulddependonthedatabeinganalyzed.Empiricalandsimulatedad217
valuesmustbesmoothedusingthesameprocedure.218
Primarypeakidentificationisstraightforward:thehighestvalueofadinacontinuous219
blockofwindowswithad>0,conditionalonthepeakvalueofadexceedingadt.Toidentify220
secondarypeaks,localminimaandmaximaofadmovingawayfromtheprimarypeakare221
noted.Arecovery,beyondvt,fromthelowpointsincethelastpeaksignifiesanew222
secondarypeak.Whenaddropsmorethanvtbelowthissecondarypeak’smaximumvalue,223
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
11
thispeakendsanditsmaximumvalueandassociatedwindowpositionarenoted.224
Statisticalsignificanceoftheseprimaryandsecondarypeaks,alongwiththeirconfidence225
intervalsandrelativestrengths,willbeassessedinsubsequentstagesofthispipeline.226
227
SIBSAMidentificationofsignificantprimarypeaks228
Thefalsepositiveprobability(P)foreachprimarypeakisestimatedbycomparing229
empiricalpeakheightsagainstsimulationsunderthenullhypothesisofnotrueQTLs,in230
whichallphenotypicvarianceinthemappingpopulationisrandomwithrespectto231
genotype.Allprimarypeaksexceedingadtfromeachsimulationreplicatearenoted.The232
enrichment(e)ofpeaksequaltoorgreaterthanagivenpeak’sheightintherealdatais233
givenbytheratioofthefrequencyofpeaksofthisheightintherealdatarelativetothe234
simulateddata.Ifthereisanenrichment(e>1),anestimateoftheproportionofreal235
peaksofthisheightrepresentingfalsepositivesisthengivenby1/e.Forexample,ifad236
peaksofatleast0.2inheightarethreetimesmorecommonintheempiricaldatathanin237
nullsimulations,thenonaverageoneoutofthreesuchempiricalpeakscanbeexplainedby238
theexpectedfalsepositiverate.PrimarypeakswithanestimatedPlessthansome239
threshold(bydefault,0.05)arecarriedforwardforsubsequentanalysis.240
241
SIBSAMinferencesfromsingleQTLsimulations242
GenomicsimulationswithasingleQTLareusedtoestimatethegenomicconfidence243
intervalsandstrengthofeachprimarypeak,alongwithaPvalueforeachsecondarypeak.244
SingleQTLsimulationsareperformedwitheachfixedgenomicpositionscorrespondingto245
thewindowwiththepeakmaximumadforeachQTL,thusconservinglocalwindow246
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
12
patternsofdepthandcMdistance.Foragivensetofsimulatedgenomesfromthemapping247
population(pre-selection),arandomQTLeffectsizeisdrawn.SuchaQTListhen248
separatelysimulatedateachpositioncorrespondingtoanempiricalprimarypeak,with249
phenotypesimulationandreadsamplingperformedseparatelyineachcase.Thesimulated250
cageancestriesarereusedforeachseparateQTLsimulationasatime-savingefficiency.251
ThesimulatedQTLstrength,s,rangingfrom0to1,istheestimatedproportionof252
variancethataQTLexplainsamongthemappingpopulationindividuals.Inthesesingle253
locussimulations,allotherphenotypiccontributionsaremodeledasrandomvariance,254
whichhereisintendedtoencompasstheeffectsofunlinkedQTLsinadditiontonon-255
geneticeffectsonphenotypicmeasurements.Theamountofrandomvariancesimulatedis256
fixedtoapproximatethevariancecontributedbyacodominantlocusinwhicheachallele257
adds1tothephenotypicscore.ThiseffectwasimplementedbyobtainingGaussian258
randomvalueswithmean0andstandarddeviation1,andthenmultiplyingeachvalueby259
0.5toobtaintherandomvarianceeffectoneachindividual’sphenotypicscore.The260
simulatedeffectsizeofeachQTL,e,describesthequantitythateachalleleofthislocus261
(inheritedfromthehighparentalstrain)addstoanindividual’sphenotypicscore.Since262
randomeffectscorrespondtothevariancecontributedbyalocuswithe=1,theproportion263
ofvariancecontributedbyasingleQTL(s)isequaltoe/(1+e).Andcorrespondingly,a264
singleQTLintendedtohavestrengthsissimulatedwithaneffectsizee=s/(1+e).265
Foreachsimulatedreplicate,thesimulatedstrengthisrecorded,alongwitheachQTL’s266
maximumheight,peakwindowlocation,andmaximumsecondarydeviation.Toanalyze267
theonelocussimulationdataforeachprimarypeak,arejectionsamplingapproachisused268
toidentifysimulationreplicatesinwhichmaximumadfallswithinaspecifiedtolerance269
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
13
(default0.025)oftheempiricalpeak’smaximumad.Foreachacceptedsimulation270
replicate,thestrengthofthesimulatedlocusgoesintotheposteriordistributionforthe271
empiricalQTL’sstrength(fromwhichstrengthvaluescorrespondingtothe0.05,0.5,and272
0.95quantilesarereturned).Agenomicconfidenceintervalissimilarlyobtainedby273
examiningthefarleftandfarrightquantilesforthesimulatedpeaklocationsresulting274
fromaQTLsimulatedattheempiricalpeaklocation.Thisassumesacertaintransitivity.275
Here,wearesimulatingQTLswithfixedpositionsandobservinghowfarawaythe276
maximumadfallsinthesesimulations.Intheempiricaldata,weobservethelocationofthe277
maximumad,andwe’dliketoknowhowfarfromthiswindowthetrueQTLmightbe.278
Thus,weassumethedistancesfromtrueQTLtomaximumadinthesimulateddataarea279
goodproxyforthedistancesbetweenmaximumadandtrueQTLintheempiricaldata.280
Lastly,thesecondarydeviationsfromeachacceptedsimulationenablePvaluestobe281
calculatedforeachoftheempiricalprimarypeak’sassociatedsecondarypeaks.Ifmore282
thanonesecondarypeakispresentonthesamesideoftheprimarypeakintheempirical283
data,thetallestsecondarypeakistestedfirst,anditsvisbasedonthedifferencebetween284
itsheightandthelowestadvaluebetweenitselfandtheprimarypeak(evenifother285
secondarypeaksexistbetweenthispeakandvalley;Figure2).Forashortersecondary286
peakbetweenaprimarypeakandatallersecondarypeak,vwouldbedefinedasthe287
differencebetweenitsheightandthehigherofthevalleysoneithersideofit.Givingtaller288
peaksthispriorityavoidsthesituationofashortersecondarypeakbeingdeemed289
significantandatallerpeakbeyonditmissingthisthreshold(asmightoccurifsecondary290
peaksweresimplyevaluatedsequentiallybyposition).Aftersuchadjustments,each291
secondarypeakdeviationintheempiricaldataassociatedwiththisprimarypeakis292
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
14
comparedtothedistributionofvfromacceptedsimulations.Theproportionof293
simulationswithavgreaterthanobservedforagivenempiricalsecondarypeakbecomes294
thePvalueforthatpeak(i.e.theprobabilityofgettingasecondarydeviationthisextreme295
whenthetruemodelisasingleQTLoftheobservedmagnitude).296
297
SIBSAMinferencesfromQTLclustersimulations298
Incaseswhereanempiricalprimarypeakisaccompaniedbyoneormorestatistically299
significantsecondarypeaks,thestrengthsandconfidenceintervalsofallpeaksinthis“QTL300
cluster”arebestapproximatedfromsimulationsthatincludeeachmemberQTL.For301
example,apairofnearbyQTLsmayeachaddtotheadpeakheightoftheother,leadingto302
overestimatesofeffectsize.Therefore,multi-QTLsimulationsareconductedseparatelyfor303
eachQTLclusterinferredfromtheempiricaldata.Forsimplicity,thewindowpositionof304
eachsimulatedQTLisfixedaccordingtothewindowsshowingmaximumadforeach305
significantpeakintheempiricalcluster.ToexamineeachQTLseparately,eachisassigned306
ananalysiszonewithboundariescorrespondingtotheempiricalvalleys(localminima)307
betweenpeaks.Movingawayfromtheouterpeaksinthecluster,thisanalysiszoneis308
boundedonlybytheendsofthechromosome.309
Foreachclustersimulationreplicate,arandomstrengthvalueisfirstdrawnforthefull310
cluster(representingthecumulativeproportionofphenotypicvarianceexplainedbythe311
QTLsinthiscluster).ThatclusterstrengthisrandomlyapportionedamongtheQTLs,and312
eachpeak’sstrengthisthentranslatedintothesimulatedeffectsizeasdescribedabove.313
AclustersimulationreplicateisacceptedonlyifthelocalmaximumadineveryQTL’s314
analysiszonefallswithinatoleranceofthecorrespondingempiricalpeakheights.Here,it315
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
15
couldbenecessarytouseaslightlyhighertolerancevaluetoaccrueenoughaccepted316
simulations(defaultadtolerance0.05).ThisoranyothersimulationstepinSIBSAMcanbe317
parallelizedtoincreasethenumberofreplicates,followedbyjointanalysisofmultiple318
simulationoutputfiles(Figure3).319
Theestimatedstrengthofeachpeakincluster,alongwithconfidenceintervalsof320
strengthandgenomicposition,areobtainedfromasimilarrejectionprocessasdescribed321
fortheonelocussimulations(basedonthedistributionofstrengthvaluesandpeak322
locationsforthatpeakamongtheacceptedsimulations).Thus,theclusterQTLsimulations323
provideestimatesofeffectsizeandgenomicconfidenceintervalsforallsignificant324
secondarypeaks.Theyalsoreplacepriorestimatesofthesequantitiesfortheassociated325
primarypeaks,sinceclusterestimatesthataccountfortheeffectsoflinkedQTLsshouldbe326
moreaccurate.327
ThefinalSIBSAMoutputfilecontains,foreachsignificantprimaryandsecondarypeak,328
itsPvalue,thegenomiccoordinatesofthepeakwindowandtheconfidenceintervalforthe329
QTL’sgenomiclocation,andthepointestimateandconfidenceintervalforQTLstrength.330
InformationsuchasPvaluesfornon-significantpeakscanbefoundintheintermediate331
filesproducedatdifferentstagesoftheSIBSAMpipeline(Figure3).332
333
SimulationstestingtheperformanceofSIBSAM334
SimulationtestingofSIBSAMwasperformedtotestitsQTLdetectionpowerunder335
differentscenarios,andtoconfirmthatestimatesandconfidenceintervalsofgenomic336
locationandQTLstrengthwereperforminginlinewithexpectations.Althoughanearly337
infiniterangeofscenarioscouldpotentiallybeinvestigated,Ifocusedonexperimental338
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
16
parametersrelevanttoourcurrentempiricalapplicationsinDrosophila(e.g.Bastideetal.339
2016),inwhich1,200individualsinterbreedfor16generations,and10%phenotypictails340
areselectedforsequencing.Testsimulationssampled1,000informativereadsforeach341
windowforeachphenotypicpool,whichisabouthalfthemediandepthperwindowfrom342
currentempiricalapplications(e.g.Bastideetal.2016).Windowsweredesignedtoeach343
contain200non-singletonvariablesitesintheZambia-Siavongapopulationgenomicdata344
describedbyLacketal.(2015).These14,107windowshadamedianlengthof6.8kb.345
SimulationswithonegenuineQTLwereperformedwithvaryinglocusstrengths(s=346
0.05,0.1,0.15,0.2,0.25,0.33,0.5).Theseinitialtestsimulationsusedfixedgenomic347
positionscorrespondingtothelocationsofDrosophilapigmentationgenestan(ontheX348
chromosome)andebony(onarm3R).Additional3Rscenarioswiths=0.2investigatedthe349
consequencesoftheremainingvariancebeingduetounlinkedQTLs(1withs=0.8orelse350
4otherswiths=0.2)insteadofrandomGaussianvariance.Comparingeachtestreplicate351
againstSIBSAMnullsimulationsrevealedthetruepositiverateforQTLdetection.Running352
thetestreplicatesthroughtheSIBSAMonelocussimulationanalysisindicatedthe353
frequencyatwhichsecondaryQTLswerefalselyinferred,alongwithallowedtheinferred354
distributionsofQTLstrengthandgenomiclocationtobecomparedagainstknowntrue355
values.356
Additionalsimulationswereconducted(focusingonthe3Rlocation)toinvestigate357
SIBSAM’sperformanceinthepresenceoftwolinkedQTLs.ScenarioswithsymmetricQTL358
strength(s=0.15or0.3)andasymmetricQTLstrength(s=0.15and0.3)were359
investigated.ThedistancebetweenthetwoQTLswasvariedat2.5,5,10,and25cM.The360
testreplicateswerethenevaluatedwithSIBSAMto(1)testthepowertodetectoneorboth361
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
17
QTLs,(2)testtherateoffalselydetectingthreeormoreQTLs,(3)evaluatethe362
performanceofQTLlocalization,and(4)evaluatetheperformanceofQTLsizeestimation.363
364
365
RESULTS:366
367
InitialSimulationStudyofBSAandIM368
SimulationswereperformedtoexaminethepropertiesofQTLsignalsunderBSAand369
IMapproaches.Importantly,theseexploratorysimulationsarenotconnectedtoany370
formalQTLinference.Instead,theyfocusontheperformanceofsummarystatisticsrelated371
tothesignatureofaQTL.ForBSA,Iexamineancestrydifference(ad),thedifference372
betweenhighandlowphenotypicpoolsintheproportionofancestrysampledfromthe373
parentalstrainwiththehigherphenotypicvalue(ataparticulargenomiclocus).ForIM,I374
examineancestryproportion(ap),theproportionofthemappingpopulation’sancestrythat375
derivesfromthenon-backcrossparentalstrain.Ratherthanfocusingontherawvaluesof376
thesestatistics,IassesstheperformanceofBSAandIMbyexaminingthegeneticdistance377
betweenatruesimulatedQTLandthe“QTLpeak”(themaximumvalueofadorapinthis378
partofthegenome).379
Theaboveapproachallowsawiderrangeofscenariostobeexaminedthanwouldbe380
computationallyfeasibleunderthefullSIBSAMinferenceprocess.Beyondatentative381
comparisonofthegenomicprecisionofBSAvs.IM,animportantgoalhereistooptimize382
criticalexperimentalparameterstoimprovetheoutcomesoffuturetraitmappingstudies.383
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
18
Asapointofreference,thesesimulationsbeganwitha“default”scenarioinwhich600384
individualswerebredeachgeneration,for10totalgenerations,phenotypicselection385
retainedthe20%mostextremeindividualsineachdirection,andeachwindow/locushada386
sequencingdepthof300.Individualparameterswerethenvaried,aloneorincombination,387
andtheaccuracyoftheadorapsignalwasexamined.388
First,performancewasexaminedwhentandemlyvaryingthenumberofQTLsandthe389
numberofindependentcrosses.Withineachsimulationcase,allQTLswereofequal390
magnitudeandexplained5/6oftotalphenotypicvariance.Independentcrosseswere391
simulatedundertheassumptionthatallpairsofparentalstrainsshareagivenQTL392
differencebetweenthem.Whenmultiplecrosseswereanalyzedtogether,adorapwere393
addedbetweencrossesforeachgenomicwindowtotestwhetheramoreprecise394
localizationemergedfromthisjointsignal.Threeprimarythemesemergedfromthis395
analysis.First,BSAoutperformedIMforanygivencombinationofcrossesandloci(Figure396
4).Second,combiningdatafrommultiplecrosseshadamarkedlypositiveeffectonthe397
accuracyoftheseancestrysignals.Third,performanceshowedapredictabledeclinefor398
more/weakerQTLs.Still,caseswithmultiplecrossesstillmanagedrelativelystronger399
performanceformorepolygenicscenarios(Figure4),particularlyinthecaseofBSA.For400
simplicity,theremainingsimulationsbelowwillfocusonasinglecrossreplicateanda401
scenariowithfiveQTLs.402
Thenumberofgenerationsbeforegenotyping/sequencingwasalsovaried.Strong403
performanceimprovementwasobservedbyincreasingthenumberofgenerationsto8or404
10,withfurtherincreasesyieldingongoingbutdiminishingimprovements(Figure5A).405
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
19
Additionalgenerationsallowmorerecombinationbetweenparentalgeneticbackgrounds,406
whichshouldleadtosharperQTLpeaks.407
Pastresultsindicatethatselectingonlythemostextremeindividualsisnotoptimalfor408
BSA(Magweneetal.2011).Concordantly,forthefocalsimulationscenariostudiedhere,409
optimumbulkproportionswerearound10-15%foreachBSApool,and20%forthesingle410
IMpool(Figure5B).Theseresultsappeartoreflectabalancebetweenenrichingfor411
causativegenotypes(favoringfewerindividuals)andminimizingtheeffectsofrandom412
samplingvariance(favoringmoreindividuals).Thus,bothBSAandIMstudiesmaybenefit413
fromselectingsignificantnumbersofindividuals,whichshouldhelptomaximizethe414
diversityofrecombinationbreakpointsrepresentedinthefinaldata.415
Relatedtotheissueofsamplingvarianceareparameterssuchasthenumberof416
individualspresentineachgenerationandthenumberofgenotypessampledinthedata417
(e.g.sequencingdepth).Whensimulationsjointlyscaledupthenumberofindividuals418
presentineachgeneration,thenumbersampledforsequencing,andthesequencingdepth,419
performanceimprovedconsiderably(Figure5C).Thenumberofindividualssampledin420
thefinalgenerationmadeaparticulardifference,atleastifdepthwasscaleduplinearly421
(FigureS1).Increasingsequencedepthconsistentlyledtobetterperformance(viaa422
reductioninsamplingvariance),althoughwithsomediminishingreturns(Figure6).423
Simulationsalsoconsideredtheinteractionbetweenselectionproportionand424
populationsize.Theoptimalselectionproportion(s)tendstoscaleinverselywith425
populationsize(N).ForBSApopulationsizesbetween100and2,400,therewasarelative426
stabilityintheoptimalnumberofsampledindividualsforsequencing(Ns),withthis427
quantityrangingonlyfrom35to60(TableS1).InlinewiththefindingsofMagweneetal.428
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
20
(2011),thisresultsuggeststhatreducingsamplingvarianceisofprimaryimportance,429
whereasenrichingforthemostphenotypicallyextremeindividualsisasecondarypriority.430
431
SimulationtestingoftheSIBSAMpipeline432
AselaboratedintheMaterialsandMethodssection,Idevelopedaprototypemethodfor433
Simulation-basedInferenceforBSAMapping(SIBSAM).Theflexibilityofthissimulation-434
drivenpipelineallowsarangeofinferences,includingforchallengingcasesinwhichtwoor435
moreQTLsarepartofthesamecomplexpeak(Figure2).ThegoalsofSIBSAMinclude436
assessingthesignificanceofpeaks,andestimatingthestrengthandgenomicconfidence437
intervalofsignificantQTL.TheperformanceofSIBSAMwasassessedviaaseriesoftest438
simulationswithoneormoreQTLs.WhileavastrangeofQTLandexperimentalscenarios439
couldpotentiallybeexamined,Ifocushereonparametersrelevanttoongoingempirical440
workinDrosophila(Bastideetal.2016).TheBSAexperimentaldesignsimulatedhere441
wentfor16generations,with1,200individualsineachgeneration,with600females442
phenotypedinthelastgenerationwith10%poolsselected,and1,000informative443
sequencereadsforeachgenomicwindow.444
Fortheabovescenario,SIBSAM’sQTLdetectionpowerwentfromweakforaQTL445
explaining10%oftheexperimentalpopulation’sphenotypicvariance(withtheremainder446
duetorandomenvironmentalormeasurementvariance)tostrongfora20%QTL,with447
intermediatepowerfor15%QTL(Figure7).Asillustratedbytheexploratorysimulations448
above,theperformanceofQTLmappingislikelytobeimprovedbyincreasingthenumber449
ofgenerations,thepopulationsize,sequencingdepth,and/orthenumberofindependent450
crosses.451
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
21
TheestimationofQTLstrengthforsignificantpeakswasquiteaccuratefor452
intermediatestrengthQTL(15%to33%)whentheremainingphenotypicvariancewas453
randomandnormallydistributed(Figure7).However,inotherscenariosthestrength454
estimatebecameupwardlybiased.ForaweakerQTL(e.g.10%inthisexample),there455
appearstobea“detectionbias”inwhichonlythetestreplicatesgivingthetallestpeaks456
weredeemedsignificant,andsincethesepeaksareunusuallyhighforas=10%QTL,their457
strengthwastypicallyoverestimated.Ifstrengthestimatesfornon-significantpeakswere458
included,therewasnodirectionalbias.ThehighestQTLstrength(50%)alsoshowed459
upwardbias,whichmayreflecta“saturationeffect”oftheadstatistic.Here,peakheights460
wereverycloseto1(individualswerewell-sortedintotheextremepoolsbasedonQTL461
genotype),whichisthesameoutcomeproducedbyaQTLwiths>50%.Upwardstrength462
biaswasalsoobservediftheremainingphenotypicvariancewasproducedbyotherstrong463
QTLs,ratherthannormallydistributedrandomvariance.Ifa20%QTLwasaccompanied464
byanunlinked80%QTL(withnoenvironmental/measurementvariance),themedian465
estimateofswas24.2%.Ifa20%QTLwasaccompaniedbyfourunlinkedQTLsofequal466
strength,themedianestimateofswas31.4%(althoughpowerincreasedfrom94%to467
100%forbothofthesecases).Inlightoftherecurrentbiasineffectsizeestimation,the468
reportedquantitiesarebestviewedasroughestimatesofQTLstrength.Future469
methodologicalstudiesmayexplorealternativeapproachestotheestimationofQTL470
strengthinasimulationframework.471
OtheraspectsofSIBSAMinferenceperformedlargelyasexpectedonthesimulated472
data.ForsignificantQTL,onlyaround5%hadafalsepositivesecondarypeak(inlinewith473
nullexpectations;FigureS2).ForQTLstrengthswithadequatepower,approximatelythe474
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
22
predictedproportionoflocifellwithintheprovidedconfidenceintervalsforQTLstrength475
andgenomicposition(FigureS2),withperformanceonlydecliningfortheweakers=10%476
casethatwasrarelydetectedforthisscenario.477
DetectionpowerwasalsoexaminedforcasesinvolvingtwolinkedQTLs(ofstrength478
15%and/or30%)separatedbyvariousdistances(2.5cM,5cM,10cM,25cM).ForQTLof479
equalstrength,the25cMlinkagehadnoadverseeffectonQTLdetection.Powerwas480
actuallyslightlyhigherinthecaseof15%QTLseparatedby25cM(relativetotheunlinked481
case),eventhough55%ofthesetestreplicateshadoneoftheQTLsasasecondarypeak.482
PowertodetectasecondpeakdroppedsignificantlyasthedistancebetweenQTLdropped483
to10cMand5cM(Figure8).InthecasewhereoneQTLshads=30%andtheotherhads484
=15%,powerremainedhighforthestrongerQTLatalldistances,butwaslowforweaker485
QTLat10cMorcloser(Figure8).486
487
DISCUSSION488
Mappingthegeneticarchitectureofphenotypictraitdifferencesremainsachallenging489
butcriticalprobleminthefieldsofgeneticsandevolutionarybiology.Above,Ihave490
comparedthebehaviorofbulksegregantanalysisandintrogressionmapping,while491
assessingtheexperimentalparametersthatmodulatetheiroutcomes.Ithenofferedanew492
simulation-basedapproachtoBSAinference,gearedtowardsystemslikeDrosophilain493
whichhundredsorthousands(butnotmillions)ofindividualscanbeexamined,andin494
whichBSAQTLsignalsmaysometimesoverlapeachother.495
AgeneralprincipleofQTLmappingisthatperformanceisenhancedbysamplinga496
diverserangeofrecombinantgenotypes.Thus,simulationresultssuggestthatBSAandIM497
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
23
shouldbothbemoresuccessfulwhenmoregenerationsofinterbreedingoccur,when498
largernumbersofindividualsarepresentinthemappingpopulation,andwhengreater499
sequencingeffortisemployed.Theimportanceofsamplingatleastafewtensof500
individualsinphenotypically-selectedpoolsisclearaswell.Theseresultssuggestthatthe501
typicalmethodofintrogressionmapping,inwhichsmallnumbersofindividualsare502
phenotypicallyselectedeverygenerationortwo,isnotadvisableformappingoligogenic503
traits(andisnotidealformonogenictraitseither;Figure4).Instead,ifIMisused,larger504
numbersofphenotypedandretainedindividualsaredesirable.However,basedonthe505
criteriaemployedhere,BSAgaveamoreprecisemappingsignalthanIMforevery506
combinationofexperimentalandQTLparametersexamined.Thisfindingmayagainrelate507
totheprincipleofmaintainingadiversityofrecombinationbreakpoints,whichis508
maximizedbyavoidingIM’spopulationbottlenecksassociatedwithphenotypicselection509
duringtheintermediategenerationsofinterbreeding.510
ThetradeoffsamongBSA,IM,andothermappingapproachesarecomplexandmerit511
furtherattention.AcompellingadvantageofBSAisthatthesameexperimentalpopulation512
maybeusedtomapmultipletraitdifferences(e.g.oncetheadultshavealready513
reproduced,selectforonetraitingeneration12,anothertraitingeneration13,etc.).For514
thesamesetofexperimentalparametersasdefinedhere,BSAactuallyrequireslesseffort515
thanIMduringtheexperiment,sincephenotypingmustbeperformedonlyinthelast516
generation.BSAdoesrequirethesequencingoftwophenotypicpools(highandlow),517
whereasIMrequiresjustonephenotypicpooltobesequenced(notehoweverthat518
doublingIMdepthdoesnotallowittomatchBSA’sperformance;FigureS1).Becauseboth519
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
24
parentalstrains’genotypesarepresentacrossthegenomesofmappingpopulation520
individuals,BSAmaybemoreinfluencedbythecomplexitiesofepistaticinteractions.521
InthecourseofaBSAexperiment,parentalstrainancestryfrequenciesinthemapping522
populationcoulddeviatefrom50%.Theeffectsofgeneticdriftshouldbemodestwhenthe523
populationsizeisvastlygreaterthanthenumberofgenerationsofinterbreeding,and524
SIBSAMallowsfordrift’soccurrence.Althoughnotmodeledhere,inadvertentlaboratory525
selectioncouldalsoshiftmappingpopulationancestryfrequencies.Ingeneral,such526
ancestryshiftsshouldnotleadtofalsepositiveQTL,becausebothphenotypicpoolswillbe527
equallyaffected.Ifancestryfrequenciesbecomeextreme,theresponseofadtoaQTLcould528
bedampened,leadingtoreducedpowerandunderestimationofQTLstrength.Hence,it529
maybeworthwhiletocollectBSAsequencedatabeforeanexcessivenumberof530
generationshaveelapsed.Genomicregionsfoundtoshowancestryshiftscouldbe531
interestingintheirownright,sincetheymaycontaindriversoflaboratoryadaptation,532
differentialmatingsuccess,orsegregationdistortion.533
ItismorechallengingtocompareBSAorIMagainstalternativemappingmethodssuch534
asthoseinvolvingindividualgenotyping(e.g.Andolfattoetal.2011)orthegenerationof535
recombinantinbredlines(e.g.Kingetal.2012).However,itmaybeworthevaluatingthe536
benefitsofcombiningelementsofBSAwiththeseapproaches.Followingmultiple537
generationsinalargemappingpopulation,offspringwithextremephenotypescouldbe538
individuallygenotyped.Or,themappingpopulationcouldbeusedtofoundalargenumber539
ofrecombinantinbredlines(RILs),withBSAandRILmappingpotentiallyintegrated.540
Themappingapproachandmethoddescribedhererequiresamoderateinvestmentof541
researchertimeandfunding,anddeliversarangeofQTLinferences.Whileusefulinits542
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
25
currentform,SIBSAMmayalsomotivatefuturesimulation-basedmappingmethods.543
AlthoughmotivatedbyDrosophilaQTLmapping,thisapproachmayprovebroadlyuseful544
fornon-modelinsectsandothersmallerorganismswithshortgenerationtimes.545
546
547
ACKNOWLEDGEMENTS548
IthankKarlBromanforhelpfuldiscussionsregardingSIBSAMandDavidBegunfor549
suggestingBSAtomeseveralyearsago.TheUW-MadisonCenterforHighThroughput550
Computingprovidedcomputationalassistanceandresourcesforthiswork.Thisproject551
wasfundedbyNIHgrantR01GM111797.552
553
554
LITERATURECITED555
Andolfatto,P.,D.Davison,D.Erezyilmaz,T.T.Hu,J.Mast,etal.,2011Multiplexedshotgun556
genotypingforrapidandefficientgeneticmapping.GenomeRes.21:610-617.557
Baird,N.A.,P.D.Etter,T.S.Atwood,M.C.Currey,A.L.Shiver,etal.,2008RapidSNP558
discoveryandgeneticmappingusingsequencedRADmarkers.PLoSONE3:e3376.559
Bastide,H.,J.D.Lange,J.B.Lack,A.Yassin,andJ.E.Pool,2016Oligogenicadaptation,soft560
sweeps,andparallelmelanicevolutioninDrosophilamelanogaster.Accompanying561
manuscript.562
Comeron,J.,R.Ratnappan,andS.Bailin,2012Themanylandscapesofrecombinationin563
Drosophilamelanogaster.PLoSGenet.8:e1002905.564
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
26
Earley,E.J.,andC.D.Jones,2011Next-generationmappingofcomplextraitswith565
phenotype-basedselectionandintrogression.Genetics189:1203-1209.566
Edwards,M.D.,andD.K.Gifford,2012High-resolutiongeneticmappingwithpooled567
sequencing.BMCBioinformatics13:S8.568
Ehrenreich,I.M.,J.Bloom,N.Torabi,X.Wang,Y.Jia,etal.,2012Geneticarchitectureof569
highlycomplexchemicalresistancetraitsacrossfouryeaststrains.PLoSGenet.8:570
e1002570.571
Haase,N.J.,T.Beissinger,C.N.Hirsch,B.Vaillancourt,S.Deshpande,etal.,2015Shared572
genomicregionsbetweenderivativesofalargesegregatingpopulationofmaize573
identifiedusingbulkedsegregantanalysissequencingandtraditionallinkageanalysis.574
G3(Bethesda)5:1593-1602.575
King,E.G.,S.J.Macdonald,andA.D.Long,2012PropertiesandpoweroftheDrosophila576
SyntheticPopulationResourcefortheroutinedissectionofcomplextraits.Genetics577
191:935-949.578
Lack,J.L.,C.M.Cardeno,M.W.Crepeau,W.Taylor,R.B.Corbett-Detig,etal.,2015The579
Drosophilagenomenexus:apopulationgenomicresourceof623Drosophila580
melanogastergenomes,including197fromasingleancestralrangepopulation.581
Genetics199:1229-1241.582
Lai,C.Q.,J.Leips,W.Zou,J.F.Roberts,K.R.Wollenberg,etal.,2007Speed-mapping583
quantitativetraitlociusingmicroarrays.Nat.Methods4:839-841.584
Magwene,P.M.,J.H.Willis,andJ.K.Kelly,2011Thestatisticsofbulksegregantanalysis585
usingnextgenerationsequencing.PLoSComputBiol,7:e1002255.586
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
27
Michelmore,R.W.,I.Paran,andR.V.Kesseli,1991Identificationofmarkerslinkedto587
disease-resistancegenesbybulkedsegregantanalysis:arapidmethodtodetect588
markersinspecificgenomicregionsbyusingsegregatingpopulations.Proc.Natl.Acad.589
Sci.USA88:9828-9832.590
Parts,L.,F.A.Cubillos,J.Warringer,K.Jain,F.Salinas,etal.2011Revealingthegenetic591
structureofatraitbysequencingapopulationunderselection.GenomeRes.21:1131-592
1138.593
VanLeeuwen,T.,P.Demaeght,E.J.Osborne,W.Dermauw,S.Gohlke,etal.2012Population594
bulksegregantmappinguncoversresistancemutationsandthemodeofactionofa595
chitinsynthesisinhibitorinarthropods.Proc.Natl.Acad.Sci.USA109:4407-4412.596
Wicks,S.R.,R.T.Yeh,W.R.Gish,R.H.Waterston,andR.H.Plasterk,2001Rapidgene597
mappinginCaenorhabditiselegansusingahighdensitypolymorphismmap.Nat.Genet.598
28:160-164.599
600
601
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
28
Figure1.TheinvestigatedexperimentaldesignsforBSAandIMareillustrated.InBSA,offspringofreciprocalparentalstraincrossesarecombinedandallowedtobreedwithouttraitselectionforanumberofgenerations.Phenotypingoccursonlyinthefinalgeneration,andpoolsofindividualswiththehighestandlowesttraitvaluesareeachsequenced.TheIMframeworkinvestigatedhereinvolvestraitselectionandparentalstrainbackcrossingeverysecondgeneration(allowingrecessivegenotypestobeexpressed).Inthelastgeneration,onephenotypicextremeissequencedandcomparedagainstthebackcrossparentalstraingenome.
Selection for high trait
Parental Odd Generations Even Generations Last Even Generation
high trait ♀♀ (seq.)
all others (discard)
High Strain ♀♀× Low Strain ♂♂
Low Strain ♀♀× High Strain ♂♂
F1 ♀♀ F1 ♂♂
F1 ♀♀ F1 ♂♂ Interbreeding
without trait selection
Parental F1 Several Generations Last Generation
high trait ♀♀ (seq.)
mid trait ♀♀ (discard)
low trait ♀♀ (seq.)
High Strain ♀♀× Low Strain ♂♂
Interbreeding w/o selection
High-selected ♀♀ × Low Strain ♂♂
(backcross)
INTROGRESSION MAPPING (IM):
BULK SEGREGANT ANALYSIS (BSA):
.CC-BY-NC 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 9, 2016. ; https://doi.org/10.1101/057984doi: bioRxiv preprint
29
Figure2.Definitionsofprimaryandsecondarypeaks,alongwithsecondarydeviation,usedbySIBSAMareillustratedhere.Amongagroupofcontiguouswindowswithsmoothedadvaluesabovezero,theprimarypeakisdefinedbythewindowwiththehighestvalue.Secondarypeaksrepresentotherlocalmaxima,andtheirsignificanceisjudgedbasedonsecondarydeviation(v).Secondarydeviationisdeterminedbythedifferenceinadbetweenthesecondarypeak’smaximumvalueandtheminimumvaluebetweenthatpeakandtheprimarypeak(oratallersecondarypeak,whicheverminimumisgreater).
PrimaryPeak
SecondaryPeak1
SecondaryPeak2
v2v1
30
Figure3.AflowchartillustratingtheSIBSAManalysispipelineisshown.Asingleinputfilecontainsphysicalandgeneticmappositionsofwindowboundariesforallchromosomes,alongwithancestrydifferencevaluesandinformativedepth(thenumberofreadswithininformationaboutparentalstrainancestry)foreachwindow.NullsimulationswithnotrueQTLsareusedtoidentifysignificantprimarypeaksintheempiricaldata.SimulationswithoneQTL(matchingaprimarypeaklocation)arethenusedtoestimateconfidenceintervalsforprimarypeakeffectsizeandgenomiclocation,whilealsoidentifyingsignificantsecondarypeaks.Foranyprimarypeakwithsignificantsecondarypeaks,clustersimulationsareconductedwithQTLsateachpeak’slocation,inordertogeneratefinalconfidenceintervalsforeffectsizeandgenomiclocation.Theseanalysesaresummarizedintoasingleoutputfilecontainingallrelevantinferencesforeachsignificantpeak.
Input File: Window Data, Depth, Ancestry
sibsam_cluster_prepare.pl
sibsam_cluster_analysis.pl
Cluster Simulations
Null Simulations
sibsam_null.pl
Null Sim. Output File(s)
Secondary Peak File
Primary Peak P Value File
sibsam_cluster.pl
Cluster Peak Data File(s)
Cluster Output File(s)
Cluster Sim. Output File(s)
Secondary Peak P Value File
One Locus Simulations
1 Locus Sim. Output File(s)
Primary Peak Results File
sibsam_1locus.pl
sibsam_1locus_analysis.pl sibsam_null_analysis.pl
SIBSAM pipeline:
sibsam_summarize.pl Final Output File
31
Figure4.ResultsareshownforexploratoryBSAandIMsimulationswithvaryingnumbersofQTLsandnumbersofjointly-analyzedindependentcrosses.Asaproxyformethodperformance,themediancentiMorgandistancebetweenthetrueQTLandthestatisticmaximum(ofadforBSAorapforIM)isshown.ThenullexpectationforarandomlylocatedpeakwithinaQTL’sanalysiswindowisalsoshown(gray).Theseresultsindicate:(1)theincreasingchallengeofmorepolygenicscenariosforallapproaches,(2)ageneraladvantageofBSAoverIM,and(3)theutilityofcombiningdatafromindependentcrossesthatallshareagivenQTLincommon.
(26.8) (24.3) (14.6) (7.3)
32
Figure5.TheresultsofexploratoryBSAandIMsimulationsareshowninwhichoneormoreexperimentalvariablesweremanipulated.(A)Increasingthetotalnumberof
generationsintheexperimentreducesthemediancentiMorgandistancebetweenthetrue
QTLsandtheobservedpeak.(B)Abroadoptimalrangeofselectionproportionexistsfor
thefocalBSAandIMscenarios.(C)Scalinguptheexperimentalpopulationsize(andhence
thenumberofphenotypedindividuals),alongwiththesequencingdepth,leadsto
improvedstatisticalperformance.
0
1
2
3
4
5
0 4 8 12 16 20
cMfrom
PeaktoQTL
Genera2ons
IMBSA
0.0
0.5
1.0
1.5
2.0
0 0.1 0.2 0.3 0.4 0.5
cMfrom
PeaktoQTL
Selec2onPropor2on
IMBSA
0
1
2
3
4
5
6
7
75 150 300 600 1200
cMfrom
PeaktoQTL
Pop.Size,#Phenotyped,&(Depthx2)
IMBSA
C
A
B
33
Figure6.OutcomesofexploratoryBSAandIMsimulationswithvariablesequencingdepthareshown.Tomoreclearly
illustratetheinfluenceofdepthonsamplingvariance,depthisplottedintermsoftheaveragenumberofreadsforeach
individualinaphenotypicallyselectedpool.Fromagroupof600phenotypedindividuals,resultsforaseriesofselection
proportionsareillustrated.Resultsillustratetheadvantageofincreasedsequencingdepth,withsomediminishingreturns.
0.0
0.5
1.0
1.5
2.0
2.5
0.625 1.25 2.5 5 10
Distan
cefrom
peaktoQTL(cM)
Depthpersampledindividual
5%(n=30)10%(n=60)20%(n=120)35%(n=210)50%(n=300)
0.0
0.5
1.0
1.5
2.0
2.5
0.625 1.25 2.5 5 10Distan
cefrom
peaktoQTL(cM)
Depthpersampledindividual
5%(n=30)10%(n=60)20%(n=120)35%(n=210)50%(n=300)
(11.9) (3.5)BSAIM
34
Figure7.ResultsofonelocustestsimulationsassessingtheperformanceoftheSIBSAMpiplineareshownforQTLsontheautosomes(reddiamond)andXchromosome(blueX).Asshowninthetoppanel,thescenarioinvestigatedhere(involvingapopulationof1,200individualswith600phenotypedafter16generationsand10%retainedineachphenotypicpool)hadintermediatepowerforaQTLexplaining15%ofphenotypicvarianceintheexperimentalpopulation,withlow/highpowerbelow/abovethatmark.Asillustratedinthebottompanelanddiscussedinthetext,someupwardbiasineffectsizeestimationwasobservedfortheweakestandstrongestQTLsexamined.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
Powerto
Detect
QTLEffectSize
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
Es#m
ated
Effe
ctSize
QTLEffectSize
35
Figure8.ThedetectionpowerofSIBSAMintestsimulationswithtwolinkedQTLsis
illustrated.ThetoppanelsillustratethepowertodetectthesecondoftwolinkedQTLsof
equalmagnitude,conditionalondetectingthefirst.Thebottompanelillustratesthepower
todetecteithertheweakerorthestrongeroftwolinkedQTLsofunequalsizes.
0
0.2
0.4
0.6
0.8
1
Unlinked
25cM
10cM
5cM
2.5cM
Powerto
DetectS
econ
dQTL
Strength30%
Linked QTLs of Equal Strength
0
0.2
0.4
0.6
0.8
1Unlinked
25cM
10cM
5cM
2.5cM
Powerto
DetectS
econ
dQTL
Strength15%
0
0.2
0.4
0.6
0.8
1
Unlinked 25cM 10cM 5cM 2.5cM
Powerto
DetectE
achQTL
LinkedQTLsofUnequalStrength
Strength15%
Strength30%