Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Waste Not, Want Not
SusanHolmeshttp://www-stat.stanford.edu/˜susan/
Bio-X andStatistics, StanfordUniversity ABabcdfghiejklAugust9, 2015
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Challengesforthoseofusworkingfromthegroundup
Heteroscedasticity. InformationLeaks.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part I
Heteroscedasticity: Mixturesand how to Normalize them
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Some real data (Caporoso et al, 2011)
> GlobalPatterns
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data()Sample Data: [ 26 samples by 7 sample variables ]
tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
otu_table(GlobalPatterns)[45:55,1:10]
OTU Table: [11 taxa and 10 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
573586 0 0 0 0 0 0 0 0 0 0
568724 0 0 0 0 0 0 0 0 0 0
175045 0 0 0 0 1 0 0 2 0 0
552540 0 0 0 0 0 0 0 0 0 0
546313 72 153 11232 0 1 1 0 1 1 0
548602 0 0 16 0 0 0 0 0 0 0
564501 0 0 3 0 0 0 0 0 0 0
47778 1 14 207 0 0 0 5 0 0 0
54107 2 87 746 0 0 0 3 35 0 0
25116 1 4 169 0 0 0 0 1 0 0
71074 93 341 11788 1 0 23 48 58 2 0
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
> sample_sums(GlobalPatterns)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
864077 1135457 697509 1543451 2076476 718943 433894 186297
.....
NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
1478965 1652754 58688 493126 279704 937466 1211071 1216137
> summary(sample_sums(GlobalPatterns))
Min. 1st Qu. Median Mean 3rd Qu. Max.
58690 567100 1107000 1085000 1527000 2357000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Howtodealwithdifferentnumbersofreads?
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Verypopular: qiime
2,300citations.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
CurrentMethod: Rarefying
Adhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen
calledthe rarefactionlevel.2. Discardlibraries(microbiomesamples)thathavefewer
readsthan NL,min.3. Subsampletheremaininglibrarieswithoutreplacement
suchthattheyallhavesize NL,min.Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ReductionofDatatoProportions
Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.
StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Howtocompressthedata?
.
LOST INFORMATION
Original Data
FinalResults
...withoutlosingtoomuchinformation?
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Theproportionisnota sufficient statisticfortheBinomial.
A statisticT(X) iscalledsufficientfor θ ifitcontainsalltheinformationinX about θ.Standardstatisticalviewpoint:Thejointprobabilitydistributionofthedataconditionalonthevalueofasufficientstatisticforaparameter, doesnotdependonthatparameter: Pθ(X|T(X) = T) doesnotdependon θ. Wiki
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
EquivalentDefinitions
MutualInformation:
I(X,Y) =∑x∈X
∑y∈Y
P(x, y)log P(x, y)P(x)P(y) = K(P(x, y),P(x)P(y))
A functionofthedata T(X) isasufficientstatisticforthedistributionif
I(θ,X) = I(θ,T(X))
foralldistributionson θ.Note:ForaBayesian, nomatterwhatprioroneuses, oneonlyhastoconsiderthesufficientstatisticformakinginference,becausetheposteriordistributiongiven T = T(x) isthesameastheposteriorgiventhedata X = x.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Aimofthestudies: DifferentialAbundance
Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.
Wehavetocorrectformanysourcesoferror(blocking,modeling, control, etc..)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
RarefactionandReductiontoProportionsareInadmissible
Thefollowingisaminimalexampletoexplainwhyrarefyingisstatisticallyinadmissible, especiallywithregardstovariancestabilization.Supposewewanttocomparetwodifferentsamples, called Aand B,comprisedof100and1000reads, respectively. Inthesehypotheticalcommunitiesonlytwotypesofmicrobeshavebeenobserved, OTU1 and OTU2
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AccordingtoTable 1, Left.
Table: A minimalexampleoftheeffectofrarefyingonpower.
OriginalAbundanceA B
OTU1 62 500OTU2 38 500Total 100 1000
RarefiedAbundanceA B
OTU1 62 50OTU2 38 50
100 100StandardTestsforDifference
P-value χ2 Prop FisherOriginal 0.0290 0.0290 0.0272Rarefied 0.1171 0.1171 0.1169
Hypotheticalabundancedatainitsoriginal(Top-Left)andrarefied(Top-Right)form, withcorrespondingformaltestresultsfordifferentiation(Bottom).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Formallycomparingthetwoproportionsaccordingtoastandardtestisdoneeitherusinga χ2 test(equivalenttoatwosampleproportiontesthere)oraFisherexacttest. Thisrequiresknowledgeofthenumberoftrials.Byrarefying(Table 1, top-right)sothatbothsampleshavethesamenumberofcounts, wearenolongerabletodifferentiatebetweenthem.Thislossofpoweriscompletelyattributabletoreducingthesizeof B byafactorof10, whichalsoincreasestheconfidenceintervalscorrespondingtoeachproportionsuchthattheyarenolongerdistinguishablefromthosein A,eventhoughtheyaredistinguishableintheoriginaldata.Thevarianceoftheproportion'sestimate p ismultipliedby10whenthetotalcountisdividedby10.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Equalizationofvariances
Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =
pqn =
qnE(
Xn ), afunctionofthemean.
Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
VarianceStabilization
Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
TakeforinstancedifferentPoissonvariableswithmean µi.Theirvariancesarealldifferentifthe µi aredifferent.However, ifthesquareroottransformationisappliedtoeachofthevariables, thenthetransformedvariableswillhaveapproximatelyconstantvariance.Actuallyifwetakethetransformation x −→ 2
√x weobtaina
varianceapproximatelyequalto1..
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
var = µ+ cµ2
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
...
WolfgangHuber . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Modelingreadcounts
Iftechnicalreplicateshavesamenumberofreads: sj.Poissonvariationwithmean µ = sjui.Taxa i havinganincidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe
Kij ∼ Poisson (sjui)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
NegativeBinomialwiththetwoparameters: themean m andr = 1−p
p m, thentheprobabilityis:
X ∼ NB(m; r)
P(X = k) =
(k+ r− 1
k
)(r
r+ m
)r( mr+ m
)k
=Γ(k+ r)k!Γ(r)
(r
r+ m
)r( mr+ m
)k
Thevarianceis Var(X) = m(m+r)r = m+ m2
r , wewillalsouseϕ = 1
r andcallthistheoverdispersionparameter, givingVar(X) = m+ ϕm2. When ϕ = 0 thedistributionof X willbePoisson(m). Thisisthe(mean=m,overdispersion=ϕ)parametrizationwewillusefromnowon.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ModelingCounts
Forbiologicalreplicateswithinthesamegroup--suchastreatmentorcontrolgroupsorthesameenvironments--theproportions ui willbevariablebetweensamples.Callthetwoparameters ri and ϕ =
pi1−pi .
Sothat Uij theproportionoftaxa i insample j isdistributedaccordingtoGamma(ri, ϕ =
pi1−pi ).
Kij haveaPoisson-GammamixtureofdifferentPoissonvariableseachwithitsownparametergeneratedfromthePoisson.ThisgivestheNegativeBinomialwithparameters (m = uisj)and ϕi asasatisfactorymodelofthevariability.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DifferentConditionsSamplesbelongtodifferentconditionssuchastreatmentandcontrolordifferentenvironments.Estimatethevaluesoftheparametersseparatelyforeachofthedifferentbiologicalreplicateconditions/classes.Usetheindex c forthedifferentconditions, wethenhavethecountsforthetaxa i andsample j incondition c havingaNegativeBinomialdistributionwith mc = uicsj and ϕic sothatthevarianceiswritten
uicsj + ϕics2ju2ic (1)Estimatetheparameters uic and ϕic fromthedataforeachOTU andsamplecondition.Theendresultprovidesavariancestabilizingtransformationofthedatathatallowsastatisticallyefficientcomparisonsbetweenconditions.Thisapplicationofahierarchicalmixturemodelisverysimilartotherandomeffectsmodelsusedinthecontextofanalysisofvariance. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
UsingRNA-seqimplementation: DESeq2
McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.ExamplesofOverdispersioninMicrobiomeData.Common-ScaleVarianceversusMeanforMicrobiomeData.EachpointineachpanelrepresentsadifferentOTU'smean/varianceestimateforabiologicalreplicateandstudy.Thedatainthisfigurecomefromthe GlobalPatternssurveyandthe Long-TermDietaryPatterns study(Right)Varianceversusmeanabundanceforrarefiedcounts.(Left)Common-scalevariancesandcommon-scalemeans,estimatedaccordingtotheDESeq2package.Thedashedgraylinedenotesthe σ2 = µ case(Poisson;ϕ = 0). ThecyancurvedenotesthefittedvarianceestimateusingDESeq.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Code
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Common Scale Rarefied
1e+02
1e+05
1e+08
1e+02
1e+05
1e+08
1e+02
1e+05
1e+08
DietPatternsGlobalPatterns
GlobalPatterns
2004 HighFatfreshwater
marine
10 1000 10 1000Mean
Varia
nce
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
38 10 6 12 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 647 21 7 39 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13
380 100 60 120 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 6470 210 70 390 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13
samples
OTU
stest null
OTU
s
5028368918047
34 1 154 20 429 1 61 85 3
161 6 1342 2 3
191163173123010
574851357899307
191163173000
000
357899307
15 15 161 0 0 0 0
87 4 72 0 0 0 0
10 148 15 0 0 0 0
0 0 0 82 244 7 24
0 0 0 354 452 92 1
0 0 0 14 9 33 251
samplesO
TUs
Ocean Feces
OTU
s
Ocean Feces
1. Sum rows. A multinomial for each sample class.
2. Deterministic mixing. Mix multinomials in precise proportion.
Ocean Feces
Microbiome Clustering Simulationsamples
OTU
s
Environment
OTU
s
1. Sum rows for each environment.
2. Sample from multinomial.
Differential Abundance Simulation
3. Multiply randomly
selected OTUs within test class by
effect size.
4. Perform differential abundance tests,
evaluate performance.
158 56 214 39 47 4 11 11 5 3
124 54 212 29 40 3 10 7 8 6
129 46 216 33 42 4 13 7 3 6
11 3 14 3 1 39 95 63 29 37
19 7 34 7 0 88 237 137 73 86
9 1 15 1 2 29 84 51 14 29
OTU
s
samples
Simulated Ocean Simulated Feces
3. Sample from these multinomials.
4. Perform clustering, evaluate accuracy.
BA
Microbiome count data from the Global
Patterns dataset
Repeat for each effect size and media library size.
Repeat for each environ-ment, number of sam-ples, effect size, and median library size.
Amount added is library size / effect size
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
NormalizationsinSimulationForeachsimulatedexperimentweusedthefollowingnormalizationmethodspriortocalculatingsample-wisedistances.1. DESeqVS.VarianceStabilizationimplementedinthe
DESeqpackage.2. None. Countsnottransformed. Differencesintotal
librarysizecouldaffectthevaluesofsomedistancemetrics.
3. Proportion. Countsaredividedbytotallibrarysize.4. Rarefy. Rarefyingisperformedasdefinedinthe
introduction, using rarefy_even_depth implementedinthephyloseqpackage. with NL,min settothe15th-percentileoflibrarysizeswithineachsimulatedexperiment.
5. UQ-logFC.The Upper-QuartileLog-FoldChangenormalizationimplementedintheedgeR package, coupledwiththe top-MSD distance.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DistancesinSimulationForeachofthepreviousnormalizationswecalculatedsample-wisedistance/dissimilaritymatricesusingthefollowingmethods, ifapplicable.1. Bray-Curtis. TheBray-Curtisdissimilarityfirstdefinedin
1957forforestecology.2. Euclidean. TheeuclideandistancetreatingeachOTU as
adimension.√∑n
i=1(Ki1 − Ki2)2, isthedistancebetweensamples 1 and 2,n thenumberofdistinctOTUs.
3. PoissonDist. Ourabbreviationof PoissonDistance, asample-wisedistanceimplementedinthePoiClaClupackage (Witten,2011).
4. top-MSD.ThemeansquareddifferenceoftopOTUs, asimplementedinedgeR.
5. UniFrac-u. TheUnweightedUniFracdistance (Lozupone,2005).
6. UniFrac-w. TheWeightedUniFracdistance (Lozupone,2007). . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Inordertoconsistentlyevaluateperformanceinthisregard,wegeneratedmicrobiomecountsbysamplingfromtwodifferentmultinomialsthatwerebasedoneitherthe Oceanor Feces microbiomesofthe GlobalPatterns empiricaldataset.Anequalnumberofsimulatedmicrobiomesampleswasgeneratedfromeachmultinomial. The Ocean and FecessampleclasseshavenegligibleoverlappingOTUs.Mixingthembyadefinedproportionallowscontroloverthedifficultyoftheclusteringtaskfromtrivial(nomixing)toimpossible(bothmultinomialsevenlymixed).Clusteringwasperformedindependentlyforeachcombinationofsimulatedexperiment, normalizationmethod, anddistancemeasureusingpartitioningaroundmedoids(PAM).Theaccuracyisthefractionofsimulatedsamplescorrectlyclustered; worstpossibleaccuracyis50%ifallsamplesareclustered. (Rarefyingprocedureomitssamples, soitsaccuracycanbebelow50%)
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ImprovementinPowerandFDR
Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DESeq2 − nbinomWaldTest DESeq − nbinomTest edgeR − exactTest metagenomeSeq − fitZig two sided Welch t−test
0.6
0.8
1.0
0.6
0.8
1.0
N ~L
=2
00
0N ~
L=
50
00
0
AU
C
Number Samples per Class: 3 5 10 Normalization Method: Model/None Rarefied Proportion
0.6
0.8
1.0
N ~L
=5
00
00
Sp
eci
ficity
0.2
0.5
0.8
1.0
N ~L
=5
00
00
5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Effect Size
Se
nsi
tivity
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ImprovementsofDistancebasedclustering
Clusteringaccuracyinsimulatedtwo-classmixing.Clusteringaccuracy(withPAM:verticalaxis)followingdifferentnormalizationanddistancemethods.Pointsdenotethemeanvaluesofreplicates, withaverticalbarrepresentingonestandarddeviationaboveandbelow.Thehorizontalaxisistheeffectsize.EachmultinomialisderivedfromtwomicrobiomesthathavenegligibleoverlappingOTUs(FecalandOceanmicrobiomesintheGlobalPatternsdataset).Highervaluesofeffectsizeindicateaneasierclusteringtask.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Bray − Curtis Euclidean PoissonDist top − MSD UniFrac − u UniFrac − w
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
N ~L
=1
00
0N ~
L=
20
00
N ~L
=1
00
00
1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00
Effect Size
Accu
racy
Normalization Method: DESeqVS None Proportion Rarefy UQ−logFC
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ExamplesusingPhyloseq:http://joey711.github.io/phyloseq-extensions/
DESeq2.html
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
BenefittingfromthetoolsandschoolsofStatisticians.......
Thankstothe R community:Chessel, Jombart, Dray, Thioulouse ade4 , WolfgangHuber,MichaelLovefor DESeq2, GordonSmythandhisteamforedgeR andEmmanuelParadisfor ape.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Collaborators:
DavidRelman AlfredSpormann ElizabethPurdomJustinSonnenburgandPersiDiaconis.
PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan.Students: MilingShen, DianaProctor, AldenTimme, KatieShelef, YanaHoy, JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH-TR01, NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
phyloseq
JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Google: wikiphyloseqdeseq2http://www-stat.stanford.edu/˜susan/
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .