44
Waste Not, Want Not Susan Holmes http://www-stat.stanford.edu/˜susan/ Bio-X and Statistics, Stanford University ABabcdfghiejkl August 9, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Waste Not, Want Not

SusanHolmeshttp://www-stat.stanford.edu/˜susan/

Bio-X andStatistics, StanfordUniversity ABabcdfghiejklAugust9, 2015

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 2: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 3: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Challengesforthoseofusworkingfromthegroundup

Heteroscedasticity. InformationLeaks.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 4: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Part I

Heteroscedasticity: Mixturesand how to Normalize them

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 5: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Some real data (Caporoso et al, 2011)

> GlobalPatterns

phyloseq-class experiment-level object

otu_table() OTU Table: [ 19216 taxa and 26 samples ]

sample_data()Sample Data: [ 26 samples by 7 sample variables ]

tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 6: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

otu_table(GlobalPatterns)[45:55,1:10]

OTU Table: [11 taxa and 10 samples]

taxa are rows

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong

573586 0 0 0 0 0 0 0 0 0 0

568724 0 0 0 0 0 0 0 0 0 0

175045 0 0 0 0 1 0 0 2 0 0

552540 0 0 0 0 0 0 0 0 0 0

546313 72 153 11232 0 1 1 0 1 1 0

548602 0 0 16 0 0 0 0 0 0 0

564501 0 0 3 0 0 0 0 0 0 0

47778 1 14 207 0 0 0 5 0 0 0

54107 2 87 746 0 0 0 3 35 0 0

25116 1 4 169 0 0 0 0 1 0 0

71074 93 341 11788 1 0 23 48 58 2 0

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 7: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

> sample_sums(GlobalPatterns)

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr

864077 1135457 697509 1543451 2076476 718943 433894 186297

.....

NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1

1478965 1652754 58688 493126 279704 937466 1211071 1216137

> summary(sample_sums(GlobalPatterns))

Min. 1st Qu. Median Mean 3rd Qu. Max.

58690 567100 1107000 1085000 1527000 2357000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 8: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Howtodealwithdifferentnumbersofreads?

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 9: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Verypopular: qiime

2,300citations.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 10: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

CurrentMethod: Rarefying

Adhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen

calledthe rarefactionlevel.2. Discardlibraries(microbiomesamples)thathavefewer

readsthan NL,min.3. Subsampletheremaininglibrarieswithoutreplacement

suchthattheyallhavesize NL,min.Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 11: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

ReductionofDatatoProportions

Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.

StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 12: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Howtocompressthedata?

.

LOST INFORMATION

Original Data

FinalResults

...withoutlosingtoomuchinformation?

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 13: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Theproportionisnota sufficient statisticfortheBinomial.

A statisticT(X) iscalledsufficientfor θ ifitcontainsalltheinformationinX about θ.Standardstatisticalviewpoint:Thejointprobabilitydistributionofthedataconditionalonthevalueofasufficientstatisticforaparameter, doesnotdependonthatparameter: Pθ(X|T(X) = T) doesnotdependon θ. Wiki

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 14: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

EquivalentDefinitions

MutualInformation:

I(X,Y) =∑x∈X

∑y∈Y

P(x, y)log P(x, y)P(x)P(y) = K(P(x, y),P(x)P(y))

A functionofthedata T(X) isasufficientstatisticforthedistributionif

I(θ,X) = I(θ,T(X))

foralldistributionson θ.Note:ForaBayesian, nomatterwhatprioroneuses, oneonlyhastoconsiderthesufficientstatisticformakinginference,becausetheposteriordistributiongiven T = T(x) isthesameastheposteriorgiventhedata X = x.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 15: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Aimofthestudies: DifferentialAbundance

Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.

Wehavetocorrectformanysourcesoferror(blocking,modeling, control, etc..)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 16: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

RarefactionandReductiontoProportionsareInadmissible

Thefollowingisaminimalexampletoexplainwhyrarefyingisstatisticallyinadmissible, especiallywithregardstovariancestabilization.Supposewewanttocomparetwodifferentsamples, called Aand B,comprisedof100and1000reads, respectively. Inthesehypotheticalcommunitiesonlytwotypesofmicrobeshavebeenobserved, OTU1 and OTU2

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 17: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

AccordingtoTable 1, Left.

Table: A minimalexampleoftheeffectofrarefyingonpower.

OriginalAbundanceA B

OTU1 62 500OTU2 38 500Total 100 1000

RarefiedAbundanceA B

OTU1 62 50OTU2 38 50

100 100StandardTestsforDifference

P-value χ2 Prop FisherOriginal 0.0290 0.0290 0.0272Rarefied 0.1171 0.1171 0.1169

Hypotheticalabundancedatainitsoriginal(Top-Left)andrarefied(Top-Right)form, withcorrespondingformaltestresultsfordifferentiation(Bottom).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 18: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Formallycomparingthetwoproportionsaccordingtoastandardtestisdoneeitherusinga χ2 test(equivalenttoatwosampleproportiontesthere)oraFisherexacttest. Thisrequiresknowledgeofthenumberoftrials.Byrarefying(Table 1, top-right)sothatbothsampleshavethesamenumberofcounts, wearenolongerabletodifferentiatebetweenthem.Thislossofpoweriscompletelyattributabletoreducingthesizeof B byafactorof10, whichalsoincreasestheconfidenceintervalscorrespondingtoeachproportionsuchthattheyarenolongerdistinguishablefromthosein A,eventhoughtheyaredistinguishableintheoriginaldata.Thevarianceoftheproportion'sestimate p ismultipliedby10whenthetotalcountisdividedby10.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 19: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Equalizationofvariances

Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =

pqn =

qnE(

Xn ), afunctionofthemean.

Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 20: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

VarianceStabilization

Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 21: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

TakeforinstancedifferentPoissonvariableswithmean µi.Theirvariancesarealldifferentifthe µi aredifferent.However, ifthesquareroottransformationisappliedtoeachofthevariables, thenthetransformedvariableswillhaveapproximatelyconstantvariance.Actuallyifwetakethetransformation x −→ 2

√x weobtaina

varianceapproximatelyequalto1..

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 22: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

var = µ+ cµ2

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 23: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 24: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

...

WolfgangHuber . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 25: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Modelingreadcounts

Iftechnicalreplicateshavesamenumberofreads: sj.Poissonvariationwithmean µ = sjui.Taxa i havinganincidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe

Kij ∼ Poisson (sjui)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 26: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

NegativeBinomialwiththetwoparameters: themean m andr = 1−p

p m, thentheprobabilityis:

X ∼ NB(m; r)

P(X = k) =

(k+ r− 1

k

)(r

r+ m

)r( mr+ m

)k

=Γ(k+ r)k!Γ(r)

(r

r+ m

)r( mr+ m

)k

Thevarianceis Var(X) = m(m+r)r = m+ m2

r , wewillalsouseϕ = 1

r andcallthistheoverdispersionparameter, givingVar(X) = m+ ϕm2. When ϕ = 0 thedistributionof X willbePoisson(m). Thisisthe(mean=m,overdispersion=ϕ)parametrizationwewillusefromnowon.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 27: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

ModelingCounts

Forbiologicalreplicateswithinthesamegroup--suchastreatmentorcontrolgroupsorthesameenvironments--theproportions ui willbevariablebetweensamples.Callthetwoparameters ri and ϕ =

pi1−pi .

Sothat Uij theproportionoftaxa i insample j isdistributedaccordingtoGamma(ri, ϕ =

pi1−pi ).

Kij haveaPoisson-GammamixtureofdifferentPoissonvariableseachwithitsownparametergeneratedfromthePoisson.ThisgivestheNegativeBinomialwithparameters (m = uisj)and ϕi asasatisfactorymodelofthevariability.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 28: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

DifferentConditionsSamplesbelongtodifferentconditionssuchastreatmentandcontrolordifferentenvironments.Estimatethevaluesoftheparametersseparatelyforeachofthedifferentbiologicalreplicateconditions/classes.Usetheindex c forthedifferentconditions, wethenhavethecountsforthetaxa i andsample j incondition c havingaNegativeBinomialdistributionwith mc = uicsj and ϕic sothatthevarianceiswritten

uicsj + ϕics2ju2ic (1)Estimatetheparameters uic and ϕic fromthedataforeachOTU andsamplecondition.Theendresultprovidesavariancestabilizingtransformationofthedatathatallowsastatisticallyefficientcomparisonsbetweenconditions.Thisapplicationofahierarchicalmixturemodelisverysimilartotherandomeffectsmodelsusedinthecontextofanalysisofvariance. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 29: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

UsingRNA-seqimplementation: DESeq2

McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.ExamplesofOverdispersioninMicrobiomeData.Common-ScaleVarianceversusMeanforMicrobiomeData.EachpointineachpanelrepresentsadifferentOTU'smean/varianceestimateforabiologicalreplicateandstudy.Thedatainthisfigurecomefromthe GlobalPatternssurveyandthe Long-TermDietaryPatterns study(Right)Varianceversusmeanabundanceforrarefiedcounts.(Left)Common-scalevariancesandcommon-scalemeans,estimatedaccordingtotheDESeq2package.Thedashedgraylinedenotesthe σ2 = µ case(Poisson;ϕ = 0). ThecyancurvedenotesthefittedvarianceestimateusingDESeq.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 30: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Code

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 31: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Common Scale Rarefied

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

DietPatternsGlobalPatterns

GlobalPatterns

2004 HighFatfreshwater

marine

10 1000 10 1000Mean

Varia

nce

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 32: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

38 10 6 12 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 647 21 7 39 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13

380 100 60 120 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 6470 210 70 390 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13

samples

OTU

stest null

OTU

s

5028368918047

34 1 154 20 429 1 61 85 3

161 6 1342 2 3

191163173123010

574851357899307

191163173000

000

357899307

15 15 161 0 0 0 0

87 4 72 0 0 0 0

10 148 15 0 0 0 0

0 0 0 82 244 7 24

0 0 0 354 452 92 1

0 0 0 14 9 33 251

samplesO

TUs

Ocean Feces

OTU

s

Ocean Feces

1. Sum rows. A multinomial for each sample class.

2. Deterministic mixing. Mix multinomials in precise proportion.

Ocean Feces

Microbiome Clustering Simulationsamples

OTU

s

Environment

OTU

s

1. Sum rows for each environment.

2. Sample from multinomial.

Differential Abundance Simulation

3. Multiply randomly

selected OTUs within test class by

effect size.

4. Perform differential abundance tests,

evaluate performance.

158 56 214 39 47 4 11 11 5 3

124 54 212 29 40 3 10 7 8 6

129 46 216 33 42 4 13 7 3 6

11 3 14 3 1 39 95 63 29 37

19 7 34 7 0 88 237 137 73 86

9 1 15 1 2 29 84 51 14 29

OTU

s

samples

Simulated Ocean Simulated Feces

3. Sample from these multinomials.

4. Perform clustering, evaluate accuracy.

BA

Microbiome count data from the Global

Patterns dataset

Repeat for each effect size and media library size.

Repeat for each environ-ment, number of sam-ples, effect size, and median library size.

Amount added is library size / effect size

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 33: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

NormalizationsinSimulationForeachsimulatedexperimentweusedthefollowingnormalizationmethodspriortocalculatingsample-wisedistances.1. DESeqVS.VarianceStabilizationimplementedinthe

DESeqpackage.2. None. Countsnottransformed. Differencesintotal

librarysizecouldaffectthevaluesofsomedistancemetrics.

3. Proportion. Countsaredividedbytotallibrarysize.4. Rarefy. Rarefyingisperformedasdefinedinthe

introduction, using rarefy_even_depth implementedinthephyloseqpackage. with NL,min settothe15th-percentileoflibrarysizeswithineachsimulatedexperiment.

5. UQ-logFC.The Upper-QuartileLog-FoldChangenormalizationimplementedintheedgeR package, coupledwiththe top-MSD distance.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 34: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

DistancesinSimulationForeachofthepreviousnormalizationswecalculatedsample-wisedistance/dissimilaritymatricesusingthefollowingmethods, ifapplicable.1. Bray-Curtis. TheBray-Curtisdissimilarityfirstdefinedin

1957forforestecology.2. Euclidean. TheeuclideandistancetreatingeachOTU as

adimension.√∑n

i=1(Ki1 − Ki2)2, isthedistancebetweensamples 1 and 2,n thenumberofdistinctOTUs.

3. PoissonDist. Ourabbreviationof PoissonDistance, asample-wisedistanceimplementedinthePoiClaClupackage (Witten,2011).

4. top-MSD.ThemeansquareddifferenceoftopOTUs, asimplementedinedgeR.

5. UniFrac-u. TheUnweightedUniFracdistance (Lozupone,2005).

6. UniFrac-w. TheWeightedUniFracdistance (Lozupone,2007). . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 35: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Inordertoconsistentlyevaluateperformanceinthisregard,wegeneratedmicrobiomecountsbysamplingfromtwodifferentmultinomialsthatwerebasedoneitherthe Oceanor Feces microbiomesofthe GlobalPatterns empiricaldataset.Anequalnumberofsimulatedmicrobiomesampleswasgeneratedfromeachmultinomial. The Ocean and FecessampleclasseshavenegligibleoverlappingOTUs.Mixingthembyadefinedproportionallowscontroloverthedifficultyoftheclusteringtaskfromtrivial(nomixing)toimpossible(bothmultinomialsevenlymixed).Clusteringwasperformedindependentlyforeachcombinationofsimulatedexperiment, normalizationmethod, anddistancemeasureusingpartitioningaroundmedoids(PAM).Theaccuracyisthefractionofsimulatedsamplescorrectlyclustered; worstpossibleaccuracyis50%ifallsamplesareclustered. (Rarefyingprocedureomitssamples, soitsaccuracycanbebelow50%)

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 36: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

ImprovementinPowerandFDR

Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 37: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 38: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

DESeq2 − nbinomWaldTest DESeq − nbinomTest edgeR − exactTest metagenomeSeq − fitZig two sided Welch t−test

0.6

0.8

1.0

0.6

0.8

1.0

N ~L

=2

00

0N ~

L=

50

00

0

AU

C

Number Samples per Class: 3 5 10 Normalization Method: Model/None Rarefied Proportion

0.6

0.8

1.0

N ~L

=5

00

00

Sp

eci

ficity

0.2

0.5

0.8

1.0

N ~L

=5

00

00

5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Effect Size

Se

nsi

tivity

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 39: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

ImprovementsofDistancebasedclustering

Clusteringaccuracyinsimulatedtwo-classmixing.Clusteringaccuracy(withPAM:verticalaxis)followingdifferentnormalizationanddistancemethods.Pointsdenotethemeanvaluesofreplicates, withaverticalbarrepresentingonestandarddeviationaboveandbelow.Thehorizontalaxisistheeffectsize.EachmultinomialisderivedfromtwomicrobiomesthathavenegligibleoverlappingOTUs(FecalandOceanmicrobiomesintheGlobalPatternsdataset).Highervaluesofeffectsizeindicateaneasierclusteringtask.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 40: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Bray − Curtis Euclidean PoissonDist top − MSD UniFrac − u UniFrac − w

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

N ~L

=1

00

0N ~

L=

20

00

N ~L

=1

00

00

1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00

Effect Size

Accu

racy

Normalization Method: DESeqVS None Proportion Rarefy UQ−logFC

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 41: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

ExamplesusingPhyloseq:http://joey711.github.io/phyloseq-extensions/

DESeq2.html

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 42: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

BenefittingfromthetoolsandschoolsofStatisticians.......

Thankstothe R community:Chessel, Jombart, Dray, Thioulouse ade4 , WolfgangHuber,MichaelLovefor DESeq2, GordonSmythandhisteamforedgeR andEmmanuelParadisfor ape.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 43: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

Collaborators:

DavidRelman AlfredSpormann ElizabethPurdomJustinSonnenburgandPersiDiaconis.

PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan.Students: MilingShen, DianaProctor, AldenTimme, KatieShelef, YanaHoy, JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH-TR01, NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 44: Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. · > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077

phyloseq

JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Google: wikiphyloseqdeseq2http://www-stat.stanford.edu/˜susan/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .