Waste Not, Want Notweb.stanford.edu/class/bios221/WasteNotWantNot-STAMPS.pdf · 2015. 8. 9. ·...

Preview:

Citation preview

Waste Not, Want Not

SusanHolmeshttp://www-stat.stanford.edu/˜susan/

Bio-X andStatistics, StanfordUniversity ABabcdfghiejklAugust9, 2015

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Challengesforthoseofusworkingfromthegroundup

Heteroscedasticity. InformationLeaks.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part I

Heteroscedasticity: Mixturesand how to Normalize them

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Some real data (Caporoso et al, 2011)

> GlobalPatterns

phyloseq-class experiment-level object

otu_table() OTU Table: [ 19216 taxa and 26 samples ]

sample_data()Sample Data: [ 26 samples by 7 sample variables ]

tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

otu_table(GlobalPatterns)[45:55,1:10]

OTU Table: [11 taxa and 10 samples]

taxa are rows

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong

573586 0 0 0 0 0 0 0 0 0 0

568724 0 0 0 0 0 0 0 0 0 0

175045 0 0 0 0 1 0 0 2 0 0

552540 0 0 0 0 0 0 0 0 0 0

546313 72 153 11232 0 1 1 0 1 1 0

548602 0 0 16 0 0 0 0 0 0 0

564501 0 0 3 0 0 0 0 0 0 0

47778 1 14 207 0 0 0 5 0 0 0

54107 2 87 746 0 0 0 3 35 0 0

25116 1 4 169 0 0 0 0 1 0 0

71074 93 341 11788 1 0 23 48 58 2 0

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

> sample_sums(GlobalPatterns)

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr

864077 1135457 697509 1543451 2076476 718943 433894 186297

.....

NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1

1478965 1652754 58688 493126 279704 937466 1211071 1216137

> summary(sample_sums(GlobalPatterns))

Min. 1st Qu. Median Mean 3rd Qu. Max.

58690 567100 1107000 1085000 1527000 2357000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Howtodealwithdifferentnumbersofreads?

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Verypopular: qiime

2,300citations.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

CurrentMethod: Rarefying

Adhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen

calledthe rarefactionlevel.2. Discardlibraries(microbiomesamples)thathavefewer

readsthan NL,min.3. Subsampletheremaininglibrarieswithoutreplacement

suchthattheyallhavesize NL,min.Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ReductionofDatatoProportions

Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.

StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Howtocompressthedata?

.

LOST INFORMATION

Original Data

FinalResults

...withoutlosingtoomuchinformation?

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Theproportionisnota sufficient statisticfortheBinomial.

A statisticT(X) iscalledsufficientfor θ ifitcontainsalltheinformationinX about θ.Standardstatisticalviewpoint:Thejointprobabilitydistributionofthedataconditionalonthevalueofasufficientstatisticforaparameter, doesnotdependonthatparameter: Pθ(X|T(X) = T) doesnotdependon θ. Wiki

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

EquivalentDefinitions

MutualInformation:

I(X,Y) =∑x∈X

∑y∈Y

P(x, y)log P(x, y)P(x)P(y) = K(P(x, y),P(x)P(y))

A functionofthedata T(X) isasufficientstatisticforthedistributionif

I(θ,X) = I(θ,T(X))

foralldistributionson θ.Note:ForaBayesian, nomatterwhatprioroneuses, oneonlyhastoconsiderthesufficientstatisticformakinginference,becausetheposteriordistributiongiven T = T(x) isthesameastheposteriorgiventhedata X = x.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Aimofthestudies: DifferentialAbundance

Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.

Wehavetocorrectformanysourcesoferror(blocking,modeling, control, etc..)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

RarefactionandReductiontoProportionsareInadmissible

Thefollowingisaminimalexampletoexplainwhyrarefyingisstatisticallyinadmissible, especiallywithregardstovariancestabilization.Supposewewanttocomparetwodifferentsamples, called Aand B,comprisedof100and1000reads, respectively. Inthesehypotheticalcommunitiesonlytwotypesofmicrobeshavebeenobserved, OTU1 and OTU2

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

AccordingtoTable 1, Left.

Table: A minimalexampleoftheeffectofrarefyingonpower.

OriginalAbundanceA B

OTU1 62 500OTU2 38 500Total 100 1000

RarefiedAbundanceA B

OTU1 62 50OTU2 38 50

100 100StandardTestsforDifference

P-value χ2 Prop FisherOriginal 0.0290 0.0290 0.0272Rarefied 0.1171 0.1171 0.1169

Hypotheticalabundancedatainitsoriginal(Top-Left)andrarefied(Top-Right)form, withcorrespondingformaltestresultsfordifferentiation(Bottom).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Formallycomparingthetwoproportionsaccordingtoastandardtestisdoneeitherusinga χ2 test(equivalenttoatwosampleproportiontesthere)oraFisherexacttest. Thisrequiresknowledgeofthenumberoftrials.Byrarefying(Table 1, top-right)sothatbothsampleshavethesamenumberofcounts, wearenolongerabletodifferentiatebetweenthem.Thislossofpoweriscompletelyattributabletoreducingthesizeof B byafactorof10, whichalsoincreasestheconfidenceintervalscorrespondingtoeachproportionsuchthattheyarenolongerdistinguishablefromthosein A,eventhoughtheyaredistinguishableintheoriginaldata.Thevarianceoftheproportion'sestimate p ismultipliedby10whenthetotalcountisdividedby10.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Equalizationofvariances

Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =

pqn =

qnE(

Xn ), afunctionofthemean.

Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

VarianceStabilization

Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

TakeforinstancedifferentPoissonvariableswithmean µi.Theirvariancesarealldifferentifthe µi aredifferent.However, ifthesquareroottransformationisappliedtoeachofthevariables, thenthetransformedvariableswillhaveapproximatelyconstantvariance.Actuallyifwetakethetransformation x −→ 2

√x weobtaina

varianceapproximatelyequalto1..

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

var = µ+ cµ2

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

...

WolfgangHuber . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Modelingreadcounts

Iftechnicalreplicateshavesamenumberofreads: sj.Poissonvariationwithmean µ = sjui.Taxa i havinganincidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe

Kij ∼ Poisson (sjui)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

NegativeBinomialwiththetwoparameters: themean m andr = 1−p

p m, thentheprobabilityis:

X ∼ NB(m; r)

P(X = k) =

(k+ r− 1

k

)(r

r+ m

)r( mr+ m

)k

=Γ(k+ r)k!Γ(r)

(r

r+ m

)r( mr+ m

)k

Thevarianceis Var(X) = m(m+r)r = m+ m2

r , wewillalsouseϕ = 1

r andcallthistheoverdispersionparameter, givingVar(X) = m+ ϕm2. When ϕ = 0 thedistributionof X willbePoisson(m). Thisisthe(mean=m,overdispersion=ϕ)parametrizationwewillusefromnowon.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ModelingCounts

Forbiologicalreplicateswithinthesamegroup--suchastreatmentorcontrolgroupsorthesameenvironments--theproportions ui willbevariablebetweensamples.Callthetwoparameters ri and ϕ =

pi1−pi .

Sothat Uij theproportionoftaxa i insample j isdistributedaccordingtoGamma(ri, ϕ =

pi1−pi ).

Kij haveaPoisson-GammamixtureofdifferentPoissonvariableseachwithitsownparametergeneratedfromthePoisson.ThisgivestheNegativeBinomialwithparameters (m = uisj)and ϕi asasatisfactorymodelofthevariability.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DifferentConditionsSamplesbelongtodifferentconditionssuchastreatmentandcontrolordifferentenvironments.Estimatethevaluesoftheparametersseparatelyforeachofthedifferentbiologicalreplicateconditions/classes.Usetheindex c forthedifferentconditions, wethenhavethecountsforthetaxa i andsample j incondition c havingaNegativeBinomialdistributionwith mc = uicsj and ϕic sothatthevarianceiswritten

uicsj + ϕics2ju2ic (1)Estimatetheparameters uic and ϕic fromthedataforeachOTU andsamplecondition.Theendresultprovidesavariancestabilizingtransformationofthedatathatallowsastatisticallyefficientcomparisonsbetweenconditions.Thisapplicationofahierarchicalmixturemodelisverysimilartotherandomeffectsmodelsusedinthecontextofanalysisofvariance. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

UsingRNA-seqimplementation: DESeq2

McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.ExamplesofOverdispersioninMicrobiomeData.Common-ScaleVarianceversusMeanforMicrobiomeData.EachpointineachpanelrepresentsadifferentOTU'smean/varianceestimateforabiologicalreplicateandstudy.Thedatainthisfigurecomefromthe GlobalPatternssurveyandthe Long-TermDietaryPatterns study(Right)Varianceversusmeanabundanceforrarefiedcounts.(Left)Common-scalevariancesandcommon-scalemeans,estimatedaccordingtotheDESeq2package.Thedashedgraylinedenotesthe σ2 = µ case(Poisson;ϕ = 0). ThecyancurvedenotesthefittedvarianceestimateusingDESeq.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Code

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Common Scale Rarefied

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

DietPatternsGlobalPatterns

GlobalPatterns

2004 HighFatfreshwater

marine

10 1000 10 1000Mean

Varia

nce

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

38 10 6 12 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 647 21 7 39 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13

380 100 60 120 15 14 26 913 13 0 11 4 3 13 715 10 1 13 9 8 24 6470 210 70 390 23 17 42 2398 48 11 70 49 36 108 3625 12 3 20 14 8 23 13

samples

OTU

stest null

OTU

s

5028368918047

34 1 154 20 429 1 61 85 3

161 6 1342 2 3

191163173123010

574851357899307

191163173000

000

357899307

15 15 161 0 0 0 0

87 4 72 0 0 0 0

10 148 15 0 0 0 0

0 0 0 82 244 7 24

0 0 0 354 452 92 1

0 0 0 14 9 33 251

samplesO

TUs

Ocean Feces

OTU

s

Ocean Feces

1. Sum rows. A multinomial for each sample class.

2. Deterministic mixing. Mix multinomials in precise proportion.

Ocean Feces

Microbiome Clustering Simulationsamples

OTU

s

Environment

OTU

s

1. Sum rows for each environment.

2. Sample from multinomial.

Differential Abundance Simulation

3. Multiply randomly

selected OTUs within test class by

effect size.

4. Perform differential abundance tests,

evaluate performance.

158 56 214 39 47 4 11 11 5 3

124 54 212 29 40 3 10 7 8 6

129 46 216 33 42 4 13 7 3 6

11 3 14 3 1 39 95 63 29 37

19 7 34 7 0 88 237 137 73 86

9 1 15 1 2 29 84 51 14 29

OTU

s

samples

Simulated Ocean Simulated Feces

3. Sample from these multinomials.

4. Perform clustering, evaluate accuracy.

BA

Microbiome count data from the Global

Patterns dataset

Repeat for each effect size and media library size.

Repeat for each environ-ment, number of sam-ples, effect size, and median library size.

Amount added is library size / effect size

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

NormalizationsinSimulationForeachsimulatedexperimentweusedthefollowingnormalizationmethodspriortocalculatingsample-wisedistances.1. DESeqVS.VarianceStabilizationimplementedinthe

DESeqpackage.2. None. Countsnottransformed. Differencesintotal

librarysizecouldaffectthevaluesofsomedistancemetrics.

3. Proportion. Countsaredividedbytotallibrarysize.4. Rarefy. Rarefyingisperformedasdefinedinthe

introduction, using rarefy_even_depth implementedinthephyloseqpackage. with NL,min settothe15th-percentileoflibrarysizeswithineachsimulatedexperiment.

5. UQ-logFC.The Upper-QuartileLog-FoldChangenormalizationimplementedintheedgeR package, coupledwiththe top-MSD distance.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DistancesinSimulationForeachofthepreviousnormalizationswecalculatedsample-wisedistance/dissimilaritymatricesusingthefollowingmethods, ifapplicable.1. Bray-Curtis. TheBray-Curtisdissimilarityfirstdefinedin

1957forforestecology.2. Euclidean. TheeuclideandistancetreatingeachOTU as

adimension.√∑n

i=1(Ki1 − Ki2)2, isthedistancebetweensamples 1 and 2,n thenumberofdistinctOTUs.

3. PoissonDist. Ourabbreviationof PoissonDistance, asample-wisedistanceimplementedinthePoiClaClupackage (Witten,2011).

4. top-MSD.ThemeansquareddifferenceoftopOTUs, asimplementedinedgeR.

5. UniFrac-u. TheUnweightedUniFracdistance (Lozupone,2005).

6. UniFrac-w. TheWeightedUniFracdistance (Lozupone,2007). . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Inordertoconsistentlyevaluateperformanceinthisregard,wegeneratedmicrobiomecountsbysamplingfromtwodifferentmultinomialsthatwerebasedoneitherthe Oceanor Feces microbiomesofthe GlobalPatterns empiricaldataset.Anequalnumberofsimulatedmicrobiomesampleswasgeneratedfromeachmultinomial. The Ocean and FecessampleclasseshavenegligibleoverlappingOTUs.Mixingthembyadefinedproportionallowscontroloverthedifficultyoftheclusteringtaskfromtrivial(nomixing)toimpossible(bothmultinomialsevenlymixed).Clusteringwasperformedindependentlyforeachcombinationofsimulatedexperiment, normalizationmethod, anddistancemeasureusingpartitioningaroundmedoids(PAM).Theaccuracyisthefractionofsimulatedsamplescorrectlyclustered; worstpossibleaccuracyis50%ifallsamplesareclustered. (Rarefyingprocedureomitssamples, soitsaccuracycanbebelow50%)

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ImprovementinPowerandFDR

Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DESeq2 − nbinomWaldTest DESeq − nbinomTest edgeR − exactTest metagenomeSeq − fitZig two sided Welch t−test

0.6

0.8

1.0

0.6

0.8

1.0

N ~L

=2

00

0N ~

L=

50

00

0

AU

C

Number Samples per Class: 3 5 10 Normalization Method: Model/None Rarefied Proportion

0.6

0.8

1.0

N ~L

=5

00

00

Sp

eci

ficity

0.2

0.5

0.8

1.0

N ~L

=5

00

00

5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Effect Size

Se

nsi

tivity

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ImprovementsofDistancebasedclustering

Clusteringaccuracyinsimulatedtwo-classmixing.Clusteringaccuracy(withPAM:verticalaxis)followingdifferentnormalizationanddistancemethods.Pointsdenotethemeanvaluesofreplicates, withaverticalbarrepresentingonestandarddeviationaboveandbelow.Thehorizontalaxisistheeffectsize.EachmultinomialisderivedfromtwomicrobiomesthathavenegligibleoverlappingOTUs(FecalandOceanmicrobiomesintheGlobalPatternsdataset).Highervaluesofeffectsizeindicateaneasierclusteringtask.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Bray − Curtis Euclidean PoissonDist top − MSD UniFrac − u UniFrac − w

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

N ~L

=1

00

0N ~

L=

20

00

N ~L

=1

00

00

1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00

Effect Size

Accu

racy

Normalization Method: DESeqVS None Proportion Rarefy UQ−logFC

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ExamplesusingPhyloseq:http://joey711.github.io/phyloseq-extensions/

DESeq2.html

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

BenefittingfromthetoolsandschoolsofStatisticians.......

Thankstothe R community:Chessel, Jombart, Dray, Thioulouse ade4 , WolfgangHuber,MichaelLovefor DESeq2, GordonSmythandhisteamforedgeR andEmmanuelParadisfor ape.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Collaborators:

DavidRelman AlfredSpormann ElizabethPurdomJustinSonnenburgandPersiDiaconis.

PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan.Students: MilingShen, DianaProctor, AldenTimme, KatieShelef, YanaHoy, JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH-TR01, NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

phyloseq

JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Google: wikiphyloseqdeseq2http://www-stat.stanford.edu/˜susan/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Recommended