Upload
vuongnguyet
View
225
Download
0
Embed Size (px)
Citation preview
Sta$s$csforGenomics(140.688)
Instructor:JeffLeek
SlideCredits:RafaelIrizarry,JohnStorey
Noannouncementstoday.
Hypothesis testing • Once you have a given score for each gene, how do you
decide on a cut-off?
• p-values are popular.
• But how do we decide on a cut-off?
• Are 0.05 and 0.01 appropriate?
• Are the p-values correct?
P‐values
P-value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01000
2000
3000
4000
5000
P-value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
02000400060008000
12000
NoDifferen$alExpression ALotofDifferen$alExpression
Mul$pleComparisonProblem
• Ifwedohaveusefulapproxima$onsofourp‐values,wes$llfacethemul$plecomparisonproblem
• Whenperformingmanyindependenttestsp‐valuesnolongerhavethesameinterpreta$on
HypothesisTes$ng
• Test for each gene null hypothesis: no differential expression.
• Two types of errors can be committed – Type I error or false positive (say that a gene is differentially
expressed when it is not, i.e., reject a true null hypothesis).
– Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)
Hypothe$calExample
• Microarraywith10,000genes
• Calculate10,000p‐values
• Callgenes“significant”ifp‐value<0.05
• ExpectedNumberofFalsePosi$ves:
10,000×0.05=500FalsePosi$ves
Mul$pleHypothesisTes$ng
• Whathappensifwecallallgenessignificantwithp‐values≤0.05,forexample?
CalledSignificant
Not Called Significant
Total
Null True V m0 – V m0
Altern.True S m1 – S m1
Total R m – R m
Null = Equivalent Expression; Alternative = Differential Expression
ErrorRates• Percomparisonerrorrate(PCER):theexpectedvalueofthenumberofTypeIerrorsoverthenumberofhypotheses
PCER=E(V)/m
• Perfamilyerrorrate(PFER):theexpectednumberofTypeIerrorsPFER=E(V)
• Family‐wiseerrorrate:theprobabilityofatleastoneTypeIerrorFEWR=Pr(V≥1)
• Falsediscoveryrate(FDR)ratethatfalsediscoveriesoccurFDR=E(V/R;R>0)=E(V/R|R>0)Pr(R>0)
• Posi$vefalsediscoveryrate(pFDR):ratethatdiscoveriesarefalsepFDR=E(V/R|R>0).
Mul$pleComparisonErrorRates
• Familywiseerrorrate:
Pr(# False Positives ≥ 1)
• Falsediscoveryrate:
€
E #False Positives# Of Discoveries
FalseDiscoveryRate
• The“falsediscoveryrate”measuresthepropor$onoffalseposi$vesamongallgenescalledsignificant:
• Thisisusuallyappropriatebecauseonewantstofindasmanytrulydifferen$allyexpressedgenesaspossiblewithrela$velyfewfalseposi$ves
• Thefalsediscoveryrategivestherateatwhichfurtherbiologicalverifica$onwillresultindead‐ends
RV
SVV
=+
=tsignifican called#
positives false#
FalsePosi$veRateversusFalseDiscoveryRate
• Falseposi$verateistherateatwhichtrulynullgenesarecalledsignificant
• Falsediscoveryrateistherateatwhichsignificantgenesaretrulynull
RV
=≈tsignifican called#
positives false#FDR
0nulltruly #positives false#FPR
mV
=≈
DifferenceinInterpreta$onSuppose550outof10,000genesaresignificantat0.05level
P‐value<0.05Expect0.05*10,000=500falseposi$ves
FalseDiscoveryRate<0.05Expect0.05*550=27.5falseposi$ves
FamilyWiseErrorRate<0.05Theprobabilityofatleast1falseposi$ve≤0.05
ControllingErrorRates
BonferroniCorrec$on(FWER)P‐valueslessthanα/maresignificant
Benjamini‐HochbergCorrec$on(FDR)Orderthep‐values:p(1),…,p(m)
Ifp(i)≤α×i/mthenitissignificant
Correc$onswhendoingmtests:
FalsePosi$veRateandP‐values
• Thep‐valueisameasureofsignificanceintermsofthefalseposi$verate(akaTypeIerrorrate)
• P‐valueisdefinedtobetheminimumfalseposi$verateatwhichthesta$s$ccanbecalledsignificant
• Canbedescribedastheprobabilityatrulynullsta$s$cis“asormoreextreme”thantheobservedone
FalseDiscoveryRateandQ‐values
• Theq‐valueisameasureofsignificanceintermsofthefalsediscoveryrate
• Q‐valueisdefinedtobetheminimumfalsediscoveryrateatwhichthesta$s$ccanbecalledsignificant
• Canbedescribedastheprobabilityasta$s$c“asormoreextreme”istrulynull
• Webeginbyes$ma$ngFDRwhencallingallgenessignificantwithp‐values≤t
• Heuris9cmo$va$on:
[ ][ ]
[ ][ ]} {#E
} {null#E)(E)(E)(FDR
tptp
tRtVt
i
i
≤
≤=≈
Es$mateofFDR
=m0t
} {#ˆ
)(RD̂F 0
tptmt
i ≤
⋅=
Es$mateofπ0
• Wefirstes$matethemoreeasilyinterpretedπ0=m0/m,thepropor$onoftrulynull(non‐differen$allyexpressed)genes:
• Thenclearly
)1(}{#)(ˆ0 λλ
λπ−⋅
>=mpi
mm ⋅= 00 ˆˆ π
The Effect of Heterogeneity Color = Environment
(Idaghdour et al. 2008) Color = Processing Year
(Cheung et al. 2008) Color = Allele
(Brem et al. 2005)
NullP‐ValueDistribu$ons
IndependentE
DependentE
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
P‐value P‐value P‐value P‐value
P‐value P‐value P‐value P‐value
NullP‐ValueDistribu$ons
|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00 Correla$on
IndependentE
DependentE
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
Freq
uency
P‐value P‐value P‐value P‐value
P‐value P‐value P‐value P‐value
Batchandrankings
expression=b0+b1×group+b2batch+noise
expression=b0+3×group+10batch+noise
expression=b0+1×group+1batch+noise
TheModel
Gene1
Gene2
PrincipalComponentsAnalysis/SingularValueDecomposi$on
• Amethodtoiden$fypaternsinthedatathatexplainalargepercentageofthevaria$on
• PCAandSVDhavedifferentmathema$calgoalsbutendupes$ma$ngthesamething
• FirstproposedforgenomicsbyAlteretal.(2000)PNAS
SingularValueDecomposi$on
samples
gene
s = × ×
U D VT
Data eigenarrays/leusingularvectors/
loadings
singularvalues
eigengenes/rightsingularvectors/principalcomponents
Proper$esofSVDsamples
gene
s = × ×
U D VT
ColumnsofVTdescribepaternsacrossgenes
ColumnsofUdescribepaternsacrossarrays
isthepercentofvaria$onexplainedbytheithcolumnofV
€
di2 / di
2
i=1
n
∑
ColumnsofVT/rowsofUareorthogonalandcalculatedoneata$me
SVAAdjustedGenebyGeneModel
expression=b0+b1×group+surrogates+noise
Testwhetherb1=0
CalculateaP‐value
False Discovery Rate Estimates
IndependentE DependentEDependentE+IRW‐SVA
TrueFalseDiscoveryRate TrueFalseDiscoveryRate TrueFalseDiscoveryRate
Q‐value
Q‐value
Q‐value