60
Adaptive FDR Estimation for High Dimensional Discrete Data Naomi Altman 1 & Isaac Dialsingh 2 ADNAT 2012 - Hyderabad February 5, 2013 1. The Pennsylvania State University 2. The University of the West Indies [email protected] [email protected] Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 1 / 34

Adaptive FDR Estimation for High Dimensional Discrete Data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive FDR Estimation for High Dimensional Discrete Data

Adaptive FDR Estimation for High DimensionalDiscrete Data

Naomi Altman1 & Isaac Dialsingh2

ADNAT 2012 - HyderabadFebruary 5, 2013

1. The Pennsylvania State University 2. The University of the West [email protected] [email protected]

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 1 / 34

Page 2: Adaptive FDR Estimation for High Dimensional Discrete Data

False Discovery Rate

Controlling error rates is essential for high dimensional “omics” data.

Benjamini & Hochberg (1995) realized that when testing 1000’s ofhypotheses a few errors could be tolerated.

False Discovery Rate (FDR) is the expected percentage of nullhypotheses among the statistically significant tests.

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

FDR =E(

VR |R > 0

)P(R > 0)

R: number of rejectionsV: number of false

rejections

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 2 / 34

Page 3: Adaptive FDR Estimation for High Dimensional Discrete Data

False Discovery Rate

Controlling error rates is essential for high dimensional “omics” data.

Benjamini & Hochberg (1995) realized that when testing 1000’s ofhypotheses a few errors could be tolerated.

False Discovery Rate (FDR) is the expected percentage of nullhypotheses among the statistically significant tests.

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

FDR =E(

VR |R > 0

)P(R > 0)

R: number of rejectionsV: number of false

rejections

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 2 / 34

Page 4: Adaptive FDR Estimation for High Dimensional Discrete Data

False Discovery Rate

Controlling error rates is essential for high dimensional “omics” data.

Benjamini & Hochberg (1995) realized that when testing 1000’s ofhypotheses a few errors could be tolerated.

False Discovery Rate (FDR) is the expected percentage of nullhypotheses among the statistically significant tests.

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

FDR =E(

VR |R > 0

)P(R > 0)

R: number of rejectionsV: number of false

rejections

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 2 / 34

Page 5: Adaptive FDR Estimation for High Dimensional Discrete Data

Adaptive FDR

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

π0 = m0m

m: number of testsm0: number of null tests

Since we are trying to control false discoveries we do not need tocontrol for the truly non-null tests.Adaptive FDR methods use an estimate of π0 to improve thepower of the multiple comparisons adjustments.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 3 / 34

Page 6: Adaptive FDR Estimation for High Dimensional Discrete Data

Adaptive FDR

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

π0 = m0m

m: number of testsm0: number of null tests

Since we are trying to control false discoveries we do not need tocontrol for the truly non-null tests.Adaptive FDR methods use an estimate of π0 to improve thepower of the multiple comparisons adjustments.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 3 / 34

Page 7: Adaptive FDR Estimation for High Dimensional Discrete Data

Adaptive FDR

Table : Outcomes of m tests.

Not Significant TotalSignificant

True Null U V m0False Null T S m1

Total W R m

π0 = m0m

m: number of testsm0: number of null tests

Since we are trying to control false discoveries we do not need tocontrol for the truly non-null tests.Adaptive FDR methods use an estimate of π0 to improve thepower of the multiple comparisons adjustments.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 3 / 34

Page 8: Adaptive FDR Estimation for High Dimensional Discrete Data

Implementing FDR procedures

Compute a test statistic for each hypothesis.We might use the p-value as the test statistic.

Order the hypotheses from most to least significant, so that H0k hasthe k th significant test statistic.Estimate FDR(k) if we reject H01 · · ·H0k .Either

Pick a level q and reject H01 · · ·H0k if FDR(k)< q ORPick a p-value α and reject H0i if its p-value is less than α. Thenestimate FDR.

0 2000 4000 6000 8000 10000

0.00

0.02

0.04

0.06

0.08

0.10

BH Heuristic

Sorted Hypothesis Number

p−va

lue

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

BH q=0.05Adaptive BH, q=0.05 pi0=.6reject at p<.02, q=0.133

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 4 / 34

Page 9: Adaptive FDR Estimation for High Dimensional Discrete Data

Implementing FDR procedures

Compute a test statistic for each hypothesis.We might use the p-value as the test statistic.

Order the hypotheses from most to least significant, so that H0k hasthe k th significant test statistic.Estimate FDR(k) if we reject H01 · · ·H0k .

EitherPick a level q and reject H01 · · ·H0k if FDR(k)< q ORPick a p-value α and reject H0i if its p-value is less than α. Thenestimate FDR.

0 2000 4000 6000 8000 10000

0.00

0.02

0.04

0.06

0.08

0.10

BH Heuristic

Sorted Hypothesis Number

p−va

lue

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

BH q=0.05Adaptive BH, q=0.05 pi0=.6reject at p<.02, q=0.133

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 4 / 34

Page 10: Adaptive FDR Estimation for High Dimensional Discrete Data

Implementing FDR procedures

Compute a test statistic for each hypothesis.We might use the p-value as the test statistic.

Order the hypotheses from most to least significant, so that H0k hasthe k th significant test statistic.Estimate FDR(k) if we reject H01 · · ·H0k .Either

Pick a level q and reject H01 · · ·H0k if FDR(k)< q ORPick a p-value α and reject H0i if its p-value is less than α. Thenestimate FDR.

0 2000 4000 6000 8000 10000

0.00

0.02

0.04

0.06

0.08

0.10

BH Heuristic

Sorted Hypothesis Number

p−va

lue

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

BH q=0.05Adaptive BH, q=0.05 pi0=.6reject at p<.02, q=0.133

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 4 / 34

Page 11: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

For discrete test statistics we want to:Estimate π0.Estimate FDR.

Discrete test statisticsarise from binary and count data such as

read counts in RNA-seq and ChIP-seqSNP studiesthresholding (above/below)multiple 2-way tables (e.g. surveys)

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 5 / 34

Page 12: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

For discrete test statistics we want to:Estimate π0.Estimate FDR.

Discrete test statisticsarise from binary and count data such as

read counts in RNA-seq and ChIP-seqSNP studiesthresholding (above/below)multiple 2-way tables (e.g. surveys)

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 5 / 34

Page 13: Adaptive FDR Estimation for High Dimensional Discrete Data

Why does discreteness matter?

0

500

1000

1500

0.0 0.3 0.6 0.9pValues

coun

t h0

Null

Nxn−null

Figure : Continuous p-values π0 = 0.8

0

500

1000

1500

0.00 0.25 0.50 0.75 1.00pValues

coun

t h0

Null

Nxn−null

Figure : Discrete p-values π0 = 0.8

The histogram on the left represents 10000 t-tests.The histogram on the right represents p-values from 10000 Fisher exacttests.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 6 / 34

Page 14: Adaptive FDR Estimation for High Dimensional Discrete Data

Why does discreteness matter?edgeR 3 samples/trt

p−values

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

0025

00

LIMMA

p−values

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

0012

00

P-values from an RNA-seq study of2 maize genotypes with 3 biologicalreplicates.

P-values from a microarray study inpoppy tissues with 4 biologicalreplicates.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 7 / 34

Page 15: Adaptive FDR Estimation for High Dimensional Discrete Data

Why does discreteness matter?

We will use the same heuristics for discrete and continuous tests. BUT

Table : Distribution of p-values.

Continuous Discretenull p-value distribution uniform depends on an ancillary

Prob(p=1) 0 >0percent of support points 0% 100%with positive probability

minimum achievable 0 >0p-value

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 8 / 34

Page 16: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Estimating π0 from continuous p-values

Storey (2002) estimates height of flat part of histogram.Nettleton et al (2006) estimate the heights of the bins in excess ofexpected given π̂0.Pounds and Cheng (2004) assume all true non-nulls have p=0, so2 ∗ p̄ ≈ π0.

LIMMA

p−values

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

0012

00

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 9 / 34

Page 17: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Estimating π0 from discrete p-valuesThese methods seem less plausible since low power non-null testsmay have p-values far from 0.As well, both null and non-null tests have p-values with mass at 1,leading to a peak at p=1.We add 3 new methods.

edgeR 3 samples/trt

p−values

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

0025

00

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 10 / 34

Page 18: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Mixture distribution of p-valuesWe use the mixture distribution

f (p) = π0f0(p) + (1− π0)fA(p)

wheref is the distribution of the p-values,f0 is the distribution of p-values for the hypotheses that are truly nullandfA is the distribution of p-values for the hypotheses that are truly notnull.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 11 / 34

Page 19: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Estimating π0 from discrete p-values

There is often an ancillary statistic which determines the distributionof the test statistic. e.g. row totals.

If the ancillary statistic is known for each test, the distribution ofp-values under the null f0(p) is known.If there are many tests with the same value of the ancillary, f̂ (p) theempirical distribution of the p-values can be estimated by theobserved frequencies.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 12 / 34

Page 20: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Estimating π0 from discrete p-values

There is often an ancillary statistic which determines the distributionof the test statistic. e.g. row totals.If the ancillary statistic is known for each test, the distribution ofp-values under the null f0(p) is known.

If there are many tests with the same value of the ancillary, f̂ (p) theempirical distribution of the p-values can be estimated by theobserved frequencies.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 12 / 34

Page 21: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0

Estimating π0 from discrete p-values

There is often an ancillary statistic which determines the distributionof the test statistic. e.g. row totals.If the ancillary statistic is known for each test, the distribution ofp-values under the null f0(p) is known.If there are many tests with the same value of the ancillary, f̂ (p) theempirical distribution of the p-values can be estimated by theobserved frequencies.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 12 / 34

Page 22: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 using f (p) = π0f0(p) + (1− π0)fA(p)

Regression Method

Useful when we have many tests with the same ancillary statistic.Regression method - regress empirical frequencies of p-valuesagainst expected frequencies under H0.The slope is approximately π0.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 13 / 34

Page 23: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 using the histogram of p-values

0.00

0.25

0.50

0.75

1.00

1.25

0.00 0.25 0.50 0.75 1.00p

dens

ity

type

expected

observed

Histogram methodUsing the ancillary, we compute theexpected frequency of the p-valuesunder the null.We use the area A between theobserved histogram and the histogramexpected under the null.

A ≤ 2(1− π0)

π̂0 = 1− A2 has expectation at least as

big as π0.The method is sensitive to the choiceof bin boundaries.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 14 / 34

Page 24: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 using the histogram of p-values

0.00

0.25

0.50

0.75

1.00

1.25

0.00 0.25 0.50 0.75 1.00p

dens

ity

type

expected

observed

Histogram methodUsing the ancillary, we compute theexpected frequency of the p-valuesunder the null.We use the area A between theobserved histogram and the histogramexpected under the null.A ≤ 2(1− π0)

π̂0 = 1− A2 has expectation at least as

big as π0.

The method is sensitive to the choiceof bin boundaries.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 14 / 34

Page 25: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 using the histogram of p-values

0.00

0.25

0.50

0.75

1.00

1.25

0.00 0.25 0.50 0.75 1.00p

dens

ity

type

expected

observed

Histogram methodUsing the ancillary, we compute theexpected frequency of the p-valuesunder the null.We use the area A between theobserved histogram and the histogramexpected under the null.A ≤ 2(1− π0)

π̂0 = 1− A2 has expectation at least as

big as π0.The method is sensitive to the choiceof bin boundaries.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 14 / 34

Page 26: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 by removing zero-power tests

Minimum achievable p-valueFor a given test statistic, there is some set ψ1 < · · · < ψk = 1 ofachievable p-values.ψ1 is the minimal achievable p-value for the test.Select a level α the maximum p-value at which to reject the nullhypothesis.If ψ1 > α then the test has zero power.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 15 / 34

Page 27: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating π0 by removing zero-power tests

Tarone (1990) noted that we need not consider tests with zeropower.We call a method a “T” method if we remove the zero-power testsand then proceed with a method for continuous data.“T” methods remove some of the excess mass at p = 1 and makethe histogram of p-values more uniform.In this talk, we use the Storey-T method.We remove the tests with zero power at α = 0.01 and then useStorey’s method on the remaining tests (right plot).

p−values pi0=0.9

p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

0012

00

p−values pi0=0.9 power>0

p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

0

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 16 / 34

Page 28: Adaptive FDR Estimation for High Dimensional Discrete Data

Simulated RNA-seq data

DataWe simulated RNA-seq data assuming:

m (number of tests) =1000 or 10,000.π0 = 0.1,0.2 · · · 0.8,0.9,0.95,1.0Two different discretized log-Normal distributions for totalreads/feature estimated from real data.Features are independent within sample.We used 2 treatments with no replication.The statistic was Fisher’s Exact Test.

0 200 400 600 800 1000

0.0

00

0.0

05

0.0

10

0.0

15

x value

De

nsi

ty

Lognormal Parameters

(3,2)(4,2)

log-Normal distributions

Configuration % 0 or 1 total1 0.9%2 3.2%

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 17 / 34

Page 29: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimated π0 with m=10,000

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Scenario 1, m=10000

pi0

Est

imat

ed p

i0

● pi0HistogramRegressionNettletonPoundStoreyStorey−T

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Scenario 2, m=10000

pi0

Est

imat

ed p

i0

● pi0HistogramRegressionNettletonPoundStoreyStorey−T

logNormal(3,2)

(few small counts)

logNormal(4,2)

(many small counts)

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 18 / 34

Page 30: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Benjamini and Hochberg (1995) suggest an algorithm for controllingFDR at level q:

Find the maximal i such that p(i) ≤ i×qm .

It has been shown when the test statistics are continuous andindependent, then this algorithm controls the FDR at level π0q

The BH method is known to be conservative with discrete tests.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 19 / 34

Page 31: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Benjamini and Hochberg (1995) suggest an algorithm for controllingFDR at level q:

Find the maximal i such that p(i) ≤ i×qm .

It has been shown when the test statistics are continuous andindependent, then this algorithm controls the FDR at level π0q

The BH method is known to be conservative with discrete tests.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 19 / 34

Page 32: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Adaptive FDR methods control the FDR at approximate level qusing an estimate of π0.For example, the adaptive Benjamini and Hochberg method is

Find the maximal i such that p(i) ≤ i×qmπ̂0

.

When the test statistics are continuous and independent, then thisalgorithm controls the FDR at approximately level q

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 20 / 34

Page 33: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Adaptive FDR methods control the FDR at approximate level qusing an estimate of π0.For example, the adaptive Benjamini and Hochberg method is

Find the maximal i such that p(i) ≤ i×qmπ̂0

.

When the test statistics are continuous and independent, then thisalgorithm controls the FDR at approximately level q

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 20 / 34

Page 34: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Gilbert (2005) uses Tarone’s idea of removing tests which have zeropower to achieve significance at level α.Gilbert filters zero power tests then applies the BH method to theremaining mF tests.

We suggest an adaptive Gilbert method that uses an estimate of π0with Gilbert’s method.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 21 / 34

Page 35: Adaptive FDR Estimation for High Dimensional Discrete Data

Estimating FDR

Gilbert (2005) uses Tarone’s idea of removing tests which have zeropower to achieve significance at level α.Gilbert filters zero power tests then applies the BH method to theremaining mF tests.We suggest an adaptive Gilbert method that uses an estimate of π0with Gilbert’s method.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 21 / 34

Page 36: Adaptive FDR Estimation for High Dimensional Discrete Data

Simulation Results

Using the same simulation scenario as before we implementedBenjamini and Hochberg’s 1995 methodGilbert’s (2005) method using α = 0.01Adaptive versions of BH and Gilbert using

true π0estimated π0 using the Storey-T method

We considered error rates forfalse detectionfalse nondetection (including nondetection due to zero power)total errors

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 22 / 34

Page 37: Adaptive FDR Estimation for High Dimensional Discrete Data

Simulation Results

Using the same simulation scenario as before we implementedBenjamini and Hochberg’s 1995 methodGilbert’s (2005) method using α = 0.01Adaptive versions of BH and Gilbert using

true π0estimated π0 using the Storey-T method

We considered error rates forfalse detectionfalse nondetection (including nondetection due to zero power)total errors

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 22 / 34

Page 38: Adaptive FDR Estimation for High Dimensional Discrete Data

Simulation Results

Using the same simulation scenario as before we implementedBenjamini and Hochberg’s 1995 methodGilbert’s (2005) method using α = 0.01Adaptive versions of BH and Gilbert using

true π0estimated π0 using the Storey-T method

We considered error rates forfalse detectionfalse nondetection (including nondetection due to zero power)total errors

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 22 / 34

Page 39: Adaptive FDR Estimation for High Dimensional Discrete Data

Results m=10,000 few small margins

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

00

Scenario 1, m=10000 Total Rejections

pi0

Mea

n To

tal R

ejec

tions

● NonNullBHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

Scenario 1, m=10000 False Rejections

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

Scenario 1, m=10000 E(V)/E(R)

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

00

Scenario 1, m=10000 Total Errors

pi0

Mea

n To

tal E

rror

s

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 23 / 34

Page 40: Adaptive FDR Estimation for High Dimensional Discrete Data

Results m=10,000 few small margins

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

00

Scenario 1, m=10000 Total Rejections

pi0

Mea

n To

tal R

ejec

tions

● NonNullBHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

Scenario 1, m=10000 False Rejections

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

Scenario 1, m=10000 E(V)/E(R)

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

00

Scenario 1, m=10000 Total Errors

pi0

Mea

n To

tal E

rror

s

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 23 / 34

Page 41: Adaptive FDR Estimation for High Dimensional Discrete Data

Results m=10,000 few small margins

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

00

Scenario 1, m=10000 Total Rejections

pi0

Mea

n To

tal R

ejec

tions

● NonNullBHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

Scenario 1, m=10000 False Rejections

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

Scenario 1, m=10000 E(V)/E(R)

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

00

Scenario 1, m=10000 Total Errors

pi0

Mea

n To

tal E

rror

s

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 23 / 34

Page 42: Adaptive FDR Estimation for High Dimensional Discrete Data

Results m=10,000 few small margins

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

00

Scenario 1, m=10000 Total Rejections

pi0

Mea

n To

tal R

ejec

tions

● NonNullBHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

Scenario 1, m=10000 False Rejections

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

Scenario 1, m=10000 E(V)/E(R)

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

00

Scenario 1, m=10000 Total Errors

pi0

Mea

n To

tal E

rror

s

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 23 / 34

Page 43: Adaptive FDR Estimation for High Dimensional Discrete Data

Results m=10,000 many small margins●

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

00

Scenario 2, m=10000 Total Rejections

pi0

Mea

n To

tal R

ejec

tions

● NonNullBHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

250

300

Scenario 2, m=10000 False Rejections

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

Scenario 2, m=10000 E(V)/E(R)

pi0

Mea

n Fa

lse

Rej

ectio

ns

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

0040

0050

00

Scenario 2, m=10000 Total Errors

pi0

Mea

n To

tal E

rror

s

BHBH−TrueBH−TGilbertGilbert−TrueGilbert−T

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 24 / 34

Page 44: Adaptive FDR Estimation for High Dimensional Discrete Data

Does it matter?

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

00

Scenario 1, m=10000 Difference in Total Errors

pi0

Mea

n D

iffer

ence

Tot

al E

rror

s

BH−TrueBH−TGilbertGilbert−TrueGilbert−T

●●

●●●

●●●●

●●

●●●

●●

●●

BH BHT BHTr G GT GTr

1250

1300

1350

1400

1450

Scenario 1, m=10000, pi0=0.7

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

00

Scenario 2, m=10000 Difference in Total Errors

pi0

Mea

n D

iffer

ence

Tot

al E

rror

s

BH−TrueBH−TGilbertGilbert−TrueGilbert−T

●●

●●●

●●

●●●

●●

BH BHT BHTr G GT GTr

1650

1700

1750

1800

1850

1900

1950

Scenario 2, m=10000, pi0=0.7

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 25 / 34

Page 45: Adaptive FDR Estimation for High Dimensional Discrete Data

Blekhman Primate Liver Data

Blekhman, et al, (2010) used RNA-seq to interrogate liver samples inmale and female human, chimpanzee and rhesus monkey.

There were 20689 features but 2803 had no reads, and a further907 had only 1 read across the 18 samples.These 3710 features were removed, leaving 16979 features.There were 3 biological samples for each species by gendercombination.Each sample was divided into 2 sequencing lanes.

The 2 lanes were combined to attain total reads for each feature foreach biological sample.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 26 / 34

Page 46: Adaptive FDR Estimation for High Dimensional Discrete Data

Blekhman Primate Liver Data

Blekhman, et al, (2010) used RNA-seq to interrogate liver samples inmale and female human, chimpanzee and rhesus monkey.

There were 20689 features but 2803 had no reads, and a further907 had only 1 read across the 18 samples.These 3710 features were removed, leaving 16979 features.There were 3 biological samples for each species by gendercombination.Each sample was divided into 2 sequencing lanes.The 2 lanes were combined to attain total reads for each feature foreach biological sample.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 26 / 34

Page 47: Adaptive FDR Estimation for High Dimensional Discrete Data

Blekhman Primate Liver Data

We look at 2 comparisons:Comparison Test

2 lanes same human Fisher’s exact testmale human versus chimpanzee moderated Negative Binomial test

It is difficult to compute expected counts for the moderated NegativeBinomial test, so we use the T-method.We use Fisher’s exact test to estimate the minimal achievablep-value. It is conservative.Data are normalized using the TMM method.Analysis is done using edgeR in Bioconductor.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 27 / 34

Page 48: Adaptive FDR Estimation for High Dimensional Discrete Data

Blekhman Primate Liver Data

We look at 2 comparisons:Comparison Test

2 lanes same human Fisher’s exact testmale human versus chimpanzee moderated Negative Binomial test

It is difficult to compute expected counts for the moderated NegativeBinomial test, so we use the T-method.We use Fisher’s exact test to estimate the minimal achievablep-value. It is conservative.Data are normalized using the TMM method.Analysis is done using edgeR in Bioconductor.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 27 / 34

Page 49: Adaptive FDR Estimation for High Dimensional Discrete Data

Human Male 1

We compared the two lanes of sequencing data for Human male 1.

13553 features were detected by at least 1 read.10359 features were detected by at least 7 reads (giving minimumachievable p-value>0.01.)We do not expect any differences between the two lanes.

0 5 10 15

0

5

10

15

Log2(Lane 1+.5)

Log2

(Lan

e 2+

.5)

Lanes 1 and 2 of Human Male 1

14283

124165206247288329370411452493534575616657

Counts

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 28 / 34

Page 50: Adaptive FDR Estimation for High Dimensional Discrete Data

Human Male 1

HS Male 1 p−values

All p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

0040

0050

0060

0070

00

HS Male 1 Filtered p−values

Filtered p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

0012

00

π̂0 = 1.0 using both Storey’s method and the Storey-T method.Method Number Significant FDR<0.05

Benjamini & Hochberg 3Gilbert 5

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 29 / 34

Page 51: Adaptive FDR Estimation for High Dimensional Discrete Data

Human Males versus Chimpanzee Males

The data were normalized, and the dispersion shrinkage factorswere computed.There are 3 biological replicates of each.16375 features were detected with at least 1 read.13809 features were detected with at least 7 reads.

HS Male Vs Chimp Male

All p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

00

HS Male Vs Chimp Male

Filtered p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

00

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 30 / 34

Page 52: Adaptive FDR Estimation for High Dimensional Discrete Data

Human Males versus Chimpanzee Males

π̂0 = 1.0 using Storey’s method and 0.87 using the Storey-T method.

Method Number Significant Number SignificantNon-adaptive Adaptive

Benjamini & Hochberg 1166 1239Gilbert 1251 1325

Note: 1166/16375 = 7.1% so π0 = 1 is not reasonable.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 31 / 34

Page 53: Adaptive FDR Estimation for High Dimensional Discrete Data

Summary:

What did we learn?For tabular count data (e.g. RNA-seq, SNP)

It is important to have an estimate of π0.When most counts are big, remove features with small margins anduse methods for continuous data.When many counts are small, use the regression method.

Adaptive methodsdo not help much for π0 > 0.9.can significantly reduce total errors when π0 < 0.5

Gilbert’s method is preferable to vanilla BH.Happily Gilbert’s method is equivalent to removing features withsmall margins and then using BH.

SummaryFor RNA-seq and SNP data, remove features with small margins andproceed as if p-values were continuous.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 32 / 34

Page 54: Adaptive FDR Estimation for High Dimensional Discrete Data

Summary:

What did we learn?For tabular count data (e.g. RNA-seq, SNP)

It is important to have an estimate of π0.When most counts are big, remove features with small margins anduse methods for continuous data.When many counts are small, use the regression method.

Adaptive methodsdo not help much for π0 > 0.9.can significantly reduce total errors when π0 < 0.5

Gilbert’s method is preferable to vanilla BH.Happily Gilbert’s method is equivalent to removing features withsmall margins and then using BH.

SummaryFor RNA-seq and SNP data, remove features with small margins andproceed as if p-values were continuous.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 32 / 34

Page 55: Adaptive FDR Estimation for High Dimensional Discrete Data

Summary:

What did we learn?For tabular count data (e.g. RNA-seq, SNP)

It is important to have an estimate of π0.When most counts are big, remove features with small margins anduse methods for continuous data.When many counts are small, use the regression method.

Adaptive methodsdo not help much for π0 > 0.9.can significantly reduce total errors when π0 < 0.5

Gilbert’s method is preferable to vanilla BH.

Happily Gilbert’s method is equivalent to removing features withsmall margins and then using BH.

SummaryFor RNA-seq and SNP data, remove features with small margins andproceed as if p-values were continuous.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 32 / 34

Page 56: Adaptive FDR Estimation for High Dimensional Discrete Data

Summary:

What did we learn?For tabular count data (e.g. RNA-seq, SNP)

It is important to have an estimate of π0.When most counts are big, remove features with small margins anduse methods for continuous data.When many counts are small, use the regression method.

Adaptive methodsdo not help much for π0 > 0.9.can significantly reduce total errors when π0 < 0.5

Gilbert’s method is preferable to vanilla BH.Happily Gilbert’s method is equivalent to removing features withsmall margins and then using BH.

SummaryFor RNA-seq and SNP data, remove features with small margins andproceed as if p-values were continuous.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 32 / 34

Page 57: Adaptive FDR Estimation for High Dimensional Discrete Data

Summary:

What did we learn?For tabular count data (e.g. RNA-seq, SNP)

It is important to have an estimate of π0.When most counts are big, remove features with small margins anduse methods for continuous data.When many counts are small, use the regression method.

Adaptive methodsdo not help much for π0 > 0.9.can significantly reduce total errors when π0 < 0.5

Gilbert’s method is preferable to vanilla BH.Happily Gilbert’s method is equivalent to removing features withsmall margins and then using BH.

SummaryFor RNA-seq and SNP data, remove features with small margins andproceed as if p-values were continuous.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 32 / 34

Page 58: Adaptive FDR Estimation for High Dimensional Discrete Data

Many thanks

Thanks for your attention

Thanks to NSFNSF DMS 1007801 (Altman, PI)NSF IOS 0820729 (Altman, subcontract from McSteen, PI)

Main Reference:Dialsingh, I (2011) False Discovery Rates when the Statistics areDiscrete. PhD Dissertation, Dept. of Statistics, Penn StateUniversity

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 33 / 34

Page 59: Adaptive FDR Estimation for High Dimensional Discrete Data

Many thanks

Thanks for your attention

Thanks to NSFNSF DMS 1007801 (Altman, PI)NSF IOS 0820729 (Altman, subcontract from McSteen, PI)

Main Reference:Dialsingh, I (2011) False Discovery Rates when the Statistics areDiscrete. PhD Dissertation, Dept. of Statistics, Penn StateUniversity

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 33 / 34

Page 60: Adaptive FDR Estimation for High Dimensional Discrete Data

References

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: Apractical and powerful approach to multiple testing. Journal of the Royal StatisticalSociety Series B, 57, 289-300.

Benjamini, Y., Hochberg, Y. (2000). On the adaptive control of the false discoveryrate in multiple testing with independent statistics. Journal of BehavioralEducational Statistics, 25, 60-83.

Gilbert,P.B. (2005). A modified false discovery rate multiple comparisonsprocedure for discrete data, applied to human immunodeficiency virus genetics.Journal of Applied Statistics, 54, 143-158.

Nettleton, D., Hwang, J.T.G., Caldo, R.A., Wise, R.P. (2006). Estimating thenumber of true null hypotheses from a histogram of p-values. Journal ofAgricultural, Biological, and Environmental Statistics, 11, 337-356.

Pounds, S. and Cheng, C. (2004). Improving false discovery rate estimation.Bioinformatics, 20, 1737-1745.

Storey, J.D. (2003) The positive false discovery rate: A Bayesian interpretation andthe q-value. Annals of Statistics. 31, 2013-2035.

Tarone,R.E. (1990) A modified Bonferroni method for discrete data. Biometrics.46, 515-522.

Altman & Dialsingh (Penn State) Discrete FDR February 5, 2013 34 / 34