Upload
sabrina-terry
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Buried treasures Old statistics in new contexts
“If I have seen further it is by standing on the shoulders of giants”
- Isaac Newton
One form of the past effect
You are dealing with a statistical problem in a special context.
You solve it by realizing a new interpretation of an old, interesting, but uncelebrated result, which was developed in a completely different context.
-
-
Three vignettes
V2: Bootstrapping and rank statistics (theory)
V1: Genomics meets sample surveys (methodology)
V3: Cancer genetics and stochastic geometry (application)
V2: Bootstrapping and rank statistics (theory)
V1: Genomics meets sample surveys (methodology)
V3: Cancer genetics and stochastic geometry (application)
John Tukey
V1: Genomics meets sample surveys
Context
Second-order gene-set enrichment analysis
Buried treasure
J.W. Tukey, 1950, Some sampling simplified. J. Amer. Statist. Assoc., 45, 501-519.
Context
D Pyeon, MA Newton, PF Lambert, JA den Boon, S Sengupta, CJ Marsit, CD Woodworth, JP Connor , TH Haugen, EM Smith, KT Kelsey, LP Turek and P Ahlquist (2007).
Fundamental Differences in Cell Cycle Deregulation in Human Papillomavirus Positive and Human Papillomavirus Negative Head/Neck and Cervical Cancers. Cancer Research, 67, 4605-4619.
MA Newton, X Ma, D Sarkar, D Pyeon, and P Ahlquist (2007).
Second order enrichment analysis of microarray expression datareveals gene sets with heterogeneous activation states. Submitted.
Context
D Pyeon, MA Newton, PF Lambert, JA den Boon, S Sengupta, CJ Marsit, CD Woodworth, JP Connor , TH Haugen, EM Smith, KT Kelsey, LP Turek and P Ahlquist (2007).
Fundamental Differences in Cell Cycle Deregulation in Human Papillomavirus Positive and Human Papillomavirus Negative Head/Neck and Cervical Cancers. Cancer Research, 67, 4605-4619.
MA Newton, X Ma, D Sarkar, D Pyeon, and P Ahlquist (2007).
Second order enrichment analysis of microarray expression datareveals gene sets with heterogeneous activation states. Submitted.
Slice of expression data from Pyeon et al. 2007
genes(a few)
tissue samplesHPV + HPV -
Fold changes between HPV+ and HPV- (all genes)
-2 -1 0 1 2
den
sity
log2 [ HPV+ / HPV- ]
The post-processing problem
expression exogenous
results biology
+
Exogenous biology
B = { c: c = {genes with specific property } }
- gene ontology (GO)
- Kyoto Encylopedia (KEGG)
e.g.
In HPV example, cell cycle may be an interesting gene set
Large sample variance(largest in KEGG, GO)
Excess differential expressionin both directions
€
u s,c( ) =1
m −1sg − s c( )
2
g∈c
∑
Expression results:
€
s = s1,s2,L ,sG( )
Gene set:
€
c ⊂ 1,2,L ,G{ }
€
c ∈ B
Gene set variance:
Standardized statistic:
€
z(s,c) =u(s,c) − E u(s,C){ }
var u(s,C){ }
Centering:
€
E u(s,C){ } =1
G −1sg − s ( )
2
g=1
G
∑
Connection: C indexes a simple random sample of genes I.e. finite population sampling
Scaling:
€
var u(s,C){ } = ??
€
var u(s,C){ } =1
m−
1
G
⎛
⎝ ⎜
⎞
⎠ ⎟b1
Tδ(s) +2
m −1−
2
G −1
⎛
⎝ ⎜
⎞
⎠ ⎟b2
Tδ(s)
We get:
following Tukey’s 1950 calculation involving “K” functions: set-level statistics whose expected value equals the same statistic computed on the whole population
€
1
Gγ 4
€
1
G(G −1)γ 2
2 − γ 4( )
€
1
G(G −1)γ1 γ 3 − γ 4( )
€
1
G(G −1)(G − 2)γ
1
2 γ 2 − 2γ1γ 3 − γ 22 + 2γ 4( )
€
1
G(G −1)(G − 2)(G − 3)γ
1
4 + 8γ1γ 3 + 3γ 22 − 6γ1
2γ 2 − 6γ 4( )
1 0
-3 1
-4 0
12 -2
-6 1
b1 b2
€
δ s( )
where
€
γk = sgk∑
V2: Bootstrapping and rank statistics (theory)
V1: Genomics meets sample surveys (methodology)
V3: Cancer genetics and stochastic geometry (application)
V2: Bootstrapping and rank statistics
Context
Mason and Newton, 1992, A rank statistics approach to theConsistency of a general bootstrap. Ann. Statist., 20,1611-24
Buried treasure
J. Hajak, 1961, Some extensions of the Wald-Wolfowitz-Noether theorem. Ann. Math. Statist., 32, 506-523.
Jaroslav Hajek
CLT:
€
n X n − μ( )
σ⇒ N 0,1[ ]
Bootstrap mean:
€
X n* =
1
nMn,i
i=1
n
∑ x i
Data:
€
X = (X1, X2,L ) iid
€
μ,σ 2( )
Bootstrap CLT:
€
n X n* − x n( )
sn
⇒ N 0,1[ ] a.s. x
multinomials
Generalized bootstrap: exchangeableweights
€
X nW =
1
nWn,i
i=1
n
∑ x i
Mason, Newton asked: What is CLT for this case?
€
an,i : i =1,2,L ,n{ }€
n
€
bn,i : i =1,2,L ,n{ }€
n
Consider two triangular arrays of numbers
€
Tn = an,π n,i
i=1
n
∑ bn,iAnd the sum
For a random permutation
€
π n,1, π n,2, L , π n,n( )
€
Tn = an,π n,i
i=1
n
∑ bn,iNotes about:
- Linear rank statistic; studied in nonparametrics.
- Hajak 1961 gives weak conditions for AN
Back to the general bootstrap problem:
This is precisely a linear rank statistic, and Hajek (1961)gives general conditions for its asymptotic normality.
Key fact:
€
X nW =D X n
Wπ =1
nWn,π n,i
i=1
n
∑ x i random permutation
Now condition on both data
€
X = x and weights
€
W = w
€
Tn =1
nwn,π n,i
i=1
n
∑ x i
V2: Bootstrapping and rank statistics (theory)
V1: Genomics meets sample surveys (methodology)
V3: Cancer genetics and stochastic geometry (application)
V3: Cancer genetics and stochastic geometry
Context
Cellular events during tumor initiation, intestinal cancer
Buried treasure
P. Armitage, 1949, An overlap problem arising in particle counting. Biometrika, 45, 501-519.
Peter Armitage
Context
AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).
Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.
MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).
A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.
MA Newton (2006).
On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.
Context
AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).
Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.
MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).
A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.
MA Newton (2006).
On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.
Context
AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).
Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.
MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).
A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.
MA Newton (2006).
On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.
Monoclonal theory of tumor origin
genetic defectapears in a cell
Monoclonal theory of tumor origin
aberrant cell divides and persists
Aggregation chimerasprovide data on clonality.
B6 Apc Min/+ Mom1 R/R <--> B6 Apc Min/+ Mom1 R/R Rosa26/+
B6 Apc Min/+ Mom1 R/R <--> B6 Apc Min/+ Mom1 R/R Rosa26/+
Heterotypic tumor!
mouse id % blue tissue
total # tumors
heterotypic pure blue
pure white
ambiguous
1 20 19 5 5 6 3
2 85 24 3 13 6 2
3 20 9 2 2 5 0
4 60 19 3 2 10 4
5 30 24 2 0 21 1
6 50 9 2 2 3 2
7 40 8 5 0 3 0
totals 112 22 24 54 12
Summary count data
€
∃ many heterotypic tumors … but why?
€
HA : clonal cooperation - recruitment; selection
€
∃ many heterotypic tumors … but why?
€
Ho : random collision
€
HA : clonal cooperation - recruitment; selection
# initiated clones
€
N =
collision distance
€
δ =
Key parameters:
€
X1 = # isolated clones
€
X2 = # doublets
€
X3 = # triplets
Induced R.V.’s
# tumors (one mouse)
€
X1 + X2 + X3 +L
# initiated clones
€
N =
collision distance
€
δ =
Key parameters:
€
X1 = # isolated clones
€
X2 = # doublets
€
X3 = # triplets
Induced R.V.’s
Intractable distribution!!
# tumors (one mouse)
€
X1 + X2 + X3 +L
But thanks to Armitage, 1949,
€
E(X1) ≈ m1 = N exp −4ψ( )
€
E(X2) ≈ m2 = 2N ψ −4π + 3 3
πψ 2
⎛
⎝ ⎜
⎞
⎠ ⎟
€
E(X3) ≈ m3 = N4 2π + 3 3( )
3π
⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟ψ 2
where
€
ψ =πNδ 2
4A
Armitage was studying dust particles … not cancer
• Lineage marking
• Unknown N’s
• Extra Poisson variation
Closing the inference loop
Conditional predictive p-values
One form of the past effect
You are dealing with a statistical problem in a special context.
You solve it by realizing a new interpretation of an old, interesting, but uncelebrated result, which was developed in a completely different context.
-
-
John Tukey Jaraslav Hajek Peter Armitage
1915-2000 1924-present1926-1974
John Tukey Jaraslav Hajek Peter Armitage
1915-2000 1924-present1926-1974
8 943
# citations of key paper
John Tukey Jaraslav Hajek Peter Armitage
1915-2000 1924-present1926-1974
2800 5300415
# citations of a book
“I seem to have been only like a child playing on the seashore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.”
- Isaac Newton
Peter Armitage 1924 - present worked with George Barnard. worked for the
Medical Research Council from 1947-61.
From 1961-76 he was Professor of Medical Statistics at the London School of Hygiene and Tropical Medicine.
moved to Oxford as Professor of Biomathematics and became Professor of Applied Statistics and head of the new Department of Statistics, retiring in 1990.
president of the Royal Statistical Society in 1982-4.