Upload
doantruc
View
220
Download
3
Embed Size (px)
Citation preview
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Semiparametric Regression Based on MultipleSources
Benjamin Kedem
Department of Mathematics & Inst. for Systems ResearchUniversity of Maryland, College Park
“Give me a place to stand and rest my lever on, and I can move the Earth”,(Archimedes, 287-212 B.C.)
AISC, October 5-7, 2012
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
1 Cheng, K.F., and Chu, C.K. (2004), “Semiparametric density estimation under a two-sample density ratiomodel,” Bernoulli, 10(4), 583-604.
2 Fokianos, K. (2004). Merging information for semiparametric density estimation. Journal of the RoyalStatistical Society, Series B, 66, 941-958.
3 Fokianos, K., Kedem, B., Qin, J., Short, D.A. (2001). A semiparametric approach to the one-way layout.Technometrics, 43, 56-65.
4 McGlynn K.A., Sakoda1 L.C., Rubertone M.V., Sesterhenn I.A., Lyu C., Graubard B.I., Erickson R.L. (2007).Body size, dairy consumption, puberty, and risk of testicular germ cell tumors. American Journal ofEpidemiology, 165, 355-363.
5 Qin, J., Zhang, B. (1997). A goodness of fit test for logistic regression models based on case-control data.Biometrika, 84, 609-618.
6 Qin, J., and Zhang, B. (2005). Density Estimation Under A Two-Sample Semiparametric Model.Nonparametric Statistics, 1, 665-683.
7 Kedem, B., Kim, E-Y, Voulgaraki, A., and Graubard, B. (2009). Two-Dimensional Semiparametric DensityRatio Modeling of Testicular Germ Cell Data. Statistics in Medicine, 2009, 28, 2147-2159.
8 Voulgaraki, A., Kedem, B., and Graubard, B.(2012). "Semiparametric Regression in Testicular Germ CellData", Annals of Applied Statistics, 6, 1185-1208Supplement: DOI:10.1214/12-AOAS552SUPP
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Abstract
1 Information is combined from multiple multivariate sources.
2 Density ratio model: Multivariate distributions are “regressed" on a reference distribution.
3 Random cocariates.
4 Kernel density estimator from many data sources: more efficient than the traditional single-sample kerneldensity estimator.
5 Each multivariate distribution and the corresponding conditional expectation (regression) of interest areestimated from the combined data using all sources.
6 Graphical and quantitative diagnostic tools are suggested to assess model validity.
7 The method is applied in quantifying the effect of height and age on weight of germ cell testicular cancerpatients.
8 Comparison with multiple regression, GAM, and nonparametric kernel regression.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Main PointsDensity Ratio ExamplesSemiparametric Statistical FormulationCombined semiparametric density estimatorsSome Simulation ResultsApplication to Testicular Germ Cell CancerSummary
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Outline
1 Density Ratio Examples
2 Semiparametric Statistical Formulation
3 Combined semiparametric density estimators
4 Semiparametric regression
5 Application to Testicular Germ Cell Cancer
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Multiple filtering of a signal
f1(ω) = |H1(ω)|2f (ω).
. (1)
.
fq(ω) = |Hq(ω)|2f (ω)
That is, q “distortions” or multiple “tilting” of the same referencespectral density f .
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Linear system with Gaussian noise
X t = αX t−1 + εt , ε = (ε1t , ...., εqt , εmt)′
εjt ∼ gj ∼ N(0, σ2j ), j = 1, ...,q,m. Choose gm ≡ g. We get
many distortions of the same reference g:
g1(x) = eα1+β1x2g(x)
.
.
.
gq(x) = eαq+βqx2g(x)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Analysis of variance
Consider the classical one-way ANOVA with m = q + 1independent normal random samples:
x11, . . . , x1n1∼g1(x).
.
.
xq1, . . . , xqnq∼gq(x)xm1, . . . , xmnm∼gm(x)
gj(x)∼N( µj, σ2), j = 1, ...,m.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Then, holding gm(x) ≡ g(x) as a reference:
g1(x) = exp(α1 + β1x)g(x).
.
.
gq(x) = exp(αq + βqx)g(x)
αj =µ2
m− µ2j
2 σ2 , β j =µj− µm
σ2 , j = 1, ...,q
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
k -parameter exponential families
g(x ,θ) = d(θ)S(x)exp
{k∑
i=1
ci(θ)Ti(x)
}Let gj(x) ≡ g(x ,θj), g(x) ≡ g(x ,θm).Again, q distortions of a reference g(x):
g1(x) = exp{α1 + β′1h(x)}g(x).
.
.
gq(x) = exp{αq + β′qh(x)}g(x)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Case-control: Multinomial logistic regression (Prenticeand Pyke 1979).
RV y s.t. P(y = j) = πj ,∑m
j=1 πj = 1.Assume: For j = 1, ...,m, and any h(x),
P(y = j |x) =exp( α∗j + β jh(x))
1+∑q
k=1 exp( α∗k+ βkh(x))
Define: f (x |y = j) = gj(x), j = 1, ...,m
Then with αj = α∗j + log[ πm/πj ], j = 1, ...,q, and gm ≡ g,
gj(x) = exp( αj + β jh(x))g(x), j = 1, ...,q.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Weighted Distributions
Rao(1965) unified the concept of weighted distributions of theform:
pw (x ; θ, α) =W (x ;α)p(x ; θ)
E [W (X ;α)]
where W (x ;α) is known. This is an example of “tilting” of areference distribution. By changing the weight W (x ;α) weobtain different distortions of the same reference p(x ; θ).
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Biased sampling
Length-biased Sampling: Vardi (1982) introduced
F (y) = 1/µ∫ y
0xdG(x), y ≥ 0,
where µ =∫∞
0 xdG(x) <∞. This is a tilt model.Biased Sampling/Selection Bias: Vardi (1985), and Gill,Vardi, Wellner (1988) considered the more general biasedsampling model
F (y) = W (G)−1∫ y
−∞w(x)dG(x),
where w(x) is known and W (G) =∫∞−∞w(x)dG(x). This is
a tilt model. F ,G are obtained by NPMLE.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Biased sampling
Length-biased Sampling: Vardi (1982) introduced
F (y) = 1/µ∫ y
0xdG(x), y ≥ 0,
where µ =∫∞
0 xdG(x) <∞. This is a tilt model.Biased Sampling/Selection Bias: Vardi (1985), and Gill,Vardi, Wellner (1988) considered the more general biasedsampling model
F (y) = W (G)−1∫ y
−∞w(x)dG(x),
where w(x) is known and W (G) =∫∞−∞w(x)dG(x). This is
a tilt model. F ,G are obtained by NPMLE.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Comparison Distributions (Parzen 1977,...,2009)
Sequence of cdf’s: {F1, ...,Fq} � G, with cont. densitiesf1, ..., fq,g.Comparison Distributions are defined as:
Dj(u;G,Fj) = Fj(G−1(u)), 0 < u < 1, j = 1, ...,q
Then by differentiation, with x = G−1(u):
f1(x) = d(G(x);G,F1)g(x)···
fq(x) = d(G(x);G,Fq)g(x)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Outline
1 Density Ratio Examples
2 Semiparametric Statistical Formulation
3 Combined semiparametric density estimators
4 Semiparametric regression
5 Application to Testicular Germ Cell Cancer
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
m = q + 1 independent p-dim multivariate samples:
xij = (xij1, xij2, . . . , xijp) ∼ gi(x1, . . . , xp), i = 1, . . . ,q,m, j = 1, . . . ,ni
andxi1, xi2, . . . , xini
iid∼ gi
Reference or baseline probability density function (pdf):
g ≡ gm(x) ≡ gm(x1, . . . , xp)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Density ratio model:
gi(x)g(x)
= w(x,θi) (2)
or equivalentlygi(x) = w(x,θi)g(x) (3)
1. gi(x), g(x) are unknown.2. w is a known positive and continuous function.3. θi are unknown d-dimensional parameters.4. pdf’s may be continuous or discrete.5. Normal assumption is not made.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Problem:Use the combined data from all the samples
x11, ...,x1n1 , ...xm1, ...,xmnm
of size n = n1 + · · ·+ nm to estimate all the parameters θi ,i = 1, ...,q, and all the densities g1, ...,gq and gm = g.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
G(x) ≡ Gm(x), pij = dG(xij) = dGm(xij).
The empirical log-likelihood is given by:
l = log L =m∑
i=1
ni∑j=1
log(pij) +
q∑i=1
ni∑j=1
log(w(xij ,θi)) (4)
subject to the constraints:
pij ≥ 0,m∑
i=1
ni∑j=1
pij = 1,m∑
i=1
ni∑j=1
pijw(xij ,θk ) = 1 for k = 1, . . . ,q.
(5)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
pij =1n
1
1 +∑q
k=1 µk
[w(xij , θk )− 1
] , (6)
G(x) = Gm(x) =m∑
i=1
ni∑j=1
pij I(xij ≤ x) (7)
More generally, for l = 1, . . . ,m and w(xij , θm) ≡ 1,
Gl(x) =m∑
i=1
ni∑j=1
pijw(xij , θl)I(xij ≤ x) (8)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Outline
1 Density Ratio Examples
2 Semiparametric Statistical Formulation
3 Combined semiparametric density estimators
4 Semiparametric regression
5 Application to Testicular Germ Cell Cancer
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Traditional single sample kernel density estimator:
f (x) =1
nhpn
n∑i=1
K(
x− xi
hn
)(9)
hn is a sequence of bandwidths such that:
1. hn → 0, nhpn →∞ as n→∞.
2. K (x) is defined for p-dimensional x.3 K (x) ≥ 0, symmetric around 0.4.∫
Rp K (x)dx = 1.
Under certain conditions, f (x) is a consistent estimator of f (x)(Parzen 1962, Shao 2003).
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Fokianos (2004), Cheng and Chu (2004), Qin and Zhang(2005), Voulgaraki, Kedem, Graubard (VKG) (2012), use thethe probabilities pij instead of 1/n:
gl(x) =1hp
n
m∑i=1
ni∑j=1
pijwl(xij)K(
x− xij
hn
)(10)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
VKG 2012: Assume that K (·) is a nonnegative boundedsymmetric function with
∫K (x)dx = 1,
∫x′xK (x)dx = k2 > 0.
Assume that gl is continuous at x and twice differentiable in aneighborhood of x. If, as n→∞, hn = O(n−
14+p ), then√
nhpn
(gl(x)− gl(x)−
12
h2n
∫u′∂2gl(x∗)∂x∂x′
uK (u)du)
D→ N(0, σ2(x))
whereσ2(x) =
wl(x)gl(x)∑mk=1 ζkwk (x)
∫K 2(u)du
for any fixed x.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
The mean integrated square error (MISE) is defined as:
MISE(gl(x)) = E(∫ ∣∣gl(x)− gl(x)
∣∣2dx)
(11)
Minimizing the MISE with respect to hn, the optimal bandwidthis:
h∗n =
(p/n)∫
wl(x)gl(x)/[∑m
k=1 ζkwk (x)]
dx∫
K 2(u)du∫ (∫u′(∂2gl(x)/∂x∂x′)uK (u)du
)2
dx
1
4+p
(12)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
For sufficient large n, an alternative way to find hn is byminimizing
h−pn
n∑i=1
n∑i ′=1
p(ti)wl(ti)p(ti ′)wl(ti ′)
∫K (z)K
(z +
ti − ti ′
hn
)dz
− 2nl(nl − 1)hp
n
∑i 6=j
K(
xli − xlj
hn
). (13)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
VKG 2012: If f is the classical single-sample multivariate kerneldensity estimator of gl , then As n→∞, hn → 0 and nhp
n →∞,(a)
AMISE(gl) ≤ AMISE(f )
(b) Using optimal bandwidths, the proposed semiparametricdensity estimator gl(x) is more efficient than f (x), i.e forevery l
eff (f , gl) ≡AMISE∗(gl)
AMISE∗(f )≤ 1
where AMISE∗ is the optimal AMISE .
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Diagnostic Plots and Goodness of Fit
Define
R2α,k = 1− exp
{−(
xαni − xα
)k}
(14)
ni : Sample size of i th sample.
xα: The number of times the estimated semiparametric cdf falls in theestimated 1 − α confidence interval obtained from the correspondingempirical cdf, both evaluated at the sample points.
k > 0 abd α chosen by the user.
R2α,k takes values between 0 and 1.
Computing R2α,k is both simple and fast.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Qin and Zhang (1997):
R23 = exp(−
√n ·max |Gi − Gi |) (15)
Clearly, R23 takes values between 0 and 1.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Simulation Results: Gi vs. Gi , i = 1,2Simulation 1: g1 ∼ N((0, 0)′,Σ)) (case), g2 ∼ N((0, 0)′,Σ)) (control),
Σ =
(4 22 3
), n1 = 40, n2 = 30.
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Simulation 1: G1
Joint Empcdf case
Join
t S
em
. C
DF
case
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Simulation 1: G2
Joint Empcdf control
Join
t S
em
. re
f. C
DF
contr
ol
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Simulation 2: g1 ∼ N((0, 0)′,Σ)) (case), g2 ∼ N((1, 1)′,Σ)) (control) with
Σ =
(3 11 2
), n1 = 200, n2 = 200.
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
Simulation 2: G1
Joint Empcdf case
Jo
int
Se
m.
CD
F c
ase
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Simulation 2: G2
Joint Empcdf control
Jo
int
Se
m.
ref.
CD
F c
on
tro
l
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Simulation 3: g1 from standard two dimensional Multivariate Cauchy (case) and g2
from two dimensional Multivariate Cauchy (control) with µ = (1, 1)′, V =
(5 55 10
),
n1 = 200, n2 = 200.
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Simulation 3: G1
Joint Empcdf case
Jo
int
Se
m.
CD
F c
ase
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Simulation 3: G2
Joint Empcdf control
Jo
int
Se
m.
ref.
CD
F c
on
tro
l
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Simulation 4: g1 from standard two dimensional Multivariate Cauchy (case) and g2from uniform distribution on the triangle (0, 0), (6, 0), (−3, 4) (control), and n1 = 200,n2 = 200.
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Simulation 4: G1
Joint Empcdf case
Jo
int
Se
m.
CD
F c
ase
0.0 0.1 0.2 0.3 0.4
0.1
0.2
0.3
0.4
0.5
Simulation 4: G2
Joint Empcdf control
Jo
int
Se
m.
ref.
CD
F c
on
tro
l
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Table: Comparison of R23 and R2
.05,2 for 100 repetitions of case andcontrol.
Simulation Group R23 R2
.05,2
(1) Case 0.6307 1Control 0.5976 1
(2) Case 0.3912 0.9353Control 0.3766 0.9718
(3) Case 0.1080 0.3342Control 0.1129 0.3324
(4) Case 0.0507 0.3361Control 0.0495 0.0033
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Bandwidth (BW) selection using cross validation.
Case ControlSame BW Diff. BW’s Same BW Diff. BW’s
h h1 h2 h h1 h2
Leave One OutSimulation 1 0.61 0.90 0.40 0.59 0.31 0.61Simulation 2 0.38 0.50 0.20 0.61 0.36 0.71
Alternative (12)Simulation 1 0.64 0.90 0.50 0.63 0.21 0.71Simulation 2 0.30 0.40 0.20 0.74 0.11 0.96
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Outline
1 Density Ratio Examples
2 Semiparametric Statistical Formulation
3 Combined semiparametric density estimators
4 Semiparametric regression
5 Application to Testicular Germ Cell Cancer
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Suppose we have m multivariate samples of p-dim vectors:
(Random Covariates, Response) = (x1, . . . , x(p−1), y )
We wish to estimate m conditional expectations:
E(y |x1, . . . , x(p−1))
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Data:i = 1, . . . ,q,m, j = 1, . . . ,ni ,
(xij1, xij2, . . . , xij(p−1), yij) ∼ gi(x1, . . . , x(p−1), y).
Reference:g ≡ gm(x1, . . . , x(p−1), y)
Distortions:
gi(x1, . . . , x(p−1), y), i = 1, . . . ,q
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Model:
As in ratio of pdf’s from N(µ1,Σ), N(µ2,Σ) assume
gi(x)g(x)
= w(x, αi ,βi) ≡ exp(αi + β′ix), i = 1, ...,q (16)
where x = (x1, . . . , x(p−1), y)′ and βi = (βi1, . . . , βip)′.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Let t denote the vector of combined data of lengthn = n1 + n2 + · · ·+ nm. Following the method of constrainedempirical likelihood we obtain score equations for αj and βj :
∂l∂αj
= −n∑
i=1
ρjwj(ti)
1 + ρ1w1(ti) + · · ·+ ρqwq(ti)+ nj = 0 (17)
∂l∂βj
= −n∑
i=1
ρjwj(ti)ti
1 + ρ1w1(ti) + · · ·+ ρqwq(ti)+
nj∑i=1
(xji1, . . . , yji)′ = 0(18)
j = 1, . . . ,q andρj = nj/nm
wj(ti) = exp(αj + β′j ti)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
pi =1
nm· 1
1 + ρ1w1(ti) + · · ·+ ρqwq(ti)(19)
G(t) =1
nm·
n∑i=1
I(ti ≤ t)1 + ρ1w1(ti) + · · ·+ ρqwq(ti)
(20)
wherewj(ti) = exp(αj + β
′j ti)
As n→∞, the estimators θ = (α1, · · · , αq, β1, · · · , βq)′ are
asymptotically normal (Qin and Zhang 1997, Lu 2007).
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Under the p-dimensional density ratio model we can predict theresponse y given the covariate information x1, x2, . . . , x(p−1) forany of the m data sets as follows:
Ej(y | x1, . . . , x(p−1)) =
nj∑i
yigj(x1, . . . , x(p−1), yi)∑yi
gj(x1, . . . , x(p−1), yi), j = 1, . . . ,q,m.
(21)The gj in (21) are the semiparametric kernel density estimates.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Assume that the data are bounded. Then:(a) As n→∞,h→ 0 and nhp →∞,∫
|E(y |x)− E(y |x)|g(x)dx→ 0
in the mean square sense.(b) If, in addition 0 < A < g(x), then∫
|E(y |x)− E(y |x)|dx→ 0
in the mean square sense.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Outline
1 Density Ratio Examples
2 Semiparametric Statistical Formulation
3 Combined semiparametric density estimators
4 Semiparametric regression
5 Application to Testicular Germ Cell Cancer
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Testicular germ cell tumor (TGCT) is a common cancer among U.S.men, mainly in the age group of 15-35 years (McGlynn et al 2003).
Increased risk is significantly related to height, whereas body massindex was not found significant (McGlynn et al 2007).
Using a two dimensional semiparametric model, it was shown thatjointly height and weight are significant risk factors (Kedem et al 2009).
We use TGCT data with 3 variables: age, height (cm), weight (kg).
Case: n1 = 763. Control: n2 = 928. n = 1691 individuals.
We consider two cases:2D with variables height and weight3D with variables height, weight and age.In both cases the control distribution is the reference distribution.
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
2D Problem: E(Weight | Height)
Diagnostic plots of Gi versus Gi , i = 1, 2 evaluated at (height,weight) pairs.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Case Group
Joint Empirical CDF
Join
t S
em
. C
DF
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Control Group
Joint Empirical CDF
Join
t S
em
. C
DF
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
2D Case Regression
160 170 180 190 200
60
80
100
120
2D TGCT: E[Weight/Height] for G1
height
weig
ht
Actual points Ref.
Regression
Semiparametric
GAM
NW
70 75 80 85 90 95
−30
−20
−10
010
20
30
40
Residual plot for 2D TGCT Case
Sem. E[weight/height]
hat(
e)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
2D Control Regression
150 160 170 180 190 200 210
40
60
80
100
120
2D TGCT: E[Weight/Height] for G2
height
weig
ht
Actual points Ref.
Regression
Semiparametric
GAM
NW
55 60 65 70 75 80 85 90
−40
−20
020
40
Residual plot for 2D TGCT Control
Sem. E[weight/height]
hat(
e)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
3D Problem: E(Weight | Height, Age)Diagnostic plots of Gi versus Gi , i = 1, 2 evaluated at (age,height,weight)triplets.
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Case Group
Joint Empirical CDF
Join
t S
em
. C
DF
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Control Group
Joint Empirical CDF
Join
t S
em
. C
DF
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
3D Residuals
Residual plots for the semiparametric model in the 3D TGCTdata set.
70 75 80 85 90 95
−2
0−
10
01
02
03
04
0
Residual plot for 3D TGCT Case
Sem. E[weight/height, age]
ha
t(e
)
55 60 65 70 75 80 85 90
−4
0−
20
02
04
0
Residual plot for 3D TGCT Control
Sem. E[weight/height, age]
ha
t(e
)
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Some joint probabilities of age, height and weight in the case and control groups.
Probability Case ControlPr(A≤ 45, H≤ 152.40, W≤ 58.967 ) 0.000378 0.000767Pr(A≤ 26, H≤ 165.10, W≤ 58.967 ) 0.004502 0.007074Pr(A≤ 29, H≤ 177.80, W≤ 65.317 ) 0.042723 0.054313Pr(A≤ 33, H≤ 185.42, W≤ 70.307 ) 0.157968 0.184774Pr(A≤ 34, H≤ 180.34, W≤ 79.832 ) 0.316077 0.362967Pr(A≤ 37, H≤ 180.34, W≤ 89.811 ) 0.513664 0.575512Pr(A≤ 40, H≤ 187.96, W≤ 94.801 ) 0.797157 0.833803Pr(A≤ 43, H≤ 200.66, W≤ 99.790 ) 0.943058 0.956300Pr(A≤ 45, H≤ 203.20, W≤ 117.934 ) 0.995010 0.996560
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Case-control weight and E [weight|height, age]. Empty entries in the table correspond to subjects with the sameheight and age, but possibly different weights.
Control CaseAge Height Weight E [W | H, A] Weight E [W | H, A]
27 162.56 58.967 69.08335 58.967 68.5365228 162.56 77.111 69.05132 65.771 68.59858
68.03930 165.10 68.039 72.20524 72.575 72.002837 165.10 69.40 72.42138 63.503 71.850425 167.64 86.183 73.68129 72.575 73.69978
90.71863.503
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Case-control weight and E [weight|height, age]. Empty entries in the table correspond to subjects with the sameheight and age, but possibly different weights.
Control CaseAge Height Weight E [W | H, A] Weight E [W | H, A]
30 167.64 72.575 74.81333 88.451 74.9354318 170.18 61.235 73.67032 72.575 73.6751832 170.18 70.307 76.53351 81.647 76.64543
63.50337 172.72 74.843 77.88598 88.451 77.941740 172.72 70.307 77.97789 90.718 78.0441
77.11122 175.26 77.111 76.62195 86.183 76.70862
65.771 65.77179.379 86.18383.91565.771
25 175.26 68.039 77.14234 79.379 77.2175583.915 72.57574.843 83.91583.915 74.84379.379 72.57586.183 74.843
61.23561.23565.77179.379
Benjamin Kedem AISC: October 5-7, 2012
UMinformal
Density Ratio ExamplesSemiparametric Statistical Formulation
Combined semiparametric density estimatorsSemiparametric regression
Application to Testicular Germ Cell Cancer
Case-control weight and E [weight|height, age]. Empty entries in the table correspond to subjects with the sameheight and age, but possibly different weights.
Control CaseAge Height Weight E [W | H, A] Weight E [W | H, A]
21 185.42 86.183 82.46773 79.379 82.7814072.575 77.111
102.058 97.52222 190.50 97.522 85.23493 86.183 85.64845
95.254 71.66831 190.50 102.058 86.05980 104.326 86.27744
74.84322 193.04 86.183 86.73352 102.058 87.18440
80.73924 193.04 99.337 87.50020 108.862 88.23938
86.18399.790
108.86234 193.04 113.398 87.72937 88.451 88.58960
117.93434 195.58 83.915 88.81524 89.811 89.036535
Benjamin Kedem AISC: October 5-7, 2012