Upload
christian-robert
View
1.733
Download
0
Embed Size (px)
DESCRIPTION
Xiao-Li Meng's slides for his talks at Columbia, Sept. 2011, and ICERM, Nov. 2012
Citation preview
logo
Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data
Xiao-Li Meng
Department of Statistics, Harvard University
September 24, 2011
Based on
Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
logo
Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data
Xiao-Li Meng
Department of Statistics, Harvard University
September 24, 2011
Based on
Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
Importance sampling (IS)
Estimand:
c1 =
∫Γ
q1(x)µ(dx) =
∫Γ
q1(x)
p2(x)p2(x)µ(dx).
Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2
Estimating Equation (EE):
r ≡ c1
c2= E2
[q1(X )
q2(X )
].
The EE estimator:
r =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
Standard IS estimator for c1 when c2 = 1.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
logo
What about MLE?
The “likelihood” is:
f (X12 . . .Xn22) =
n2∏i=1
p2(Xi2) — free of the estimand c1!
So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?
What are we “inferring”?What is the “unknown” model parameter?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
logo
What about MLE?
The “likelihood” is:
f (X12 . . .Xn22) =
n2∏i=1
p2(Xi2) — free of the estimand c1!
So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?
What are we “inferring”?What is the “unknown” model parameter?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
logo
What about MLE?
The “likelihood” is:
f (X12 . . .Xn22) =
n2∏i=1
p2(Xi2) — free of the estimand c1!
So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?
What are we “inferring”?What is the “unknown” model parameter?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
logo
What about MLE?
The “likelihood” is:
f (X12 . . .Xn22) =
n2∏i=1
p2(Xi2) — free of the estimand c1!
So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?
What are we “inferring”?What is the “unknown” model parameter?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
logo
Bridge sampling (BS)
Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2
Estimating Equation (Meng and Wong, 1996):
r ≡ c1
c2=
E2[α(X )q1(X )]
E1[α(X )q2(X )], ∀ α : 0 < |
∫αq1q2dµ| <∞
Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1
Optimal estimator rO , the limit of
r(t+1)O =
1n2
n2∑i=1
[q1(Xi2)
s1q1(Xi2)+s2 r(t)O q2(Xi2)
]1n1
n1∑i=1
[q2(Xi1)
s1q1(Xi1)+s2 r(t)O q2(Xi1)
]
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
logo
Bridge sampling (BS)
Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2
Estimating Equation (Meng and Wong, 1996):
r ≡ c1
c2=
E2[α(X )q1(X )]
E1[α(X )q2(X )], ∀ α : 0 < |
∫αq1q2dµ| <∞
Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1
Optimal estimator rO , the limit of
r(t+1)O =
1n2
n2∑i=1
[q1(Xi2)
s1q1(Xi2)+s2 r(t)O q2(Xi2)
]1n1
n1∑i=1
[q2(Xi1)
s1q1(Xi1)+s2 r(t)O q2(Xi1)
]
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
logo
Bridge sampling (BS)
Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2
Estimating Equation (Meng and Wong, 1996):
r ≡ c1
c2=
E2[α(X )q1(X )]
E1[α(X )q2(X )], ∀ α : 0 < |
∫αq1q2dµ| <∞
Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1
Optimal estimator rO , the limit of
r(t+1)O =
1n2
n2∑i=1
[q1(Xi2)
s1q1(Xi2)+s2 r(t)O q2(Xi2)
]1n1
n1∑i=1
[q2(Xi1)
s1q1(Xi1)+s2 r(t)O q2(Xi1)
]
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
logo
Bridge sampling (BS)
Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2
Estimating Equation (Meng and Wong, 1996):
r ≡ c1
c2=
E2[α(X )q1(X )]
E1[α(X )q2(X )], ∀ α : 0 < |
∫αq1q2dµ| <∞
Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1
Optimal estimator rO , the limit of
r(t+1)O =
1n2
n2∑i=1
[q1(Xi2)
s1q1(Xi2)+s2 r(t)O q2(Xi2)
]1n1
n1∑i=1
[q2(Xi1)
s1q1(Xi1)+s2 r(t)O q2(Xi1)
]
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
logo
Bridge sampling (BS)
Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2
Estimating Equation (Meng and Wong, 1996):
r ≡ c1
c2=
E2[α(X )q1(X )]
E1[α(X )q2(X )], ∀ α : 0 < |
∫αq1q2dµ| <∞
Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1
Optimal estimator rO , the limit of
r(t+1)O =
1n2
n2∑i=1
[q1(Xi2)
s1q1(Xi2)+s2 r(t)O q2(Xi2)
]1n1
n1∑i=1
[q2(Xi1)
s1q1(Xi1)+s2 r(t)O q2(Xi1)
]
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
What about MLE?
The “likelihood” is:
2∏j=1
nj∏i=1
qj(Xij)
cj∝ c−n1
1 c−n22 — free of data!
What went wrong: cj is not “free parameter” becausecj =
∫Γ qj(x)µ(dx) and qj is known.
So what is the “unknown” model parameter?
Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.
So why is that? Can it be improved upon without any “sleight ofhand”?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
logo
Pretending the measure is unknown!
Because
c =
∫Γ
q(x)µ(dx),
and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.
This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.
Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
logo
Pretending the measure is unknown!
Because
c =
∫Γ
q(x)µ(dx),
and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.
This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.
Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
logo
Pretending the measure is unknown!
Because
c =
∫Γ
q(x)µ(dx),
and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.
This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.
Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
logo
Pretending the measure is unknown!
Because
c =
∫Γ
q(x)µ(dx),
and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.
This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.
Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
logo
Importance Sampling Likelihood
Estimand: c1 =∫
Γ q1(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)
Likelihood for µ:
L(µ) =
n2∏i=1
c−12 q2(Xi2)µ(Xi2)
Note that c2 is a functional of µ.
The nonparametric MLE of µ is
µ(dx) =P(dx)
q2(x), P — empirical measure
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
logo
Importance Sampling Likelihood
Estimand: c1 =∫
Γ q1(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)
Likelihood for µ:
L(µ) =
n2∏i=1
c−12 q2(Xi2)µ(Xi2)
Note that c2 is a functional of µ.
The nonparametric MLE of µ is
µ(dx) =P(dx)
q2(x), P — empirical measure
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
logo
Importance Sampling Likelihood
Estimand: c1 =∫
Γ q1(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)
Likelihood for µ:
L(µ) =
n2∏i=1
c−12 q2(Xi2)µ(Xi2)
Note that c2 is a functional of µ.
The nonparametric MLE of µ is
µ(dx) =P(dx)
q2(x), P — empirical measure
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
logo
Importance Sampling Likelihood
Estimand: c1 =∫
Γ q1(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)
Likelihood for µ:
L(µ) =
n2∏i=1
c−12 q2(Xi2)µ(Xi2)
Note that c2 is a functional of µ.
The nonparametric MLE of µ is
µ(dx) =P(dx)
q2(x), P — empirical measure
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
logo
Importance Sampling Likelihood
Estimand: c1 =∫
Γ q1(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)
Likelihood for µ:
L(µ) =
n2∏i=1
c−12 q2(Xi2)µ(Xi2)
Note that c2 is a functional of µ.
The nonparametric MLE of µ is
µ(dx) =P(dx)
q2(x), P — empirical measure
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
logo
Importance Sampling Likelihood
Thus the MLE for r ≡ c1/c2 is
r =
∫q1(x)µ(dx) =
1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.
{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
logo
Importance Sampling Likelihood
Thus the MLE for r ≡ c1/c2 is
r =
∫q1(x)µ(dx) =
1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.
{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
logo
Importance Sampling Likelihood
Thus the MLE for r ≡ c1/c2 is
r =
∫q1(x)µ(dx) =
1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.
{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
logo
Importance Sampling Likelihood
Thus the MLE for r ≡ c1/c2 is
r =
∫q1(x)µ(dx) =
1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)
When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.
{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
logo
Bridge Sampling Likelihood
Estimand: ∝ cj =∫
Γ qj(x)µ(x), j = 1, . . . , J.
Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 c−1j qj(Xij)µ(Xij)
Writing θ(x) = log µ(x), then
log L(µ) = n
∫Γθ(x)dP −
J∑j=1
nj log cj(θ),
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
logo
Bridge Sampling Likelihood
Estimand: ∝ cj =∫
Γ qj(x)µ(x), j = 1, . . . , J.
Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 c−1j qj(Xij)µ(Xij)
Writing θ(x) = log µ(x), then
log L(µ) = n
∫Γθ(x)dP −
J∑j=1
nj log cj(θ),
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
logo
Bridge Sampling Likelihood
Estimand: ∝ cj =∫
Γ qj(x)µ(x), j = 1, . . . , J.
Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 c−1j qj(Xij)µ(Xij)
Writing θ(x) = log µ(x), then
log L(µ) = n
∫Γθ(x)dP −
J∑j=1
nj log cj(θ),
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
logo
Bridge Sampling Likelihood
Estimand: ∝ cj =∫
Γ qj(x)µ(x), j = 1, . . . , J.
Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 c−1j qj(Xij)µ(Xij)
Writing θ(x) = log µ(x), then
log L(µ) = n
∫Γθ(x)dP −
J∑j=1
nj log cj(θ),
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
logo
Bridge Sampling Likelihood
Estimand: ∝ cj =∫
Γ qj(x)µ(x), j = 1, . . . , J.
Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 c−1j qj(Xij)µ(Xij)
Writing θ(x) = log µ(x), then
log L(µ) = n
∫Γθ(x)dP −
J∑j=1
nj log cj(θ),
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
logo
Bridge Sampling Likelihood
MLE for µ given by equating the canonical sufficient statistics P toits expectation:
nP(dx) =J∑
j=1
nj c−1j qj(x)µ(dx),
µ(dx) =nP(dx)∑J
j=1 nj c−1j qj(x)
. (A)
Consequently, the MLE for {c1, . . . , cJ} must satisfy
cr =
∫Γ
qr (x) d µ =J∑
j=1
nj∑i=1
qr (xij)∑Js=1 ns c−1
s qs(xij). (B)
(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
logo
Bridge Sampling Likelihood
MLE for µ given by equating the canonical sufficient statistics P toits expectation:
nP(dx) =J∑
j=1
nj c−1j qj(x)µ(dx),
µ(dx) =nP(dx)∑J
j=1 nj c−1j qj(x)
. (A)
Consequently, the MLE for {c1, . . . , cJ} must satisfy
cr =
∫Γ
qr (x) d µ =J∑
j=1
nj∑i=1
qr (xij)∑Js=1 ns c−1
s qs(xij). (B)
(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
logo
Bridge Sampling Likelihood
MLE for µ given by equating the canonical sufficient statistics P toits expectation:
nP(dx) =J∑
j=1
nj c−1j qj(x)µ(dx),
µ(dx) =nP(dx)∑J
j=1 nj c−1j qj(x)
. (A)
Consequently, the MLE for {c1, . . . , cJ} must satisfy
cr =
∫Γ
qr (x) d µ =J∑
j=1
nj∑i=1
qr (xij)∑Js=1 ns c−1
s qs(xij). (B)
(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
logo
Bridge Sampling Likelihood
MLE for µ given by equating the canonical sufficient statistics P toits expectation:
nP(dx) =J∑
j=1
nj c−1j qj(x)µ(dx),
µ(dx) =nP(dx)∑J
j=1 nj c−1j qj(x)
. (A)
Consequently, the MLE for {c1, . . . , cJ} must satisfy
cr =
∫Γ
qr (x) d µ =J∑
j=1
nj∑i=1
qr (xij)∑Js=1 ns c−1
s qs(xij). (B)
(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodel
Linear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodel
Log-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
But We Can Ignore Less ...
To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.
The new MLE has a smaller asymptotic variance under the submodelthan under the full model.
Examples:
Group-invariance submodelLinear submodelLog-linear submodel
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
An Universally Improved IS
Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)
Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)
Taking G = {Id ,−Id} leads to
rG =1
n2
n2∑i=1
q1(Xi2) + q1(−Xi2)
q2(Xi2) + q2(−Xi2).
Because of the Rao-Blackwellization, V(rG) ≤ V(r).
Need twice as many evaluations, but typically this is a small insurancepremium.
Consider S1 = R & S2 = R+. Then rG is consistent for r :
rG =1
n2
n2∑i=1
q1(Xi2)
q2(Xi2)+
1
n2
n2∑i=1
q1(−Xi2)
q2(Xi2).
But standard IS r only estimates∫∞
0 q1(x)µ(dx)/c2.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
logo
There are many more improvements ...
Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.
The new MLE of µ is
µG(dx) =nPG(dx)∑J
j=1 nj c−1j q Gj (x)
,
where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).
When the draws are i.i.d. within each psdµ,
µG = E [µ| GX ],
i.e., the Rao-Blackwellization of µ given the orbit.
Consequently,
c Gj =
∫Γ
qj(x)µG(dx) = E [cj |GX ].
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
logo
There are many more improvements ...
Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.
The new MLE of µ is
µG(dx) =nPG(dx)∑J
j=1 nj c−1j q Gj (x)
,
where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).
When the draws are i.i.d. within each psdµ,
µG = E [µ| GX ],
i.e., the Rao-Blackwellization of µ given the orbit.
Consequently,
c Gj =
∫Γ
qj(x)µG(dx) = E [cj |GX ].
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
logo
There are many more improvements ...
Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.
The new MLE of µ is
µG(dx) =nPG(dx)∑J
j=1 nj c−1j q Gj (x)
,
where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).
When the draws are i.i.d. within each psdµ,
µG = E [µ| GX ],
i.e., the Rao-Blackwellization of µ given the orbit.
Consequently,
c Gj =
∫Γ
qj(x)µG(dx) = E [cj |GX ].
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
logo
There are many more improvements ...
Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.
The new MLE of µ is
µG(dx) =nPG(dx)∑J
j=1 nj c−1j q Gj (x)
,
where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).
When the draws are i.i.d. within each psdµ,
µG = E [µ| GX ],
i.e., the Rao-Blackwellization of µ given the orbit.
Consequently,
c Gj =
∫Γ
qj(x)µG(dx) = E [cj |GX ].
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
logo
Using Groups to model trade-off
If G1 k G2, thenVar
(~c G1
)≤ Var
(~c G2
).
The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
logo
Using Groups to model trade-off
If G1 k G2, thenVar
(~c G1
)≤ Var
(~c G2
).
The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
logo
Using Groups to model trade-off
If G1 k G2, thenVar
(~c G1
)≤ Var
(~c G2
).
The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
logo
Linear submodel: stratified sampling (Tan 2004)
Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.
The sub-model has parameter space{µ :
∫Γ
pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).
}Likelihood for µ: L(µ) =
∏Jj=1
∏nj
i=1 pj(Xij)µ(Xij)
The MLE is
µlin(dx) =P(dx)∑J
j=1 πjpj(x),
where πjs are MLEs from a mixture model:
the datai .i .d∼
∑Jj=1 πjpj(·) with πjs unknown
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
logo
Linear submodel: stratified sampling (Tan 2004)
Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.
The sub-model has parameter space{µ :
∫Γ
pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).
}Likelihood for µ: L(µ) =
∏Jj=1
∏nj
i=1 pj(Xij)µ(Xij)
The MLE is
µlin(dx) =P(dx)∑J
j=1 πjpj(x),
where πjs are MLEs from a mixture model:
the datai .i .d∼
∑Jj=1 πjpj(·) with πjs unknown
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
logo
Linear submodel: stratified sampling (Tan 2004)
Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.
The sub-model has parameter space{µ :
∫Γ
pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).
}
Likelihood for µ: L(µ) =∏J
j=1
∏nj
i=1 pj(Xij)µ(Xij)
The MLE is
µlin(dx) =P(dx)∑J
j=1 πjpj(x),
where πjs are MLEs from a mixture model:
the datai .i .d∼
∑Jj=1 πjpj(·) with πjs unknown
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
logo
Linear submodel: stratified sampling (Tan 2004)
Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.
The sub-model has parameter space{µ :
∫Γ
pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).
}Likelihood for µ: L(µ) =
∏Jj=1
∏nj
i=1 pj(Xij)µ(Xij)
The MLE is
µlin(dx) =P(dx)∑J
j=1 πjpj(x),
where πjs are MLEs from a mixture model:
the datai .i .d∼
∑Jj=1 πjpj(·) with πjs unknown
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
logo
Linear submodel: stratified sampling (Tan 2004)
Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.
The sub-model has parameter space{µ :
∫Γ
pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).
}Likelihood for µ: L(µ) =
∏Jj=1
∏nj
i=1 pj(Xij)µ(Xij)
The MLE is
µlin(dx) =P(dx)∑J
j=1 πjpj(x),
where πjs are MLEs from a mixture model:
the datai .i .d∼
∑Jj=1 πjpj(·) with πjs unknown
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
logo
So why MLE?
Goal: to estimate c =∫
Γ q(x)µ(dx).
For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)
cb ≡J∑
j=1
nj∑i=1
q(xji )− b>g(xji )∑Js=1 nsps(xji )
,
where g = (p2 − p1, . . . , pJ − p1)>.
A more general class: for∑J
j=1 λj(x) ≡ 1 and∑J
j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
cλ,B =J∑
j=1
1
nj
nj∑i=1
λj(xji )q(xji )− b>j (xji )g(xji )
pj(xji )
Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
logo
So why MLE?
Goal: to estimate c =∫
Γ q(x)µ(dx).
For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)
cb ≡J∑
j=1
nj∑i=1
q(xji )− b>g(xji )∑Js=1 nsps(xji )
,
where g = (p2 − p1, . . . , pJ − p1)>.
A more general class: for∑J
j=1 λj(x) ≡ 1 and∑J
j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
cλ,B =J∑
j=1
1
nj
nj∑i=1
λj(xji )q(xji )− b>j (xji )g(xji )
pj(xji )
Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
logo
So why MLE?
Goal: to estimate c =∫
Γ q(x)µ(dx).
For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)
cb ≡J∑
j=1
nj∑i=1
q(xji )− b>g(xji )∑Js=1 nsps(xji )
,
where g = (p2 − p1, . . . , pJ − p1)>.
A more general class: for∑J
j=1 λj(x) ≡ 1 and∑J
j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
cλ,B =J∑
j=1
1
nj
nj∑i=1
λj(xji )q(xji )− b>j (xji )g(xji )
pj(xji )
Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
logo
So why MLE?
Goal: to estimate c =∫
Γ q(x)µ(dx).
For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)
cb ≡J∑
j=1
nj∑i=1
q(xji )− b>g(xji )∑Js=1 nsps(xji )
,
where g = (p2 − p1, . . . , pJ − p1)>.
A more general class: for∑J
j=1 λj(x) ≡ 1 and∑J
j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
cλ,B =J∑
j=1
1
nj
nj∑i=1
λj(xji )q(xji )− b>j (xji )g(xji )
pj(xji )
Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
logo
So why MLE?
Goal: to estimate c =∫
Γ q(x)µ(dx).
For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)
cb ≡J∑
j=1
nj∑i=1
q(xji )− b>g(xji )∑Js=1 nsps(xji )
,
where g = (p2 − p1, . . . , pJ − p1)>.
A more general class: for∑J
j=1 λj(x) ≡ 1 and∑J
j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
cλ,B =J∑
j=1
1
nj
nj∑i=1
λj(xji )q(xji )− b>j (xji )g(xji )
pj(xji )
Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
logo
Three estimators for c =∫
Γ q(x) µ(dx):
IS: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πj = nj/n are the true proportions.
Reg:1
n
n∑i=1
q(xi )− β>g(xi )∑Jj=1 πjpj(xi )
,
where β is the estimated regression coefficient, ignoring stratification.
Lik: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πjs are the estimated proportions, ignoring stratification.
Which one is most efficient? Least efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
logo
Three estimators for c =∫
Γ q(x) µ(dx):
IS: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πj = nj/n are the true proportions.
Reg:1
n
n∑i=1
q(xi )− β>g(xi )∑Jj=1 πjpj(xi )
,
where β is the estimated regression coefficient, ignoring stratification.
Lik: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πjs are the estimated proportions, ignoring stratification.
Which one is most efficient? Least efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
logo
Three estimators for c =∫
Γ q(x) µ(dx):
IS: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πj = nj/n are the true proportions.
Reg:1
n
n∑i=1
q(xi )− β>g(xi )∑Jj=1 πjpj(xi )
,
where β is the estimated regression coefficient, ignoring stratification.
Lik: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πjs are the estimated proportions, ignoring stratification.
Which one is most efficient? Least efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
logo
Three estimators for c =∫
Γ q(x) µ(dx):
IS: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πj = nj/n are the true proportions.
Reg:1
n
n∑i=1
q(xi )− β>g(xi )∑Jj=1 πjpj(xi )
,
where β is the estimated regression coefficient, ignoring stratification.
Lik: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πjs are the estimated proportions, ignoring stratification.
Which one is most efficient? Least efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
logo
Three estimators for c =∫
Γ q(x) µ(dx):
IS: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πj = nj/n are the true proportions.
Reg:1
n
n∑i=1
q(xi )− β>g(xi )∑Jj=1 πjpj(xi )
,
where β is the estimated regression coefficient, ignoring stratification.
Lik: 1
n
n∑i=1
q(xi )∑Jj=1 πjpj(xi )
,
where πjs are the estimated proportions, ignoring stratification.
Which one is most efficient? Least efficient?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or
(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
Let’s find it out ...
Γ = R10 and µ is Lebesgue measure.
The integrand is
q(x) = 0.810∏j=1
φ(x j) + 0.210∏j=1
ψ(x j ; 4) ,
where φ(·) is standard normal density and ψ(·; 4) is t4 density.
Two sampling designs:
(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,
where
q1(x) =10∏j=1
φ(x j), q2(x) =10∏j=1
ψ(x j ; 1)
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
logo
A little surprise?
Table: Comparison of design and estimator
one sampler two samplers
IS Reg Lik IS Reg Lik
Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881
Std Err .162 .00919 .00920 .0174 .00885 .00884
Note: Sqrt MSE is√
mean squared error of the point estimates andStd Err is
√mean of the variance estimates from 10000 repeated
simulations of size n = 500.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 19 / 23
logo
Comparison of efficiency:
Statistical efficiency: IS < Reg ≈ Lik
IS is a stratified estimator, which uses only the labels.
Reg is conventional method of control variates.
Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
logo
Comparison of efficiency:
Statistical efficiency: IS < Reg ≈ Lik
IS is a stratified estimator, which uses only the labels.
Reg is conventional method of control variates.
Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
logo
Comparison of efficiency:
Statistical efficiency: IS < Reg ≈ Lik
IS is a stratified estimator, which uses only the labels.
Reg is conventional method of control variates.
Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
logo
Comparison of efficiency:
Statistical efficiency: IS < Reg ≈ Lik
IS is a stratified estimator, which uses only the labels.
Reg is conventional method of control variates.
Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
logo
Comparison of efficiency:
Statistical efficiency: IS < Reg ≈ Lik
IS is a stratified estimator, which uses only the labels.
Reg is conventional method of control variates.
Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
logo
Building intuition ...
Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?
Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?
Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
logo
Building intuition ...
Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?
Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?
Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
logo
Building intuition ...
Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?
Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?
Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
logo
Building intuition ...
Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?
Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?
Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
logo
Building intuition ...
Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?
Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?
Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
logo
What Did I Learn?
Model what we ignore, not what we know!
Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.
There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
logo
What Did I Learn?
Model what we ignore, not what we know!
Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.
There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
logo
What Did I Learn?
Model what we ignore, not what we know!
Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.
There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
logo
What Did I Learn?
Model what we ignore, not what we know!
Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.
There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.
Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
logo
If you are looking for theoretical research topics ...
RE-EXAM OLD ONES AND DERIVE NEW ONES!
Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.
Markov chain Monte Carlo (Tan 2006, 2008)
More ......
Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23