Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
KK 1
Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Gileshas provided the solutions to Q4.
• Solution for Question 1 is on page 2.
• Grading rubric for Question 1 is on page 8.
• Solution for Question 2 is on page 9.
• Grading rubric for Question 2 is on page 13.
• (a sketch) solution for Question 3 is on page 14.
• Grading rubric for Question 3 is on page 20.
• (a sketch) solution for Question 4 is on page 21.
• Grading rubric for Question 4 is on page 30.
Note that the pages are hyperlinked - you can click on them.
Scripting Techniques:
1. For while loops, especially if you unsure if the code is right, set a counter to break thewhile loop if you are afraid of infinite loops.
2. For while / for loops, you can print out each iteration, or each kth iteration (eg,use mod or similar). If each iteration is relatively fast, consider printing out every kth
iteration, since printing out every iteration will actually result in a slower process.
3. rbind / cbind are much slower than just creating a vector("numeric"). If your looprequires rbind or cbind, do something else.
4. For time consuming simulations, save the output at every kth iteration. Eg, look atpage 10 or page 13 in the solutions of HW1 for syntax.
5. Print stuff out where you think errors may be - best way to debug code.
6. Try not to name your variables as R functions.
Plotting and misc
1. Cheap hack (for R): If you have too many observations in a plot, you’ll find that theimage size is a few MB, and it takes ages to print it out. Solution: Pretend you are anundergrad again, and take a screenshot of it.
2. To see if two outputs vary, sometimes it’s nice to compare the differences instead ofplotting them side by side.
KK 2
1. Variance Estimates and U Statistics
(a) incomplete_U_stat <- function(Z, h, k, L, B)
{
Z <- as.matrix(Z)
n <- nrow(Z)
# asymptotic variance
M <- B * L
H_1 <- c()
H_1_mean <- c()
H_2 <- c()
N <- matrix(0, n, M)
for(b in 1:B){
index_1 <- sample(n, 1, replace = FALSE)
for(l in 1:L){
index_2 <- sample(c(1:n)[-index_1], k - 1, replace = FALSE)
H_1 <- cbind(H_1, h(Z[c(index_1, index_2), ]))
}
if(nrow(H_1) == 1){
H_1_mean <- cbind(H_1_mean, mean(H_1[((b - 1)*L + 1):(b * L)]))
}
if(nrow(H_1) >= 2){
H_1_mean <- cbind(H_1_mean,
rowMeans(H_1[, ((b - 1)*L + 1):(b * L)]))
}
}
# infinitesimal jackknife
for(i in 1:M){
index <- sample(n, k, replace = FALSE)
H_2 <- cbind(H_2, h(Z[index, ]))
N[index, i] <- 1
}
zeta_1 <- apply(H_1_mean, 1, var)
zeta_k <- apply(H_1, 1, var)
KK 3
sigma_square <- k^2 / n * zeta_1 + zeta_k / M
U_stat <- rowMeans(H_2)
sigma_hat_square <- 0
for(j in 1:n){
sigma_hat_square <- sigma_hat_square +
(cov(N[j, ], t(H_2)))^2
}
list(U = U_stat, sigma1 = as.vector(sigma_square),
sigma2 = as.vector(sigma_hat_square))
}
Now test the function:
k <- 2
B <- 20
L <- 100
h <- function(Z){
abs(Z[1] - Z[2])
}
Z <- rnorm(1000)
example_est <- incomplete_U_stat(Z, h, k, L, B)
example_est
(b) n_run <- 100
U_est <- rep(NA, n_run)
sigma1_est <- rep(NA, n_run)
sigma2_est <- rep(NA, n_run)
for(i in 1:n_run){
Z <- rnorm(1000)
est <- incomplete_U_stat(Z, h, k, L, B)
U_est[i] <- est$U
sigma1_est[i] <- est$sigma1
sigma2_est[i] <- est$sigma2
}
var_U_est <- var(U_est)
KK 4
var_U_est
mean_sigma1_est <- mean(sigma1_est)
mean_sigma2_est <- mean(sigma2_est)
mean_sigma1_est
mean_sigma2_est
sd_sigma1_est <- sd(sigma1_est)
sd_sigma2_est <- sd(sigma2_est)
sd_sigma1_est
sd_sigma2_est
MSE_sigma1_est <- mean((sigma1_est - var_U_est)^2)
MSE_sigma2_est <- mean((sigma2_est - var_U_est)^2)
MSE_sigma1_est
MSE_sigma2_est
(c) n <- 500
k <- 50
L <- 250
B <- 25
h <- function(Z){
Z <- as.data.frame(Z)
tree <- rpart(y ~ x1 + x2, data = Z,
control = rpart.control(minsplit = 10))
x1 <- rep(seq(-2, 2, length.out = 9), each = 9)
x2 <- rep(seq(-2, 2, length.out = 9), times = 9)
x <- data.frame(x1, x2)
predict(tree, x)
}
n_run <- 100
U_est <- matrix(NA, n_run, 81)
sigma1_est <- matrix(NA, n_run, 81)
sigma2_est <- matrix(NA, n_run, 81)
for(i in 1:n_run){
print(i)
KK 5
x1 <- rnorm(n)
x2 <- rnorm(n)
epi <- rnorm(n)
y <- sqrt(abs(x1) + abs(x2)) + 0.2 * epi
Z <- cbind(x1, x2, y)
est <- incomplete_U_stat(Z, h, k, L, B)
U_est[i, ] <- est$U
sigma1_est[i, ] <- est$sigma1
sigma2_est[i, ] <- est$sigma2
}
var_U_est <- apply(U_est, 2, var)
mean_sigma1_est <- colMeans(sigma1_est)
mean_sigma2_est <- colMeans(sigma2_est)
diff_sigma1_est <- var_U_est - mean_sigma1_est
diff_sigma2_est <- var_U_est - mean_sigma2_est
filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),
t(matrix(diff_sigma1_est, 9, 9)), xlab = "x1", ylab = "x2",
main = c("The difference between the between-simulation variance",
"and average asymptotic variance estimate"))
filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),
t(matrix(diff_sigma2_est, 9, 9)), xlab = "x1", ylab = "x2",
main = c("The difference between the between-simulation variance",
"and average infinitesimal jacknife variance estimate"))
sd_sigma1_est <- apply(sigma1_est, 2, sd)
sd_sigma2_est <- apply(sigma2_est, 2, sd)
filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),
t(matrix(sd_sigma1_est, 9, 9)), xlab = "x1", ylab = "x2",
main = c("The standard deviation of",
"asymptotic variance estimate"))
filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),
t(matrix(sd_sigma2_est, 9, 9)), xlab = "x1", ylab = "x2",
main = c("The standard deviation of",
"infinitesimal jacknife variance estimate"))
KK 6
-0.0015
-0.0010
-0.0005
0.0000
-2 -1 0 1 2
-2
-1
0
1
2
The difference between the between-simulation varianceand average asymptotic variance estimate
x1
x2
-1e-03
-5e-04
0e+00
5e-04
-2 -1 0 1 2
-2
-1
0
1
2
The difference between the between-simulation varianceand average infinitesimal jacknife variance estimate
x1
x2
KK 7
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
-2 -1 0 1 2
-2
-1
0
1
2
The standard deviation ofasymptotic variance estimate
x1
x2
2e-04
4e-04
6e-04
8e-04
1e-03
-2 -1 0 1 2
-2
-1
0
1
2
The standard deviation ofinfinitesimal jacknife variance estimate
x1
x2
KK 8
Grading rubric for Question 1
(a) i. Structure input and output correctly [5]
ii. Asymptotic variance [5]
iii. Infinitesimal jackknife [5]
(b) i. Calculate between simulation variance [2]
ii. Calculate average of both variance estimates [2]
iii. Calculate standard deviation of both variance estimates [2]
iv. Calculate mean squared error of both variance estimates [2]
v. Conclude [2]
(c) i. Set up trees [5]
ii. Conclude [10]
Since the simulations take a relatively long time to run, what I have done is to lookat the code and check for certain things (eg, implemented sampling for both methodscorrectly), and have around the same scale for each value.
KK 9
2. Newton Raphson Root Finding for Inverse-Transform Sampling
Note that for the normal distribution, we have that:
Φ(x) =1√2π
∫ x
−∞e−t
2/2 dt
(a) The Newton-Raphson algorithm will consist of the following steps:
1 Set x0 = x0, tolerance = ε, p = p
2 While |xn − xn−1| > ε
3 Compute xn := xn−1 − f(xn−1)f ′(xn−1)
4 Return xn
where:
f(x) = Φ(x)− p
Note that:
f ′(x) =1√2πe−x
2/2
i. Here is Matlab code:
function [myroot] = NRQ2a(tol, x_0, p, stop)
% Q2a) Newton Raphson Algorithm for standard Normal
% Inputs: tol, x_0, p, stop
% stop to break (infinite) while loops, just in
% case. Eg, setting stop to be 1000 will terminate
% the while loop after 1000 iterations.
counter = 0;
prevroot = x_0;
myroot = abs(2*tol) + abs(prevroot);
while counter < stop;
myroot = prevroot - ( normcdf(prevroot) - p)/( 1/sqrt(2*pi) *
exp(-prevroot^2/2));
if ( abs(myroot - prevroot) < tol)
% terminate function here if we are happy with our root
break
end
% else continue on
prevroot = myroot;
KK 10
counter = counter +1;
end
if counter == stop;
disp(’Check while loop, might have screw up’);
end
end
and an example:
> NRQ2a(10^(-8), 0, 0.975, 1000)
ans =
1.9600
ii. Here is R code:
NRQ2a<-function(tol, x_0, p, stop)
{
counter = 0
prevroot = x_0
myroot = abs(2*tol) + abs(prevroot)
while (counter < stop)
{
myroot = prevroot - ( pnorm(prevroot) - p)/(1/sqrt(2*pi) * exp(-prevroot^2/2))
if ( abs(myroot - prevroot) < tol)
{
break
}
prevroot = myroot
counter = counter +1
}
if (counter == stop)
{
print(’Check while loop, might have screw up’)
}
return(myroot)
}
and an example:
> NRQ2a(10^(-8),0,0.975,1000)
[1] 1.959964
(b) i. Here are the results in Matlab.
pvec = 0.5:0.01:0.98
for i = 1:length(pvec);
KK 11
i
rootvec(i) = NRQ2a(10^(-8), 0, pvec(i), 1000);
comparevec(i) = norminv(pvec(i),0,1);
end
% Just see a straight line if you plot the two on the same scale
% so plot differences
figure;
hold on; grid on;
plot(pvec, (rootvec-comparevec))
title(’Difference between NR and norminv’);
xlabel(’p values’);
ylabel(’Differences in inverse’);
Doesn’t really seem to be much deviation (look at the scale!)
ii. and in R
pvec = seq(0.5, 0.98, length=49)
rootvec = vector("numeric")
comparevec = vector("numeric")
for (i in 1:length(pvec))
{
rootvec[i] = NRQ2a(10^(-8),0,pvec[i],1000)
comparevec[i] = qnorm(pvec[i])
}
KK 12
plot(pvec,(rootvec-comparevec), type = "l", main = "Difference between
NR and qnorm", ylab = "Differences in inverse", xlab = "p values")
0.5 0.6 0.7 0.8 0.9
-4e-16
0e+00
4e-16
8e-16
Difference between NR and qnorm
p values
Diff
eren
ces
in in
vers
e
Doesn’t really seem to be much deviation (look at the scale!)
(c) There really shouldn’t be much deviation (at least within tolerance level) at all.The CDF of the normal distribution is injective1 which implies that for each pthere is only one root.
Assuming that x0 is chosen reasonably well2, the conditions for Newton Raphsonare satisfied - thus our algorithm would converge to the real root x for each pwithin the tolerance level (which is set low enough). It shouldn’t be “exact”though, since qnorm is approximated differently.
(d) Supposing we generate a Uniform (0, 1) sample. Then:
i. Rejection sampling
+ Assuming calculations done correctly and optimized, relatively fast tocheck if our sample comes from our distribution (or not).
- May take lots of Uniform (0, 1) samples to get one sample from our pro-posed distribution.
ii. NR algorithm
+ Will always get a sample from our proposed distribution for each U(0, 1)sample
- But might take longer to converge if assumptions aren’t met (eg somediscrete distributions)
Generally NR is less efficient when the algorithm is “stuck”, or takes a long timeto converge.
1Well, bijective actually, but being injective suffices. Note that this is not necessarily true of all CDFs -think of discrete distributions or artificially constructed distributions.
2Look at the denominator of f ′(x) - too large or too small an x0 would exceed in an overflow.
KK 13
Grading rubric for Question 2
(a) Write the NR algorithm [10]
(b) i. Compute φ−1(x) [2]
ii. Plot solutions [2]
iii. Comment [1]
(c) i. Do they differ numerically? (yes) [1]
ii. What difference is there? [4]
? Bonus: without using function [2]
(d) Comment on comparative computational efficiency . . . [5]
? EC (added by Giles): Look at # of NR iterations [2]
?? Total: [10 + 5 + 5 + 2 + 5 +2 = 29 ]
KK 14
3. Domain of Convergence
(a) The following proof is adapted from The Theory and Applications of Iter-ation Methods (pg 144). Also note there is a typo in the question - the firstderivative should also be positive as well.
Proof. Let f be our multimodal function. From the conditions given, we have f ∈C3. Then using Newton’s method to optimize f is equivalent to using Newton’smethod to solve f ′ (x) = 0.
Suppose we denote g (x) = f ′ (x) . Thus we have:
g′ (x) > 0 and g′′ (x) > 0
Since g′(x) > 0, then g in increasing. Now suppose we have our optimum point sand are current starting point x0 > s.
We have g is strictly increasing on [s, x0]. By definition of the Newton step, wehave that
xk+1 = xk −g (xk)
g′ (xk)
so we have
x1 = x0 −g (x0)
g′ (x0)︸ ︷︷ ︸>0 since both positive
hence x1 < x0. Now define f (x) = x− g(x)g′(x)
on [s, x0]. We have
f ′ (x) = 1− g′ (x)2 − g (x) g′′ (x)
g′ (x)2
= 1− 1 +g (x) g′′ (x)
g′ (x)2
=g (x) g′′ (x)
g′ (x)2> 0
hence f is strictly increasing on [s, x0]. Thus we have f (x1) < f (x0), i.e
x2 = x1 −g (x1)
g′ (x1)
≤ x0 −g (x1)
g′ (x1)= x1
hence by induction, we see that {xn}n≥0 is a strictly decreasing sequence. Nextwe claim that s is the lower bound for {xn}n≥0. We can use induction to prove
KK 15
this claim. The base case is easy for k = 0, now suppose that for k = n the claimholds, i.e., xk ≥ s, then
xk+1 = f (xk)
≥ f (s)
= s− g (s)
g′ (s)= s
hence s is the lower bound for {xn}n≥0 . Notice we have {xn}n≥0 strictly decreasingand has a lower bound s. By monotone convergence theorem for sequences, wehave that limn→∞ xn = s. Hence, we are done since we have shown that xn isgetting closer to the optimum.
(b) (Bonus) This proof seems to have been covered in BTRY 7180 (GLMs) whentaught last year by Jacob. Here is the relevant portion (might have typos)
Local convergence of Newton Raphson
Theorem 0.1. Suppose ∃ β̂ such that l(β̂) = maxβ l(β) and ∇2l(β̂) < 0 (negative
definite) and ∇2l(β) is continuous over β ∈ D(β, δ) = {β ∈ Rp : ‖β − β̂‖2 ≤ δ}.i. ∃ δ > 0 such that the Newton Raphson step is defined, and Newton Raphson
converges at a linear rate if β̂(0) ∈ D(β̂, δ).
ii. Suppose ∃ L,M, δ > 0 such that: ∀ β, β̃ ∈ D(β̂, δ):
A. ‖∇2l(β)−∇2l(β̃)‖operator norm ≤ L‖β − β̃‖2 (Lipschitz condition)
B. ‖∇l(β)−1‖operator norm ≤M . All eigenvalues of ∇2l(β) ≤ − 1M
.
then ‖β̂(k+1) − β̂‖ ≤ LM2‖β̂(k) − β̂‖2
This is called quadratic convergence (if LMδ2
< 1).
Note: ‖A‖operator norm =√λmax(ATA) = σmax(A) (singular value).
A < 0 is negative definite -A = UDUT . Dii < 0 ∀ i, A−1 = UD−1UT , ‖A−1‖operator norm ≤M ⇔ Dii ≤ − 1
M∀ i.
Also have ‖Av‖ ≤ ‖A‖operator norm‖v‖, equality when v = eigenvector correspond-ing to largest eigenvalue.
Proof. z
i. ∇2l(β̂) < 0, so by continuity, ∇2l(β) < 0 for all β ∈ D(β̂, δ). Take M suchthat ‖∇2l(β)−1‖operator norm ≤ M ∀ β ∈ D(β̂, δ). Then λmax(∇2l(β)) ≤ − 1
M
and:
‖β̂(k+1) − β̂‖ = ‖β̂(k) − β̂ −∇2l(β̂(k))−1∇l(β̂(k))‖
=
∥∥∥∥[∇2l(β̂(k))]−1
(∇2l(β̂(k))(β̂(k) − β̂)−∇l(β̂(k)))
∥∥∥∥≤M‖∇2l(β̂(k))(β̂(k) − β̂)−∇l(β̂(k))‖
KK 16
Note that:
∇l(β̂(k)) =
∫ 1
0
∇2l(β̂ + t(β̂(k) − β̂)) dt(β̂(k) − β̂) + ∇l(β̂)︸ ︷︷ ︸=0 by MLE
FTC for gradients:
φ(y˜)− φ(x
˜) =
∫path x
˜→y
˜
∇φ′(r˜) dr
˜
For us, straight path r˜
= β̂ + t(β̂(k) − β̂) from β̂ → β̂(k).
Apply with φ = ∂l∂βj
:
‖β̂(k−1) − β̂‖ ≤M
∥∥∥∥∫ 1
0
[∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))
]dt× (β̂(k) − β̂)
∥∥∥∥2
≤M
∫ 1
0
∥∥∥[∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))]∥∥∥
operator normdt · ‖β̂(k) − β̂‖ CS
By continuity of ∇2l, ∃ δ > 0 such that:
‖∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))‖operator norm <1
2M
Hence:
‖β̂(k−1) − β̂‖ ≤M
∫ 1
0
1
2Mdt · ‖β̂(k) − β̂‖
≤ 1
2‖β̂(k) − β̂‖
≤ 1
2(k+1)‖β̂(0) − β̂‖
which converges at a linear rate.
ii. We have:
‖β̂(k+1) − β̂‖ ≤M
∫ 1
0
∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))‖op dt‖β̂(k) − β̂‖
≤ML
∫ 1
0
‖β̂(k) −(β̂ + t(β̂(k) − β̂)
)‖2 dt · ‖β̂(k) − β̂‖
= ML
∫ 1
0
(1− t) dt · ‖β̂(k) − β̂‖2
=ML
2‖β̂(k) − β̂‖2
Suppose we have a sequence of numbers such that:
ak+1 =ML
2‖β̂(k+1) − β̂‖
KK 17
then we have ak+1 ≤ a2k which implies ak ≤ a2k
0 . Given a0 ≤ MLδ2
, then wehave that:
‖β̂(k) − β̂‖ ≤ 2
M
(MLδ
2
)2k
takeMLδ
2< 1
For global convergence, use a modified Newton Raphson algorithm.This is the same as Newton’s method with the same constraints. Need tomodify it for other situations where constraints fail.
(c) Any counter example, eg y =√|x| or similar would suffice.
(d) Assuming we have the functions in R as from http://faculty.bscb.cornell.
edu/~hooker/BTRY6520/R_MultivariateOptimization.R, here are the plots foreach algorithm. Note that 4 denotes one optima, 5 another optima, and ×denoting no optima is reached.
i. z
Optimum for steepest descent
µ1
µ2
2.0 2.5 3.0 3.5 4.0 4.5 5.0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
ii. z
KK 18
Optimum for Newton Raphson
µ1
µ2
2.0 2.5 3.0 3.5 4.0 4.5 5.0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
iii. z
Optimum for Levenberg-Marquardt
µ1
µ2
2.0 2.5 3.0 3.5 4.0 4.5 5.0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
(e) z
KK 19
Optimum for alternative method
µ1
µ2
2.0 2.5 3.0 3.5 4.0 4.5 5.0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
The computational cost of this method is lower than other methods, as we donot need to do Golden Section like Steepest Descent, or compute second orderderivatives like Newton Raphson and the Levenberg-Marquardt algorithm.
KK 20
Grading rubric for Question 3
(a) Reasonable attempt at a proof3 [10]
(b) Bonus: Generalize . . . [5]
(c) Counter example [5]
(d) Plots of graphs [5]
(e) i. Code and plot of epsilon gradient [8]
ii. Commentary [2]
?? Total: [10 + 5 + 5 + 5 + 10 = 35]
3bearing in mind that question is phrased wrongly
KK 21
4. Constrained Optimization
# Load in data from troutpcb.txt
troutpcb=read.table("troutpcb.txt", head=T)
n=length(troutpcb$Age)
par(mfrow=c(1,2))
plot(PCB~Age,type="o",data=troutpcb)
plot(log(PCB)~Age, type="o", data=troutpcb)
2 4 6 8 10 12
05
1015
2025
30
Age
PCB
2 4 6 8 10 12
01
23
Age
log(PCB)
(a) # Implent functions to evaluate l(beta) and its gradient
reg.l=function(beta, x){
if (beta[2]==0) { pred=beta[1]
} else {pred=beta[1]+beta[2]*x/(1+beta[3]+x)}
return(pred)
}
l=function(beta, x=troutpcb$Age, y=troutpcb$PCB){
n=length(x)
l=-sum((y-reg.l(beta,x))^2)/(n-1)
return (l)
}
# Sample input: l(c(mean(troutpcb$PCB), 0))
# Output [1] -56.18434
l.grad=function(beta, x=troutpcb$Age, y=troutpcb$PCB){
n=length(x)
KK 22
dbeta0=1
dbeta1=x/(1+beta[3]+x)
dbeta2=-beta[2]*x/((1+beta[3]+x)^2)
J=matrix(rep(0, n*3), ncol=3)
J[,1]=dbeta0
J[,2]=dbeta1
J[,3]=dbeta2
grad=2*t(J)%*%(y-reg.l(beta,x))/(n-1)
# the following way can also calculate gradient
#grad.0=2/(n-1)*sum(dbeta0*(y-reg.l(beta,x)))
#grad.1=2/(n-1)*sum(dbeta1*(y-reg.l(beta,x)))
#grad.2=2/(n-1)*sum(dbeta2*(y-reg.l(beta,x)))
#return(c(grad.0, grad.1, grad.2))
return(grad)
}
(b)SteepestAscent.mod = function(beta,fn=l,dfn=l.grad,x=troutpcb$Age,y,
tol=1e-8,maxit=1000,maxtry=250)
{
tol.met=FALSE; iter = 0; flag=FALSE # flag for alpha
while(!tol.met){
iter = iter+1; oldbeta = beta; g = dfn(beta,x,y)
if (beta[3]>0 & g[3]==0){
#If it is true, thenn in GoldenSectionLineSearch
# the step cannot go across zero
flag=TRUE
} else if (beta[3]==0 & g[3]<0){
#If it is true, set g[3]=0.
g[3]=0
}
# Define a new function (g and alpha can be found from workspace)
fns = function(alpha,x,y){ return(fn(beta+alpha*g,x,y)) }
# Conduct the line search
linesearch = GoldenSectionLineSearch(flag,fns,x,y,tol,maxtry)
# And update the estimate
beta = beta + linesearch$xm*g
if( max(abs( beta-oldbeta )) < tol | iter > maxit){ tol.met=TRUE }
}
return(list(beta=beta,iter=iter))
KK 23
}
GoldenSectionLineSearch = function(flag,fns,x,y,tol,maxtry=250)
{
# We need to work out where to put our limits for the Golden Section
# search.
if (flag==FALSE){
ar = c(0,2*tol,4*tol) # Start near tolerance
fval = c(fns(0,x,y),fns(2*tol,x,y),fns(4*tol,x,y))
try = 3 # Keep doubling
while( fval[try]>fval[try-1] & try < maxtry ){ # the end point
ar = c(ar,2*ar[try])
try = try+1
fval = c(fval,fns(ar[try],x,y))
}
if(try == maxtry){ al = ar[try-1]; ar = ar[try]
} else{ al = ar[try-2]; ar = ar[try] } # if not at a hump, otherwise
# last three.
# Now call GoldenSection for the line search
return( GoldenSection(fns,al,ar,al,1,x,y,tol=tol) )
} else {
#If flag is TRUE, then no step can be made large to across 0.
ar = c(0,2*tol,4*tol) # Start near tolerance
fval = c(fns(0,x,y),fns(2*tol,x,y),fns(4*tol,x,y))
try = 3 # Keep doubling
while( fval[try]>fval[try-1] & try < maxtry
& ar[try]>(-beta[3]/g[3])){ # the end point
ar = c(ar,2*ar[try])
try = try+1
fval = c(fval,fns(ar[try],x,y))
}
if(try == maxtry){ al = ar[try-1]; ar = ar[try]
} else{ al = ar[try-2]; ar = ar[try] }
return( GoldenSection(fns,al,ar,al,1,x,y,tol=tol) )
}
}
GoldenSection = function(fn,betal,betar,beta,dim,x,y,tol=1e-8,maxit=250){
# Here beta is a vector of inputs into fn, betal and betar are upper and lower
# values for beta, also given as vectors. However, the function only works
KK 24
# on the dimension dim.
gr = (1 + sqrt(5))/2 # Pre-calculate golden ratio
xl = beta # We’ll set xl, xr and xm to be vectors, but only
xl[dim] = betal[dim] # the dim’th entry will be different. Note that in
# all our updating rules, the other entries will
xr = beta # not be affected.
xr[dim] = betar[dim]
xm = beta
xm[dim] = betal[dim] + (betar[dim]-betal[dim])/(1+gr)
fl = fn(xl,x,y); fr = fn(xr,x,y); fm = fn(xm,x,y)
tol.met = FALSE # No tolerance met
iter = 0 # No iterations
while(!tol.met){ # Here we only need to check conditions on the
iter = iter + 1 # dim’th entry.
if( (xr[dim]-xm[dim]) > (xm[dim]-xl[dim]) ){ # Working on the right-hand side
temp = xm + (xr-xm)/(1+gr); ftemp = fn(temp,x,y);
if( ftemp > fm){ xl = xm; fl = fm; xm = temp; fm = ftemp }
else{ xr = temp; fr = ftemp }
}
else{
temp = xm - (xm-xl)/(1+gr); ftemp = fn(temp,x,y);
if( ftemp > fm){ xr = xm; fr = fm; xm = temp; fm = ftemp }
else{ xl = temp; fl = ftemp }
}
if( (xr[dim]-xm[dim]) < tol | iter > maxit ){ tol.met=TRUE }
}
return(list(xm=xm,iter=iter))
}
beta=c(-1,6,7)
x=troutpcb$Age
y=-1+5*x/(6+x)
SteepestAscent.mod(c(-1,6,7),l,l.grad,x,y)$beta
# [,1]
#[1,] -0.9887819
#[2,] 5.0176656
KK 25
#[3,] 5.1051488
SteepestAscent.mod(c(-1,6,7),l,l.grad,x,y)$iter
# [1] 1001
(c) # Modify gradient function.
l.grad.mod=function(beta,x,y){
n=length(x)
dbeta0=1
dbeta1=x/(1+beta[3]+x)
dbeta2=-beta[2]*x/((1+beta[3]+x)^2)
J=matrix(rep(0, n*3), ncol=3)
J[,1]=dbeta0
J[,2]=dbeta1
J[,3]=dbeta2
grad=2*solve(t(J)%*%J)%*%t(J)%*%(y-reg.l(beta,x))/(n-1)
return(grad)
}
x=troutpcb$Age
y=-1+5*x/(6+x)
beta.est=SteepestAscent.mod(c(-1,6,7),l,l.grad.mod,x,y)$beta
#> beta.est
# [,1]
#[1,] -1.000000
#[2,] 5.000000
#[3,] 4.999999
beta.iter=SteepestAscent.mod(c(-1,6,7),l,l.grad.mod,x,y)$iter
#> beta.iter
#[1] 4
y.mod=-1+6*x/(0.6+x)
beta.est.mod=SteepestAscent.mod(c(-1,6,-0.4),l,l.grad.mod,x,y.mod)$beta
#> beta.est.mod
# [,1]
#[1,] -1
#[2,] 5
#[3,] 5
# The algorithm didn’t stops at beta2=0.
(d) # Apply the modified Sttepest Ascent to the trout PCB data set
beta.estimate=SteepestAscent.mod(c(-1,6,7),l, l.grad,troutpcb$Age,
troutpcb$PCB)$beta
KK 26
#> beta.estimate
# [,1]
#[1,] -2.514947
#[2,] 69.942658
#[3,] 31.522597
(e) # Generate normal errors and add them to the predictions. Create
#200 simulated data sets and re-estimate parameters.
error=matrix(rnorm(length(troutpcb$Age)*200), ncol=200)
beta.re=matrix(rep(0, 200*3), ncol=3)
for (k in 1:200){
pred.sim=reg.l(beta.estimate, x)+error[,k]
beta.re[k,]=SteepestAscent.mod(beta.estimate,l,l.grad,
troutpcb$Age,pred.sim)$beta
}
count=sum(beta.re[,3]<1e-8)
hist(beta.re[,3])
#count=0
Histogram of beta.re[, 3]
beta.re[, 3]
Frequency
10 20 30 40 50
020
4060
KK 27
(f) # null case is beta2=0. pred=beta0+beta1*x/(1+x)
beta.estimate[3]=0
beta.est3=matrix(rep(0, 200*3), ncol=3)
for (m in 1:200){
pred.sim=reg.l(beta.estimate,x)+error[,m]
beta.est3[m,]=SteepestAscent.mod(beta.estimate,l,l.grad,
troutpcb$Age, pred.sim)$beta
}
count.new=sum(beta.est3[,3]<1e-8)
# count.new=116
hist(beta.est3[,3])
# From the plot, we can see that beta2 has asymmetric distribution.
# It is not reasonable to apply normal-theory confidence intervals
# in this case.
Histogram of beta.est3[, 3]
beta.est3[, 3]
Frequency
-0.1 0.0 0.1 0.2 0.3
020
4060
80
(g) # Gauss-newton optimization to estimate beta0, beta1, and r,
# where beta2=exp(r).
GaussNewton = function(beta,fn,dfn,Y,x,tol=1e-8,maxit=100){
tol.met=FALSE; iter = 0;
while(!tol.met){
iter = iter + 1
KK 28
oldbeta = beta
f = fn(beta,x)
g = dfn(beta,x)
beta = beta + chol2inv(chol(t(g) %*% g)) %*% ( t(g) %*% (Y-f))/27
# beta = beta + solve(t(g)%*%g,t(g)%*%(Y-f))
if(max(abs( beta-oldbeta )) < tol | iter > maxit)
{ tol.met=TRUE }
}
return(list(beta=beta,iter=iter))
}
# the Jacobian matrix of f(para, x) (not the gradient)
dreg.f=function(para,x){
n=length(x)
dbeta0=1
dbeta1=x/(1+exp(para[3])+x)
dgamma= - para[2]*x*exp(para[3])/((1+x+exp(para[3]))^2)
J=matrix(rep(0, n*3), ncol=3)
J[,1]=dbeta0
J[,2]=dbeta1
J[,3]=dgamma
return(J)
}
l.gamma=function(para, x=troutpcb$Age, y=troutpcb$PCB){
n=length(x)
l=-sum((y-reg.gamma(para,x))^2)/(n-1)
return (l)
}
reg.gamma=function(para,x){
return(para[1]+para[2]*x/(1+exp(para[3])+x))}
para=c(-2,70,log(1000))
para.est=GaussNewton(para, reg.gamma, dreg.f, troutpcb$PCB, troutpcb$Age)
# Employ Gauss-Newton optimization to estimate parameters in simulations
# in 4(f)
para.re=matrix(rep(0, 200*3), ncol=3)
for (s in 1:200){
pred.sim=reg.gamma(para,troutpcb$Age)+error[,s]
para.re[s,]=GaussNewton(para, reg.gamma, dreg.f, troutpcb$PCB, pred.sim)$beta
}
hist(exp(para.re[,3]))
(h) # bonus: null case is beta2=0. pred=beta0+beta1*x/(1+x)
para.re4=matrix(rep(0, 200*2), ncol=2)
para.est[3]=1
para.est4=matrix(rep(0, 200*3), ncol=3)
KK 29
for (m in 1:200){
pred.sim=reg.gamma(para.est,troutpcb$Age)+error[,m]
para.re4[m,]=GaussNewton(para.est,reg.gamma,dreg.f,troutpcb$Age, pred.sim)
}
count.new=sum(para.re4[,3]==0)
hist(para.re[,4])
KK 30
Grading rubric for Question 4
i. Plot of PCB against age [1]
ii. Plot of log PCB against age [1]
(a) i. Gradient function [2]
ii. Checking it [1]
(b) i. Modified steepest ascent function [5]
ii. Checking it [1]
(c) i. Changing the gradient [1]
ii. First test [1]
iii. Second test [1]
(d) i. Applying modified steepest ascent to data [2]
(e) i. Generating errors and estimating parameters [3]
ii. Reporting number of times β̂2 = 0 [1]
iii. Histogram of β2 [1]
(f) i. Changing response generation method [2]
ii. Reporting number of times β̂2 = 0 [1]
iii. Discussing if Normal Theory CIs are appropriate [1]
(g) i. Implement Gauss Newton [3]
ii. Histogram [1]
iii. Comparing to distribution [1]
? Bonus: Repeat for 4f) [5]