30
KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has provided the solutions to Q4. Solution for Question 1 is on page 2. Grading rubric for Question 1 is on page 8. Solution for Question 2 is on page 9. Grading rubric for Question 2 is on page 13. (a sketch) solution for Question 3 is on page 14. Grading rubric for Question 3 is on page 20. (a sketch) solution for Question 4 is on page 21. Grading rubric for Question 4 is on page 30. Note that the pages are hyperlinked - you can click on them. Scripting Techniques: 1. For while loops, especially if you unsure if the code is right, set a counter to break the while loop if you are afraid of infinite loops. 2. For while / for loops, you can print out each iteration, or each k th iteration (eg, use mod or similar). If each iteration is relatively fast, consider printing out every k th iteration, since printing out every iteration will actually result in a slower process. 3. rbind / cbind are much slower than just creating a vector("numeric"). If your loop requires rbind or cbind, do something else. 4. For time consuming simulations, save the output at every k th iteration. Eg, look at page 10 or page 13 in the solutions of HW1 for syntax. 5. Print stuff out where you think errors may be - best way to debug code. 6. Try not to name your variables as R functions. Plotting and misc 1. Cheap hack (for R): If you have too many observations in a plot, you’ll find that the image size is a few MB, and it takes ages to print it out. Solution: Pretend you are an undergrad again, and take a screenshot of it. 2. To see if two outputs vary, sometimes it’s nice to compare the differences instead of plotting them side by side.

KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 1

Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Gileshas provided the solutions to Q4.

• Solution for Question 1 is on page 2.

• Grading rubric for Question 1 is on page 8.

• Solution for Question 2 is on page 9.

• Grading rubric for Question 2 is on page 13.

• (a sketch) solution for Question 3 is on page 14.

• Grading rubric for Question 3 is on page 20.

• (a sketch) solution for Question 4 is on page 21.

• Grading rubric for Question 4 is on page 30.

Note that the pages are hyperlinked - you can click on them.

Scripting Techniques:

1. For while loops, especially if you unsure if the code is right, set a counter to break thewhile loop if you are afraid of infinite loops.

2. For while / for loops, you can print out each iteration, or each kth iteration (eg,use mod or similar). If each iteration is relatively fast, consider printing out every kth

iteration, since printing out every iteration will actually result in a slower process.

3. rbind / cbind are much slower than just creating a vector("numeric"). If your looprequires rbind or cbind, do something else.

4. For time consuming simulations, save the output at every kth iteration. Eg, look atpage 10 or page 13 in the solutions of HW1 for syntax.

5. Print stuff out where you think errors may be - best way to debug code.

6. Try not to name your variables as R functions.

Plotting and misc

1. Cheap hack (for R): If you have too many observations in a plot, you’ll find that theimage size is a few MB, and it takes ages to print it out. Solution: Pretend you are anundergrad again, and take a screenshot of it.

2. To see if two outputs vary, sometimes it’s nice to compare the differences instead ofplotting them side by side.

Page 2: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 2

1. Variance Estimates and U Statistics

(a) incomplete_U_stat <- function(Z, h, k, L, B)

{

Z <- as.matrix(Z)

n <- nrow(Z)

# asymptotic variance

M <- B * L

H_1 <- c()

H_1_mean <- c()

H_2 <- c()

N <- matrix(0, n, M)

for(b in 1:B){

index_1 <- sample(n, 1, replace = FALSE)

for(l in 1:L){

index_2 <- sample(c(1:n)[-index_1], k - 1, replace = FALSE)

H_1 <- cbind(H_1, h(Z[c(index_1, index_2), ]))

}

if(nrow(H_1) == 1){

H_1_mean <- cbind(H_1_mean, mean(H_1[((b - 1)*L + 1):(b * L)]))

}

if(nrow(H_1) >= 2){

H_1_mean <- cbind(H_1_mean,

rowMeans(H_1[, ((b - 1)*L + 1):(b * L)]))

}

}

# infinitesimal jackknife

for(i in 1:M){

index <- sample(n, k, replace = FALSE)

H_2 <- cbind(H_2, h(Z[index, ]))

N[index, i] <- 1

}

zeta_1 <- apply(H_1_mean, 1, var)

zeta_k <- apply(H_1, 1, var)

Page 3: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 3

sigma_square <- k^2 / n * zeta_1 + zeta_k / M

U_stat <- rowMeans(H_2)

sigma_hat_square <- 0

for(j in 1:n){

sigma_hat_square <- sigma_hat_square +

(cov(N[j, ], t(H_2)))^2

}

list(U = U_stat, sigma1 = as.vector(sigma_square),

sigma2 = as.vector(sigma_hat_square))

}

Now test the function:

k <- 2

B <- 20

L <- 100

h <- function(Z){

abs(Z[1] - Z[2])

}

Z <- rnorm(1000)

example_est <- incomplete_U_stat(Z, h, k, L, B)

example_est

(b) n_run <- 100

U_est <- rep(NA, n_run)

sigma1_est <- rep(NA, n_run)

sigma2_est <- rep(NA, n_run)

for(i in 1:n_run){

Z <- rnorm(1000)

est <- incomplete_U_stat(Z, h, k, L, B)

U_est[i] <- est$U

sigma1_est[i] <- est$sigma1

sigma2_est[i] <- est$sigma2

}

var_U_est <- var(U_est)

Page 4: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 4

var_U_est

mean_sigma1_est <- mean(sigma1_est)

mean_sigma2_est <- mean(sigma2_est)

mean_sigma1_est

mean_sigma2_est

sd_sigma1_est <- sd(sigma1_est)

sd_sigma2_est <- sd(sigma2_est)

sd_sigma1_est

sd_sigma2_est

MSE_sigma1_est <- mean((sigma1_est - var_U_est)^2)

MSE_sigma2_est <- mean((sigma2_est - var_U_est)^2)

MSE_sigma1_est

MSE_sigma2_est

(c) n <- 500

k <- 50

L <- 250

B <- 25

h <- function(Z){

Z <- as.data.frame(Z)

tree <- rpart(y ~ x1 + x2, data = Z,

control = rpart.control(minsplit = 10))

x1 <- rep(seq(-2, 2, length.out = 9), each = 9)

x2 <- rep(seq(-2, 2, length.out = 9), times = 9)

x <- data.frame(x1, x2)

predict(tree, x)

}

n_run <- 100

U_est <- matrix(NA, n_run, 81)

sigma1_est <- matrix(NA, n_run, 81)

sigma2_est <- matrix(NA, n_run, 81)

for(i in 1:n_run){

print(i)

Page 5: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 5

x1 <- rnorm(n)

x2 <- rnorm(n)

epi <- rnorm(n)

y <- sqrt(abs(x1) + abs(x2)) + 0.2 * epi

Z <- cbind(x1, x2, y)

est <- incomplete_U_stat(Z, h, k, L, B)

U_est[i, ] <- est$U

sigma1_est[i, ] <- est$sigma1

sigma2_est[i, ] <- est$sigma2

}

var_U_est <- apply(U_est, 2, var)

mean_sigma1_est <- colMeans(sigma1_est)

mean_sigma2_est <- colMeans(sigma2_est)

diff_sigma1_est <- var_U_est - mean_sigma1_est

diff_sigma2_est <- var_U_est - mean_sigma2_est

filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),

t(matrix(diff_sigma1_est, 9, 9)), xlab = "x1", ylab = "x2",

main = c("The difference between the between-simulation variance",

"and average asymptotic variance estimate"))

filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),

t(matrix(diff_sigma2_est, 9, 9)), xlab = "x1", ylab = "x2",

main = c("The difference between the between-simulation variance",

"and average infinitesimal jacknife variance estimate"))

sd_sigma1_est <- apply(sigma1_est, 2, sd)

sd_sigma2_est <- apply(sigma2_est, 2, sd)

filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),

t(matrix(sd_sigma1_est, 9, 9)), xlab = "x1", ylab = "x2",

main = c("The standard deviation of",

"asymptotic variance estimate"))

filled.contour(seq(-2, 2, length.out = 9), seq(-2, 2, length.out = 9),

t(matrix(sd_sigma2_est, 9, 9)), xlab = "x1", ylab = "x2",

main = c("The standard deviation of",

"infinitesimal jacknife variance estimate"))

Page 6: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 6

-0.0015

-0.0010

-0.0005

0.0000

-2 -1 0 1 2

-2

-1

0

1

2

The difference between the between-simulation varianceand average asymptotic variance estimate

x1

x2

-1e-03

-5e-04

0e+00

5e-04

-2 -1 0 1 2

-2

-1

0

1

2

The difference between the between-simulation varianceand average infinitesimal jacknife variance estimate

x1

x2

Page 7: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 7

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

-2 -1 0 1 2

-2

-1

0

1

2

The standard deviation ofasymptotic variance estimate

x1

x2

2e-04

4e-04

6e-04

8e-04

1e-03

-2 -1 0 1 2

-2

-1

0

1

2

The standard deviation ofinfinitesimal jacknife variance estimate

x1

x2

Page 8: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 8

Grading rubric for Question 1

(a) i. Structure input and output correctly [5]

ii. Asymptotic variance [5]

iii. Infinitesimal jackknife [5]

(b) i. Calculate between simulation variance [2]

ii. Calculate average of both variance estimates [2]

iii. Calculate standard deviation of both variance estimates [2]

iv. Calculate mean squared error of both variance estimates [2]

v. Conclude [2]

(c) i. Set up trees [5]

ii. Conclude [10]

Since the simulations take a relatively long time to run, what I have done is to lookat the code and check for certain things (eg, implemented sampling for both methodscorrectly), and have around the same scale for each value.

Page 9: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 9

2. Newton Raphson Root Finding for Inverse-Transform Sampling

Note that for the normal distribution, we have that:

Φ(x) =1√2π

∫ x

−∞e−t

2/2 dt

(a) The Newton-Raphson algorithm will consist of the following steps:

1 Set x0 = x0, tolerance = ε, p = p

2 While |xn − xn−1| > ε

3 Compute xn := xn−1 − f(xn−1)f ′(xn−1)

4 Return xn

where:

f(x) = Φ(x)− p

Note that:

f ′(x) =1√2πe−x

2/2

i. Here is Matlab code:

function [myroot] = NRQ2a(tol, x_0, p, stop)

% Q2a) Newton Raphson Algorithm for standard Normal

% Inputs: tol, x_0, p, stop

% stop to break (infinite) while loops, just in

% case. Eg, setting stop to be 1000 will terminate

% the while loop after 1000 iterations.

counter = 0;

prevroot = x_0;

myroot = abs(2*tol) + abs(prevroot);

while counter < stop;

myroot = prevroot - ( normcdf(prevroot) - p)/( 1/sqrt(2*pi) *

exp(-prevroot^2/2));

if ( abs(myroot - prevroot) < tol)

% terminate function here if we are happy with our root

break

end

% else continue on

prevroot = myroot;

Page 10: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 10

counter = counter +1;

end

if counter == stop;

disp(’Check while loop, might have screw up’);

end

end

and an example:

> NRQ2a(10^(-8), 0, 0.975, 1000)

ans =

1.9600

ii. Here is R code:

NRQ2a<-function(tol, x_0, p, stop)

{

counter = 0

prevroot = x_0

myroot = abs(2*tol) + abs(prevroot)

while (counter < stop)

{

myroot = prevroot - ( pnorm(prevroot) - p)/(1/sqrt(2*pi) * exp(-prevroot^2/2))

if ( abs(myroot - prevroot) < tol)

{

break

}

prevroot = myroot

counter = counter +1

}

if (counter == stop)

{

print(’Check while loop, might have screw up’)

}

return(myroot)

}

and an example:

> NRQ2a(10^(-8),0,0.975,1000)

[1] 1.959964

(b) i. Here are the results in Matlab.

pvec = 0.5:0.01:0.98

for i = 1:length(pvec);

Page 11: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 11

i

rootvec(i) = NRQ2a(10^(-8), 0, pvec(i), 1000);

comparevec(i) = norminv(pvec(i),0,1);

end

% Just see a straight line if you plot the two on the same scale

% so plot differences

figure;

hold on; grid on;

plot(pvec, (rootvec-comparevec))

title(’Difference between NR and norminv’);

xlabel(’p values’);

ylabel(’Differences in inverse’);

Doesn’t really seem to be much deviation (look at the scale!)

ii. and in R

pvec = seq(0.5, 0.98, length=49)

rootvec = vector("numeric")

comparevec = vector("numeric")

for (i in 1:length(pvec))

{

rootvec[i] = NRQ2a(10^(-8),0,pvec[i],1000)

comparevec[i] = qnorm(pvec[i])

}

Page 12: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 12

plot(pvec,(rootvec-comparevec), type = "l", main = "Difference between

NR and qnorm", ylab = "Differences in inverse", xlab = "p values")

0.5 0.6 0.7 0.8 0.9

-4e-16

0e+00

4e-16

8e-16

Difference between NR and qnorm

p values

Diff

eren

ces

in in

vers

e

Doesn’t really seem to be much deviation (look at the scale!)

(c) There really shouldn’t be much deviation (at least within tolerance level) at all.The CDF of the normal distribution is injective1 which implies that for each pthere is only one root.

Assuming that x0 is chosen reasonably well2, the conditions for Newton Raphsonare satisfied - thus our algorithm would converge to the real root x for each pwithin the tolerance level (which is set low enough). It shouldn’t be “exact”though, since qnorm is approximated differently.

(d) Supposing we generate a Uniform (0, 1) sample. Then:

i. Rejection sampling

+ Assuming calculations done correctly and optimized, relatively fast tocheck if our sample comes from our distribution (or not).

- May take lots of Uniform (0, 1) samples to get one sample from our pro-posed distribution.

ii. NR algorithm

+ Will always get a sample from our proposed distribution for each U(0, 1)sample

- But might take longer to converge if assumptions aren’t met (eg somediscrete distributions)

Generally NR is less efficient when the algorithm is “stuck”, or takes a long timeto converge.

1Well, bijective actually, but being injective suffices. Note that this is not necessarily true of all CDFs -think of discrete distributions or artificially constructed distributions.

2Look at the denominator of f ′(x) - too large or too small an x0 would exceed in an overflow.

Page 13: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 13

Grading rubric for Question 2

(a) Write the NR algorithm [10]

(b) i. Compute φ−1(x) [2]

ii. Plot solutions [2]

iii. Comment [1]

(c) i. Do they differ numerically? (yes) [1]

ii. What difference is there? [4]

? Bonus: without using function [2]

(d) Comment on comparative computational efficiency . . . [5]

? EC (added by Giles): Look at # of NR iterations [2]

?? Total: [10 + 5 + 5 + 2 + 5 +2 = 29 ]

Page 14: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 14

3. Domain of Convergence

(a) The following proof is adapted from The Theory and Applications of Iter-ation Methods (pg 144). Also note there is a typo in the question - the firstderivative should also be positive as well.

Proof. Let f be our multimodal function. From the conditions given, we have f ∈C3. Then using Newton’s method to optimize f is equivalent to using Newton’smethod to solve f ′ (x) = 0.

Suppose we denote g (x) = f ′ (x) . Thus we have:

g′ (x) > 0 and g′′ (x) > 0

Since g′(x) > 0, then g in increasing. Now suppose we have our optimum point sand are current starting point x0 > s.

We have g is strictly increasing on [s, x0]. By definition of the Newton step, wehave that

xk+1 = xk −g (xk)

g′ (xk)

so we have

x1 = x0 −g (x0)

g′ (x0)︸ ︷︷ ︸>0 since both positive

hence x1 < x0. Now define f (x) = x− g(x)g′(x)

on [s, x0]. We have

f ′ (x) = 1− g′ (x)2 − g (x) g′′ (x)

g′ (x)2

= 1− 1 +g (x) g′′ (x)

g′ (x)2

=g (x) g′′ (x)

g′ (x)2> 0

hence f is strictly increasing on [s, x0]. Thus we have f (x1) < f (x0), i.e

x2 = x1 −g (x1)

g′ (x1)

≤ x0 −g (x1)

g′ (x1)= x1

hence by induction, we see that {xn}n≥0 is a strictly decreasing sequence. Nextwe claim that s is the lower bound for {xn}n≥0. We can use induction to prove

Page 15: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 15

this claim. The base case is easy for k = 0, now suppose that for k = n the claimholds, i.e., xk ≥ s, then

xk+1 = f (xk)

≥ f (s)

= s− g (s)

g′ (s)= s

hence s is the lower bound for {xn}n≥0 . Notice we have {xn}n≥0 strictly decreasingand has a lower bound s. By monotone convergence theorem for sequences, wehave that limn→∞ xn = s. Hence, we are done since we have shown that xn isgetting closer to the optimum.

(b) (Bonus) This proof seems to have been covered in BTRY 7180 (GLMs) whentaught last year by Jacob. Here is the relevant portion (might have typos)

Local convergence of Newton Raphson

Theorem 0.1. Suppose ∃ β̂ such that l(β̂) = maxβ l(β) and ∇2l(β̂) < 0 (negative

definite) and ∇2l(β) is continuous over β ∈ D(β, δ) = {β ∈ Rp : ‖β − β̂‖2 ≤ δ}.i. ∃ δ > 0 such that the Newton Raphson step is defined, and Newton Raphson

converges at a linear rate if β̂(0) ∈ D(β̂, δ).

ii. Suppose ∃ L,M, δ > 0 such that: ∀ β, β̃ ∈ D(β̂, δ):

A. ‖∇2l(β)−∇2l(β̃)‖operator norm ≤ L‖β − β̃‖2 (Lipschitz condition)

B. ‖∇l(β)−1‖operator norm ≤M . All eigenvalues of ∇2l(β) ≤ − 1M

.

then ‖β̂(k+1) − β̂‖ ≤ LM2‖β̂(k) − β̂‖2

This is called quadratic convergence (if LMδ2

< 1).

Note: ‖A‖operator norm =√λmax(ATA) = σmax(A) (singular value).

A < 0 is negative definite -A = UDUT . Dii < 0 ∀ i, A−1 = UD−1UT , ‖A−1‖operator norm ≤M ⇔ Dii ≤ − 1

M∀ i.

Also have ‖Av‖ ≤ ‖A‖operator norm‖v‖, equality when v = eigenvector correspond-ing to largest eigenvalue.

Proof. z

i. ∇2l(β̂) < 0, so by continuity, ∇2l(β) < 0 for all β ∈ D(β̂, δ). Take M suchthat ‖∇2l(β)−1‖operator norm ≤ M ∀ β ∈ D(β̂, δ). Then λmax(∇2l(β)) ≤ − 1

M

and:

‖β̂(k+1) − β̂‖ = ‖β̂(k) − β̂ −∇2l(β̂(k))−1∇l(β̂(k))‖

=

∥∥∥∥[∇2l(β̂(k))]−1

(∇2l(β̂(k))(β̂(k) − β̂)−∇l(β̂(k)))

∥∥∥∥≤M‖∇2l(β̂(k))(β̂(k) − β̂)−∇l(β̂(k))‖

Page 16: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 16

Note that:

∇l(β̂(k)) =

∫ 1

0

∇2l(β̂ + t(β̂(k) − β̂)) dt(β̂(k) − β̂) + ∇l(β̂)︸ ︷︷ ︸=0 by MLE

FTC for gradients:

φ(y˜)− φ(x

˜) =

∫path x

˜→y

˜

∇φ′(r˜) dr

˜

For us, straight path r˜

= β̂ + t(β̂(k) − β̂) from β̂ → β̂(k).

Apply with φ = ∂l∂βj

:

‖β̂(k−1) − β̂‖ ≤M

∥∥∥∥∫ 1

0

[∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))

]dt× (β̂(k) − β̂)

∥∥∥∥2

≤M

∫ 1

0

∥∥∥[∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))]∥∥∥

operator normdt · ‖β̂(k) − β̂‖ CS

By continuity of ∇2l, ∃ δ > 0 such that:

‖∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))‖operator norm <1

2M

Hence:

‖β̂(k−1) − β̂‖ ≤M

∫ 1

0

1

2Mdt · ‖β̂(k) − β̂‖

≤ 1

2‖β̂(k) − β̂‖

≤ 1

2(k+1)‖β̂(0) − β̂‖

which converges at a linear rate.

ii. We have:

‖β̂(k+1) − β̂‖ ≤M

∫ 1

0

∇2l(β̂(k))−∇2l(β̂ + t(β̂(k) − β̂))‖op dt‖β̂(k) − β̂‖

≤ML

∫ 1

0

‖β̂(k) −(β̂ + t(β̂(k) − β̂)

)‖2 dt · ‖β̂(k) − β̂‖

= ML

∫ 1

0

(1− t) dt · ‖β̂(k) − β̂‖2

=ML

2‖β̂(k) − β̂‖2

Suppose we have a sequence of numbers such that:

ak+1 =ML

2‖β̂(k+1) − β̂‖

Page 17: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 17

then we have ak+1 ≤ a2k which implies ak ≤ a2k

0 . Given a0 ≤ MLδ2

, then wehave that:

‖β̂(k) − β̂‖ ≤ 2

M

(MLδ

2

)2k

takeMLδ

2< 1

For global convergence, use a modified Newton Raphson algorithm.This is the same as Newton’s method with the same constraints. Need tomodify it for other situations where constraints fail.

(c) Any counter example, eg y =√|x| or similar would suffice.

(d) Assuming we have the functions in R as from http://faculty.bscb.cornell.

edu/~hooker/BTRY6520/R_MultivariateOptimization.R, here are the plots foreach algorithm. Note that 4 denotes one optima, 5 another optima, and ×denoting no optima is reached.

i. z

Optimum for steepest descent

µ1

µ2

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

ii. z

Page 18: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 18

Optimum for Newton Raphson

µ1

µ2

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

iii. z

Optimum for Levenberg-Marquardt

µ1

µ2

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

(e) z

Page 19: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 19

Optimum for alternative method

µ1

µ2

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

The computational cost of this method is lower than other methods, as we donot need to do Golden Section like Steepest Descent, or compute second orderderivatives like Newton Raphson and the Levenberg-Marquardt algorithm.

Page 20: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 20

Grading rubric for Question 3

(a) Reasonable attempt at a proof3 [10]

(b) Bonus: Generalize . . . [5]

(c) Counter example [5]

(d) Plots of graphs [5]

(e) i. Code and plot of epsilon gradient [8]

ii. Commentary [2]

?? Total: [10 + 5 + 5 + 5 + 10 = 35]

3bearing in mind that question is phrased wrongly

Page 21: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 21

4. Constrained Optimization

# Load in data from troutpcb.txt

troutpcb=read.table("troutpcb.txt", head=T)

n=length(troutpcb$Age)

par(mfrow=c(1,2))

plot(PCB~Age,type="o",data=troutpcb)

plot(log(PCB)~Age, type="o", data=troutpcb)

2 4 6 8 10 12

05

1015

2025

30

Age

PCB

2 4 6 8 10 12

01

23

Age

log(PCB)

(a) # Implent functions to evaluate l(beta) and its gradient

reg.l=function(beta, x){

if (beta[2]==0) { pred=beta[1]

} else {pred=beta[1]+beta[2]*x/(1+beta[3]+x)}

return(pred)

}

l=function(beta, x=troutpcb$Age, y=troutpcb$PCB){

n=length(x)

l=-sum((y-reg.l(beta,x))^2)/(n-1)

return (l)

}

# Sample input: l(c(mean(troutpcb$PCB), 0))

# Output [1] -56.18434

l.grad=function(beta, x=troutpcb$Age, y=troutpcb$PCB){

n=length(x)

Page 22: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 22

dbeta0=1

dbeta1=x/(1+beta[3]+x)

dbeta2=-beta[2]*x/((1+beta[3]+x)^2)

J=matrix(rep(0, n*3), ncol=3)

J[,1]=dbeta0

J[,2]=dbeta1

J[,3]=dbeta2

grad=2*t(J)%*%(y-reg.l(beta,x))/(n-1)

# the following way can also calculate gradient

#grad.0=2/(n-1)*sum(dbeta0*(y-reg.l(beta,x)))

#grad.1=2/(n-1)*sum(dbeta1*(y-reg.l(beta,x)))

#grad.2=2/(n-1)*sum(dbeta2*(y-reg.l(beta,x)))

#return(c(grad.0, grad.1, grad.2))

return(grad)

}

(b)SteepestAscent.mod = function(beta,fn=l,dfn=l.grad,x=troutpcb$Age,y,

tol=1e-8,maxit=1000,maxtry=250)

{

tol.met=FALSE; iter = 0; flag=FALSE # flag for alpha

while(!tol.met){

iter = iter+1; oldbeta = beta; g = dfn(beta,x,y)

if (beta[3]>0 & g[3]==0){

#If it is true, thenn in GoldenSectionLineSearch

# the step cannot go across zero

flag=TRUE

} else if (beta[3]==0 & g[3]<0){

#If it is true, set g[3]=0.

g[3]=0

}

# Define a new function (g and alpha can be found from workspace)

fns = function(alpha,x,y){ return(fn(beta+alpha*g,x,y)) }

# Conduct the line search

linesearch = GoldenSectionLineSearch(flag,fns,x,y,tol,maxtry)

# And update the estimate

beta = beta + linesearch$xm*g

if( max(abs( beta-oldbeta )) < tol | iter > maxit){ tol.met=TRUE }

}

return(list(beta=beta,iter=iter))

Page 23: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 23

}

GoldenSectionLineSearch = function(flag,fns,x,y,tol,maxtry=250)

{

# We need to work out where to put our limits for the Golden Section

# search.

if (flag==FALSE){

ar = c(0,2*tol,4*tol) # Start near tolerance

fval = c(fns(0,x,y),fns(2*tol,x,y),fns(4*tol,x,y))

try = 3 # Keep doubling

while( fval[try]>fval[try-1] & try < maxtry ){ # the end point

ar = c(ar,2*ar[try])

try = try+1

fval = c(fval,fns(ar[try],x,y))

}

if(try == maxtry){ al = ar[try-1]; ar = ar[try]

} else{ al = ar[try-2]; ar = ar[try] } # if not at a hump, otherwise

# last three.

# Now call GoldenSection for the line search

return( GoldenSection(fns,al,ar,al,1,x,y,tol=tol) )

} else {

#If flag is TRUE, then no step can be made large to across 0.

ar = c(0,2*tol,4*tol) # Start near tolerance

fval = c(fns(0,x,y),fns(2*tol,x,y),fns(4*tol,x,y))

try = 3 # Keep doubling

while( fval[try]>fval[try-1] & try < maxtry

& ar[try]>(-beta[3]/g[3])){ # the end point

ar = c(ar,2*ar[try])

try = try+1

fval = c(fval,fns(ar[try],x,y))

}

if(try == maxtry){ al = ar[try-1]; ar = ar[try]

} else{ al = ar[try-2]; ar = ar[try] }

return( GoldenSection(fns,al,ar,al,1,x,y,tol=tol) )

}

}

GoldenSection = function(fn,betal,betar,beta,dim,x,y,tol=1e-8,maxit=250){

# Here beta is a vector of inputs into fn, betal and betar are upper and lower

# values for beta, also given as vectors. However, the function only works

Page 24: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 24

# on the dimension dim.

gr = (1 + sqrt(5))/2 # Pre-calculate golden ratio

xl = beta # We’ll set xl, xr and xm to be vectors, but only

xl[dim] = betal[dim] # the dim’th entry will be different. Note that in

# all our updating rules, the other entries will

xr = beta # not be affected.

xr[dim] = betar[dim]

xm = beta

xm[dim] = betal[dim] + (betar[dim]-betal[dim])/(1+gr)

fl = fn(xl,x,y); fr = fn(xr,x,y); fm = fn(xm,x,y)

tol.met = FALSE # No tolerance met

iter = 0 # No iterations

while(!tol.met){ # Here we only need to check conditions on the

iter = iter + 1 # dim’th entry.

if( (xr[dim]-xm[dim]) > (xm[dim]-xl[dim]) ){ # Working on the right-hand side

temp = xm + (xr-xm)/(1+gr); ftemp = fn(temp,x,y);

if( ftemp > fm){ xl = xm; fl = fm; xm = temp; fm = ftemp }

else{ xr = temp; fr = ftemp }

}

else{

temp = xm - (xm-xl)/(1+gr); ftemp = fn(temp,x,y);

if( ftemp > fm){ xr = xm; fr = fm; xm = temp; fm = ftemp }

else{ xl = temp; fl = ftemp }

}

if( (xr[dim]-xm[dim]) < tol | iter > maxit ){ tol.met=TRUE }

}

return(list(xm=xm,iter=iter))

}

beta=c(-1,6,7)

x=troutpcb$Age

y=-1+5*x/(6+x)

SteepestAscent.mod(c(-1,6,7),l,l.grad,x,y)$beta

# [,1]

#[1,] -0.9887819

#[2,] 5.0176656

Page 25: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 25

#[3,] 5.1051488

SteepestAscent.mod(c(-1,6,7),l,l.grad,x,y)$iter

# [1] 1001

(c) # Modify gradient function.

l.grad.mod=function(beta,x,y){

n=length(x)

dbeta0=1

dbeta1=x/(1+beta[3]+x)

dbeta2=-beta[2]*x/((1+beta[3]+x)^2)

J=matrix(rep(0, n*3), ncol=3)

J[,1]=dbeta0

J[,2]=dbeta1

J[,3]=dbeta2

grad=2*solve(t(J)%*%J)%*%t(J)%*%(y-reg.l(beta,x))/(n-1)

return(grad)

}

x=troutpcb$Age

y=-1+5*x/(6+x)

beta.est=SteepestAscent.mod(c(-1,6,7),l,l.grad.mod,x,y)$beta

#> beta.est

# [,1]

#[1,] -1.000000

#[2,] 5.000000

#[3,] 4.999999

beta.iter=SteepestAscent.mod(c(-1,6,7),l,l.grad.mod,x,y)$iter

#> beta.iter

#[1] 4

y.mod=-1+6*x/(0.6+x)

beta.est.mod=SteepestAscent.mod(c(-1,6,-0.4),l,l.grad.mod,x,y.mod)$beta

#> beta.est.mod

# [,1]

#[1,] -1

#[2,] 5

#[3,] 5

# The algorithm didn’t stops at beta2=0.

(d) # Apply the modified Sttepest Ascent to the trout PCB data set

beta.estimate=SteepestAscent.mod(c(-1,6,7),l, l.grad,troutpcb$Age,

troutpcb$PCB)$beta

Page 26: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 26

#> beta.estimate

# [,1]

#[1,] -2.514947

#[2,] 69.942658

#[3,] 31.522597

(e) # Generate normal errors and add them to the predictions. Create

#200 simulated data sets and re-estimate parameters.

error=matrix(rnorm(length(troutpcb$Age)*200), ncol=200)

beta.re=matrix(rep(0, 200*3), ncol=3)

for (k in 1:200){

pred.sim=reg.l(beta.estimate, x)+error[,k]

beta.re[k,]=SteepestAscent.mod(beta.estimate,l,l.grad,

troutpcb$Age,pred.sim)$beta

}

count=sum(beta.re[,3]<1e-8)

hist(beta.re[,3])

#count=0

Histogram of beta.re[, 3]

beta.re[, 3]

Frequency

10 20 30 40 50

020

4060

Page 27: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 27

(f) # null case is beta2=0. pred=beta0+beta1*x/(1+x)

beta.estimate[3]=0

beta.est3=matrix(rep(0, 200*3), ncol=3)

for (m in 1:200){

pred.sim=reg.l(beta.estimate,x)+error[,m]

beta.est3[m,]=SteepestAscent.mod(beta.estimate,l,l.grad,

troutpcb$Age, pred.sim)$beta

}

count.new=sum(beta.est3[,3]<1e-8)

# count.new=116

hist(beta.est3[,3])

# From the plot, we can see that beta2 has asymmetric distribution.

# It is not reasonable to apply normal-theory confidence intervals

# in this case.

Histogram of beta.est3[, 3]

beta.est3[, 3]

Frequency

-0.1 0.0 0.1 0.2 0.3

020

4060

80

(g) # Gauss-newton optimization to estimate beta0, beta1, and r,

# where beta2=exp(r).

GaussNewton = function(beta,fn,dfn,Y,x,tol=1e-8,maxit=100){

tol.met=FALSE; iter = 0;

while(!tol.met){

iter = iter + 1

Page 28: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 28

oldbeta = beta

f = fn(beta,x)

g = dfn(beta,x)

beta = beta + chol2inv(chol(t(g) %*% g)) %*% ( t(g) %*% (Y-f))/27

# beta = beta + solve(t(g)%*%g,t(g)%*%(Y-f))

if(max(abs( beta-oldbeta )) < tol | iter > maxit)

{ tol.met=TRUE }

}

return(list(beta=beta,iter=iter))

}

# the Jacobian matrix of f(para, x) (not the gradient)

dreg.f=function(para,x){

n=length(x)

dbeta0=1

dbeta1=x/(1+exp(para[3])+x)

dgamma= - para[2]*x*exp(para[3])/((1+x+exp(para[3]))^2)

J=matrix(rep(0, n*3), ncol=3)

J[,1]=dbeta0

J[,2]=dbeta1

J[,3]=dgamma

return(J)

}

l.gamma=function(para, x=troutpcb$Age, y=troutpcb$PCB){

n=length(x)

l=-sum((y-reg.gamma(para,x))^2)/(n-1)

return (l)

}

reg.gamma=function(para,x){

return(para[1]+para[2]*x/(1+exp(para[3])+x))}

para=c(-2,70,log(1000))

para.est=GaussNewton(para, reg.gamma, dreg.f, troutpcb$PCB, troutpcb$Age)

# Employ Gauss-Newton optimization to estimate parameters in simulations

# in 4(f)

para.re=matrix(rep(0, 200*3), ncol=3)

for (s in 1:200){

pred.sim=reg.gamma(para,troutpcb$Age)+error[,s]

para.re[s,]=GaussNewton(para, reg.gamma, dreg.f, troutpcb$PCB, pred.sim)$beta

}

hist(exp(para.re[,3]))

(h) # bonus: null case is beta2=0. pred=beta0+beta1*x/(1+x)

para.re4=matrix(rep(0, 200*2), ncol=2)

para.est[3]=1

para.est4=matrix(rep(0, 200*3), ncol=3)

Page 29: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 29

for (m in 1:200){

pred.sim=reg.gamma(para.est,troutpcb$Age)+error[,m]

para.re4[m,]=GaussNewton(para.est,reg.gamma,dreg.f,troutpcb$Age, pred.sim)

}

count.new=sum(para.re4[,3]==0)

hist(para.re[,4])

Page 30: KK 1 - Cornell Universityfaculty.bscb.cornell.edu/~hooker/BTRY6520/HW4 Ans.pdf · KK 1 Note: Code is provided in R for this homework. Ze Jin has provided code for Q1, and Giles has

KK 30

Grading rubric for Question 4

i. Plot of PCB against age [1]

ii. Plot of log PCB against age [1]

(a) i. Gradient function [2]

ii. Checking it [1]

(b) i. Modified steepest ascent function [5]

ii. Checking it [1]

(c) i. Changing the gradient [1]

ii. First test [1]

iii. Second test [1]

(d) i. Applying modified steepest ascent to data [2]

(e) i. Generating errors and estimating parameters [3]

ii. Reporting number of times β̂2 = 0 [1]

iii. Histogram of β2 [1]

(f) i. Changing response generation method [2]

ii. Reporting number of times β̂2 = 0 [1]

iii. Discussing if Normal Theory CIs are appropriate [1]

(g) i. Implement Gauss Newton [3]

ii. Histogram [1]

iii. Comparing to distribution [1]

? Bonus: Repeat for 4f) [5]