Simulation Methods for Optimization and Learningglobex.coe.pku.edu.cn/file/upload/201906/27/165320829.pdf · Simulation Methods for Optimization and Learning Bernd Heidergott Department

Motivation Designing a Simulation Experiment Generating Random Variables Discrete Event Simuation

Simulation Methods for Optimization and Learning

Bernd Heidergott

Department of Econometrics and Operations ResearchVrije Universiteit Amsterdam

Simulation Refresher

1


Overview

Motivation

Designing a Simulation Experiment

Generating Random Variables

Discrete Event Simuation

2


Motivation

3


Stochastic Computer Simulation

• That means that there is some real-world system with intrinsic randomness, andyou are asked to solve some kind of statistical or numerical or optimal controlproblem.

• The problem cannot be solved exactly.

• However, you can mimick the system via a computer program.

• Hence, you can do repeated experiments with the system and generate manyrandom outcomes for the problem.

• In this way you can get an estimated solution.

4


Motivation

• In this course, we are considering optimization problems.

• Canonical representationminθ∈Θ

J(θ).

• Furthermore, we consider iterative solution:

θn+1 = θn + εnG(θn).

• Typically, based on gradient descent:

G(θ) = −∇J(θ).

5


Motivation (cont’d)

• What if the objective function is an expected value of some random experiment:

J(θ) = E[Y (θ)],

where J(θ) cannot be computed analytically, or numerically; i.e., J(θ) is not givenby a formula.

• We call Y (θ) the output of the random experiment.

• Typically in practical situations is that Y (θ) is a function of one or more randominput variables X1,X2, . . ., that depend on the control variable θ. ThusX1(θ),X2(θ), . . ..

• Or, maybe the input variables do not depend on the control, but the function is.

• Examples and applications follow today, in the exercises, and in the next lectures.

6


Motivation (cont’d)• Recall from your course in Probability:

SLLNWhen Y1(θ),Y2(θ), . . . are iid replications of Y (θ) then

1n

n∑i=1

Yi (θ)a.s.→ E[Y (θ)] (n→∞).

• Thus J(θ) can be estimated by stochastic simulation (aka Monte Carlosimulation):• Repeat n times the random experiment Y (θ), independently.

• Observe the n outcomes y1, y2, . . . , yn.

• Take the average as an estimate: y .= 1

n∑n

i=1 yi ≈ J(θ).

• Note that y is an observation (realization) of the so-called sample averageestimator of J(θ):

Y n(θ).

=1n

n∑i=1

Yi (θ).

Later more about the statistical analysis of the estimator.

7


Motivation (cont’d)

• In case of the gradient descent iteration for minθ J(θ) (for ease one-dimensionalθ):

θn+1 = θn − εn∇J(θn) = θn − εn∂

∂θE[J(θn)].

• If J(θ) can be computed only by simulation, how do we compute the gradient∂∂θE[J(θn)]?

• This will be one of the topics in this course.

8


Designing

a Simulation Experiment

9


Application: Appointment Scheduling

Mrs Quant, the OR-specialist of a local hospital, is asked to schedule the appointmenttimes for patients who will consult one of the physicians, dr Who, the next day.

She knows• The number of patients for the next day will be

d = 20.

• The consulting times will fluctuate (are random)with an average of 15 minutes.

• Whatever schedule for the appointment times,all patients will be on time, and there will be nono-shows.

• Dr Who is available from 9:00 AM, and has nodue time of ending his working day.

10


Objectives

Mrs Quant is asked to find a schedule that minimizes both

• the expected total waiting times of the patients,

• and the expected total idle time of dr Who.

11


Illustration

• Suppose d = 5 patients.

• Suppose that mrs Quant schedules appointment times at

9:00, 9:15, 9:30, 9:45, 10:00 hr.

• Suppose that at the end of the day it turned out that the consulting times were

20, 5, 10, 25, 10 minutes,

respectively (and in order of arrival).

• Then the consecutive waiting times were

0, 5, 0, 0, 10 minutes.

• The consecutive idle times were

0, 5, 5, 0 minutes.

• This realisation has total 15 minutes waiting and 10 minutes idling times.

12


Mathematics

Before you develop a simulation program you analyse mathematically the model.Basically, you define the mathematical ingredients and relations.

• Consulting times can be considered to be iid random variables C1,C2, . . . withcommon cdf F (x) and pdf f (x).

• The control (decision) variables are the inter-appointment times θ1, θ2, . . ..

• Let for k = 1, . . . , d (number of patients for the next day)

τk = arrival time of patient k .

Then we assume that τ1 = 0. And then we get

τk+1 = τk + θk , k = 1, . . . , d − 1.

13


Mathematics (cont’d)

• More notation

Wk = waiting time of patient k

Ik = idle time between patients k − 1 and k

• Then W1 = I1 = 0 and for k = 1, 2, . . . , d − 1 we get the Lindley equations:

Wk+1 =((Wk + Ck )− θk

)+

Ik+1 =(θk − (Wk + Ck )

)+

• Note, formally the variables Wk and Ik depend on θk−1.

τk τk+1

Wk Ck

θk Wk+1

14


Mathematics (cont’d)

• Given the control variables θk , k = 1, 2, . . . , d − 1 the output variable is

Y (θ) = Y (θ1, . . . , θd−1) =d∑

k=1

(Wk + Ik

).

• Objective:minθE[Y (θ)].

• Today, set some policy θ = (θ1, . . . , θd−1), then estimate E[Y (θ)].

15


The Statistics of Stochastic Simulation

• The purpose of simulation is to estimate a performance measure E[Y ] for somerandom output variable Y .

• For ease of notation, the θ-dependence is omitted.

• Denote J = E[Y ] and σ2 = Var [Y ], both unknown.

• Let Y1,Y2, . . . be i.i.d. replications of Y .

• Define for n ∈ N

Y n =1n

n∑i=1

Yi , S2n =

1n − 1

n∑i=1

(Yi − Y n)2.

These are called sample average and sample variance, based on sample size n.

16


The Statistics (cont’d)Then

E

[Y n

]= J (1)

Var[Y n

]=

1nσ2 (2)

P

(lim

n→∞Y n = J

)= 1 (3)

E

[S2

n

]= σ2 (4)

limn→∞

P

(|S2

n − σ2| > ε)

= 0 (5)

limn→∞

P

(Y n − Jσ/√

n≤ x

)= Φ(x), (6)

ifY n − Jσ/√

nD= N(0, 1) then P

(Y n − JSn/√

n≤ x

)= Tn−1(x), (7)

limn→∞

P

(Y n − JSn/√

n≤ x

)= Φ(x), (8)

where Φ(·) is the standard normal cumulative probability distribution function, andTn−1(·) is the standard Student’s-t probability distribution function with n − 1 degreesof freedom.

17


The Statistics (cont’d)

• Equation (1) says that the sample average Y n is an unbiased estimator of J (forany n).

• Equation (3) is the law of large numbers which says that almost all realisations ofthe sample average are close to the unknown J (for large n).

• Equation (6) is the central limit theorem which says that the sample average isapproximately normal distributed (for large n). Written differently,

√n(Y n − J

) D→ N(0, σ2).

• Equation (4) says that the sample variance S2n is an unbiased estimator of σ2.

• Equation (5) says that the sample variance S2n is a strongly consistent estimator of

σ2.

• Equation (8) follows from (5) and (6). It says that the statistic√

n(Y n − J)/Sn hasasymptotically a normal distribution. Also written as

√n(Y n − J

)/Sn

D→ N(0, 1).

18


Performance of the Estimator

When you have run your simulations and have computed the (sample average)estimator, you compute and report one or more of the following performances.

standard error SE =

√Var [Y n]

relative error RE = SE/E[Y n]

100(1− α)% confidence interval(Y n − z1−α/2 SE, Y n + z1−α/2 SE

)work-normalized variance Var [Y n]× T[Y n]

coverage see next slide

where zp is the p-th quantile of the standard normal distribution; i.e.,Φ(zp) = P(N(0, 1) ≤ zp) = p, and T [X ] is the computer time of computing X .

Assuming that T [ 1n∑n

i=1 Yi ] ≈ nT [Y ], the work normalized variance does not dependon sample size!

19


Coverage

• The confidence interval is based on the central limit theorem:

P(|Y − J| ≤ z1−α/2 σ/

√n︸︷︷︸

=SE

)≈ 1− α.

• The interpretation is: when you would generate many independent estimates(each based on sample n), and when you would calculate the associatedconfidence intervals, then about 100(1− α)% of these confidence intervals wouldcover the true value.

• Repeat the simulation experiment m times, each based on sample n. Thecoverage is the fraction of the associated confidence intervals covering the truevalue.

20


Estimated Performances

Note that the performances of slide 19 are assuming that the mean and variance of theestimator are known.

However, these are estimated. Hence we get actually estimated performances:

standard error SE = S/√

n

relative error RE = SE/Y n

100(1− α)% confidence interval(Y n − tn−1,1−α/2 SE, Y n + tn−1,1−α/2 SE

)work-normalized variance (S2/n)× T[Y n]

where tn−1,p is the p-th quantile for the t-distribution with n − 1 degrees of freedom.

21


The Monte-Carlo Algorithm

Summary. Suppose we wish to compute a performance measure J = E[Y ] where theoutput variable Y = h(X ) can be calculated as a function h of a random input vector X .Suppose σ2 = Var [Y ] also unknown.

Monte Carlo algorithm1. Repeat for i = 1, . . . , n:

(i). Generate X i independently of previous runs.

(ii). Compute output Yi = h(X i ).

2. Compute sample average estimator for J: Y =1n

n∑i=1

Yi .

3. Compute sample variance estimator for σ2: S2 =1

n − 1

n∑i=1

(Yi − Y )2.

4. Report estimate, confidence interval, (estimated) standard error and/or relativeerror, work-normalized variance.

22


Relative Error

The relative error is used as stopping criterium: keep generating new samples untilRE < ε.

Then the confidence interval has relative width 2tn−1,1−α/2ε.

For instance, when the objective is to obtain a 95% confidence interval with relativewidth 10% (thus 5% each side), you simulate until RE ≈ 0.025 = 2.5%.

Example. In Canvas a Python code of the appointment scheduling application isavailable. There are 20 patients per day, all with exponential consulting times withmean 5. The control variables are set θk = E[C] = 5. Then sample size n = 1000 forestimating J gave

estimate : 180.831288standard error : 4.589540relative error : 2.538023 %95% cfi : [ 171.835791, 189.826786 ]

23


Generating

Random Variates

24


The Kernel of Simulation

You need to generate independent observations of random variables.

For instance X D∼ Exp(λ) with cdf

F (x) = P(X ≤ x) = 1− e−λx , x ≥ 0,

for some given parameter λ.

Suppose that your simulation algorithm produces a stream of numbers X1,X2, . . ..

The first requirement: these numbers should be distributed according to F (). For this,there are statistical tests such as chi-squared, Kolmogorov-Smirnov, etc. Also visual(p-p, q-q) or descriptive (moments, quantiles) tests could be utilised.

The second requirement: independence of X1,X2, . . .. Also for this tests have beendeveloped.

Consult textbooks on simulation.

25


The Uniform Distribution

All simulation algorithms for generating random variates are mappings of one or moreuniform random numbers, U1,U2, . . ..

For instance, the exponentially distributed X of the previous slide is obtained byX = (− ln(1− U))/λ.

Hence, it suffices that there is an algorithm for generating uniform random numbersthat has the the distributional and independence requirements.

Such an algorithm is called Random Number Generator (RNG).

What is left is that you have to prove that the outcome X of your algorithm is a perfect(or unbiased) sample of the required distribution.

For instance: if U D∼ U(0, 1), then

P((− ln(1− U))/λ ≤ x

)= 1− e−λx .

26


From Random Numbers to Processes

random numbers

distributions

x

φ(x)

simulation

27


Random Number Generator (RNG)

DefinitionA RNG is an algorithm implemented as a computer program that will produce finitesequences of numbers U1,U2, . . . ,Un that ‘look like’ random numbers between 0 and1.

0 12

34

5

6

789

10

1112

13

1415

LCG produces:10,13, 8, 11, 6, 9, 4, 7, 2, 5, 0,3, 14, 1, 12, 15,10,13, 8, . . .

1 0 1 0 0

LFSRG produces:10,1, 2, 12, 15, 8, 13, 13, 4, 2, 5,9, 15, 1, 11, 10, 8, 4, 11, 3, 14, 3,7, 5, 0, 9, 6, 7, 12, 6, 14,10,1, 2, . . .

28


Requirements for RNG

1. The generated numbers should be i.i.d..

2. The generated sequence should be replicable.

3. Preferably, there should be no cycling (periodicity), but if there is cycling (mostoften), the period should have a large length.

4. Generation should be fast and memory efficient.

5. Ability of producing separate streams.

6. Portability: same sequence on other computers or operating systems.

A newly developed RNG is tested empirically and theoretically for these requirements.

29


The Period of Generators

DefinitionThe period length of an RNG is the number of different values before repeating.

See slide 28 for toy examples.

• Classical RNG’s (usually LCG-type, see next slide) had small periods≈ 231 ≈ 2 · 109.

• Modern RNG’s are generalized LCG’s, e.g. are MRG32k3a with period ≈ 3 · 1057;and generalized LFSRG’s, e.g. the Mersenne Twister with period219937 − 1 ≈ 106001.

• The latter is default in computer languages as R, Matlab/Octave, Python, Ox.

30


Linear Congruential Generators

DefinitionA linear congruential generator (LCG) is defined by the recursion

Xi+1 = (aXi + c) mod m (i = 0, 1, . . .).

• LCG produces integers in {0, 1, . . . ,m − 1}, which are converted topseudorandom numbers in [0, 1) by

Ui = Xi/m.

• Created by D.H. Lehmer 1951.

31


Non-Uniform Random Variates

ProblemGiven a random variable X and its pdf f and/or its cdf F , how to generate a realisationx from its domain?

Algorithms:• General: Inverse transform; Accept-reject.

• Special: Alias; Ratio-of-Uniforms, ...

• Ad-Hoc for specific distributions (e.g. Ziggurat for standard normals).

32


Inverse Transform Method

Algorithm1. Generate a uniform random number U from U(0, 1).

2. Return X = F−1(U).

It is easy to prove that this method is perfect.

33


Examples Inverse Transform Method

(i) Continuous distributions:• Examples: Exponential (see slide 26), Weibull (exercise), Pareto (exercise).

• Uniform (a, b): X = a + (b − a)U.

(ii) Discrete distributions:• Examples: Bernoulli (exercise), Binomial (exercise), Geometric (exercise), Poisson

(exercise)

• Uniform {a, a + 1, . . . , b}: X = a + floor((b − a + 1)U).

34


Accept-Reject Method

We describe this method for continuous random variables only.

TheoremLet X be a continuous rv with cdf F and density f . Suppose that we can find a densityg, say of a random variable Y , and a finite constant c such that

f (x) ≤ cg(x), for all x ∈ R

Let U be a U(0, 1) random variable independent of Y . Then by conditioning

Y∣∣∣∣(U ≤

f (Y )

cg(Y )

)D= X .

Algorithm1. Generate Y using g;

2. Generate U from U(0, 1);

3. If U ≤ f (Y )/(cg(Y )) return X = Y ; else go to step 1.

35


Illustration

• Gamma pdf f with scale parameter λ = 0.8 and shape parameter α = 2.5.

• A-R with exponential g with scale parameter µ.

• The number of iterations until accept is ageometric random variable with mean c.

• Criterion to choose majorant is minimal c.

• Example: for Gamma with shape α > 1 optimalexponential majorant has µ = 1/α.

36


Discrete Event Simulation

37


Clocks and Events

Details on board!

38

Documents

Simulation Methods for Optimization and Learningglobex.coe.pku.edu.cn/file/upload/201906/27/165320829.pdf · Simulation Methods for Optimization and Learning Bernd Heidergott Department