Statistical Methods Bayesian methods 4 Daniel Thorburn Stockholm University 2012-04-12

Statistical Methods Bayesian methods 4

Daniel Thorburn

Stockholm University

2012-04-12

2

Outline

14. MCMC – an introduction – Gibbs’ sampling

15. Exercises in subjective probability16. Further examples of Gibbs’ sampling17. Dynamic models (or time series analysis

in a changing world)

14. MCMC – an introductionGibbs’ sampling

Simulate from posterior and predictive distributions

• In some cases there are no closed expressions for the posterior distributions. In most of these cases it is nevertheless possible to simulate from the posterior.

• Last time we illustrated this in a simple situation (where it also was possible to get the posterior directly)

• (y,x) is N((0,0), 1, 1, )

Standard way of simulating multivariate normal (p dimensional)• For comparison. Variance .

• Find R the square root of . – e.g. diagonalise =ADA’, where D is diagonal and A orthonormal – Let root(D) be a diagonal matrix with the diagonal consisting of the

positive roots of the elements in D– Define Aroot(D)A = root() = R – Check that RR =

• Construct a vector Z of p iid N(0,1)

• Now RZ is multivariate normal with variance – Since E(RZ(RZ)’) = RE(ZZ’)R = RR =

With MCMC• Instead of describing the distribution with

the bivariate normal N(0,0,1,1,) we describe it by the one-dimendional conditional distributions

• Y|X=x € Normal(x,1-• X|Y=y € Normal(y,1-• It may be shown that the pair of marginals

together describe the joint distribution completely

Algorithm

• Starting value: anything e.g. x0=0• Take y1 from the conditional distribution of Y1 given that X1=x0.

Y1 = Normalrand(x0,1-

Take x1 from the conditional distribution of X1 given that Y1=y1. X1 = Normalrand(y1,1-

• Continue like this. • Take yi+1 from the conditional distribution of Yi+1 given that Xi+1=xi ,

Yi+1 = Normalrand(xi,1-

Take xi+1 from the conditional distribution of Xi+1 given that Yi+1=yi+1. X1 = Normalrand(y1,1-

Algorithm• The pairs (Y1,X1) (Y2,X2) (Y3,X3) … form a Markov chain, which converges in

distribution to N((0,0), 1, 1, ) (convergence is exponentially fast, but slow for | close to one)

• Thus after a ”burn-in” period (say B=100) the pairs will be a sequence of (dependent) random variables from the correct distribution N(0,0,1,1,). (see Excel sheet)

• Histograms based on this sequence (Y101,X101) (Y102,X102) (Y103,X103) … describe the marginal distributions. In this way on can also construct scatter plots or describe the multivariate densities in other ways. You need more observations compared to inependent observations but just let the computer work, you can get arbitary good precision.

• This was an example where only one-dimensional random variables were obtained but used to describe multivariate distributions. More generally lower-dimensional conditional distributions may be used to describe higher dimensional situations)

• This was a simple example of an MCMC-technique called Gibbs’ sampling. Usually burn-ins of more than 100 are used and the chain is repeated for at least 10 000 times (computer time is cheap). Special techniques are used to hurry upp the convergence. (For more detais see last time)

Next example• Let 1, 8, 7, 9, 2, 5, 8, 9, 5, 6 be observations from a Normal

distribution with unknown mean and unknown variance 2. (n is 10, x-bar is 6, and x2 = 430)

• The priors – The mean was guessed to be 4.5 with a precision corresponding to 4

measurements and – the variance 2 was guessed to be 6 with a precision corresponding to 5

d.f.• The conditional posteriors are

– given X and 2 is normal with mean 78/14=5.571 and variance 2/14– 2 given X and is inverse gamma with parameters 15 and (30+430-

102 + 4(5.571-4.52+10(5,571-6) 2))/15 (df and mean)

• We may simulate from these two distributions in order to get a sequence of parameter estimates from their distribution given the data.

• Start value: anything say 2 = 6

• Given 2 = 6 we know that = Normal((60+18)/14, 6/14) = Normal(5.571, 0.429)– Take a random number with this distribution– The first random number in the first pair is 5.571+0.654*(-0,391) = 5.314

• Given = 5.314 we know that 2 is Inverse Gamma(15/2, 2/(5*6+430-1062+10(5.571-6)2+4(5.571-4.52))) = -1(7.5, 0.0197)– Take a random number with this distribution– A random number from (7.5, 0.0197) = 0.1482(15)/(15) is 0,137. The

second random number in the first component is thus 7.30.

• Given 2 = 7.30 we know that = Normal(5.571,7.3/14). – Take a random number with this distribution– The first random number in the second pair is 5.571+0.722*1.324 =

6.527

• …

First ten simulated values

First 500 simulated values

Note that the simulated means look fairly independent but having skew distributions. The two diagrams to the left are often used as diagnostics

Frequency polygon for the posterior distribution of the mean (based on 500 Gibbs iterations.)A 95 % interval is 4.02 to 6.95 (usually longer simulations are done)

Frequency polygon for the posterior distribution of the variance (based on 500 Gibbs iterations).

A 95 % interval is 3.32 to 18.04

Posterior variance distribution

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40

15. Exercises in subjective probability

Results of your probability assessments

Assessed Number of cases Proportion

Probability occured not occur

0-0.05 7 12 0,368421

0.05-0.25 4 10 0,285714

0.25-0.45 4 13 0,235294

0.45-0.55 29 34 0,460317

0.55-0.75 23 6 0,793103

0.75-0.85 6 4 0,6

0.95-1.00 0 3 0

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 20 40 60 80 100 120

Based on your 155 assessments. (13 persons 13 questions and a few item nonresponse)

If your assessments had been correct, the line should have been going from the corners (0,0) to /(1,1) almost like a straight line

Correct answers

• My length 179 cm (3/7)• Length of the Nile 6670 km (3/6)• Iron 7.86 g/cm2 (2/6)• Mexico city 21.2 millions inhabitants (1/6)• Gotland area 2994 km2 (1/4)• Countries in Africa 54 sovereign states (6/7)• (Average) distance to the sun 8.19 light minutes

(150 millions km) (2/5)• Students at Stockholm university 64000 (0/6)• Temperature at Easter day, 6o C (4/6)

Results of your probability assessments

• Only seven persons handed in. Their joint coverage rate was only 22 out of 53 answers (41 %). (Much less than the 95 % which we aimed for).

• Best was Jalnaz with 75 % correct intervals• I also measured it with min( error/(intervall-length/4) ,8). (If

your estimates had been normal the expected value of the mean would be (2/) = 0.65.

• The average was 5.62 (without the minimum function many far off estimates would have increased it to 35.7)

• With this measure the Heeva was best with 3,49• To summarise you were far to optimistic. • Your estimates were best on the question on the number of

African states with only one interval of seven did not cover.• You can easily train on probability assessment

16. Further examples of Gibbs’ sampling

Example continued from last time

• Consider an agricultural study on six piglets with two explanatory variables which are strongly colinear. The piglets stem from three litters.

• Model Yi=a +bx1i+cx2i+k(i)+i where the error terms are independent and Normal(0,

2) resp. Normal(0,

2) (d.f. and ”mean”)

• (compared to last time we have added unknown variances)

Data

y x1 x2 Litter

32 1 47 A

141 2 53 B

544 6 90 B

528 6 97 C

336 4 76 C

239 3 65 C

Parameters a=47 b=129 c=-3 ABC 0,87 -0,59 -1,2

Priors

• a,b,c are independent Normal with means (50, 1, 0) and variances 1000, 1000, 50

• 2 and

2 are independent and Inverse Gamma with parameters (2, 1) resp. (10, 1)

• For this problem three marginal distributions will be used in the

posterior.1. For (a, b, c, 1,2,3) given Y,

2 and 2

2. For 2 given Y, a, b, c, 1,2,3 and

2 i.e. for 2 given 1,2 and3

3. For 2 given Y, a, b, c, 1,2,3 and

2 i.e. for 2 given ei = yi-a-bx1i,-

cx2i –k(i) for i = 1,…,6

• These posterior marginals are obtained

– in exactly the same way as when we did it last time with known variances (normal with mean m+ A’(AA’1(x-Am) and variance

A’(AA’1A – Using the theory for conjugate distributions of chi-2-distributions i.e.

Inverse Gamma with parameters 2+3/2 and (2+i2)/3.5

– Using the theory for conjugate distributions of chi-2-distributions i.e. Inverse Gamma with parameters 10+6/2 and (10+ei

2)/13

• One may thus simulate data from the posterior distribution iteratively by running through these steps iteratively

Predictive distributions• The only motivation for models is their potential for

predicting further observations – a model which cannot predict anything is useless

• Let us suppose that one is interested in two more piglets one from a new litter and one from litter one.

• It is easy to get the predictive distributions for them.

• In the simulation put in on extra step after each iteration.– Take a random observation from the distribution of the data given

a, b, c, 1, 2 and

2.– After the iterations plot the distributions– E.g. if you want 95 % prediction intervals for those piglets just

take the 2.5 % and 97.5 percentiles fromthe simulated distributions, since all observations are drawn from the same distribution as the true values given the data

Example: Missing data or Item Non-response

• Missing data– Suppose we should have observed two quantities X

and Y for a sample of 100. – Unfortunately some y-values are missing. We have

observed 100 x-values but 5 y-values are missing due to chance

• Model Y = a + bx + i where i are iid normal i=1, …, 100– Put (vague ?) priors on a, b and 2.– Derive the posterior distributions for 2 and for a,b

given 2.

• Draw a random triplet of parameters from the posterior – First for 2

– Then for a and b given 2.

• Draw a random five-tuple from the predictive distribution of the missing y-values given the three parameters and their resp. x-values.

• Repeat B times (e.g. B = 5 or 100). You have now B sets with complete data.

• You may now do the analysis that you intended B times on the sets of completed data

• Classical continuation: – The result will be B estimated values *i and B estimated variances Vi, i

= 1, …, B– Estimate by the mean *i /B – Estimate the variance that you would have got woithout response by V1

= Vi /B – Estimate the variance due to imputation by the variance between the B

estimates V2 = (*i - )2/(B-1) – The total variance is V= V1 + (B+1)V2/B

• Bayesian continuation– For each complete set B find the posterior of .– Draw a (some, m) random observation(s) from the posterior– The set of B (mB) random drawings is a sample drawn from the true

distribution of given the data (and the model) – Plot the posterior

• Similar approaches may be used for other types of problems/ analyses (e.g. factor analysis or life table models) but with suitable modifications

17. Dynamic models (or time series analysis in a

changing world)

AR-process

• Standard classical model Yt = a + b1yt-1 + b2yt-2 + t , where is normal with mean 0 and variance 2. (a second order AR-process)

• Usually one looks at the economy andf analyses the longest serie that can be thought of as under a stationary economy without any important changes

• Bayesian model: The parameters a, b1, b2 and 2 are random variables

• Dynamic model. The parameters also change with time according to some specified model e.g. at = at-1 + t. A model identified 10 or twenty years ago is probably not fully adequate today since the economy/society always changes

A simple dynamic model

• Yt = at + b1tyt-1 + b2tyt-2 + t , where – (at, b1t, b2t) = (at-1, b1t-1, b2t-1) + t

– t are trivariate normal with mean (0,0,0) and variance , and independent for different t

– t is normal with mean 0 and variance 2, and independent for different t

• The prior is that (a0, b10, b20) had a trivariate normal distribution with mean (0, 10, 20) and variance T (for simplicity we assume that all variances are known)

• Analysis next page – fairly straight forward. Don´t worry about the formulas

• The prior is (a0, b10, b20) had a standard trivariate normal distribution with mean (0, 10, 20) and variance T0

• At time 1 (a1, b11, b21) is normal with mean (0, 10, 20) and variance T0 + before observing y1.

• When y1 has been observed the distribution of (a1, b11, b21) is updated through Bayes theorem being normal – with mean (1, 11, 21) =

(0, 10, 20) + T0+X’(X(T0+)X’+ )-1(y-(0, 10, 20) X’) – and variance T1 = T0 + –T0+X’(X(T0+)X’+ )-1XT0+,– (formulas which we gave for linear models – but do not worry about

them)

• At time 2 (a2, b12, b22) is normal with mean (1, 11, 21) and variance T1 + before observing y2.

• When y2 has been observed the distribution of (a2, b12, b22) is updated through Bayes theorem being normal with mean (2, 12, 22) and variance T2

• A.s.o.

Prediction• It is fairly easy to predict future values taking into account also that

parameters will change in the future

• It is not so important to use a time period where the parameters are fairly stable – this analyses will take care of that down-weighting the information from old data points if the parameters change

• With this model their is no need to make the standard assumption that the process is stationary (eigenvalues less than one) The process may well be exposive for a short period

• Often observation errors are also added to the model Z = Yt + t = at + b1tyt-1 + b2tyt-2 + t + t . This analyses is not more difficult in this setting then without the extra term – but impossible to make within the standard ARIMA frame-work. The updating formula is changed and Yt has to be estimated (the posterior is found)

E.g. party preferences

• The following is an analysis of the the party preferences (bourgouis alliance – red green block)

• All measurements are have varying precision depending on size.

• It is assumed that no other mechanism is working like protest preferences during periods between elections. The model is a little more complicated due to non-normality (e.g. sample space is the interval (0,1)). The model also includes both levels and trend terms.

Party preference studiessept 2006- sept 2010

Data

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80

Forecasts of the election result for the period between the last two elections

Percentage of the red-green coalition(A new forecast is made after each party preference study. The time

scale here is the number of party-preference studies)

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80

Forecast of the election result for the period between the last two

elections(Probability calculated after each study)

0

0,2

0,4

0,6

0,8

1

1,2

0 5 10 15 20 25 30 35 40 45

Probability of red-green majority in the Swedish Parliament

Thank you, for your patience

Good luck !

Documents

Statistical Methods Bayesian methods 4 Daniel Thorburn Stockholm University 2012-04-12