35
Page 1 of 35 MATLAB TUTORIALS ON STATISTICS, PROBABILITY & RELIABILITY Table of Contents 1 GENERATION OF PSEUDORANDOM NUMBERS ............................................................... 2 1.1 Uniformly distributed numbers ............................................................................................. 2 1.2 Normally distributed numbers ............................................................................................... 3 2 SIMPLE LINEAR REGRESSION: ............................................................................................. 5 2.1 Scatterplots, covariance, and correlation coefficient ............................................................ 5 2.2 Ordinary Least Squares Regression: ..................................................................................... 7 3 THE CENTRAL LIMIT THEOREM (CLT) ............................................................................. 10 4 MULTIPLE LINEAR REGRESSION ...................................................................................... 14 4.1 Simple linear regression in matrix form.............................................................................. 14 4.2 Multiple Linear Regression ................................................................................................. 16 5 CONFIDENCE INTERVALS ................................................................................................... 17 5.1 Sampling distribution of the sample mean ...................................................................... 17 5.2 Confidence Intervals for the population mean ................................................................. 18 5.2.1 Two-sided CIs for population mean when is known ............................................ 18 5.2.2 Two-sided CIs for population mean when is unknown ........................................ 21 5.3 Confidence intervals for regression coefficients ................................................................. 23 5.4 Confidence intervals for and prediction intervals for ................................................... 25 6 MONTE CARLO SIMULATION ............................................................................................. 27 6.1 Prerequisites ........................................................................................................................ 27 6.2 Background ......................................................................................................................... 27 6.3 A simple example: estimation of the value of ................................................................. 27 7 NONLINEAR REGRESSION................................................................................................... 32 7.1 Nonlinear Transformations ................................................................................................. 32 7.2 Polynomial fitting................................................................................................................ 33

MATLAB TUTORIALS ON STATISTICS, PROBABILITY & …qcheng/Teaching Software/Matlab Tutorials.pdf · generated by Matlab (and others) are actually pseudorandom numbers as they are computed

Embed Size (px)

Citation preview

Page 1 of 35

MATLAB TUTORIALS ON STATISTICS, PROBABILITY & RELIABILITY

Table of Contents

1 GENERATION OF PSEUDORANDOM NUMBERS ............................................................... 2

1.1 Uniformly distributed numbers ............................................................................................. 2

1.2 Normally distributed numbers ............................................................................................... 3

2 SIMPLE LINEAR REGRESSION: ............................................................................................. 5

2.1 Scatterplots, covariance, and correlation coefficient ............................................................ 5

2.2 Ordinary Least Squares Regression: ..................................................................................... 7

3 THE CENTRAL LIMIT THEOREM (CLT) ............................................................................. 10

4 MULTIPLE LINEAR REGRESSION ...................................................................................... 14

4.1 Simple linear regression in matrix form .............................................................................. 14

4.2 Multiple Linear Regression ................................................................................................. 16

5 CONFIDENCE INTERVALS ................................................................................................... 17

5.1 Sampling distribution of the sample mean 𝐗 ...................................................................... 17

5.2 Confidence Intervals for the population mean 𝛍 ................................................................. 18

5.2.1 Two-sided CIs for population mean 𝛍 when 𝛔 is known ............................................ 18

5.2.2 Two-sided CIs for population mean 𝛍 when 𝛔 is unknown ........................................ 21

5.3 Confidence intervals for regression coefficients ................................................................. 23

5.4 Confidence intervals for 𝒚 and prediction intervals for 𝒚 ................................................... 25

6 MONTE CARLO SIMULATION ............................................................................................. 27

6.1 Prerequisites ........................................................................................................................ 27

6.2 Background ......................................................................................................................... 27

6.3 A simple example: estimation of the value of 𝛑 ................................................................. 27

7 NONLINEAR REGRESSION................................................................................................... 32

7.1 Nonlinear Transformations ................................................................................................. 32

7.2 Polynomial fitting ................................................................................................................ 33

Page 2 of 35

1 GENERATION OF PSEUDORANDOM NUMBERS

1.1 Uniformly distributed numbers

The command rand generates uniformly distributed pseudorandom numbers.

The rand command, when used alone (without an argument) generates a single number between 0

and 1, from a uniform distribution:

Each time the command is used, a different number will be generated. The “random” numbers

generated by Matlab (and others) are actually pseudorandom numbers as they are computed using a

deterministic algorithm. The algorithm, however, is very complicated, and the output does not appear

to follow a predictable pattern. For this reason the output can be treatead as random for most practical

purposes. The same sequence of numbers will not be generated unless the same starting point is used.

This starting point is called the “seed”. Each time you start Matlab, the random number generator is

initialized to the same seed value. The current seed value can be seen using

By setting a seed value, we ensure that the same results will be produced each time the script is

executed. The seed can be set to a value (say, 1234) as follows:

The purpose here is to make sure that the program starts from the same seed. The value of the seed

is not important.

The rand command, when used with a single argument, creates a square matrix where each entry is

drawn from a uniform distribution between 0 and 1:

,

To generate a row vector of 100 uniformly distributed numbers between 0 and 1, use

help rand

u=rand

s = rand('seed')

rand('seed',1234)

n=5;

u=rand(n) % square matrix 5 x 5

Page 3 of 35

A quick way to see the distribution of the numbers is the hist command:

To generate uniformly distributed numbers between x1 and x2, a transformation is needed. For

example, to generate uniformly distributed numbers between 10 and 20,

To simulate a die tossing experiment, we need to generate integers 1 to 6. In this case, the command

randi can be used:

Bar plot of the simulations can be generated as follows:

As the number of generated numbers (n) increases, the bar plot becomes more uniform.

1.2 Normally distributed numbers

The command randn generates normally distributed pseudorandom numbers.

n=100;

u=rand(1,n) % row vector, size 1 x n

u=rand(n,1) % column vector, size n x 1

hist(u)

x1=10;

x2=20;

u=x1+(x2-x1)*rand(n,1);

hist(u)

n=100;

u= randi(6,1,n); % generate n integers between 1 and 6

afreq= hist(u, 1:6); % absolute frequency

bar(1:6 ,afreq); % bar plot

xlabel('Outcome')

ylabel( 'Absolute Frequency' );

title( 'Die rolling experiment' );

Page 4 of 35

This command has many useful applications, one of which is the generation of Gaussian white

noise. To generate a column vector of length 500, use

The distribution of these numbers can be visualized using the hist command

The randn command generates numbers from a standard normal distribution (mean=0, standard

deviation=1). To get normally distributed numbers with mean m and standard deviation s, we use:

help randn

n=500;

g=randn(n,1) % column vector, size n x 1

hist(g)

m= 12; % mean

s=6; % standard deviation

g=m + s* randn(n,1); % transform

hist(g) % to see the distribution

Page 5 of 35

2 SIMPLE LINEAR REGRESSION:

2.1 Scatterplots, covariance, and correlation coefficient

A bivariate scatterplot is a simple plot of 𝑥 versus 𝑦 between two variables. A bivariate scatterplot is

a convenient first step to visualize the relationship between the two variables.

Assume that we have two variables that are linearly related, except some Gaussian noise term with

mean 0 and standard deviation 1:

𝑦 = 10𝑥 + 3 + 𝑛𝑜𝑖𝑠𝑒

Assuming that the variable x is a linearly spaced row vector of length 50, between 0 and 1, generate

the y vector:

In a bivariate scatterplot (x,y), the point with coordinates (𝑚𝑒𝑎𝑛(𝑥),𝑚𝑒𝑎𝑛(𝑦)), is known as the point

of averages.

randn('seed',1) % specify a seed (optional)

n=50; % number of observations

x=linspace(0,1,n); % linearly spaced vector a length n

y= 10*x + 3 + randn(1,n);

plot(x,y,'.')

xlabel('x')

ylabel('y')

mx=mean(x);

my=mean(y);

hold on;

plot(mx,my, 'ro', 'markerfacecolor','r')

legend('data', 'point of averages')

Page 6 of 35

Covariance:

Covariance between vectors x and y can be computed in “unbiased” and “biased” versions as

Correlation coefficient:

The correlation coefficient between two variables is a measure of the linear relationship between

them. The correlation coefficient between two vectors can be found using the average of the product

of the z-scores of x and y. The “biased” version is

Correlation coefficient can also be computed from the covariance, as follows:

The “unbiased” version (sample correlation coefficient) is computed the same way, except that the

flag “1” is replaced by “0”.

Add a title that shows the correlation coefficient to the previous plot. For this, we need to convert the

numerical value to a string, using the num2str command:

c= mean((x-mx).*(y-my)) % covariance (biased)

n=length(x);

cs= c*n/(n-1) % sample covariance(unbiased)

zx=zscore(x,1)

zy=zscore(y,1)

r=mean(zx.*zy)

sx=std(x,1);

sy=std(y,1);

r=c/(sx*sy)

title(['Correlation coefficient=',num2str(r)])

Page 7 of 35

The correlation coefficient is sensitive to outliers. To see this, change the first element of y to 40 and

recompute the correlation coefficient:

Notice that a single outlier has significantly reduced the correlation coefficient.

2.2 Ordinary Least Squares Regression:

Regression is a way to understand the mathematical relationship between variables. This

relationship can then be used to

Describe the linear dependence of one variable on another

Predict values of one variable from values of another

Correct for the linear dependence of one variable on another, in order to clarify other

features of its variability

Unlike the correlation coefficient, which measures the strength of a linear relationship, regression

focuses on the mathematical form of the relationship.

In simple linear regression, the mathematical problem is as follows: Given a set of 𝑘 points (𝑥𝑖, 𝑦𝑖),

𝑖 = 1,2, … , 𝑘, which are related through the equation 𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖 + 𝑛𝑖, where 𝑏0 and 𝑏1 are

constant (unknown) coefficients and 𝑛𝑖 is a realization of zero-mean Gaussian noise with variance

𝜎2. That is, 𝑛𝑖~𝑁(0, 𝜎2). As the noise term 𝑛𝑖 is a realization of a random variable, so is 𝑦𝑖 . Because

of the random noise, the coefficients 𝑏0 and 𝑏1 cannot be determined with certainty. Our goal is to

find the best fit line �̂�𝑖 = �̂�0 + �̂�1𝑥𝑖 minimizing the sum of squared errors:

𝑆 = ∑(𝑦𝑖 − �̂�𝑖)2

𝑘

𝑖=1

The �̂�1 and �̂�0 values minimizing 𝑆 are found by setting 𝜕𝑆

𝜕�̂�1= 0,

𝜕𝑆

𝜕 �̂�0= 0. The result is

�̂�1 =𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑥 𝑎𝑛𝑑 𝑦

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑥

�̂�0 = (𝑚𝑒𝑎𝑛 𝑜𝑓 𝑦) − �̂�1 (𝑚𝑒𝑎𝑛 𝑜𝑓 𝑥)

y(1)=40;

zx=zscore(x,1)

zy=zscore(y,1)

r=mean(zx.*zy)

Page 8 of 35

These �̂�1 and �̂�0 values are the Ordinary Least Square (OLS) estimates of 𝑏1 and 𝑏0, respectively.

The equation of the regression line (also known as the “best fit line”) is then �̂� = �̂�0 + �̂�1𝑥

Plot the regression line in red, and update the legend and the title:

Note that the regression line passes through the point of averages. The equation of the regression

line shown in the title should be close to the original equation from which the data was generated:

𝑦 = 10𝑥 + 3 + 𝑛𝑜𝑖𝑠𝑒.

Because of the noise, the predictions will not exactly coincide with the observations. The residuals

𝑒𝑖 are defined as the deviations of each observation from its estimate:

𝑒𝑖 = 𝑦𝑖 − �̂�𝑖

Ideally, the residuals should be more or less symmetrically distributed around zero (have mean≅ 0):

In addition, the amount of scatter should not show a systematic increase or decrease with increasing

values of x. In other words, the scatterplot should be homoscedastic, not heteroscedastic. The

variance of the noise can be estimated from the residuals as follows:

𝜎2̂ =∑ 𝑒𝑖

2𝑛𝑖=1

𝑛 − 2

bh1=c/sx^2; % covariance divided by variance of x

bh0=my-bh1*mx;

yhat=bh0+bh1*x; % regression line

plot(x,yhat,'r')

legend('data', 'point of averages','regression line')

title(['Regression line: yhat=',num2str(bh1),'*x+',num2str(bh0)])

e=y-yhat; %residuals

figure;

plot(x,e,'.')

me=mean(e) % average residual

Page 9 of 35

The n-2 in the denominator is known as the “degrees of freedom”, and is computed by subtracting

the number of parameters estimated ( 𝑏0 and 𝑏1) from the number of observations.

The estimated noise variance for this particular problem should be close to 1, which is the variance

of the noise used in generating the data.

The coefficient of determination (𝑅2) is a measure of how well the regression line

represents the data. It is defined as:

𝑅2 = 1 −∑ 𝑒𝑖

2𝑛𝑖=1

∑ (𝑦𝑖 − �̅�)2𝑛𝑖=1

, 𝑤ℎ𝑒𝑟𝑒 �̅� =1

𝑛∑ 𝑦𝑖

𝑛

𝑖=1

In simple linear regression, 𝑅2 is equal to the square of the correlation coefficient (𝑟2) between x

and y. If r = 0.9, then 𝑅2 = 𝑟2 = 0.81 which means that 81% of the total variation in y can be

explained by the linear relationship between x and y. The other 19% of the total variation in y

remains unexplained.

varhnoise=sum(e.^2)/(n-2) % OLS estimator of noise variance

R2=1-sum(e.^2)/sum((y-my).^2) % coefficient of determination

r2=r^2 % correlation coefficient squared

Page 10 of 35

3 THE CENTRAL LIMIT THEOREM (CLT)

Let 𝑋1, 𝑋2, … , 𝑋𝑛 be independent and identically distributed (i.i.d) random variables with 𝐸(𝑋) = 𝜇

and 𝑉𝑎𝑟(𝑋) = 𝜎2 and consider the sample mean

�̅� =1

𝑛∑ 𝑋𝑖

𝑛

𝑖=1

.

The Central Limit Theorem states that the distribution of the standardized sample mean (�̅� −𝜇

𝜎/√𝑛 )

becomes standard normal as 𝑛 approaches infinity:

𝑇ℎ𝑒 𝐶𝑒𝑛𝑡𝑟𝑎𝑙 𝐿𝑖𝑚𝑖𝑡 𝑇ℎ𝑒𝑜𝑟𝑒𝑚: lim𝑛→∞

(�̅� − 𝜇

𝜎/√𝑛 ) ~𝑁(0, 1)

Central Limit Theorem states that sample means are normally distributed regardless of the shape of

the underlying population if the sample size is sufficiently large.

The Central Limit Theorem can also be stated as follows:

As 𝑛 → ∞, the the distribution of the sample mean becomes normal with mean 𝜇 , and

standard deviation 𝜎

√𝑛

lim𝑛→∞

�̅� ~𝑁 (𝜇,𝜎

√𝑛 )

As 𝑛 → ∞, the distribution of the sample sum becomes normal with mean 𝑛𝜇 , and standard

deviation √𝑛 𝜎

lim𝑛→∞

∑ 𝑋𝑖

𝑛

𝑖=1

~𝑁(𝑛𝜇, √𝑛 𝜎)

Example: Consider the coin flip experiment with two equally likely outcomes. Let 𝑋 be the

outcome and assign numerical values to the outcomes as follows: Head: 1, Tail: 0. The outcome has

Bernoulli distribution ~𝐵𝑒𝑟𝑛(𝑝) . Suppose we simultaneously flip 𝑛 coins and take the sample

sum, and call it 𝑌.

𝑌 = ∑ 𝑋𝑖

𝑛

𝑖=1

Page 11 of 35

We repeat this experiment many times (say, 10000 times) with outcomes 𝑌1, 𝑌2, … 𝑌10000.

We are interested in the distribution of 𝑌. How does it look like?

The distribution of the sample sum is binomial 𝑌~𝐵𝑖𝑛(𝑛, 𝑝) with 𝑝 = 0.5. The sample sum has

expected value 𝜇 = 𝑛𝑝 = 0.5𝑛 and standard deviation 𝜎 = √𝑛𝑝(1 − 𝑝) = 0.5√𝑛.

According to the CLT, as 𝑛 → ∞ the distribution of 𝑌 will converge to 𝑁(𝑛𝜇, √𝑛 𝜎)

% Central Limit Theorem

% Demonstration with coins

ncoin = [1 3 5 10 20 50];

nroll=10000; % number of rolls

for i=1:length(ncoin),

ni=ncoin(i);

x=randi([0,1],ni,nroll) ; % coin flip: Head = 1 Tail =0

y=sum(x,1); % sample sum.

edges=min(y):max(y);

af=histc(y,edges); %absolute

rf=af/nroll; % relative

% plot figure

subplot(3,2,i)

stem(edges,rf,'filled')

title(['Number of Coins: n = ',num2str(ni)]);

xlabel('sample sum');

ylabel('rel. freq.');

end

Page 12 of 35

The output of the code is shown below:

0 0.5 10

0.5

1Number of Coins: n = 1

sample sum

rel. f

req.

0 1 2 30

0.2

0.4Number of Coins: n = 3

sample sum

rel. f

req.

0 2 4 60

0.2

0.4Number of Coins: n = 5

sample sum

rel. f

req.

0 5 100

0.2

0.4Number of Coins: n = 10

sample sumre

l. f

req.

0 5 10 15 200

0.1

0.2Number of Coins: n = 20

sample sum

rel. f

req.

0 10 20 30 400

0.1

0.2Number of Coins: n = 50

sample sum

rel. f

req.

Page 13 of 35

Example: Modify the code used above to show the distribution of the sample sum of

[1,3,5,10,20,50] fair dice, instead of coins.

Example: Consider an unfair die, where 𝑃(1) = 𝑃(3) = 𝑞 ; 𝑃(2) = (5) = 0 and 𝑃(4) =

𝑃(6) = 3𝑞. Simulate 2000 tosses of this die, and plot the relative frequency as a stem plot.

0 2 4 60

0.1

0.2 n dice = 1

sample sum

rel. f

req

0 5 10 15 200

0.1

0.2 n dice = 3

sample sum

rel. f

req

0 10 20 300

0.1

0.2 n dice = 5

sample sum

rel. f

req

0 20 40 600

0.05

0.1 n dice = 10

sample sum

rel. f

req

40 60 80 1000

0.05

0.1 n dice = 20

sample sum

rel. f

req

100 150 200 2500

0.02

0.04 n dice = 50

sample sum

rel. f

req

1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Outcome

Rela

tive f

requency

Quiz part (b): Biased die

Page 14 of 35

4 MULTIPLE LINEAR REGRESSION

4.1 Simple linear regression in matrix form

Consider the simple linear regression equation �̂�𝑖 = �̂�0 + �̂�1𝑥𝑖 .

Note that same equation can be written as �̂�𝑖 = [1 𝑥𝑖] [�̂�0

�̂�1

].

This means that if the two coefficients are combined into a single column vector �̂� = [�̂�0

�̂�1

], and the

independent variable is augmented by adding a “1” to the front �̆�𝑖 = [1 𝑥𝑖], the ith predicted value

can be computed as �̂�𝑖 = �̆�𝑖𝑏. For the entire set of observations, we can write 𝑌 ̂ = �̌��̂�

where 𝑌 ̂ is a column of predicted values, �̌� is the design matrix, where the first column consists of

ones, the second column is the values of the independent variables, and �̂� = [�̂�0

�̂�1

].

The OLS (ordinary least squares) estimate of the regression coefficients is given by

�̂� = (�̌�𝑇�̌�)−1

�̌�𝑇 𝑌.

Recall the simple linear regression data generated from 𝑦 = 10𝑥 + 3 + 𝑛𝑜𝑖𝑠𝑒.

Page 15 of 35

The same estimates of the regression coefficients can be obtained using the matrix form:

The �̂� vector should contain the previously computed �̂�0 and �̂�1 values. The new regression line

should also coincide with the previous line.

The residuals and the estimated noise variance are computed as

n=50;

x=linspace(0,1,n); % linearly spaced vector a length n

y= 10*x + 3 + randn(1,n);

mx=mean(x), my=mean(y), sx=std(x,1);

c= mean((x-mx).*(y-my)) %covariance

bh1=c/sx^2

bh0=my-bh1*mx

yhat=bh0+bh1*x; %regression line

figure;

plot(x,y,'.')

hold on

plot(x,yhat,'r')

xlabel('x'), ylabel('y')

title(['Regression yhat=',num2str(bh1),'*x+',num2str(bh0)])

x=x(:); % make x a column

y=y(:); % make y a column

XX=[ones(n,1),x]; % create the design matrix

bh=(XX'*XX)^-1*XX'*y % OLS estimate of b

yhat=XX*bh;

hold on

plot(x,yhat,'g+','linewidth',2)

e=y-yhat; % residuals

dof= n-rank(XX); % degrees of freedom

vn=sum(e.^2)/dof % estimated noise variance

Page 16 of 35

Save the code as SimpleRegressionTutorial4.m. This file will be used in future tutorials.

4.2 Multiple Linear Regression

In multiple linear regression, the regression equation is

�̂�𝑖 = �̂�0 + �̂�1𝑥1𝑖 + �̂�2𝑥2𝑖 + ⋯ + �̂�𝑘𝑥𝑘𝑖

and each observation is equal to the predicted value and a residual term 𝑒𝑖

𝑦𝑖 = �̂�𝑖 + 𝑒𝑖 .

The matrix-based analysis presented in the previous section is equally applicable to multiple

independent variables. For each additional independent variable, another column is added to the

design matrix, �̌�. With k independent variables, the design matrix contains k+1 columns, the first

column containing 1’s. One difficulty with multiple independent variables is that the entire analysis

cannot be summarized in a single figure, and the residuals need to be plotted with respect to each

independent variable separately.

The coefficient of determination (𝑅2) is computed the same way as in the simple linear case:

𝑅2 = 1 −∑ 𝑒𝑖

2𝑛𝑖=1

∑ (𝑦𝑖 − �̅�)2𝑛𝑖=1

, where �̅� =1

𝑛∑ 𝑦𝑖

𝑛

𝑖=1

The 𝑅2 value in multiple linear regression is often called the “coefficient of multiple determination.”

e=y-yhat; % residuals

R2=1-sum(e.^2)/sum((y-my).^2)

Page 17 of 35

5 CONFIDENCE INTERVALS

A confidence interval (CI) is an interval that is likely to contain a parameter of interest. Confidence

intervals can be one-sided or two-sided. A two-sided confidence interval has the following form:

(𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒) ± (𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟)

𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝐶𝐼: 2 × (𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟)

The confidence level indicates the proportion of intervals that contain the true value of the parameter,

if the experiment is repeated a large number of times. For example, a 95% confidence level means

that if we repeat the experiment many times and construct the 95% CI for each experiment, we expect

that 95% of these intervals contain the parameter, and the 5% not contain. Note that we cannot say

that the parameter lies in the interval with probability 95%. Once the CI is constructed, the probability

of the parameter lying in the interval is either 0 (if the parameter is outside the interval) or 1 (if the

parameter is inside the interval).

Wrong statement: the parameter lies in the interval with probability 95%.

Correct statement: the parameter lies in the interval with confidence level 95%.

5.1 Sampling distribution of the sample mean (�̅�)

Let 𝑋1, 𝑋2, … , 𝑋𝑛 be independent and identically distributed (i.i.d) random variables with 𝐸(𝑋) = 𝜇

and 𝑉𝑎𝑟(𝑋) = 𝜎2. Consider the expectation and standard deviation of the sample mean (�̅�) :

�̅� =1

𝑛(𝑋1 + 𝑋2 + 𝑋3 + ⋯ + 𝑋𝑛)

𝐸(�̅�) =1

𝑛(𝐸(𝑋1) + 𝐸(𝑋2) + ⋯ + 𝐸(𝑋𝑛)) = 𝜇

𝑉𝑎𝑟(�̅�) =1

𝑛2(𝑉𝑎𝑟(𝑋1) + 𝑉𝑎𝑟(𝑋2) + ⋯ + 𝑉𝑎𝑟(𝑋𝑛)) =

𝜎2

𝑛

The standard deviation of the sample mean (also known as the standard error of the mean) is:

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛: 𝑠𝑒𝑚 = 𝑆𝐷(�̅�) =𝜎

√𝑛.

We can standardize (compute the z-score of) the sample mean:

Page 18 of 35

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛: 𝑍�̅� = (�̅� − 𝐸(�̅�)

𝑆𝐷(�̅�)) =

�̅� − 𝜇

𝜎/√𝑛

The Central Limit Theorem (CLT) tells us that the distribution of the standardized sample mean

becomes standard normal as 𝑛 approaches infinity:

𝑇ℎ𝑒 𝐶𝑒𝑛𝑡𝑟𝑎𝑙 𝐿𝑖𝑚𝑖𝑡 𝑇ℎ𝑒𝑜𝑟𝑒𝑚: lim𝑛→∞

(�̅� − 𝜇

𝜎/√𝑛 ) ~𝑁(0, 1)

If the population is normally distributed, then 𝑍�̅�~𝑁(0, 1) regardless of the sample size.

5.2 Confidence Intervals for the population mean (𝛍)

Let 𝑋1, 𝑋2, … , 𝑋𝑛 be independent and identically distributed (i.i.d) random variables with 𝐸(𝑋) = 𝜇

and 𝑉𝑎𝑟(𝑋) = 𝜎2. We know from the CLT that

𝑍�̅� = (�̅� − 𝐸(�̅�)

𝑆𝐷(�̅�)) ~𝑁(0,1)

provided that 𝑛 is large. What constitutes a large sample depends on the distribution of 𝑋. A very

rough guideline is 𝑛 ≥ 30. If 𝑋 is normally distributed, then 𝑍�̅�~𝑁(0,1) regardless of the sample

size 𝑛.

5.2.1 Two-sided CIs for population mean 𝛍 when 𝛔 is known

Two-sided CI for population mean 𝜇 has the following form

𝑇𝑤𝑜˗𝑠𝑖𝑑𝑒𝑑 𝐶𝐼 𝑓𝑜𝑟 𝜇 ∶ �̅� ± 𝑧𝑐𝑟 𝑆𝐷(�̅�) where 𝑆𝐷(�̅�) =𝜎

√𝑛

𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟: 𝑧𝑐𝑟𝑆𝐷(�̅�) = 𝑧𝑐𝑟

𝜎

√𝑛

where 𝑧𝑐𝑟 is the critical value of standard normal distribution that depends on the confidence level:

For (1 − 𝛼) × 100% confidence: 𝑧𝑐𝑟 = Φ−1(1 − 𝛼/2) where Φ is the standard normal CDF.

Page 19 of 35

𝑃(−𝑧𝑐𝑟 ≤ 𝑍�̅� ≤ 𝑧𝑐𝑟) = 1 − 𝛼

For 95% confidence (𝛼 = 0.05), 𝑧𝑐𝑟 =1.96.

Example: Generate a single sample of 4 observations from a normal distribution with mean 12 and

standard deviation 10. Pretending that the population mean is unknown, construct a two-sided, 95%

confidence interval for the population mean, and then check whether the confidence interval

captures the population mean.

You can skip the comments after % sign. The line numbers (e.g. *Line1*) will be needed when

modifying the code.

Plot the observation points as black dots scattered vertically. On the same plot, show the confidence

limits as green pluses. Check whether the confidence interval was able to capture the true mean. If

the confidence interval has missed the true mean, show a red asterisk where the true mean is:

-1.96 0 1.96

p=0.025 p=0.025

95% Confidence Intervals

-1.6449 0

p=0.05

0 1.6449

z

p=0.05

n=4; % *Line 1*

pm=12; ps=10; % population mean and std *Line 2*

figure; hold on; % start an empty figure *Line 3*

%*Line 4*

i=1; % *Line 5*

x=pm+ps*randn(n,1); % *Line 6*

mx=mean(x); % *Line 7*

zcr=1.96; % *Line 8*

sem=ps/sqrt(n); % standard error of the mean *Line 9*

me=zcr*sem; % margin of error, *Line 10*

CI1= mx-me; % lower CI bound *Line 11*

CI2= mx+me; % upper CI bound *Line 12*

Page 20 of 35

Run this code several times, until you get at least one unsuccessful interval (shown as a red asterisk).

As a very rough average, you need 20 trials (5% of the time) to get an unsuccessful interval. For each

trial, check the sample mean (the part before ±) and the margin of error (the part after ±) on the title.

Your sample mean should change every time you run the code, while the margin of error should not.

The margin of error (𝑧𝑐𝑟𝜎

√𝑛) is inversely proportional to the square root of the sample size. To

halve the margin of error, we need to increase the sample size by 4. Modify the sample size and

confirm that margin of error is halved:

To see what happens when we repeat the experiment many times, we will use a for loop (unless

willing to run the code by hand many times). We will start with a relatively small number of

repetitions (20). Modify the two lines (Line 5 and Line 19) as follows:

Count the number of misses (shown as red *). The number of experiments (20) is still not very big,

and the number of misses will depend on our luck (or, what the current “seed” is). On average, we

expect to see one unsuccessful interval every time we run the code. Run the code several times to see

how the number of misses changes.

plot(i*ones(1,n), x, 'k.','markersize',5); % show points *Line 13*

plot(i*ones(1,2),[CI1, CI2],'g+') % show CI * Line 14*

if pm<CI1 || pm>CI2 % missed *Line 15*

plot(i,pm,'r*','markersize',10) %*Line 16*

% *Line 17*

end % *Line 18*

%*Line 19*

xlabel('experiment number') % *Line 20*

ylabel('observations ') % *Line 21*

title(['95% CI:',num2str(mx,3), '\pm',num2str(me,3)]), shg %*Line 22 *

for i=1:20 ; % Line 5

end % Line 19

Page 21 of 35

If we perform the experiment a large number of times, we expect that 95% of our intervals contain

the population mean. This is what 95% confidence means. If we construct 1000 confidence intervals,

approximately 950 should contain the population mean, and approximately 50 should miss.

To test this, we will count the number of misses and show it on the title:

As we increase the number of experiments (and thus the number of confidence intervals), the

proportion of misses should get closer to 5%.

The previous example was just an academic exercise explaining the meaning of confidence

intervals. In real life, we face two challenges:

We do not know the population standard deviation. If we knew it, we probably would know

the population mean as well, and there wouldn’t be any need to construct a confidence

interval.

We do not have the opportunity to repeat the experiment many times, and we have to rely on

a single confidence interval, based on a single sample of 𝑛 observations. We will not know

if we are lucky or not with the particular confidence interval we computed, we are just

confident that if we repeated the experiment many times, a certain percentage of such

intervals will contain the population mean.

5.2.2 Two-sided CIs for population mean 𝛍 when 𝛔 is unknown

If we do not know the population standard deviation (𝜎), we can substitute sample standard

deviation (𝑠)

miss_count=0; % Line 4

for i=1:1000; % Line 5

miss_count=miss_count+1; % Line 17

title([num2str(miss_count),' misses out of 1000']) % Line 22

Page 22 of 35

𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛: 𝑠 = √1

𝑛 − 1∑(𝑥𝑖 − �̅�)2

𝑛

𝑖=1

.

However, when we standardize the sample mean using (𝑠

√𝑛) instead of

𝜎

√𝑛, the resulting statistic is

no longer a z-score, but a t-score

𝑇�̅� =�̅� − 𝜇

(𝑠

√𝑛)

which has student-t distribution with (𝑛 − 1) degrees of freedom.

A two-sided CI for population mean 𝜇, when we do not know 𝜎 has the following form:

𝑇𝑤𝑜˗𝑠𝑖𝑑𝑒𝑑 𝐶𝐼 𝑓𝑜𝑟 𝜇 𝑤ℎ𝑒𝑛 𝜎 𝑖𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛: �̅� ± 𝑡𝑐𝑟

𝑠

√𝑛

𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟: 𝑡𝑐𝑟

𝑠

√𝑛

where 𝑡𝑐𝑟 is the critical value of student-t distribution that depends on the confidence level and the

degrees of freedom (𝑛 − 1). For (1 − 𝛼) × 100% confidence, 𝑡𝑐𝑟 = 𝐹−1(1 − 𝛼/2) where 𝐹 is the

CDF of the student t-distribution with (𝑛 − 1) degrees of freedom, and 𝐹−1 is the inverse of 𝐹.

The critical t value satisfies 𝑃(−𝑡𝑐𝑟 ≤ 𝑇�̅� ≤ 𝑡𝑐𝑟) = 1 − 𝛼.

Because student t- distribution is symmetric, we can also write

𝑃( 𝑇�̅� < − 𝑡𝑐𝑟) = 𝛼/2 and 𝑃(𝑇�̅� > 𝑡𝑐𝑟) = 𝛼/2.

The 𝑡𝑐𝑟 and 𝑧𝑐𝑟 values can be determined as follows:

alpha=0.05; % for 95% confidence level

zcr=norminv(1-alpha/2) % zcr for two-sided CI

zcr=norminv(1-alpha) % zcr for one-sided CI

n=28; % sample size

tcr=tinv(1-alpha/2, n-1) % tcr for two-sided CI

tcr=tinv(1-alpha, n-1) % tcr for one-sided CI

Page 23 of 35

Example: The cross sectional area measurements (in2) from 7 steel bars are as follows:

𝑥 = [135.3, 90.2, 99.3, 115.0, 100.7, 125.8, 102.4].

Assuming that the cross sectional areas of the bars are normally distributed, construct a two-sided

90% confidence interval for the mean cross sectional area of the entire batch.

5.3 Confidence intervals for regression coefficients

In tutorial 4, we wrote the regression equation as 𝑌 ̂ = �̌��̂� where 𝑌 ̂ is a column of predicted values,

�̌� is the design matrix, where the first column consists of ones, and the rest of the columns contain

the independent variables, and �̂� is the column vector containing the OLS estimates of the regression

coefficients, computed as �̂� = (�̌�𝑇�̌�)−1

�̌�𝑇 𝑌.

To predict the 𝑦𝑖 value corresponding to 𝑥𝑖 = [𝑥1𝑖 , 𝑥2𝑖 , . . , 𝑥𝑘𝑖] we use

�̂�𝑖 = �̂�0 + �̂�1𝑥1𝑖 + �̂�2𝑥2𝑖 + ⋯ + �̂�𝑘𝑥𝑘𝑖

Two-sided confidence intervals for regression coefficients are defined by �̂� ± 𝑡𝑐𝑟𝑠�̂�

where

�̂�: the vector of OLS estimates of the true, but unobservable coefficients (𝑏).

𝑡𝑐𝑟:the critical t-value corresponding to the degrees of freedom (𝑑𝑓) of the level of

confidence. The degrees of freedom is computed as 𝑑𝑓 = 𝑙𝑒𝑛𝑔𝑡ℎ(𝑌) − 𝑟𝑎𝑛𝑘(�̌�).

x=[135.3, 90.2, 99.3 ,115.0,100.7, 125.8, 102.4] % measurements

n=length(x);

mx=mean(x);

s=std(x);

%------ determine tcr -----

alpha=0.1 ; % for 90% confidence

tcr=tinv(1-alpha/2, n-1); % critical t value (also see page 8)

%------------------------

me=tcr*s/sqrt(n); % margin of error

CI1= mx-me; % lower CI bound

CI2= mx+me; % upper CI bound

Page 24 of 35

𝑠�̂�: the vector of standard deviations of �̂� (also called the “standard error of 𝑏”)

The interval �̂�(𝑖) ± 𝑡𝑐𝑟𝑠�̂�(𝑖) contains the true coefficient (𝑏𝑖), with (1 − 𝛼 )100 %

confidence. Mathematically, we write

𝑃(�̂�(𝑖) − 𝑡𝑐𝑟𝑠�̂� ≤ 𝑏𝑖 ≤ �̂�(𝑖) + 𝑡𝑐𝑟𝑠�̂�(𝑖)) = 1 − 𝛼

For 95% confidence level, 𝛼 = 0.05.

Save the file you created in Tutorial 4 (SimpleRegressionTutorial4.m) with another filename. Run

the file several times, and look at the �̂� values. Every time the script is run, a different vector of �̂�

values will be computed. If we make concrete every month using the same mix design, and test the

compressive strength of two cylinders, and report the average, the reported value will change every

month. Therefore, the reported strength is a random variable. Similarly, the OLS estimates of

regression coefficients �̂�𝑖 are random variables. To construct the confidence intervals, we need to

compute the standard errors of the regression coefficients.

To get the standard errors, we first need to compute the covariance matrix:

The main diagonal of the covariance matrix contains the variances of the predicted regression

coefficients. The standard errors of regression coefficients (𝑏𝑖) are equal to the standard deviations

of �̂�𝑖:

-1.96 0 1.96

p=0.025 p=0.025

95% Confidence Intervals

-1.6449 0

p=0.05

0 1.6449

z

p=0.05

K=pinv(XX)*XX*(XX'*XX)^-1*XX'; % pinv: pseudo-inverse

cbh=K*vn*K'; %covariance matrix of bhat values

Page 25 of 35

Next, using the inverse CDF of the t-distribution, we determine the critical t value corresponding to

the degrees of freedom in the problem, and the desired level of confidence (95%):

Finally, we determine the lower and upper confidence limits for the 𝑏 values as follows:

5.4 Confidence intervals for �̂� and prediction intervals for 𝒚

The confidence interval for the conditional expectation of 𝑦𝑖 given 𝑥𝑖 (ignoring the noise) is

constructed as follows:

The 95% prediction interval for y (including the noise) is constructed as follows:

vbh= diag(cbh); % variances of bhat values

sbh= sqrt(vbh); % standard deviations of bhat values

alpha=0.05; % for 95% confidence

p=1-alpha/2; % probability to be used in CDF

df=n-rank(XX); % degrees of freedom

t=tinv(p,df); % t-value, may need tinv558

ci_lower= bh-t*sbh; % lower confidence bound for b

ci_upper= bh+t*sbh; % upper confidence bound for b

cyh= XX*cbh*XX'; % covariance matrix of yhat

vyh=diag(cyh); % variances of yhat values

syh= sqrt(vyh); % standard deviations

ci_lower= yhat-t*syh; % lower confidence bound

ci_upper= yhat+t*syh; % upper confidence bound

s= sqrt(vyh+vn); % standard error of prediction

pi_lower = yhat-t*s; % lower prediction bound

pi_upper = yhat+t*s; % upper prediction bound

Page 26 of 35

Plot the data, regression line, and 95% confidence and prediction intervals on the same figure:

figure

plot(x,y,'.')

hold on

plot(x,yhat,'k','linewidth',2)

plot(x, ci_lower,'k:','linewidth',2)

plot(x, pi_lower,'mp')

plot(x, ci_upper,'k:','linewidth',2)

plot(x, pi_upper,'mp')

legend('data','regression','95% CI',... % line too long

'95% PI','location','northwest')

title('95% confidence and prediction intervals')

xlabel('x')

ylabel('y')

Page 27 of 35

6 MONTE CARLO SIMULATION

6.1 Prerequisites

To complete this tutorial, the student must be familiar with the following

Generation of random variables with a given probability distribution

Construction of pointwise confidence intervals

6.2 Background

The term “Monte Carlo Method” describes a large class of approaches to approximating the solution

of a complex problem for which an analytical (exact) solution cannot be obtained with a deterministic

algorithm, or infeasible to obtain. Monte Carlo methods are computer-based techniques for numerical

simulation of probabilistic processes. Monte Carlo methods are especially useful for modeling events

with significant uncertainty in inputs. Monte Carlo Methods are very versatile, but are often slower

than other methods because of their reliance on repeated computation and random numbers. Monte

Carlo methods permit direct computation of uncertainty even when the function 𝑓 is ill-behaved (e.g.

contains discontinuities or extreme nonlinearities) and when the input variables do not have well-

defined mean or variance. Monte Carlo approaches tend to follow a particular pattern:

1. Define a domain of possible inputs.

2. Generate a large number of realizations of random variables (or vectors) 𝑥1, 𝑥2 … , 𝑥𝑛 , from the

domain using the specified probability distribution (joint, marginal, or conditional).

3. For each realization, compute the output of interest using a deterministic equation 𝑓(𝑥).

Deterministic means that given a particular input 𝑓(𝑥) will always produce the same output.

4. Analyze the results using histograms, summary statistics, confidence intervals, etc.

The figure below shows the basic principle behind the Monte Carlo approach.

Model

f(x)

x1

x2

x3

y1

y2

6.3 A simple example: estimation of the value of 𝛑

Suppose that we do not know the value of 𝜋, and we wish to estimate it based on our knowledge

that it represents the area of the unit circle (a circle with radius 1). We also would like to construct

confidence intervals around our estimate. The figure below shows a unit circle circumscribed by a

Page 28 of 35

square. Because of the symmetry, we consider only one quadrant of the circle (shaded area). The area

of a quadrant of the circle with unit radius is 𝐴1 = 𝜋/4, while the area of the unit square is 𝐴 = 1.

1

0,0 1,0

0,1 1,1

π/4

If we generate 𝑛 points (𝑥, 𝑦) uniformly distributed over the square, the expected number of points

inside quadrant (𝑛1) can be computed as follows:

𝐴1

𝐴=

𝑛1

𝑛

Therefore, 𝜋 can be estimated from:

𝜋

4=

𝑛1

𝑛⇒ �̂� =

4𝑛1

𝑛

The hat over 𝜋 indicates that what we compute is an estimate of 𝜋, which will get close to the true

value as 𝑛 approaches infinity.

If we create an indicator vector 𝑏 of length 𝑛, which has a value of 1 if the point is inside the circle,

and 0 if outside, we can write

�̂� =4

𝑛∑ 𝑏𝑖

𝑛

𝑖=1

= 4�̅�

Note that each element of 𝑏 is a realization of a Bernoulli random variable with 𝐵~𝐵𝑒𝑟𝑛(𝑝) with

𝑝 = 𝜋/4. The expected value and variance of the Bernoulli random variable 𝐵 are 𝐸(𝐵) =

𝜋/4, 𝑉𝑎𝑟(𝐵) = 𝜋/4(1 − 𝜋/4).

If we define a new random variable 𝑋 = 4𝐵, then we can write

𝐸(𝑋) = 4𝐸(𝐵) = 𝜋

𝑉𝑎𝑟(𝑋) = 16𝑉𝑎𝑟(𝐵) = 16𝜋

4(1 −

𝜋

4)

Pretending that we do not know the value of 𝜋, we will use

�̂� = �̅� = 4 �̅� (1)

Page 29 of 35

𝑠2 = 16𝑠𝐵2

(2)

where �̅� and 𝑠𝐵2 denote the sample mean and sample variance of 𝑏, respectively.

The Monte Carlo procedure for estimating 𝜋 is then:

1. Define a domain of possible inputs:

0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1

2. Generate inputs randomly from the domain using a certain specified probability distribution.

Ultimately, we need realizations of the Bernoulli random variable 𝐵~𝐵𝑒𝑟𝑛(𝜋/4). Because

we are not allowed to use 𝜋, we will take the following approach:

a. Generate 𝑛 points, uniformly distributed over the unit square.

b. Create an indicator vector 𝑏 of length 𝑛, which takes the value 1 if the point is inside

the quadrant (√𝑥𝑖2 + 𝑦𝑖

2 ≤ 1), and zero otherwise. The vector 𝑏 is a realization of the

random variable 𝐵.

3. Perform a deterministic computation: Compute �̂� (estimator of 𝜋) using Eq(5).

4. Analyze the results: Derive 95% confidence intervals for 𝜋 for different values on 𝑛.

Below is a piece of code that estimates 𝜋 using a single sample of 500 points. Start a new *.m file

and type the following:

Page 30 of 35

The resulting figure is shown below:

Modify the code to compute the 95% confidence interval (CI) for 𝜋.

𝑇𝑤𝑜˗𝑠𝑖𝑑𝑒𝑑 𝐶𝐼: (𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛) ± (𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟)

𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟: 𝑡𝑐𝑟 𝑠/√𝑛

𝑠2 is the sample variance from Eq.(6)

For (1 − 𝛼) × 100% confidence, 𝑡𝑐𝑟 = 𝐹−1(1 − 𝛼/2) where 𝐹 is the CDF of the student

t-distribution with (𝑛 − 1) degrees of freedom. (For this problem, 𝑡𝑐𝑟 will be close to 𝑧𝑐𝑟 =

1.96, as n is large).

The following lines will compute the confidence limits.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

estimated =3.248

clear all; close all; clc

n=500; rand('seed',0)

xy=rand(n,2);

x=xy(:,1); y=xy(:,2);

inside=@(x,y)sqrt(x.^2+y.^2)<1;

b=inside(x,y);

mypi=4*mean(b); %predicted value of pi

figure

plot(x,y,'.')

hold on

plot(x(b),y(b),'r.'); % plot the points inside

axis equal

xlabel('x'), xlim([0 1])

ylabel('y'), ylim([0 1])

title(['estimated \pi =',num2str(mypi,5)])

Page 31 of 35

Run the code using 𝑛 = 500, 𝑛 = 2000, 𝑛 = 8000, 𝑛 = 32000 samples. The table below shows

the predicted value, margin of error, and the lower and upper confidence bounds. Your answers

should be the same as long as you use the same seed value (0).

95% Confidence Interval

�̂� = �̅� Margin of error Lower bound Upper bound

𝒏 = 𝟓𝟎𝟎 3.248 0.137458 3.11054 3.38456

𝒏 = 𝟐𝟎𝟎𝟎 3.152 0.0717126 3.08029 3.22371

𝒏 = 𝟖𝟎𝟎𝟎 3.1305 0.0361608 3.09434 3.16666

𝒏 = 𝟑𝟐𝟎𝟎𝟎 3.13837 0.0180181 3.12036 3.15639

The figure below shows the simulation with 𝑛 = 32000 samples.

Note that the margin of error is varies as 1/√𝑛. Quadrupling the sample size halves the margin of

error and the width of the confidence interval.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

estimated =3.1384

s2=16*var(b); % sample variance

alpha=0.05; % for 95% confidence level

tcr=tinv(1-alpha/2, n-1); % critical value of t for two-

sided CI

me= tcr*sqrt(s2/n) ; % margin of error

CI95_lower=mx-me; % lower confidence bound

CI95_upper=mx+me; % upper confidence bound

Page 32 of 35

7 NONLINEAR REGRESSION

When the relationship between the independent variable(s) and the dependent variable cannot be

approximated as a line (or a hyperplane), approaches beyond linear regression are needed. There are

many different methods for dealing with nonlinear relationships, but we will focus on two

approaches: (a) Using a nonlinear transformation which makes the data approximately linear; (b)

Polynomial fitting.

7.1 Nonlinear Transformations

Sometimes a non-linear relationship can be transformed into a linear one by a mathematical

transformation. Examples include the exponential growth equation:

𝑦 = 𝐴𝑒𝑏𝑥𝑢 ⇔ log(𝑦) = log(𝐴) + 𝑏𝑥 + log (𝑢)

and the constant-elasticity equation

𝑦 = 𝐴𝑥𝑏𝑢 ⇔ log(𝑦) = log(𝐴) + 𝑏 𝑙𝑜𝑔(𝑥) + log (𝑢)

Linear regression can now be performed using the transformed variables.

Example: The table below shows data to test the relationship between porosity and sandstone

strength.

x=porosity y=unconfined

strength (psi)

Source: Hale, P. A. & Shakoor, A., 2003, A laboratory investigation of the

Effects of Cyclic Heating and Cooling, Wetting and Drying, and Freezing and

Thawing on the Compressive Strength of Selected Sandstones: Environmental

and Engineering geoscience, vol IX, p. 117-130.

12.32 2636

13.94 3162

6.94 7580

4.0 16899

2.94 23739

0.86 14224

x=[12.32,13.94,6.94,4,2.94,0.86];

y=[ 2636, 3162, 7580, 16899, 23739, 14224];

x=x(:); y=y(:);

n=length(x);

XX=[ones(n,1),x];

b=(XX'*XX)^-1*XX'*y;

Page 33 of 35

Plot the data and the regression line, and compute the coefficient of determination.

The coefficient of determination is 𝑅2 = 0.72, indicating that the regression equation can explain

72% of the variation in unconfined strength.

Repeat the same analysis, using a nonlinear transformation:

The coefficient of determination has increased to 𝑅2 = 0.87.

There are a few points to keep in mind when using this method. First, we are assuming that the errors

in the transformed equation follow a zero-mean Gaussian distribution, which may not be a reasonable

assumption. Second, once we get the estimates from the transformed equation, going back to the

original equation can be tricky. Some parameter estimates are biased, and the confidence intervals

are no longer symmetrical around the predicted values. We need to get the confidence interval from

the transformed equation and then transform the bounds back.

7.2 Polynomial fitting

The commands polyfit and polyval can be used whenever the data can be approximated by a

polynomial.

y=log(y)

help polyfit

Page 34 of 35

Consider the following nonlinear system:

Fit a polynomial of order 5:

Evaluate the polynomial at the data points:

An approximate 95% prediction interval for 𝑦 (including the noise) can be constructed as follows:

help polyval

randn('seed', 1);

x=(1:50)';

y = sin(x/50)./ x + 0.002 * randn(50,1);

order=5;

poly = polyfit(x, y, order);

yhat= polyval(poly,x)

[poly model] = polyfit(x, y, order); % fit a polynomial

[yhat s] = polyval(poly, x, model); % evaluate the polynomial

alpha=0.05; % for 95% confidence

p=1-alpha/2; % probability to be used in CDF

df=50-(5+1); % degrees of freedom

t=tinv(p,df); % t-value, may need tinv558

PI_lower=yhat-t*s;

PI_upper=yhat+t*s;

Page 35 of 35

figure;

plot(x,y,'.')

hold on

plot(x,yhat, 'r')

plot(x, PI_lower, 'r:')

plot(x, PI_upper, 'r:')

legend('data','regression','95% PI')

xlabel('x'), ylabel('y')