We’ll consider here the problem of paired data. There are two common notations

We’ll consider here the problem of paired data.There are two common notations.

(x1, y1), (x2, y2), … , (xn, yn) shows the data as n points in two-space

X Y

x1 y1

x2 y2

x3 y3

… …xn yn

This is the spreadsheet form. PowerPoint show prepared by Gary Simon, 11 MARCH 2008.

The separate points are assumed independent.

We wish to find a relationship between variable X and variable Y.

We have here a data set on eye response to different types of drops, but for now we’ll look at just a few simple items of information.

DP0OD Pupil diameter, start of experiment, right eye

DP0OS Pupil diameter, start of experiment, left eye

AGE Subject age

There are altogether 100 subjects.

Let’s consider the relationship between pupil diameter in the eyes.

An obvious first step is making a scatterplot showing all 100 people.

Let’s put the right eye on the horizontal axis and the left eye on the vertical axis. This is not a critical decision.

876543

8

7

6

5

4

3

DP0OS

DP0OD

Scatterplot of DP0OD vs DP0OS

This graph shows that the points cluster near a diagonal line. This is not a surprise.

Here’s the same picture with the Y = X line superimposed:

876543

8

7

6

5

4

3

DP0OS

DP0OD

Scatterplot of DP0OD vs DP0OS

The points cling close to the line.

There are a few simple ways to summarize this situation. Perhaps the best is the correlation. Here r = 0.96.

Now let’s complicate this a bit. Suppose that we want to check on the relationship between DP0OS (pupil diameter, left eye) and AGE.

These two variables are not symmetric.

We’ll think of the variable AGE as “logically earlier.”

This means that we obtain it easily, reliably, and

(probably) earlier than the pupil diameter. Also,

it’s logical to think of using AGE to predict pupil

diameter.

We will designate AGE as the independent variable, we will identify it with the symbol X, and we will place it on the horizontal axis of the coming scatterplot.

We’ll think of the variable DP0OS as “logically later.”

This information is obtained with some difficulty, with

possible error measurement, and (probably) later than

the age.

We will designate DP0OS as the dependent variable, we will identify it with the symbol Y, and we will place it on the vertical axis of the coming scatterplot.

The scatterplot is next. Before it’s shown, we should ask ourselves whether

* pupil diameter generally rises with age

* pupil diameter is unrelated to age

* pupil diameter generally decreases with age

What do you think?

Here is the scatterplot:

706050403020

8

7

6

5

4

3

AGE

DP0OS

Scatterplot of DP0OS vs AGE

Suppose that you would like to summarize the relationship between the two variables. You would like to write

Pupil Diameter = Y = dependent variable

= f(AGE) = f(X) = f(independent variable)

for some function f .

The problem is that you’ll never find a believable function to go through all the dots on the scatterplot. There is too much statistical noise.

The expression of the model will be revised to

Y = f(X) + ε

The symbol ε represents statistical noise. It may involve random errors in measuring Y or it may just represent variability that we just don’t know to account for.

One could also have made “multiplicative noise” in the form Y = f(X) × ε. In some cases, this is useful. For now, we’ll stick with the “additive noise” with the + sign.

We will have a lot to say about the ε term. For now, we’ll just assume that it is independent over the data points.

What form should we use for the function f ?

How about f(X) = log X ?

How about f(X) = a X2 + b X + c ?

How about f(X) = tan( a X2 + h) ?

How about f(X) = ?

2

1tanh log | | 1

aX bX c

X

We will start with the simplest function, the straight line. This is f(X) = β0 + β1 X .

The symbols β0 and β1 are parameters.

β0 is the intercept, also called Y-intercept.

β1 is the slope.

In nearly all cases, β0 and β1 are not known, and we have to estimate them from data.

The notation is not universal. You will also see

f(X) = α + β X This is OK.

f(X) = a + b X Use of Roman letters is

not recommended.

For issues related to considering which symbols are fixed and which are random, we will prefer f(x) = β0 + β1 x . That is, we will prefer lower-case x.

It is however impossible to enforce distinctions between x and X and also between y and Y. We can’t be too dogmatic about the notation.

The relationship between Y and X will be described through the simple linear regression model

Y = β0 + β1 x + ε

This is made more direct by putting on subscript i to label individual data points. Our preferred form for the simple linear regression model is

Yi = β0 + β1 xi + εi

with i = 1, 2, …, n.

The simple linear regression model also includes these assumptions about the noise terms ε1 , ε2 , ε3 , … , εn :

The ε’s are independent of each other and also independent of the x’s.

The ε’s are sampled from a hypothetical population in which the mean is zero and the standard deviation is σ.

In some cases, we may add in the further assumption that the ε’s are sampled from a normal population.

The simple linear regression model Yi = β0 + β1 xi + εi has three unknown parameters: β0 , β1 , and σ .

Estimating these parameters is an important part of the regression task.

Estimating β0 and β1 is equivalent to drawing a line on the scatterplot. The estimate of σ tells us how well the line describes the set of points on the scatterplot.

The estimate of β0 is written b0 .

The estimate of β1 is written b1 .

The estimate of σ is written s .

You’ll also see sε or sY | x .

Note this consistent pattern of usage:

Model parameters are Greek letters.

Data-based estimates are corresponding Latin letters.

Be aware that other schemes exist.

Someone who writes the model as Yi = α + βxi + εi will use a for the estimate of α and will use

b for the estimate of β.

Someone who writes the model as Yi = a + b xi + εi

will use for the estimate of a and will use

for the estimate of b.

a

b

For our problem, the model is

DPi = β0 + β1 AGEi + εi

The pupil diameter DP is in units of mm (millimeters). The variable AGE is in units of years.

Therefore, β0 and its estimate b0 are in units of mm.

Also, the ε’s and their standard deviation σ are in units of mm. The estimate of σ is also in units of mm.

The slope β1 and its estimate b1 are in units of .mm

year

How should we estimate β0 and β1 ?

We could guess.

We could draw a nice-looking line on the scatterplot

and then use that line to get the estimates.

These are not necessarily bad methods, but they are not reproducible. This means that different people get different answers. Worse yet, the same person on two occasions will produce different answers.

We will instead propose that the estimates be done by minimizing a mathematical function.

Many proposals have been made, but the nearly universal choice is least squares. Choose b0 and b1 to minimize the function

Q = 2

0 11

n

i ii

Y b b x

How should this minimization be done?

The solution is by (mindless and routine) differentiation. That is, solve the system

let

0

let

1

0

0

Qb

Qb

This results in two linear equations in the two unknowns b0 and b1 .

The solution method selected by the previous slide works, but it’s clumsy. Here is a cleaner way to do this.

(1) Find the five sums , ,

, , .

xii

n

1 1

n

ii

y

2

1

n

ii

x 2

1

n

ii

y

1

n

i ii

x y

(2) Next find these quantities:

, , Sxx = ,

Syy = , Sxy =

x y

2

12

1

n

ini

ii

x

xn

2

12

1

n

ini

ii

y

yn

1 1

1

n n

i ini i

i ii

x y

x yn

(3) Find b1 (the estimate of the slope β1) as

b1 = S

Sxy

xx

(4) Find b0 (the estimate of the intercept β0) as

b0 = - b1 y x

Note that b1 is found before b0 .

(5) Finally, calculate

Syy | x = SS

Syy

xy

xx

d i2

We’ll use this later in the estimation of σ, the standard deviation of the noise.

While it’s possible to do this for our problem of pupil diameter versus age with just the use of a calculator…

there are too many steps and we are likely to make errors.

We’ll give this to the Minitab function

Stat > Regression > Regression.

The Minitab output is extensive, but from it we find

Regression Analysis: DP0OD versus AGE

The regression equation isDP0OD = 7.27 - 0.0430 AGE

This is called the fitted regression equation. This identifies for us b0 = 7.27 and b1 = -0.0430.

Here is a reprise of the scatterplot, now shown with the fitted regression line.

706050403020

8

7

6

5

4

3

AGE

DP0OD

S 0.832776R-Sq 35.6%R-Sq(adj) 34.9%

Fitted Line PlotDP0OD = 7.269 - 0.04295 AGE

This was made in Minitab with Stat > Regression > Fitted Line Plot.

This has reported also sε = 0.832776, the estimate of σ.

It’s important to distinguish population quantities from sample quantities.

The process of regression is not simply

“numbers in” “numbers out.”

The simple linear regression model is

Yi = β0 + β1 xi + εi

If you are asked to graph the line Y = β0 + β1 x . . .

Please refuse! You cannot graph this line because β0 and β1 are unknown population parameters.

With data, you will get the estimates b0 and b1.

The fitted regression line is = b0 + b1 x . Y

The “hat” on is helpful, but it’s a typesetting nuisance. The fitted line is often given without the “hat.”

Y

For the pupil diameter problem, the fitted line is

= 7.27 - 0.0430 AGE

The interpretation of -0.0430 is . . .

that each year of age is associated with a reduction of 0.0430 mm in pupil diameter.

The interpretation of 7.27 is . . .

to be avoided. It’s tempting to say that it’s an assessment of pupil diameter at birth. The data set did not have anyone younger than 18, so we won’t force an interpretation.

The estimate of the noise standard deviation was calculated as sε = 0.832776. This is about 0.83 mm, which is rather large for this context.

What are we to make of this large value?

This is saying that AGE is far from a perfect predictor of pupil diameter.

We still have to decide

* Is there an objective way to decide if this whole

activity was worth doing?

* Is there an objective way to decide if the model

Yi = β0 + β1 xi + εi was a good choice?

Documents

We’ll consider here the problem of paired data. There are two common notations