Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
ISyE 6414: Regression Analysis
Lectures: MWF 8:00-10:30, MRDC #2404Early five-week session; May 14- June 15
(8:00-9:10; 10-min break; 9:20-10:30)
Instructor: Dr. Yajun Mei (“YA_JUNE MAY”)• Email: [email protected]; Tel: 404-894-2334 (O)
• Office Hours: MWF 10:30-11:00, after class or Groseclose #343
• Course Homepage: Canvas (all HWs due Canvas)backup: http://www.isye.gatech.edu/~ymei/6414
• HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students
My academic pathway
• Undergraduate: Math, Peking Univ., BS in 1996• Work as a computer programmer in a Chinese
bank, 1996-1998• Graduate: PhD in Math with a minor in EE, Caltech,
1998-2003 (advisor: Dr. Gary Lorden)• Post Doc in biostatistics: FHCRC, Seattle, 2003-
Sep 2005 (supervisor: Dr. Sarah Holte)• New Research Fellow: SAMSI & Duke Univ., Fall
2005• Joined ISyE of GT since Jan 2006. Currently a
tenured associate professor.
About this course
• Regression Analysis is the key building block for many modern Machine Learning, Artificial Intelligent, Business Analytics techniques and methods (such as Neural Networks, Deep Learning, Boosting, Random Forrest, etc.)
• This course aims to help youUnderstand its theoretical aspects
(HW#1, #2, #4, and a midterm)Understand its computational aspects
(HW#3, and a course project)
3
4
Organization of the CourseTextbooks (Notes/slides provided):
• Kutner, Nachtsheim, Neter and Li, Applied linear statistical models (fifth edition).,” 5th ed
• Faraway, Practical Regression and ANOVA using R (freely downloadable online)
www.abebooks.com/servlet/SearchResults?isbn=9780073108742
Topics:• Simple Linear Regression (Ch 1 -4) • Multiple linear Regression (Ch 5-11) (2 weeks, Midterm)• Advanced Regression (Ch 13-14) ( 2 weeks)• Design of Experiments (Ch 13, 14)
5
Organization of the Course
Grading Policy (the past AVG GPA is [3.7,3.9]):• Class attendance (5%) • Homework (4*10%=40%): Collaboration encouraged, but
you cannot look at any other solutions before submitting. • One in-class Midterm (25%): 9:15am-10:30am,
Friday, May 25 (happy Memorial weekend )• Class project (30%): a team of 2-4 or by yourself. See
the handout for possible topics of project. Proposal (1-3 pages) : May 30 (Wed) Presentation file: due 7am on June 13 (Wed)
(only for on-campus students, not required for DL students) Final report: June 15 (Friday)
[Only for the Distance Learning students: two-lectures delay for homeworks and class project proposal, and one-week delay for midterm, and the final report.]
6
Part A
• Basic Background on probability and statistics.
We might not discuss this background part in details, but I listed some slides here, so that you can brush up your memory if necessary
• Three key Probability distributions: Binomial, Poisson, and Normal.
7
Probability Review
See Appendix A of our text.
• Probability• Discrete Random Variable• Continuous Random Variables• Joint Distribution
8
Probability
Basics of Probability Theory• Random Experiments, e.g., flip a fair coin three
times, and observe “Heads” or “Tails”
• Sample spaces: the set of all possible outcomes, e.g., S={HHH,THH,HTH,HHT, HTT,THT,TTH,TTT}
• An Event: a subset of the sample space of a random experiment, e.g., observe one “heads”
• Union/Intersection/Complement of events; Counting Techniques; Axioms of Probability; Conditional Probability; Independence; Bayes’ Theorem
Random Variable
• A random variable is a function that assigns a real number to each outcome in the sample space of a random experiments.
• Example: Let X be the number of heads when flipping a fair coin three times. Rigorously,
9
w HHH HHT HTH THH HTT THT TTH TTT
X(w) 3 2 2 2 1 1 1 0
Discrete Random Variable X
X with countable possible values • Probability Mass function: • Cumulative distribution function
• Mean:
• Variance:
• Standard Deviation
10
Important discrete RVs
11
Discrete Uniform
Binomial(n,p)
Geometric(p)
Poisson(\lambda)
• What are the mean and Var/SD?
Continuous Random Variable
• Probability density function: • Cumulative distribution function
• Mean:
• Variance:
• Standard Deviation
12
Important Continuous RVs
13
• Gamma/Weibull/Lognormal/Beta distribution • What are the mean and Var/SD?
Central Limit Theorem
a. If X is Binomial(n,p), then
(continuity correction)
b. If X1, X2,Λ,Xn are iid with mean µ and variance σ2, then
(or 𝒁𝒁 = 𝑿𝑿𝟏𝟏+..+𝑿𝑿𝒏𝒏−𝒏𝒏𝒏𝒏𝒏𝒏 𝝈𝝈
≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏) )14
𝑍𝑍 = 𝑋𝑋−𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(1−𝑛𝑛)
≈ 𝑁𝑁(0,1)
Statistical Review
• Population parameter vs. Sample statistic
• Point Estimation
• Conference Interval
• Hypothesis Testing
15
Population Parameter vs Sample Statistic
• Population: a set of entities concerning which statistical inferences are to be drawn. Typically population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible.
• Sample: a subset of observed objects from the populations. The sample represents a subset of manageable size (possibly massive).
• Parameter: a (typical unobservable) parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristics of a population or a model.
• Statistic: some measures of some attribute of a sample. It is calculated by applying a function to the values of the items comprising the sample.
[Population parameter vs. Sample statistic]
16
Important Sample statistics
• Sample mean:• Sample variance:• Sample standard deviation: • Sample range: r = max(x i) – min(x i)• Quartiles:
• The lower quartile: 25% of the data is less than q1
• The median: 50% of the data is less than q2
• The upper quartile: 75% of the data is less than q3
• As a measure of variability, the interquartile range (IQR) is defined as: IQR = q3 – q1
• Plots: Stem-and-Leaf Diagram/Plot, Histogram, Box Plots, Probability Plots (or Normal QQ plots)
17
Normal Distribution
Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2
• Sample mean �𝑿𝑿 ∼ 𝑵𝑵 𝒏𝒏, 𝝈𝝈𝟐𝟐
𝒏𝒏. Or 𝒏𝒏(�𝑿𝑿−𝒏𝒏)
𝝈𝝈∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)
• Sample variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐
𝒏𝒏−𝟏𝟏satisfies
𝒏𝒏−𝟏𝟏 𝑺𝑺𝟐𝟐
𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 (Chi-square distribution)
18
Normal Distribution (Cont.)
Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2
• Sample mean �𝑿𝑿 is independent of sample
variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐
𝒏𝒏−𝟏𝟏. Moreover,
𝒏𝒏(�𝑿𝑿−𝒏𝒏)𝑺𝑺
= 𝑵𝑵 𝟎𝟎,𝟏𝟏
𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 /(𝒏𝒏−𝟏𝟏)
has a t-distribution
with df=n-1. [In many cases,
�𝜽𝜽−𝜽𝜽𝒔𝒔.𝒆𝒆. �𝜽𝜽
often has t-distribution.]
• In Appendix B on page 1317, for t-distribution, critical point: 𝒕𝒕𝜶𝜶,𝒅𝒅𝒅𝒅 = 𝒕𝒕 𝑨𝑨,𝒅𝒅𝒅𝒅 with 𝑨𝑨 = 𝟏𝟏 − 𝜶𝜶so 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝒕𝒕 𝟎𝟎.𝟗𝟗𝟗𝟗𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏.
19
Point Estimation
• The bias of the estimator �𝜽𝜽 is 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽.
An estimator is unbiased if the bias is 0. • The variance of the estimator �𝜽𝜽.• The mean square error of the estimator �𝜽𝜽 is 𝑴𝑴𝑺𝑺𝑬𝑬 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽 𝟐𝟐 = 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜽𝜽 + 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 𝟐𝟐
• The standard error of �𝜽𝜽 is s.e.= 𝑽𝑽𝑩𝑩𝑽𝑽(�𝜽𝜽)20
Methods of Point Estimation
• There are three methodologies to create point estimates of a population parameter.A. Method of moments (MOM)B. Method of maximum likelihood (MLE)C. Bayesian estimation of parameters
21
MOM & MLE
• The method of moment (MOM) estimators are found by equating the population moment to the sample moments and solving the resulting equations, e.g.,
𝐡𝐡 𝛉𝛉 = 𝑬𝑬 𝑿𝑿 = �𝑿𝑿 = 𝑿𝑿𝟏𝟏+⋯+𝑿𝑿𝒏𝒏𝒏𝒏
.
• The maximum likelihood estimator (MLE) is the value of θ that maximizes the likelihood function
L(θ) = f(x1) f(x2) …f(xn) If the domain of f(x) does not depend on θ,
solving𝑑𝑑 𝐥𝐥𝐥𝐥𝐥𝐥𝑳𝑳(𝜽𝜽)
𝑑𝑑𝜽𝜽= 𝟎𝟎 yields the MLE.
Otherwise, plot L(θ) and find the maximum. 22
Confidence Interval & Hypothesis Testing
One sample: 1. Normal mean with known variances (one-sided)2. Normal mean with unknown variances3. Normal variance4. Proportion of Binomial Distribution
Two samples: inference on mean difference5. Two independent normal dist: variances known6. Two independent normal dist: unknown and equal
variances7. Two independent normal distributions: unknown and
unequal variances 8. Paired Samples
23
Part B
• Overview of Supervised Learning
• Simple Linear Regression
24
Overview of Supervised Learning
Supervised Learning (directed data mining, learning with a teacher):• The observed data is of the form of (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, … ,𝑿𝑿𝒊𝒊𝒊𝒊)
for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the variables can be split into two groups: independent variables (explanatory variables,
inputs, predictors) 𝑿𝑿 = (𝑿𝑿𝟏𝟏, … ,𝑿𝑿𝒊𝒊) and One (or more) dependent variable (output,
responses) Y.• The objective is to predict Y given values of the
input X.
25
Supervised Learning
• Observed Data (Training Data): (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, . . ,𝑿𝑿𝒊𝒊𝒊𝒊) for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏
• Objective: find a function 𝒅𝒅 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 =𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) that can predict 𝒀𝒀 well for any given input 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊 .
• Deterministic relationship?(many classification tasks in machine learning)
26
The Additive Error Model
• Key Statistical Ideas: Observed Data = True Value + Noise
• For the observed training data,𝒀𝒀𝒊𝒊 = 𝒅𝒅 𝒙𝒙𝒊𝒊𝟏𝟏, . . ,𝒙𝒙𝒊𝒊𝒊𝒊 + 𝝐𝝐𝒊𝒊
for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the errors 𝝐𝝐𝒊𝒊′𝒔𝒔 are iid with mean 0 and are independent of 𝑿𝑿′𝒔𝒔.
• Find the function 𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) or find its approximation!!! (Generative vs. Predictive models)
• The simplest case: when 𝒊𝒊 = 𝟏𝟏, 𝒅𝒅 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙
Simple linear regression: 𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊27
The first Main Topic
• Simple linear regression
28
Empirical Models: Regression
• Many engineering and scientific problems are concerned with determining a relationship between a set of variables.
• For example: Y= college GPA at 1st year; X= high school GPA
Or Y=Mortality rate; X= Immunization rate.• Knowledge of such a relationship would enable
us to predict the output for Y. • Regression analysis is a statistical technique
that is very useful for these types of problems, as it can be used to build a model to predict Y at a given X value.
29
Example: Immunized and Mortality
• Suppose one wants to investigate the relationship between the percentage of children who have been immunized against the infectious disease diphtheria, pertussis, and tetanus (DPT) in a given country and the corresponding mortality rate for children under five years of age in that country.
• The UN Children’s Fund (UNICEF) considers the under-five mortality rate to be one of most important indicators of the level of well-being for children.
30
31
Data
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16
Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145
China 94 43 India 89 124 Turkey 76 87Czech Republic
99 12 Italy 95 10 UK 90 9
Egypt 89 55 Japan 87 6
32
Look at Scatter Plot
The plot shows that Mortality rate tends to decrease as the percentage of children immunization increases.
33
Question
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Question:• Are Y and X related (associated), and how?• Does better immunization improve mortality
rate?
• Can we use the data to develop a model for predicting under-five mortality rate from the percentage of children immunized against DPT?
34
Linear Regression
• It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point view, because of the wide variety of uses.
• Fit a models for a dependent variable as a function of one or more independent variables
• We will talk about Building models Assessing fit and reliability Drawing conclusions
35
A Simple Linear Regression
• We are interested in developing a linear equation that best summarizes the relationship in a sample between the response variable (Y) and the predictor variable (or independent variable) x
𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.
• The equation is also used to predict Y from X
36
(a) How to estimate 𝜷𝜷’s
• Observe n data, 𝒀𝒀𝒊𝒊,𝒙𝒙𝒊𝒊 , and assume𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊
where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.
• How to estimate 𝜷𝜷’s?
37
Method of Least Squares
• The (ordinary) least squares estimator:Choose β0 and β1 to minimize the residual of sum square (RSS)
38
Why Least Squares?
• It is the Maximum Likelihood Estimators (MLE) of β0 and β1 when the errors 𝝐𝝐𝒊𝒊’s are iid N(0,𝝈𝝈𝟐𝟐).
• It leads to the best linear unbiased estimators (BLUE) of β0 and β1, no matter whether the errors 𝝐𝝐𝒊𝒊’s are normally distributed or not.
[A linear estimator is of the form ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊.The meaning of BLUE for β1:Minimize 𝐯𝐯𝐯𝐯𝐯𝐯 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = 𝝈𝝈𝟐𝟐 ∑𝒄𝒄𝒊𝒊𝟐𝟐
subject to 𝐄𝐄 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = ∑𝒄𝒄𝒊𝒊 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟏𝟏 for all β0 and β1,
i.e., subject to ∑𝒄𝒄𝒊𝒊𝜷𝜷𝟎𝟎 = 𝟎𝟎 and ∑𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏]
39
Method of Least Squares
• When minimizing the residual of sum square (RSS)
the solutions are:�𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙
𝑺𝑺𝒙𝒙𝒙𝒙, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙
where 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐
40
Example (Cont.)
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16
Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145
China 94 43 India 89 124 Turkey 76 87Czech Republic
99 12 Italy 95 10 UK 90 9
Egypt 89 55 Japan 87 6
41
Answer
• For our data
𝐧𝐧 = 𝟐𝟐𝟎𝟎,�𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, �𝒙𝒙 = 𝟎𝟎𝟗𝟗, ∑𝒙𝒙𝒊𝒊𝟐𝟐 = 𝟏𝟏𝟏𝟏𝟎𝟎𝟒𝟒𝟒𝟒𝟏𝟏,∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟐𝟐𝟏𝟏
𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙 = −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏
�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙
=−𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏
= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗;
�𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = 𝟎𝟎𝟗𝟗 + 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ∗ 𝟗𝟗𝟗𝟗.𝟒𝟒 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏• Thus, the fitted (simple linear regression) model is
𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐or 𝐄𝐄 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙.
(b) Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Estimate the mean under-five mortality rate per 1000 live births when x=10?
• Repeat the question when x= 90?
[202.9573; 32.0853] 42
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
43
(c) How to estimate 𝝈𝝈𝟐𝟐?
• Recall that the model is 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊 where
the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐
• We got the estimator �𝜷𝜷𝟎𝟎, �𝜷𝜷𝟏𝟏, and how to estimate the third parameter, 𝝈𝝈𝟐𝟐 ?
Answer: • It is natural to use the observed fitting error
𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊) and the residual sum of squares 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐
• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐
[and 𝒏𝒏 − 𝟐𝟐 �𝝈𝝈𝟐𝟐
𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟐𝟐𝟐𝟐 ]
• In practice, it is easier to compute RSS as follows:
𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏
𝒏𝒏
𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
44
Example (Cont.)
In our example, the fitted (simple linear regression) model is 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐. Find an estimate of 𝝈𝝈𝟐𝟐 = 𝒗𝒗𝑩𝑩𝑽𝑽 𝝐𝝐 .• Two ways to calculate the residual sum of squares RSS:
Calculate the observed fitting error (residual) 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊)
and then 𝑹𝑹𝑺𝑺𝑺𝑺 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝟐𝟐𝟗𝟗𝟎𝟎𝟎𝟎𝟎𝟎.𝟗𝟗𝟎𝟎 Use Sxx =10630.8, Sxy=-22706, Syy=77498, and
𝑹𝑹𝑺𝑺𝑺𝑺 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙= 𝟗𝟗𝟗𝟗𝟒𝟒𝟗𝟗𝟏𝟏 − −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏 𝟐𝟐 /𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏=29000.95
• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺
𝒏𝒏−𝟐𝟐= 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 (or �𝝈𝝈 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏).
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
R code (calculator-type)x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91,
98, 73, 47, 76, 90);y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16,
32, 145, 87, 9);
Sxx <- sum( x * x) - length(x) * (mean(x))^2Sxy <- sum(x *y ) - length(x) * mean(x) * mean(y)Syy <- sum( y * y) - length(y) * (mean(y))^2
beta1hat <- Sxy / Sxxbeta0hat <- mean(y) - beta1hat * mean(x)
### Two ways to compute RSS error <- y - (beta0hat + beta1hat * x)RSS <- sum( error * error) ### OrRSS <- Syy – Sxy^2 / Sxxsigma2hat <- RSS / (length(x) - 2)
c(beta0hat, beta1hat, sigma2hat)45
46
(d) Properties of OLS estimators
• To derive the statistical inference of the (ordinary) least squares �𝜷𝜷𝟏𝟏 and �𝜷𝜷𝟎𝟎, we need to find 𝑬𝑬 �𝜷𝜷𝒊𝒊 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝒊𝒊Then by the central limit theorem, asymptotically
�𝜷𝜷𝒊𝒊 − 𝑬𝑬 �𝜷𝜷𝒊𝒊
𝑽𝑽𝑩𝑩𝑽𝑽(�𝜷𝜷𝒊𝒊)≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏)
Key Steps
𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙= ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙
Assumption: the 𝒙𝒙𝒊𝒊’s are constants, and the 𝒀𝒀𝒊𝒊’s are independent with 𝑬𝑬(𝒀𝒀𝒊𝒊) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 and 𝑽𝑽𝑩𝑩𝑽𝑽(𝒀𝒀𝒊𝒊) = 𝝈𝝈𝟐𝟐.
• �𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙
= ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 , where 𝒄𝒄𝒊𝒊 = 𝒙𝒙𝒊𝒊−�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙
satisfying the
following three properties:∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊 = 𝟎𝟎∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏
∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝟐𝟐 = 𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙
• �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 (𝟏𝟏𝒏𝒏− 𝒄𝒄𝒊𝒊 �𝒙𝒙)𝒀𝒀𝒊𝒊
47
48
(d) Properties of OLS
• Unbiased:• Variance:
where
• Note that they are correlated:
49
CI and Tests
• Since σ2 is unknown, consider and thus
• Then and
have t-distribution with n-2 degree of freedom.
(d1) Inference on 𝜷𝜷𝟏𝟏
• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎the test statistic is
𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏
𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)=
�𝜷𝜷𝟏𝟏�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙
and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐
• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟏𝟏 is
�𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙
50
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓 level.
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏
𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏
�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙= −𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏𝟗𝟗
𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏/ 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏= −𝟎𝟎.𝟎𝟎𝟏𝟏𝟏𝟏]
51
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Find a 95% confidence interval on 𝜷𝜷𝟏𝟏.
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏, So �𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐
�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙
= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ± 𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟗𝟗 = −𝟐𝟐.𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏,−𝟏𝟏.𝟏𝟏𝟏𝟏𝟏𝟏𝟎𝟎 .]
52
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
(d2) Inference on 𝜷𝜷𝟎𝟎
• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝒐𝒐𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠𝒐𝒐𝟎𝟎, the test statistic is
𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎−𝒐𝒐𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)
=�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎
�𝝈𝝈 𝟏𝟏𝒏𝒏+
�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙
and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐
• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟎𝟎 is
�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏
+ �𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
53
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝟐𝟐𝟏𝟏𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠ 𝟐𝟐𝟏𝟏𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓level.
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏
𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎
�𝝈𝝈 𝟏𝟏𝒏𝒏+
�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙
= 𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎]
54
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Find a 95% confidence interval on 𝜷𝜷𝟎𝟎.
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏,
So�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏
+ �𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 ± 𝟏𝟏𝟏𝟏.𝟎𝟎𝟎𝟎𝟏𝟏𝟐𝟐 = [𝟏𝟏𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏,𝟐𝟐𝟗𝟗𝟎𝟎.𝟏𝟏𝟗𝟗].]
55
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
(d3) Inference on 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏For the simple linear regression model
𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊For a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, what is the confidence interval for the mean response 𝑬𝑬 𝒀𝒀 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏
Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝟏𝟏𝒏𝒏
+ 𝒄𝒄𝒊𝒊 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝒀𝒀𝒊𝒊
• 𝑬𝑬 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏
• 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝝈𝝈𝟐𝟐[𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 −�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙]
• The 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean response is
�𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏
+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
56
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Find a 95% confidence interval on the mean under-five mortality rate when x=10
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏𝟒𝟒𝟏𝟏 =
[𝟏𝟏𝟒𝟒𝟒𝟒.𝟏𝟏𝟗𝟗𝟏𝟏𝟐𝟐,𝟐𝟐𝟏𝟏𝟏𝟏.𝟐𝟐𝟐𝟐𝟏𝟏𝟒𝟒]]57
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
(e) Prediction on new Observation
For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊
How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?
• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• How about a confidence interval on Y?
This is often called prediction interval.
58
Key Idea
For the future response 𝐘𝐘 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆
Consider the estimator �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then
• 𝑬𝑬 𝒀𝒀 − �𝒀𝒀 = 𝟎𝟎• 𝑽𝑽𝑩𝑩𝑽𝑽 𝒀𝒀 − �𝒀𝒀 = 𝑽𝑽𝑩𝑩𝑽𝑽 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 − �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏
= 𝑽𝑽𝑩𝑩𝑽𝑽 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 + 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏
= 𝝈𝝈𝟐𝟐 +𝝈𝝈𝟐𝟐
𝒏𝒏+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐 𝝈𝝈𝟐𝟐
𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙
59
Key Idea (Cont.)
For the future response 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐
Consider the estimate �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then
•𝒙𝒙 − �𝒀𝒀
𝝈𝝈 𝟏𝟏+𝟏𝟏𝒏𝒏+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)
• So𝒙𝒙 − �𝒀𝒀
�𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏 + 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
∼ 𝑻𝑻𝒏𝒏−𝟐𝟐
60
Prediction Interval
For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊
How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?
• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• The 𝟏𝟏 − 𝜶𝜶 prediction interval is
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙61
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Find a 95% prediction interval on Y when x=10
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟏𝟏𝟎𝟎𝟐𝟐.𝟎𝟎𝟎𝟎𝟐𝟐𝟐𝟐 =
[𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎𝟏𝟏,𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟗𝟗𝟎𝟎]]62
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Example (Cont.)
The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐
• Find a 95% prediction interval on Y when x=90
[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙= 𝟏𝟏𝟐𝟐.𝟎𝟎𝟏𝟏𝟎𝟎𝟏𝟏 ± 𝟏𝟏𝟗𝟗.𝟎𝟎𝟐𝟐𝟗𝟗𝟏𝟏 =
[−𝟎𝟎𝟒𝟒.𝟗𝟗𝟒𝟒𝟐𝟐𝟏𝟏,𝟏𝟏𝟏𝟏𝟗𝟗.𝟏𝟏𝟒𝟒𝟐𝟐𝟗𝟗]]
63
X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992
Summary (I): point estimation
Assume that we observe (𝒙𝒙𝒊𝒊,𝒙𝒙𝒊𝒊) for i=1,..,n, and we consider the simple linear regression model
𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐.• Define
𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐
• The least squares estimators are
�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙
, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙
64
Summary (II) : Estimation of σ2 and Inference
• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐
where 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 and residuals 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − �𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 . In practice, it is better to use
𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏
𝒏𝒏
𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙
•�𝜷𝜷𝟏𝟏−𝜷𝜷𝟏𝟏𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)
∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟏𝟏 = �𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙
•�𝜷𝜷𝟎𝟎−𝜷𝜷𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)
∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟎𝟎 = �𝝈𝝈 𝟏𝟏𝒏𝒏
+ �𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙65
Summary III: Inference
At a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• the point estimator of Y is �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• A 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean
response Y is
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙• A 𝟏𝟏 − 𝜶𝜶 prediction interval on the future
observation is
�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏
+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐
𝑺𝑺𝒙𝒙𝒙𝒙(appropriate for testing data)
66
67
Part C
• Introduction to R
What is R
• R is a system for statistical computation and graphics
• It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files
• Free software• OS: Windows, Unix, Linux • Homepage: http://www.r-project.org
Installing R Under Windows
• Need Windows OS(32/64 bits)• Go to any CRAN site (see
http://cran.r-project.org/ mirrors.html for a list), and follow the instruction
• Download R 3.1.0 for Windows “R-3.1.0-win.exe” (Size: 54Mb), and double-click on the icon and follow the instructions to install
Data With R• Objects: vector, factor, array, matrix, data.frame,
ts, list
• Mode (numerical, character, complex, and logical);Length
• Read data stored in text (ASCII) filesread.table(), scan(), and read.fw f()
• Saving datawrite(x, file=“data.txt”), w rite.table() write in a
file a data.frame
• Generating data
71
Linear Regression in R
x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90);
y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9);
fm1 <- lm( y ~ x)fm1
Call:lm(formula = y ~ x)
Coefficients:(Intercept) x
224.316 -2.136
72
summary(fm1)> summary(fm1)
Call:lm(formula = y ~ x)Residuals:
Min 1Q Median 3Q Max -99.97934 -16.57854 0.06684 20.84946 89.77608
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 224.3163 31.4403 7.135 1.20e-06 ***x -2.1359 0.3893 -5.486 3.28e-05 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 40.14 on 18 degrees of freedomMultiple R-Squared: 0.6258, Adjusted R-squared: 0.605F-statistic: 30.1 on 1 and 18 DF, p-value: 3.281e-05
73
Confidence Interval on coefficients
> confint(fm1)
2.5 % 97.5 %(Intercept) 158.262579 290.369998x -2.953763 -1.317976
> confint(fm1, level = 0.99)0.5 % 99.5 %
(Intercept) 133.817133 314.815444x -3.256453 -1.015286
74
Intervals for xnew
> xnew <- data.frame(x = c(10, 90))## Confidence intervals on the mean response> predict(fm1, xnew, interval="confidence“, level=0.95)
fit lwr upr1 202.95759 144.69566 261.219532 32.08805 10.59907 53.57702
## Prediction intervals for future observations> predict(fm1, xnew, interval="prediction“, level=0.95)
fit lwr upr1 202.95759 100.45917 305.45602 32.08805 -54.93637 119.1125