74
ISyE 6414: Regression Analysis Lectures: MWF 8:00-10:30, MRDC #2404 Early five-week session; May 14- June 15 (8:00-9:10; 10-min break; 9:20-10:30) Instructor: Dr. Yajun Mei (“YA_JUNE MAY”) Email: [email protected]; Tel: 404-894-2334 (O) Office Hours: MWF 10:30-11:00, after class or Groseclose #343 Course Homepage: Canvas (all HWs due Canvas) backup: http://www.isye.gatech.edu/~ymei/6414 HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

ISyE 6414: Regression Analysis

Lectures: MWF 8:00-10:30, MRDC #2404Early five-week session; May 14- June 15

(8:00-9:10; 10-min break; 9:20-10:30)

Instructor: Dr. Yajun Mei (“YA_JUNE MAY”)• Email: [email protected]; Tel: 404-894-2334 (O)

• Office Hours: MWF 10:30-11:00, after class or Groseclose #343

• Course Homepage: Canvas (all HWs due Canvas)backup: http://www.isye.gatech.edu/~ymei/6414

• HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

Page 2: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

My academic pathway

• Undergraduate: Math, Peking Univ., BS in 1996• Work as a computer programmer in a Chinese

bank, 1996-1998• Graduate: PhD in Math with a minor in EE, Caltech,

1998-2003 (advisor: Dr. Gary Lorden)• Post Doc in biostatistics: FHCRC, Seattle, 2003-

Sep 2005 (supervisor: Dr. Sarah Holte)• New Research Fellow: SAMSI & Duke Univ., Fall

2005• Joined ISyE of GT since Jan 2006. Currently a

tenured associate professor.

Page 3: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

About this course

• Regression Analysis is the key building block for many modern Machine Learning, Artificial Intelligent, Business Analytics techniques and methods (such as Neural Networks, Deep Learning, Boosting, Random Forrest, etc.)

• This course aims to help youUnderstand its theoretical aspects

(HW#1, #2, #4, and a midterm)Understand its computational aspects

(HW#3, and a course project)

3

Page 4: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

4

Organization of the CourseTextbooks (Notes/slides provided):

• Kutner, Nachtsheim, Neter and Li, Applied linear statistical models (fifth edition).,” 5th ed

• Faraway, Practical Regression and ANOVA using R (freely downloadable online)

www.abebooks.com/servlet/SearchResults?isbn=9780073108742

Topics:• Simple Linear Regression (Ch 1 -4) • Multiple linear Regression (Ch 5-11) (2 weeks, Midterm)• Advanced Regression (Ch 13-14) ( 2 weeks)• Design of Experiments (Ch 13, 14)

Page 5: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

5

Organization of the Course

Grading Policy (the past AVG GPA is [3.7,3.9]):• Class attendance (5%) • Homework (4*10%=40%): Collaboration encouraged, but

you cannot look at any other solutions before submitting. • One in-class Midterm (25%): 9:15am-10:30am,

Friday, May 25 (happy Memorial weekend )• Class project (30%): a team of 2-4 or by yourself. See

the handout for possible topics of project. Proposal (1-3 pages) : May 30 (Wed) Presentation file: due 7am on June 13 (Wed)

(only for on-campus students, not required for DL students) Final report: June 15 (Friday)

[Only for the Distance Learning students: two-lectures delay for homeworks and class project proposal, and one-week delay for midterm, and the final report.]

Page 6: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

6

Part A

• Basic Background on probability and statistics.

We might not discuss this background part in details, but I listed some slides here, so that you can brush up your memory if necessary

• Three key Probability distributions: Binomial, Poisson, and Normal.

Page 7: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

7

Probability Review

See Appendix A of our text.

• Probability• Discrete Random Variable• Continuous Random Variables• Joint Distribution

Page 8: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

8

Probability

Basics of Probability Theory• Random Experiments, e.g., flip a fair coin three

times, and observe “Heads” or “Tails”

• Sample spaces: the set of all possible outcomes, e.g., S={HHH,THH,HTH,HHT, HTT,THT,TTH,TTT}

• An Event: a subset of the sample space of a random experiment, e.g., observe one “heads”

• Union/Intersection/Complement of events; Counting Techniques; Axioms of Probability; Conditional Probability; Independence; Bayes’ Theorem

Page 9: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Random Variable

• A random variable is a function that assigns a real number to each outcome in the sample space of a random experiments.

• Example: Let X be the number of heads when flipping a fair coin three times. Rigorously,

9

w HHH HHT HTH THH HTT THT TTH TTT

X(w) 3 2 2 2 1 1 1 0

Page 10: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Discrete Random Variable X

X with countable possible values • Probability Mass function: • Cumulative distribution function

• Mean:

• Variance:

• Standard Deviation

10

Page 11: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Important discrete RVs

11

Discrete Uniform

Binomial(n,p)

Geometric(p)

Poisson(\lambda)

• What are the mean and Var/SD?

Page 12: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Continuous Random Variable

• Probability density function: • Cumulative distribution function

• Mean:

• Variance:

• Standard Deviation

12

Page 13: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Important Continuous RVs

13

• Gamma/Weibull/Lognormal/Beta distribution • What are the mean and Var/SD?

Page 14: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Central Limit Theorem

a. If X is Binomial(n,p), then

(continuity correction)

b. If X1, X2,Λ,Xn are iid with mean µ and variance σ2, then

(or 𝒁𝒁 = 𝑿𝑿𝟏𝟏+..+𝑿𝑿𝒏𝒏−𝒏𝒏𝒏𝒏𝒏𝒏 𝝈𝝈

≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏) )14

𝑍𝑍 = 𝑋𝑋−𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(1−𝑛𝑛)

≈ 𝑁𝑁(0,1)

Page 15: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Statistical Review

• Population parameter vs. Sample statistic

• Point Estimation

• Conference Interval

• Hypothesis Testing

15

Page 16: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Population Parameter vs Sample Statistic

• Population: a set of entities concerning which statistical inferences are to be drawn. Typically population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible.

• Sample: a subset of observed objects from the populations. The sample represents a subset of manageable size (possibly massive).

• Parameter: a (typical unobservable) parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristics of a population or a model.

• Statistic: some measures of some attribute of a sample. It is calculated by applying a function to the values of the items comprising the sample.

[Population parameter vs. Sample statistic]

16

Page 17: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Important Sample statistics

• Sample mean:• Sample variance:• Sample standard deviation: • Sample range: r = max(x i) – min(x i)• Quartiles:

• The lower quartile: 25% of the data is less than q1

• The median: 50% of the data is less than q2

• The upper quartile: 75% of the data is less than q3

• As a measure of variability, the interquartile range (IQR) is defined as: IQR = q3 – q1

• Plots: Stem-and-Leaf Diagram/Plot, Histogram, Box Plots, Probability Plots (or Normal QQ plots)

17

Page 18: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Normal Distribution

Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2

• Sample mean �𝑿𝑿 ∼ 𝑵𝑵 𝒏𝒏, 𝝈𝝈𝟐𝟐

𝒏𝒏. Or 𝒏𝒏(�𝑿𝑿−𝒏𝒏)

𝝈𝝈∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

• Sample variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐

𝒏𝒏−𝟏𝟏satisfies

𝒏𝒏−𝟏𝟏 𝑺𝑺𝟐𝟐

𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 (Chi-square distribution)

18

Page 19: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Normal Distribution (Cont.)

Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2

• Sample mean �𝑿𝑿 is independent of sample

variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐

𝒏𝒏−𝟏𝟏. Moreover,

𝒏𝒏(�𝑿𝑿−𝒏𝒏)𝑺𝑺

= 𝑵𝑵 𝟎𝟎,𝟏𝟏

𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 /(𝒏𝒏−𝟏𝟏)

has a t-distribution

with df=n-1. [In many cases,

�𝜽𝜽−𝜽𝜽𝒔𝒔.𝒆𝒆. �𝜽𝜽

often has t-distribution.]

• In Appendix B on page 1317, for t-distribution, critical point: 𝒕𝒕𝜶𝜶,𝒅𝒅𝒅𝒅 = 𝒕𝒕 𝑨𝑨,𝒅𝒅𝒅𝒅 with 𝑨𝑨 = 𝟏𝟏 − 𝜶𝜶so 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝒕𝒕 𝟎𝟎.𝟗𝟗𝟗𝟗𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏.

19

Page 20: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Point Estimation

• The bias of the estimator �𝜽𝜽 is 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽.

An estimator is unbiased if the bias is 0. • The variance of the estimator �𝜽𝜽.• The mean square error of the estimator �𝜽𝜽 is 𝑴𝑴𝑺𝑺𝑬𝑬 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽 𝟐𝟐 = 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜽𝜽 + 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 𝟐𝟐

• The standard error of �𝜽𝜽 is s.e.= 𝑽𝑽𝑩𝑩𝑽𝑽(�𝜽𝜽)20

Page 21: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Methods of Point Estimation

• There are three methodologies to create point estimates of a population parameter.A. Method of moments (MOM)B. Method of maximum likelihood (MLE)C. Bayesian estimation of parameters

21

Page 22: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

MOM & MLE

• The method of moment (MOM) estimators are found by equating the population moment to the sample moments and solving the resulting equations, e.g.,

𝐡𝐡 𝛉𝛉 = 𝑬𝑬 𝑿𝑿 = �𝑿𝑿 = 𝑿𝑿𝟏𝟏+⋯+𝑿𝑿𝒏𝒏𝒏𝒏

.

• The maximum likelihood estimator (MLE) is the value of θ that maximizes the likelihood function

L(θ) = f(x1) f(x2) …f(xn) If the domain of f(x) does not depend on θ,

solving𝑑𝑑 𝐥𝐥𝐥𝐥𝐥𝐥𝑳𝑳(𝜽𝜽)

𝑑𝑑𝜽𝜽= 𝟎𝟎 yields the MLE.

Otherwise, plot L(θ) and find the maximum. 22

Page 23: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Confidence Interval & Hypothesis Testing

One sample: 1. Normal mean with known variances (one-sided)2. Normal mean with unknown variances3. Normal variance4. Proportion of Binomial Distribution

Two samples: inference on mean difference5. Two independent normal dist: variances known6. Two independent normal dist: unknown and equal

variances7. Two independent normal distributions: unknown and

unequal variances 8. Paired Samples

23

Page 24: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Part B

• Overview of Supervised Learning

• Simple Linear Regression

24

Page 25: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Overview of Supervised Learning

Supervised Learning (directed data mining, learning with a teacher):• The observed data is of the form of (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, … ,𝑿𝑿𝒊𝒊𝒊𝒊)

for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the variables can be split into two groups: independent variables (explanatory variables,

inputs, predictors) 𝑿𝑿 = (𝑿𝑿𝟏𝟏, … ,𝑿𝑿𝒊𝒊) and One (or more) dependent variable (output,

responses) Y.• The objective is to predict Y given values of the

input X.

25

Page 26: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Supervised Learning

• Observed Data (Training Data): (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, . . ,𝑿𝑿𝒊𝒊𝒊𝒊) for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏

• Objective: find a function 𝒅𝒅 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 =𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) that can predict 𝒀𝒀 well for any given input 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊 .

• Deterministic relationship?(many classification tasks in machine learning)

26

Page 27: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

The Additive Error Model

• Key Statistical Ideas: Observed Data = True Value + Noise

• For the observed training data,𝒀𝒀𝒊𝒊 = 𝒅𝒅 𝒙𝒙𝒊𝒊𝟏𝟏, . . ,𝒙𝒙𝒊𝒊𝒊𝒊 + 𝝐𝝐𝒊𝒊

for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the errors 𝝐𝝐𝒊𝒊′𝒔𝒔 are iid with mean 0 and are independent of 𝑿𝑿′𝒔𝒔.

• Find the function 𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) or find its approximation!!! (Generative vs. Predictive models)

• The simplest case: when 𝒊𝒊 = 𝟏𝟏, 𝒅𝒅 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙

Simple linear regression: 𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊27

Page 28: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

The first Main Topic

• Simple linear regression

28

Page 29: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Empirical Models: Regression

• Many engineering and scientific problems are concerned with determining a relationship between a set of variables.

• For example: Y= college GPA at 1st year; X= high school GPA

Or Y=Mortality rate; X= Immunization rate.• Knowledge of such a relationship would enable

us to predict the output for Y. • Regression analysis is a statistical technique

that is very useful for these types of problems, as it can be used to build a model to predict Y at a given X value.

29

Page 30: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example: Immunized and Mortality

• Suppose one wants to investigate the relationship between the percentage of children who have been immunized against the infectious disease diphtheria, pertussis, and tetanus (DPT) in a given country and the corresponding mortality rate for children under five years of age in that country.

• The UN Children’s Fund (UNICEF) considers the under-five mortality rate to be one of most important indicators of the level of well-being for children.

30

Page 31: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

31

Data

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16

Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145

China 94 43 India 89 124 Turkey 76 87Czech Republic

99 12 Italy 95 10 UK 90 9

Egypt 89 55 Japan 87 6

Page 32: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

32

Look at Scatter Plot

The plot shows that Mortality rate tends to decrease as the percentage of children immunization increases.

Page 33: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

33

Question

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Question:• Are Y and X related (associated), and how?• Does better immunization improve mortality

rate?

• Can we use the data to develop a model for predicting under-five mortality rate from the percentage of children immunized against DPT?

Page 34: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

34

Linear Regression

• It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point view, because of the wide variety of uses.

• Fit a models for a dependent variable as a function of one or more independent variables

• We will talk about Building models Assessing fit and reliability Drawing conclusions

Page 35: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

35

A Simple Linear Regression

• We are interested in developing a linear equation that best summarizes the relationship in a sample between the response variable (Y) and the predictor variable (or independent variable) x

𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.

• The equation is also used to predict Y from X

Page 36: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

36

(a) How to estimate 𝜷𝜷’s

• Observe n data, 𝒀𝒀𝒊𝒊,𝒙𝒙𝒊𝒊 , and assume𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.

• How to estimate 𝜷𝜷’s?

Page 37: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

37

Method of Least Squares

• The (ordinary) least squares estimator:Choose β0 and β1 to minimize the residual of sum square (RSS)

Page 38: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

38

Why Least Squares?

• It is the Maximum Likelihood Estimators (MLE) of β0 and β1 when the errors 𝝐𝝐𝒊𝒊’s are iid N(0,𝝈𝝈𝟐𝟐).

• It leads to the best linear unbiased estimators (BLUE) of β0 and β1, no matter whether the errors 𝝐𝝐𝒊𝒊’s are normally distributed or not.

[A linear estimator is of the form ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊.The meaning of BLUE for β1:Minimize 𝐯𝐯𝐯𝐯𝐯𝐯 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = 𝝈𝝈𝟐𝟐 ∑𝒄𝒄𝒊𝒊𝟐𝟐

subject to 𝐄𝐄 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = ∑𝒄𝒄𝒊𝒊 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟏𝟏 for all β0 and β1,

i.e., subject to ∑𝒄𝒄𝒊𝒊𝜷𝜷𝟎𝟎 = 𝟎𝟎 and ∑𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏]

Page 39: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

39

Method of Least Squares

• When minimizing the residual of sum square (RSS)

the solutions are:�𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙

𝑺𝑺𝒙𝒙𝒙𝒙, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙

where 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐

Page 40: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

40

Example (Cont.)

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16

Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145

China 94 43 India 89 124 Turkey 76 87Czech Republic

99 12 Italy 95 10 UK 90 9

Egypt 89 55 Japan 87 6

Page 41: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

41

Answer

• For our data

𝐧𝐧 = 𝟐𝟐𝟎𝟎,�𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, �𝒙𝒙 = 𝟎𝟎𝟗𝟗, ∑𝒙𝒙𝒊𝒊𝟐𝟐 = 𝟏𝟏𝟏𝟏𝟎𝟎𝟒𝟒𝟒𝟒𝟏𝟏,∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟐𝟐𝟏𝟏

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙 = −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏

�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

=−𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏

= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗;

�𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = 𝟎𝟎𝟗𝟗 + 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ∗ 𝟗𝟗𝟗𝟗.𝟒𝟒 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏• Thus, the fitted (simple linear regression) model is

𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐or 𝐄𝐄 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙.

Page 42: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

(b) Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Estimate the mean under-five mortality rate per 1000 live births when x=10?

• Repeat the question when x= 90?

[202.9573; 32.0853] 42

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 43: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

43

(c) How to estimate 𝝈𝝈𝟐𝟐?

• Recall that the model is 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊 where

the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐

• We got the estimator �𝜷𝜷𝟎𝟎, �𝜷𝜷𝟏𝟏, and how to estimate the third parameter, 𝝈𝝈𝟐𝟐 ?

Answer: • It is natural to use the observed fitting error

𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊) and the residual sum of squares 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐

[and 𝒏𝒏 − 𝟐𝟐 �𝝈𝝈𝟐𝟐

𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟐𝟐𝟐𝟐 ]

• In practice, it is easier to compute RSS as follows:

𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏

𝒏𝒏

𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

Page 44: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

44

Example (Cont.)

In our example, the fitted (simple linear regression) model is 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐. Find an estimate of 𝝈𝝈𝟐𝟐 = 𝒗𝒗𝑩𝑩𝑽𝑽 𝝐𝝐 .• Two ways to calculate the residual sum of squares RSS:

Calculate the observed fitting error (residual) 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊)

and then 𝑹𝑹𝑺𝑺𝑺𝑺 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝟐𝟐𝟗𝟗𝟎𝟎𝟎𝟎𝟎𝟎.𝟗𝟗𝟎𝟎 Use Sxx =10630.8, Sxy=-22706, Syy=77498, and

𝑹𝑹𝑺𝑺𝑺𝑺 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟗𝟗𝟗𝟗𝟒𝟒𝟗𝟗𝟏𝟏 − −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏 𝟐𝟐 /𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏=29000.95

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺

𝒏𝒏−𝟐𝟐= 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 (or �𝝈𝝈 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏).

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 45: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

R code (calculator-type)x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91,

98, 73, 47, 76, 90);y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16,

32, 145, 87, 9);

Sxx <- sum( x * x) - length(x) * (mean(x))^2Sxy <- sum(x *y ) - length(x) * mean(x) * mean(y)Syy <- sum( y * y) - length(y) * (mean(y))^2

beta1hat <- Sxy / Sxxbeta0hat <- mean(y) - beta1hat * mean(x)

### Two ways to compute RSS error <- y - (beta0hat + beta1hat * x)RSS <- sum( error * error) ### OrRSS <- Syy – Sxy^2 / Sxxsigma2hat <- RSS / (length(x) - 2)

c(beta0hat, beta1hat, sigma2hat)45

Page 46: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

46

(d) Properties of OLS estimators

• To derive the statistical inference of the (ordinary) least squares �𝜷𝜷𝟏𝟏 and �𝜷𝜷𝟎𝟎, we need to find 𝑬𝑬 �𝜷𝜷𝒊𝒊 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝒊𝒊Then by the central limit theorem, asymptotically

�𝜷𝜷𝒊𝒊 − 𝑬𝑬 �𝜷𝜷𝒊𝒊

𝑽𝑽𝑩𝑩𝑽𝑽(�𝜷𝜷𝒊𝒊)≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

Page 47: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Key Steps

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙= ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙

Assumption: the 𝒙𝒙𝒊𝒊’s are constants, and the 𝒀𝒀𝒊𝒊’s are independent with 𝑬𝑬(𝒀𝒀𝒊𝒊) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 and 𝑽𝑽𝑩𝑩𝑽𝑽(𝒀𝒀𝒊𝒊) = 𝝈𝝈𝟐𝟐.

• �𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

= ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 , where 𝒄𝒄𝒊𝒊 = 𝒙𝒙𝒊𝒊−�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

satisfying the

following three properties:∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊 = 𝟎𝟎∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏

∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝟐𝟐 = 𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙

• �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 (𝟏𝟏𝒏𝒏− 𝒄𝒄𝒊𝒊 �𝒙𝒙)𝒀𝒀𝒊𝒊

47

Page 48: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

48

(d) Properties of OLS

• Unbiased:• Variance:

where

• Note that they are correlated:

Page 49: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

49

CI and Tests

• Since σ2 is unknown, consider and thus

• Then and

have t-distribution with n-2 degree of freedom.

Page 50: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

(d1) Inference on 𝜷𝜷𝟏𝟏

• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎the test statistic is

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏

𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)=

�𝜷𝜷𝟏𝟏�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙

and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟏𝟏 is

�𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

50

Page 51: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓 level.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏

�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙= −𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏𝟗𝟗

𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏/ 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏= −𝟎𝟎.𝟎𝟎𝟏𝟏𝟏𝟏]

51

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 52: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on 𝜷𝜷𝟏𝟏.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏, So �𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ± 𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟗𝟗 = −𝟐𝟐.𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏,−𝟏𝟏.𝟏𝟏𝟏𝟏𝟏𝟏𝟎𝟎 .]

52

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 53: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

(d2) Inference on 𝜷𝜷𝟎𝟎

• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝒐𝒐𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠𝒐𝒐𝟎𝟎, the test statistic is

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎−𝒐𝒐𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)

=�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎

�𝝈𝝈 𝟏𝟏𝒏𝒏+

�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙

and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟎𝟎 is

�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

53

Page 54: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝟐𝟐𝟏𝟏𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠ 𝟐𝟐𝟏𝟏𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓level.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎

�𝝈𝝈 𝟏𝟏𝒏𝒏+

�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙

= 𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎]

54

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 55: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on 𝜷𝜷𝟎𝟎.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏,

So�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 ± 𝟏𝟏𝟏𝟏.𝟎𝟎𝟎𝟎𝟏𝟏𝟐𝟐 = [𝟏𝟏𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏,𝟐𝟐𝟗𝟗𝟎𝟎.𝟏𝟏𝟗𝟗].]

55

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 56: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

(d3) Inference on 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏For the simple linear regression model

𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊For a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, what is the confidence interval for the mean response 𝑬𝑬 𝒀𝒀 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝟏𝟏𝒏𝒏

+ 𝒄𝒄𝒊𝒊 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝒀𝒀𝒊𝒊

• 𝑬𝑬 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

• 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝝈𝝈𝟐𝟐[𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 −�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙]

• The 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean response is

�𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

56

Page 57: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on the mean under-five mortality rate when x=10

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏𝟒𝟒𝟏𝟏 =

[𝟏𝟏𝟒𝟒𝟒𝟒.𝟏𝟏𝟗𝟗𝟏𝟏𝟐𝟐,𝟐𝟐𝟏𝟏𝟏𝟏.𝟐𝟐𝟐𝟐𝟏𝟏𝟒𝟒]]57

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 58: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

(e) Prediction on new Observation

For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?

• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• How about a confidence interval on Y?

This is often called prediction interval.

58

Page 59: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Key Idea

For the future response 𝐘𝐘 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆

Consider the estimator �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then

• 𝑬𝑬 𝒀𝒀 − �𝒀𝒀 = 𝟎𝟎• 𝑽𝑽𝑩𝑩𝑽𝑽 𝒀𝒀 − �𝒀𝒀 = 𝑽𝑽𝑩𝑩𝑽𝑽 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 − �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

= 𝑽𝑽𝑩𝑩𝑽𝑽 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 + 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

= 𝝈𝝈𝟐𝟐 +𝝈𝝈𝟐𝟐

𝒏𝒏+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐 𝝈𝝈𝟐𝟐

𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙

59

Page 60: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Key Idea (Cont.)

For the future response 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐

Consider the estimate �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then

•𝒙𝒙 − �𝒀𝒀

𝝈𝝈 𝟏𝟏+𝟏𝟏𝒏𝒏+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

• So𝒙𝒙 − �𝒀𝒀

�𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏 + 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐

60

Page 61: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Prediction Interval

For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?

• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• The 𝟏𝟏 − 𝜶𝜶 prediction interval is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙61

Page 62: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% prediction interval on Y when x=10

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟏𝟏𝟎𝟎𝟐𝟐.𝟎𝟎𝟎𝟎𝟐𝟐𝟐𝟐 =

[𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎𝟏𝟏,𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟗𝟗𝟎𝟎]]62

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 63: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% prediction interval on Y when x=90

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟏𝟏𝟐𝟐.𝟎𝟎𝟏𝟏𝟎𝟎𝟏𝟏 ± 𝟏𝟏𝟗𝟗.𝟎𝟎𝟐𝟐𝟗𝟗𝟏𝟏 =

[−𝟎𝟎𝟒𝟒.𝟗𝟗𝟒𝟒𝟐𝟐𝟏𝟏,𝟏𝟏𝟏𝟏𝟗𝟗.𝟏𝟏𝟒𝟒𝟐𝟐𝟗𝟗]]

63

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Page 64: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Summary (I): point estimation

Assume that we observe (𝒙𝒙𝒊𝒊,𝒙𝒙𝒊𝒊) for i=1,..,n, and we consider the simple linear regression model

𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐.• Define

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐

• The least squares estimators are

�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙

64

Page 65: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Summary (II) : Estimation of σ2 and Inference

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐

where 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 and residuals 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − �𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 . In practice, it is better to use

𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏

𝒏𝒏

𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

•�𝜷𝜷𝟏𝟏−𝜷𝜷𝟏𝟏𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟏𝟏 = �𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

•�𝜷𝜷𝟎𝟎−𝜷𝜷𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟎𝟎 = �𝝈𝝈 𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙65

Page 66: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Summary III: Inference

At a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• the point estimator of Y is �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• A 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean

response Y is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙• A 𝟏𝟏 − 𝜶𝜶 prediction interval on the future

observation is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙(appropriate for testing data)

66

Page 67: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

67

Part C

• Introduction to R

Page 68: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

What is R

• R is a system for statistical computation and graphics

• It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files

• Free software• OS: Windows, Unix, Linux • Homepage: http://www.r-project.org

Page 69: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Installing R Under Windows

• Need Windows OS(32/64 bits)• Go to any CRAN site (see

http://cran.r-project.org/ mirrors.html for a list), and follow the instruction

• Download R 3.1.0 for Windows “R-3.1.0-win.exe” (Size: 54Mb), and double-click on the icon and follow the instructions to install

Page 70: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

Data With R• Objects: vector, factor, array, matrix, data.frame,

ts, list

• Mode (numerical, character, complex, and logical);Length

• Read data stored in text (ASCII) filesread.table(), scan(), and read.fw f()

• Saving datawrite(x, file=“data.txt”), w rite.table() write in a

file a data.frame

• Generating data

Page 71: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

71

Linear Regression in R

x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90);

y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9);

fm1 <- lm( y ~ x)fm1

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x

224.316 -2.136

Page 72: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

72

summary(fm1)> summary(fm1)

Call:lm(formula = y ~ x)Residuals:

Min 1Q Median 3Q Max -99.97934 -16.57854 0.06684 20.84946 89.77608

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 224.3163 31.4403 7.135 1.20e-06 ***x -2.1359 0.3893 -5.486 3.28e-05 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 40.14 on 18 degrees of freedomMultiple R-Squared: 0.6258, Adjusted R-squared: 0.605F-statistic: 30.1 on 1 and 18 DF, p-value: 3.281e-05

Page 73: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

73

Confidence Interval on coefficients

> confint(fm1)

2.5 % 97.5 %(Intercept) 158.262579 290.369998x -2.953763 -1.317976

> confint(fm1, level = 0.99)0.5 % 99.5 %

(Intercept) 133.817133 314.815444x -3.256453 -1.015286

Page 74: ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11 Discrete Uniform. Binomial(n,p) Geometric(p) ... infectious disease diphtheria, pertussis,

74

Intervals for xnew

> xnew <- data.frame(x = c(10, 90))## Confidence intervals on the mean response> predict(fm1, xnew, interval="confidence“, level=0.95)

fit lwr upr1 202.95759 144.69566 261.219532 32.08805 10.59907 53.57702

## Prediction intervals for future observations> predict(fm1, xnew, interval="prediction“, level=0.95)

fit lwr upr1 202.95759 100.45917 305.45602 32.08805 -54.93637 119.1125