Upload
nszakir
View
235
Download
0
Embed Size (px)
DESCRIPTION
Statistics
Citation preview
INTRODUCTION TO STATISTICS & PROBABILITY Chapter 2: Looking at Data–Relationships (Part 3)
1
Dr. Nahid Sultana
Chapter 2: Looking at Data–Relationships
2
2.1: Scatterplots
2.2: Correlation
2.3: Least-Squares Regression
2.5: Data Analysis for Two-Way Tables
3
Objectives
Regression lines
Prediction and Extrapolation
Correlation and r2
2.3 Least-Squares Regression
Regression line 4
Correlation tells us about strength and direction of the linear relationship between two quantitative variables.
In Regression we study the association between two variables in order to explain the values of one from the values of the other (i.e., make predictions).
When there is a linear association between two variables, then a straight line equation can be used to model the relationship.
In regression the distinction between Response and Explanatory is important.
Regression line (Cont…) 5
A regression line is a line that best describes the linear
relationship between the two variables, and it is expressed by
means of an equation of the form:
Where is the slope and is the intercept.
Once the equation of the regression line is established, we can
use it to predict the response y for a specific value of the
explanatory variable x .
The least-squares regression line 6
The least-squares regression line is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
The least-squares regression line (Cont.) 7
is the predicted y value (y hat) b1 is the slope
b0 is the y-intercept
ˆ y
xbby 10ˆ +=
The equation of the least-squares regression line of y on x is
b1 = rsy
sx
First we calculate the slope of the line,
Where r is the correlation, sy is the standard deviation of the response variable y, sx is the standard deviation of the explanatory variable x.
Once we know b1, the slope, we can calculate b0, the y-intercept:
b0 = y − b1x Where and are the sample means of the x and y variables
How to plot the least-squares regression line
8
Typically, we use stats software.
x
y
How to plot the least-squares regression line (Cont…)
9
To plot the regression line you only need to plug the x values into the equation, get y, and draw the line that goes through those points. Hint: The regression line always passes through the mean of x and y.
9
The points you use for drawing the regression line are derived from the equation. They are NOT points from your sample data (except by pure coincidence).
9
Two different regression lines can be drawn if we interchange the roles of x and y. Example:
10
Correlation coefficient of NEA and Fat, r = -0.779 stay same in both cases
Nonexercise activity (calories)
Fat
ga
in (
Kilo
gra
ms)
7006005004003002001000-100
4
3
2
1
0
Fitted Line PlotFat = 3.505 - 0.003441 NEA
Fat gain (Kilograms)
No
ne
xe
rcis
e a
cti
vit
y (
ca
lori
es)
43210
700
600
500
400
300
200
100
0
-100
Fitted Line PlotNEA = 745.3 - 176.1 Fat
BEWARE!!!
Not all calculators and software use the same convention. Some use: And some use:
bxay +=ˆ
ˆ y = ax + bMake sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions.
11
Making predictions
The equation of the least-squares regression allows you to predict y for any x within the range studied.
y
ˆ y = 0.0144x + 0.0008Nobody in the study drank 6.5
beers, but by finding the value
from the regression line for x = 6.5
we would expect a blood alcohol
content of 0.094 mg/ml.
mg/ml 0944.00008.0936.0ˆ0008.05.6*0144.0ˆ=+=
+=yy
Year Powerboat s Dead Manate es1977 447 131978 460 211979 481 241980 498 161981 513 241982 512 201983 526 151984 559 341985 585 331986 614 331987 645 391988 675 431989 711 501990 719 47
There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths.
(in 1000s)
1.214.415.62ˆ 4.41)500(125.0ˆ =−=⇒−= yy Roughly 21 manatees.
Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths?
The least squares regression line has the equation:
ˆ y = 0 . 125 x − 41 . 4
ˆ y = 0 . 125 x − 41 . 4
13 ----Could we use this regression line to predict the number of manatee
deaths for a year with 200,000 powerboat registrations?
Extrapolation is the use of a regression line for prediction far outside the range of values of x used to obtain the line. Such predictions are often not accurate.
!!!
!!!
Extrapolation 14
Sometimes the y-intercept is not biologically possible. Here we have negative blood alcohol content, which makes no sense…
y-intercept shows negative blood alcohol But the negative value is
appropriate for the equation of the regression line. There is a lot of scatter in the data, and the line is just an estimate.
The y intercept 15
Coefficient of determination, r2
16
Least-squares regression looks at the distances of the data points from the line only in the y direction.
The variables x and y play different roles in regression.
Even though correlation r ignores the distinction between x and y, there is a close connection between correlation and regression.
r2 is called the coefficient of determination.
r2 represents the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x.
r = -1 r2 = 1
Changes in x explain 100% of the variations in y. Y can be entirely predicted for any given value of x.
r = 0 r2 = 0
Changes in x explain 0% of the variations in y. The values y takes are entirely independent of what value x takes.
Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained
by something other than x.
r = 0.87 r2 = 0.76
17
17
r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y.
r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y.
r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.
r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y.
r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y.
r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.
18
Observed y
Predicted ŷ residual )ˆ( dist. =− yy
Residuals 19
Points above the line have a positive residual.
Points below the line have a negative residual.
A residual is the difference between an observed value of the response variable and the value predicted by the regression line:
residual = observed y – predicted y
= y − ˆ y
The sum of these residuals is always 0.
A residual plot is a scatterplot of the regression residuals against the explanatory variable.
Residual plots help us assess the fit of a regression line.
If residuals are scattered randomly around 0, chances are your data fit a linear model, was normally distributed, and you didn’t have outliers.
Residual plots 20
The x-axis in a residual plot is the same as on the scatterplot.
Only the y-axis is different.
21
Residuals are randomly scattered—good!
22
Curved pattern—means the relationship you are looking at is not linear.
A change in variability across a plot is a warning sign. You need to find out why it is, and remember that predictions made in areas of larger variability will not be as good.
2.5 Data Analysis for Two-Way Tables
23
Objectives
The Two-Way Table
Marginal Distribution
Conditional Distributions
23
24
Two-way tables
Two-way tables summarize data about two categorical variables (or factors) collected on the same set of individuals.
Example (Smoking Survey in Arizona): High school students were asked whether they smoke and whether their parents smoke.
Does parental smoking influence the smoking habits of their high school children?
Explanatory Variable: Smoking habit of student’s parents
(both smoke/ one smoke/ neither smoke) Response variable: Smoking habit of student
(smokes/does not smoke) To analyze the relationship we can summarize the result in a Two-way table:
25
Two-way tables (Cont …)
Explanatory (Row) Variable: Smoking habit of student’s parents
Response (Column) variable: Smoking habit of student
This 3X2 two-way table has 3 rows and 2 columns. Numbers are counts or frequency
400 1380 416 1823 188 1168
First factor: Parent smoking status
Second factor: Student smoking status
High school students were asked whether they smoke, and whether their parents smoke:
26
Margins
Margins show the total for each column and each row.
For each cell, we can compute a proportion by dividing the cell entry by the total sample size. The collection of these proportions is the joint distribution of the two categorical variables.
400 1380 416 1823 188 1168
Margin for parental smoking
Margin for student smoking
27
Marginal distributions (When examine the distribution of a single variable in a two-way table)
Marginal distributions: Distribution of column variable separately (or row variable separately) expressed in counts or percent.
%1.3353751780
≈
%7.1853751004
=
400 1380 33.1% 416 1823 41.7% 188 1168 25.2% 18.7% 81.3% 100%
400 1380 416 1823 188 1168
28
Marginal distribution (Cont..)
The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts.
Each graph represents only one of the two variables, ignoring the second one. Each marginal distribution can also be shown in a pie chart.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Smoker Nonsmoker
Perc
ent o
f stu
dent
s in
terv
iew
ed
Sum of Counts
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Both One Neither
Perc
ent o
f stu
dent
s in
terv
iew
ed
Sum of Counts Parental smoking
Student smoking
29
Conditional Distribution
A conditional distribution is the distribution of one factor for each level of the other factor.
A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total).
Percent of students who smoke when both parents smoke = 400/1780 = 22.5%
400 1380 416 1823 188 1168
Percent of students who smoke when both parents smoke = 400/1780 = 22.5%
400 1380 416 1823 188 1168
30
Conditional distributions (Cont…)
Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9%
400 1380 416 1823 188 1168
Comparing conditional distributions helps us describe the “relationship" between the two categorical variables.
We can compare the percent of individuals in one level of factor 1 for each level of factor 2.
31
Conditional distributions (Cont…)
Conditional distribution of student smoking status for different levels of parental smoking status:
Percent who
smoke Percent who do not smoke Row total
Both parents smoke 22% 78% 100% One parent smokes 19% 81% 100% Neither parent smokes 14% 86% 100%
The conditional distributions can be compared graphically by displaying the percents
making up one level of one factor, for each level of the other factor.
32
Conditional Distribution
In the table below, the 25 to 34 age group occupies the first column.
33
Conditional distributions (Cont…)
Here the percents are calculated by age range (columns).
29.30% = 11071 37785 = cell total . column total
34
The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable.
Here, the percents are calculated by age range
(columns).
34
35
Music and wine purchase decision
We want to compare the conditional distributions of the response variable (wine purchased) for each value of the explanatory variable (music played). Therefore, we calculate column percents.
What is the relationship between type of music played in supermarkets and type of wine purchased?
We calculate the column conditional percents similarly for
each of the nine cells in the table:
Calculations: When no music was played, there were 84 bottles of wine sold. Of these, 30 were French wine. 30/84 = 0.357 35.7% of the wine sold was French when no music was played.
30 = 35.7% 84 = cell total . column total
For every two-way table, there are two sets of possible conditional distributions.
Wine purchased for each kind of music played (column percents)
Music played for each kind of wine purchased
(row percents)
Does background music in supermarkets
influence customer purchasing decisions?
36