Upload
damian-harper
View
222
Download
0
Embed Size (px)
Citation preview
Chapter 12-2Transforming RelationshipsDay 2
AP Statistics
2
The Ladder of Power Let’s examine the ln rung of the
ladder a little more closely:
POWERPOWER(x, ln(y))
(ln(x), y)
(ln(x), ln(y))
CommentComment This is very useful if the values of y
increases by a percentage.
When x has a wide range or when the scatterplot descends rapidly to the left and levels off to the right
When one of the ladder powers is too big and the other one is too small, this is often a useful transformation. Also, if the scatterplot is thickening, this transformation can be very useful.
3
Non-Linear Regression Let’s examine the relationship between shutter speed and f/stops
of a particular camera lens:
Shutter Shutter speedspeed
ff/stop/stop Shutter Shutter speedspeed
ff/stop/stop
1/10001/1000 2.82.8 1/601/60 1111
1/5001/500 44 1/301/30 1616
1/2501/250 5.65.6 1/151/15 2222
1/1251/125 88 1/81/8 3232
4
Non-Linear Regression Store the ln(F-stop) and ln(shutter speed)
into L5 and L8, respectively:
POWERPOWER(x, y)
(x, ln(y))
(ln(x), y)
(ln(x), ln(y))
=(speed, f/stop)
=(speed, ln(f/stop))
=(ln(speed), f/stop)
=(ln(speed), ln(f/stop))
5
Non-Linear Regression Let’s examine the relationship between shutter
speed and f/stops of a particular camera lens:
No!Curved
No!Curved
No!Curved
YES!YES!
LinearLinear
6
Non-Linear Regression Let’s examine the relationship between shutter
speed and f/stops of a particular camera lens:
Although the data looks Although the data looks linear, it’s still possible linear, it’s still possible that it is actually curved.that it is actually curved.
We need to check if this We need to check if this data is actually linear or data is actually linear or just appears to be linear.just appears to be linear.• Let’s perform a residual plot Let’s perform a residual plot
on this data.on this data.
7
Non-Linear Regression Let’s examine the relationship between shutter
speed and f/stops of a particular camera lens: The points appear to
have a random spread about the MODIFIED LSRL line. So, this seems to be a good model to the data - although it may have increasing spread.
Be careful when determining the actual LSRL line.
8
Non-Linear Regression
xy 497.0464.4ˆ
Let’s examine the relationship between shutter speed and f/stops of a particular camera lens:
The following is the modified equation:
)speedshutter ln(497.0464.4)stopFln( However, in the calculator we have the following:
Take the first equation and solve for y-hatsolve for y-hat; this is the true modified equation of the LSRL:
)speed)shutter ln(497.0646.4(estopF
)speed) shutterln(497.0464.4(estopF or
9
Non-Linear Regression Let’s examine the relationship between shutter
speed and f/stops of a particular camera lens:
Let’s make sure that our new equation fits the original data:
Graph it: It looks pretty good.
See if you can determine the f/stop for a shutter speed of ¼ .YY11(1/4) = 43.612(1/4) = 43.612
10
Non-Linear Regression
PlayerPlayer YeaYearr
Salary Salary (Million(Million
s)s)Nolan Ryan 1980 1
George Foster 1982 2.04
Kirby Pucket 1990 3
Jose Canseco 1990 4.7
Roger Clemens
1991 5.3
Ken Griffey, Jr.
1996 8.5
Albert Belle 1997 11
Let’s attempt to solve the following:a) What is the equation of the curve of “best-fit”?b) Predict the salary for a superstar for 2005
PlayerPlayer YeYearar
Salary Salary (Million(Million
s)s)Pedro
Martinez1998 12.5
Mike Piazza 1999 12.5
Mo Vaughn 1999 13.3
Kevin Brown 1999 15
Carlos Delgado
2001 17
Alex Rodriguez
2001 25.2
11
Non-Linear Regression
If we examine the scatterplot for this data, it is obvious that the data is curved, so a transformation is needed to answer this question
Let’s attempt to solve the following:a) What is the equation of the curve of “best-fit”?b) Predict the salary for a superstar for 2005
This data is definitely curved
12
Non-Linear Regression Let’s check the ln transformations Store the ln(x) and ln(y) into new lists:
POWERPOWER(x, y)
(x, ln(y))
(ln(x), y)
(ln(x), ln(y))
=(year, salary)
=(year, ln(salary))
=(ln(year), salary)
=(ln(year), ln(salary))
We already know that this is curved
13
Non-Linear Regression Check all three models…
LogarithmicLogarithmic(ln(x), y)
PowerPower(ln(x), ln(y))
Scatter plot Residual plot Scatter plot Residual plot
OriginalOriginal(x, y)
ExponentialExponential(x, ln(y))
Scatter plot Residual plot Scatter plot Residual plot
Best One!!Best One!!
14
Non-Linear Regression After checking the scatter plots and the
residual plots, we see that (x, ln(x)) is the best transformation.LogarithmicLogarithmic
(ln(x), y)POWERPOWER
(ln(x), ln(y))
However, we should check the ladder of powers to make sure that there is not a better transformation.
Scatter plot Residual plot Scatter plot Residual plot
These are These are NO NO
GOOD!!!GOOD!!!
15
Non-Linear Regression The ladder of powers.
• Let’s try to transform the data according to the ladder:Power: Power: 2 ½ ln -½ -12 ½ ln -½ -1
y2y )ln(yy
1y
1
16
Non-Linear Regression To get a better picture of the somewhat
linear models, let’s look at the residual plotsPower: Power: 2 ½ ln -½ -12 ½ ln -½ -1
y2y )ln(yy
1y
1
We’ve already We’ve already
done this onedone this oneThis residualThis residual
plot stillplot still
looks curvedlooks curved
This looksThis looks
pretty good too; pretty good too;
although, it mayalthough, it may
be slightlybe slightly
curvedcurved
This is This is
definitely definitely
curvedcurved This is This is the the best best
one!!!one!!!
This isThis is
definitelydefinitely
curvedcurved
17
Non-Linear Regression
year132.0282.261ryasal
)year(132.0282.261)ryasalln(
Let’s examine the relationship between years and baseball superstar salaries: best transformation (x, ln(y))
The following is the modified equation:
However, in the calculator we have the following:
Change the equation to account for the transformation. Solve for y. This is the equation of the function of “best-fit.”
)year132.0282.261(eryasal
18
The average baseball superstar in 2005 will make about 29.312 million
If we look at the graph, it is evident that the curve fits the data
However, we should be wary of extrapolation
Non-Linear Regression Now let’s predict the salary of a superstar in
2005:
Plug in 2005:
)year132.0282.261(eryasal
)378.3(eryasal ))2005(132.0282.261(eryasal
312.29ryasal
19
Common Errors Do we have to re-scale our numbers?Do we have to re-scale our numbers?
• We need to be very careful what numbers we use when we try to transform our data. Data values that are far from 1 are often not affected much by our transformations unless the range is very large. It is often useful to try to get numbers between 1 and 100 since our re-expression will have a greater affect on 1 to 100 than from 100,001 to 100,100.
Why can’t I find the “perfect” model?Why can’t I find the “perfect” model?• Don’t expect to ever find the perfect model – more likely
than not, it doesn’t exist. Just remember that “real-world” data is messy and it is difficult to find a model that will fit the data perfectly.
20
Common Errors Is it me or does that graph look weird?Is it me or does that graph look weird?
• Re-expression can straighten many relationships, but not those that that go up and down and up again (or something to that effect). You should refuse to analyze such data with methods that require a linear form.
Why is my re-expression missing data?Why is my re-expression missing data?• It is impossible to re-express negative values for several
rungs on our ladder of powers. Such values are omitted by the calculator and the effect on your re-expression can be significant. Try not to lose good data values while transforming your data. Sometimes adding small values such as 1/2 or 1/6 is useful.
21
TransformationsType of Type of
ModelModel
Exponential
Logarithmic
Power
))ln(,(),( yxyx
New ModelNew ModelTransformationTransformation
)),(ln(),( yxyx
))ln(),(ln(),( yxyx )ln(ˆln xbay
bxay ˆln
)ln(ˆ xbay
Re-expressedRe-expressed
ModelModel
)ln(ˆ xbaey
bxaey ˆ
)ln(ˆ xbay
22
Can’t We Just Use the Curve? Although your calculator will do other
types of regression (quadratic, exponential, etc.), using the curve has drawbacks.
• First, lines are easy to understand. Using the curve, throws out all of our understanding of linear regression. We understand how to interpret the slope and the y-intercept, and linear models are more useful in advanced statistical practices. In order to use the curve, we would have to come up with a whole new system of understanding.
• It’s best to use the linear model.
OutliersOutliers In regression analysis, a data point that
diverges greatlygreatly from the overall pattern of data is called an outlieroutlier.
There are basically four ways that a point can be considered an outlier:
1. It could have an extreme X value compared to other data points.
2. It could have an extreme Y value compared to other data points.
3. It could have an extreme X and Y values compared to other data points.
4. It might be distant from the rest of the data, even without extreme X or Y values.
ExamplesExamples
Distant Data PointDistant Data PointExtreme X and YExtreme X and Y
Extreme XExtreme X Extreme YExtreme Y
Influential PointsInfluential Points An influential pointinfluential point is an outlieroutlier that
greatlygreatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.
To put it simply, influential points are data points that have disproportionate effects on the slope of the regression equation.
Influential PointsInfluential Points This type of analysis is illustrated below.
The slope is larger when the outlier is present, so this outlier would be considered an influential point. Sometimes, an influential point will cause the coefficient coefficient of determinationof determination to be bigger; sometimes, smaller. In this example, the coefficient of determination is smaller when the outlier is present. With With
OutlierOutlier
Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R2 = 0.94
Regression equation: ŷ = 97.51 - 3.32xCoefficient of determination: R2 = 0.55
Without Without OutlierOutlier
ExampleExampleWhich statement about influential points is true?
I. Removal of an influential point changes the regression line.
II. Data points that are outliers in the horizontal direction are more likely to be influential than points that are outliers in the vertical direction.
III. Influential points have large residuals.
I and II are true statements. A linear I and II are true statements. A linear transformation neither increases nor decreases the transformation neither increases nor decreases the linear relationship between variables; it preserves linear relationship between variables; it preserves the relationship. A the relationship. A nonlinearnonlinear transformation is used transformation is used to increase the relationship between variables. The to increase the relationship between variables. The most effective transformation method depends on most effective transformation method depends on the data being transformed.the data being transformed.