Chapter 12-2 Transforming Relationships Day 2 AP Statistics

Chapter 12-2Transforming RelationshipsDay 2

AP Statistics

2

The Ladder of Power Let’s examine the ln rung of the

ladder a little more closely:

POWERPOWER(x, ln(y))

(ln(x), y)

(ln(x), ln(y))

CommentComment This is very useful if the values of y

increases by a percentage.

When x has a wide range or when the scatterplot descends rapidly to the left and levels off to the right

When one of the ladder powers is too big and the other one is too small, this is often a useful transformation. Also, if the scatterplot is thickening, this transformation can be very useful.

3

Non-Linear Regression Let’s examine the relationship between shutter speed and f/stops

of a particular camera lens:

Shutter Shutter speedspeed

ff/stop/stop Shutter Shutter speedspeed

ff/stop/stop

1/10001/1000 2.82.8 1/601/60 1111

1/5001/500 44 1/301/30 1616

1/2501/250 5.65.6 1/151/15 2222

1/1251/125 88 1/81/8 3232

4

Non-Linear Regression Store the ln(F-stop) and ln(shutter speed)

into L5 and L8, respectively:

POWERPOWER(x, y)

(x, ln(y))

(ln(x), y)

(ln(x), ln(y))

=(speed, f/stop)

=(speed, ln(f/stop))

=(ln(speed), f/stop)

=(ln(speed), ln(f/stop))

5

Non-Linear Regression Let’s examine the relationship between shutter

speed and f/stops of a particular camera lens:

No!Curved

No!Curved

No!Curved

YES!YES!

LinearLinear

6



Although the data looks Although the data looks linear, it’s still possible linear, it’s still possible that it is actually curved.that it is actually curved.

We need to check if this We need to check if this data is actually linear or data is actually linear or just appears to be linear.just appears to be linear.• Let’s perform a residual plot Let’s perform a residual plot

on this data.on this data.

7


speed and f/stops of a particular camera lens: The points appear to

have a random spread about the MODIFIED LSRL line. So, this seems to be a good model to the data - although it may have increasing spread.

Be careful when determining the actual LSRL line.

8

Non-Linear Regression

xy 497.0464.4ˆ

Let’s examine the relationship between shutter speed and f/stops of a particular camera lens:

The following is the modified equation:

)speedshutter ln(497.0464.4)stopFln( However, in the calculator we have the following:

Take the first equation and solve for y-hatsolve for y-hat; this is the true modified equation of the LSRL:

)speed)shutter ln(497.0646.4(estopF

)speed) shutterln(497.0464.4(estopF or

9



Let’s make sure that our new equation fits the original data:

Graph it: It looks pretty good.

See if you can determine the f/stop for a shutter speed of ¼ .YY11(1/4) = 43.612(1/4) = 43.612

10


PlayerPlayer YeaYearr

Salary Salary (Million(Million

s)s)Nolan Ryan 1980 1

George Foster 1982 2.04

Kirby Pucket 1990 3

Jose Canseco 1990 4.7

Roger Clemens

1991 5.3

Ken Griffey, Jr.

1996 8.5

Albert Belle 1997 11

Let’s attempt to solve the following:a) What is the equation of the curve of “best-fit”?b) Predict the salary for a superstar for 2005

PlayerPlayer YeYearar

Salary Salary (Million(Million

s)s)Pedro

Martinez1998 12.5

Mike Piazza 1999 12.5

Mo Vaughn 1999 13.3

Kevin Brown 1999 15

Carlos Delgado

2001 17

Alex Rodriguez

2001 25.2

11


If we examine the scatterplot for this data, it is obvious that the data is curved, so a transformation is needed to answer this question

Let’s attempt to solve the following:a) What is the equation of the curve of “best-fit”?b) Predict the salary for a superstar for 2005

This data is definitely curved

12

Non-Linear Regression Let’s check the ln transformations Store the ln(x) and ln(y) into new lists:

POWERPOWER(x, y)

(x, ln(y))

(ln(x), y)

(ln(x), ln(y))

=(year, salary)

=(year, ln(salary))

=(ln(year), salary)

=(ln(year), ln(salary))

We already know that this is curved

13

Non-Linear Regression Check all three models…

LogarithmicLogarithmic(ln(x), y)

PowerPower(ln(x), ln(y))

Scatter plot Residual plot Scatter plot Residual plot

OriginalOriginal(x, y)

ExponentialExponential(x, ln(y))


Best One!!Best One!!

14

Non-Linear Regression After checking the scatter plots and the

residual plots, we see that (x, ln(x)) is the best transformation.LogarithmicLogarithmic

(ln(x), y)POWERPOWER

(ln(x), ln(y))

However, we should check the ladder of powers to make sure that there is not a better transformation.


These are These are NO NO

GOOD!!!GOOD!!!

15

Non-Linear Regression The ladder of powers.

• Let’s try to transform the data according to the ladder:Power: Power: 2 ½ ln -½ -12 ½ ln -½ -1

y2y )ln(yy

1y

1

16

Non-Linear Regression To get a better picture of the somewhat

linear models, let’s look at the residual plotsPower: Power: 2 ½ ln -½ -12 ½ ln -½ -1

y2y )ln(yy

1y

1

We’ve already We’ve already

done this onedone this oneThis residualThis residual

plot stillplot still

looks curvedlooks curved

This looksThis looks

pretty good too; pretty good too;

although, it mayalthough, it may

be slightlybe slightly

curvedcurved

This is This is

definitely definitely

curvedcurved This is This is the the best best

one!!!one!!!

This isThis is

definitelydefinitely

curvedcurved

17


year132.0282.261ryasal

)year(132.0282.261)ryasalln(

Let’s examine the relationship between years and baseball superstar salaries: best transformation (x, ln(y))

The following is the modified equation:

However, in the calculator we have the following:

Change the equation to account for the transformation. Solve for y. This is the equation of the function of “best-fit.”

)year132.0282.261(eryasal

18

The average baseball superstar in 2005 will make about 29.312 million

If we look at the graph, it is evident that the curve fits the data

However, we should be wary of extrapolation

Non-Linear Regression Now let’s predict the salary of a superstar in

2005:

Plug in 2005:

)year132.0282.261(eryasal

)378.3(eryasal ))2005(132.0282.261(eryasal

312.29ryasal

19

Common Errors Do we have to re-scale our numbers?Do we have to re-scale our numbers?

• We need to be very careful what numbers we use when we try to transform our data. Data values that are far from 1 are often not affected much by our transformations unless the range is very large. It is often useful to try to get numbers between 1 and 100 since our re-expression will have a greater affect on 1 to 100 than from 100,001 to 100,100.

Why can’t I find the “perfect” model?Why can’t I find the “perfect” model?• Don’t expect to ever find the perfect model – more likely

than not, it doesn’t exist. Just remember that “real-world” data is messy and it is difficult to find a model that will fit the data perfectly.

20

Common Errors Is it me or does that graph look weird?Is it me or does that graph look weird?

• Re-expression can straighten many relationships, but not those that that go up and down and up again (or something to that effect). You should refuse to analyze such data with methods that require a linear form.

Why is my re-expression missing data?Why is my re-expression missing data?• It is impossible to re-express negative values for several

rungs on our ladder of powers. Such values are omitted by the calculator and the effect on your re-expression can be significant. Try not to lose good data values while transforming your data. Sometimes adding small values such as 1/2 or 1/6 is useful.

21

TransformationsType of Type of

ModelModel

Exponential

Logarithmic

Power

))ln(,(),( yxyx

New ModelNew ModelTransformationTransformation

)),(ln(),( yxyx

))ln(),(ln(),( yxyx )ln(ˆln xbay

bxay ˆln

)ln(ˆ xbay

Re-expressedRe-expressed

ModelModel

)ln(ˆ xbaey

bxaey ˆ

)ln(ˆ xbay

22

Can’t We Just Use the Curve? Although your calculator will do other

types of regression (quadratic, exponential, etc.), using the curve has drawbacks.

• First, lines are easy to understand. Using the curve, throws out all of our understanding of linear regression. We understand how to interpret the slope and the y-intercept, and linear models are more useful in advanced statistical practices. In order to use the curve, we would have to come up with a whole new system of understanding.

• It’s best to use the linear model.

OutliersOutliers In regression analysis, a data point that

diverges greatlygreatly from the overall pattern of data is called an outlieroutlier.

There are basically four ways that a point can be considered an outlier:

1. It could have an extreme X value compared to other data points.

2. It could have an extreme Y value compared to other data points.

3. It could have an extreme X and Y values compared to other data points.

4. It might be distant from the rest of the data, even without extreme X or Y values.

ExamplesExamples

Distant Data PointDistant Data PointExtreme X and YExtreme X and Y

Extreme XExtreme X Extreme YExtreme Y

Influential PointsInfluential Points An influential pointinfluential point is an outlieroutlier that

greatlygreatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

To put it simply, influential points are data points that have disproportionate effects on the slope of the regression equation.

Influential PointsInfluential Points This type of analysis is illustrated below.

The slope is larger when the outlier is present, so this outlier would be considered an influential point. Sometimes, an influential point will cause the coefficient coefficient of determinationof determination to be bigger; sometimes, smaller. In this example, the coefficient of determination is smaller when the outlier is present. With With

OutlierOutlier

Regression equation: ŷ = 104.78 - 4.10x

Coefficient of determination: R2 = 0.94

Regression equation: ŷ = 97.51 - 3.32xCoefficient of determination: R2 = 0.55

Without Without OutlierOutlier

ExampleExampleWhich statement about influential points is true?

I. Removal of an influential point changes the regression line.

II. Data points that are outliers in the horizontal direction are more likely to be influential than points that are outliers in the vertical direction.

III. Influential points have large residuals.

I and II are true statements. A linear I and II are true statements. A linear transformation neither increases nor decreases the transformation neither increases nor decreases the linear relationship between variables; it preserves linear relationship between variables; it preserves the relationship. A the relationship. A nonlinearnonlinear transformation is used transformation is used to increase the relationship between variables. The to increase the relationship between variables. The most effective transformation method depends on most effective transformation method depends on the data being transformed.the data being transformed.

Documents

Chapter 12-2 Transforming Relationships Day 2 AP Statistics