23
Linear Modeling- Trendlines The Problem - Last time we discussed linear equations (models) where the data is perfectly linear. By using the slope-intercept formula, we derived linear equation/models. In the “real world” most data is not perfectly linear. How do we handle this type of data? The Solution - We use trendlines (also known as line of best fit and least squares line). Why - If we find a trendline that is a good fit, we can use the equation to make predictions. Generally we predict into the future (and occasionally into the past) which is called extrapolation. Constructing points between existing points is referred to as interpolation.

Linear Modeling-Trendlines The Problem - Last time we discussed linear equations (models) where the data is perfectly linear. By using the slope-intercept

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Linear Modeling-Trendlines

The Problem - Last time we discussed linear equations (models) where the data is perfectly linear. By using the slope-intercept formula, we derived linear equation/models. In the “real world” most data is not perfectly linear. How do we handle this type of data?

The Solution - We use trendlines (also known as line of best fit and least squares line).

Why - If we find a trendline that is a good fit, we can use the equation to make predictions. Generally we predict into the future (and occasionally into the past) which is called extrapolation. Constructing points between existing points is referred to as interpolation.

Is the trendline a good fit for the data? To answer this question, you need to

address the following five guidelines:

Guideline 1: Do you have at least 7 data points? For the datasets that we use in this class, you should use at least 7 of the most recent data points available.

Guideline 2: Does the R-squared value indicate a relationship? R2 is a standard measure of how well the line fits the data. If R2 is very low, it tells us the model is not very good and probably shouldn't be used. The R-squared value is also called the Coefficient of Determination and can be written as r 2 or R2.

If the R2 = 1, then there is a perfect match between the line and the data points. If the R2 = 0, then there is no relationship between n the x and y values.

If the R2 value is between .7 and 1.0, there is a strong linear relationship

If the R2 value is between .4 and .7, there is a moderate linear relationship.

If the R2 value is below .4, the relationship is weak and you should not use this data to make predictions.

More facts….. R2 is a measure that allows us to determine how

certain one can be in making predictions from a certain model/graph.

The coefficient of determination is such that 0 <  r 2 < 1,  and denotes the strength of the linear association between x and y. 

The coefficient of determination represents the percent of the data that is the closest to the line of best fit.  For example, if r = 0.922, then r 2 = 0.850, which means

that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation).  The other 15% of the total variation in y remains unexplained.

Calculating the coefficient of determination The mathematical formula for computing r is:

                                                              

where n is the number of pairs of data.

To compute R2, just square the result from the above formula.

Having a high R-squared value is not enough. When

does a model fit well?

A relatively high R-squared value does not guarantee that the

model is a good one. There are some other factors you should

look for.

A good model will have fairly random distribution of data points above and below the line. For example:

The lean of the leaning tower of Pisa…

YearLean of LeaningTower

of Pisa

1975 642

1976 644

1977 656

1978 667

1979 673

1980 688

1981 696

1982 698

1983 713

1984 717

1985 725

1986 742

1987 757

(tenths of mm in excess of 2.9 meters)

"Lean" of Leaning Tower of Pisa

R2 = 0.988

620

640

660

680

700

720

740

760

780

1974 1976 1978 1980 1982 1984 1986 1988

Year

Ten

ths

of

mm

in

exc

ess

of

2.9

m

In contrast, consider the following data: North Sea Plaice Length vs. weight.

North Sea Plaice

Length (cm) Weight (g)

28.5 213

30.5 259

32.5 308

34.5 363

36.5 419

38.5 500

40.5 574

42.5 674

44.5 808

46.5 909

48.5 1124

Weight vs. length of the North Sea plaice

R2 = 0.9499

0

200

400

600

800

1000

1200

25 30 35 40 45 50

Length (cm)

Wei

gh

t (g

)

Guidelines continued… Guideline 3: Verify that your trendline fits

the shape of your graph. For example, if your trendline continues upward, but the data makes a downward turn during the last few years, verify that the “higher” prediction makes sense (use practical knowledge). In some cases it is obvious that you have a

localized trend. Localized trends will be discussed at a later date.

A related situation occurs when there is a consistent long term uptrend with an abrupt change toward the end:

US Murder Rate (per 100,000 population)

Year US Murder Rate

1963 4.6

1964 4.9

1965 5.1

1966 5.6

1967 6.2

1968 6.9

1969 7.3

1970 7.9

1971 8.6

1972 9

1973 9.4

1974 9.8

1975 9.6

1976 8.7

(per 100,000 population)

US Homicide Rate

R2 = 0.9181

4

5

6

7

8

9

10

11

1962 1964 1966 1968 1970 1972 1974 1976 1978

Year

Ra

te (

pe

r 1

00

,00

0 p

op

ula

tio

n)

Guideline 4: Look for outliers: Outliers should be investigated carefully. Often

they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points. If the data was entered incorrectly, it is important to find the right information and update it. In some cases, the data is correct and an anomaly

occurred that partial year. The outlier can be removed if it is justified. It must also be documented.

Individual points can very strong effects on models. Watch out for them. Consider the following data:

Length of animal vs. running speed.

AnimalLength

(cm)Running

Speed(cm/sec)

Clover mite 0.08 0.85

Anyestid mite 0.13 4.3

Argentine ant 0.24 4.4

Ant 0.42 6.5

Deer mouse 9 250

Lizard 15 720

Chipmunk 16 480

Iguana 24 730

Squirrel 25 760

Fox 60 2000

Cheetah 120 2900

Ostrich 210 2300

Running Speed vs. Length of Animal

R2 = 0.736

0

500

1000

1500

2000

2500

3000

3500

0 50 100 150 200 250

Length (cm)

Ru

nn

ing

sp

eed

(cm

/s)

Running Speed vs. Length of Animal

R2 = 0.736

0

500

1000

1500

2000

2500

3000

3500

0 50 100 150 200 250

Length (cm)

Ru

nn

ing

sp

eed

(cm

/s)

Ostrich

Running Speed vs. Length of Animal

R2 = 0.9647

0500

100015002000250030003500

0 50 100 150Length (cm)

Ru

nn

ing

sp

ee

d (

cm

/s)

Guideline 5: Practical Knowledge How many years out can we predict?

Based on what you know about the topic, does it make sense to go ahead with the prediction? Use your subject knowledge, not your mathematical knowledge to address this guideline.