View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Linear Modeling-Trendlines
The Problem - Last time we discussed linear equations (models) where the data is perfectly linear. By using the slope-intercept formula, we derived linear equation/models. In the “real world” most data is not perfectly linear. How do we handle this type of data?
The Solution - We use trendlines (also known as line of best fit and least squares line).
Why - If we find a trendline that is a good fit, we can use the equation to make predictions. Generally we predict into the future (and occasionally into the past) which is called extrapolation. Constructing points between existing points is referred to as interpolation.
Is the trendline a good fit for the data? To answer this question, you need to
address the following five guidelines:
Guideline 1: Do you have at least 7 data points? For the datasets that we use in this class, you should use at least 7 of the most recent data points available.
Guideline 2: Does the R-squared value indicate a relationship? R2 is a standard measure of how well the line fits the data. If R2 is very low, it tells us the model is not very good and probably shouldn't be used. The R-squared value is also called the Coefficient of Determination and can be written as r 2 or R2.
If the R2 = 1, then there is a perfect match between the line and the data points. If the R2 = 0, then there is no relationship between n the x and y values.
If the R2 value is between .7 and 1.0, there is a strong linear relationship
If the R2 value is between .4 and .7, there is a moderate linear relationship.
If the R2 value is below .4, the relationship is weak and you should not use this data to make predictions.
More facts….. R2 is a measure that allows us to determine how
certain one can be in making predictions from a certain model/graph.
The coefficient of determination is such that 0 < r 2 < 1, and denotes the strength of the linear association between x and y.
The coefficient of determination represents the percent of the data that is the closest to the line of best fit. For example, if r = 0.922, then r 2 = 0.850, which means
that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation). The other 15% of the total variation in y remains unexplained.
Calculating the coefficient of determination The mathematical formula for computing r is:
where n is the number of pairs of data.
To compute R2, just square the result from the above formula.
Having a high R-squared value is not enough. When
does a model fit well?
A relatively high R-squared value does not guarantee that the
model is a good one. There are some other factors you should
look for.
A good model will have fairly random distribution of data points above and below the line. For example:
The lean of the leaning tower of Pisa…
YearLean of LeaningTower
of Pisa
1975 642
1976 644
1977 656
1978 667
1979 673
1980 688
1981 696
1982 698
1983 713
1984 717
1985 725
1986 742
1987 757
(tenths of mm in excess of 2.9 meters)
"Lean" of Leaning Tower of Pisa
R2 = 0.988
620
640
660
680
700
720
740
760
780
1974 1976 1978 1980 1982 1984 1986 1988
Year
Ten
ths
of
mm
in
exc
ess
of
2.9
m
North Sea Plaice
Length (cm) Weight (g)
28.5 213
30.5 259
32.5 308
34.5 363
36.5 419
38.5 500
40.5 574
42.5 674
44.5 808
46.5 909
48.5 1124
Weight vs. length of the North Sea plaice
R2 = 0.9499
0
200
400
600
800
1000
1200
25 30 35 40 45 50
Length (cm)
Wei
gh
t (g
)
Guidelines continued… Guideline 3: Verify that your trendline fits
the shape of your graph. For example, if your trendline continues upward, but the data makes a downward turn during the last few years, verify that the “higher” prediction makes sense (use practical knowledge). In some cases it is obvious that you have a
localized trend. Localized trends will be discussed at a later date.
A related situation occurs when there is a consistent long term uptrend with an abrupt change toward the end:
US Murder Rate (per 100,000 population)
Year US Murder Rate
1963 4.6
1964 4.9
1965 5.1
1966 5.6
1967 6.2
1968 6.9
1969 7.3
1970 7.9
1971 8.6
1972 9
1973 9.4
1974 9.8
1975 9.6
1976 8.7
(per 100,000 population)
US Homicide Rate
R2 = 0.9181
4
5
6
7
8
9
10
11
1962 1964 1966 1968 1970 1972 1974 1976 1978
Year
Ra
te (
pe
r 1
00
,00
0 p
op
ula
tio
n)
Guideline 4: Look for outliers: Outliers should be investigated carefully. Often
they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points. If the data was entered incorrectly, it is important to find the right information and update it. In some cases, the data is correct and an anomaly
occurred that partial year. The outlier can be removed if it is justified. It must also be documented.
Individual points can very strong effects on models. Watch out for them. Consider the following data:
Length of animal vs. running speed.
AnimalLength
(cm)Running
Speed(cm/sec)
Clover mite 0.08 0.85
Anyestid mite 0.13 4.3
Argentine ant 0.24 4.4
Ant 0.42 6.5
Deer mouse 9 250
Lizard 15 720
Chipmunk 16 480
Iguana 24 730
Squirrel 25 760
Fox 60 2000
Cheetah 120 2900
Ostrich 210 2300
Running Speed vs. Length of Animal
R2 = 0.736
0
500
1000
1500
2000
2500
3000
3500
0 50 100 150 200 250
Length (cm)
Ru
nn
ing
sp
eed
(cm
/s)
Running Speed vs. Length of Animal
R2 = 0.736
0
500
1000
1500
2000
2500
3000
3500
0 50 100 150 200 250
Length (cm)
Ru
nn
ing
sp
eed
(cm
/s)
Ostrich
Running Speed vs. Length of Animal
R2 = 0.9647
0500
100015002000250030003500
0 50 100 150Length (cm)
Ru
nn
ing
sp
ee
d (
cm
/s)