Upload
aidian-peterson
View
214
Download
0
Embed Size (px)
Citation preview
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 1/11
Soccer
Predicting the points made at the end of aSeason
Carle Peterson-Perry and Tara Hart
Advanced Statistics
2/4/2011
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 3/11
Introduction
Why soccer?
It is a great sport many have a passion for. For players it is the thrill, the fun,
and the physicality. For those who watch it, they just simply enjoy it. Some girls
just love the hot players, but for the guys it’s the controversy and the skill
involved. Ther e’s something about a great goal or an awesome save that makes
you say ‘wow’ and leaves you in awe. Also, when it comes to the world cup,
European cup champions league, premier league or any major competition you feel
almost patriotic to support your country or team. There are definitely people who
cheered all day when the Spain won the world cup final, to the point where they
wore their jerseys 3 days straight and paraded all over city centers. It is something
to put your heart behind, the intensity and passion, the expectation and
unpredictability.
Is it possible to use past game statistics to predict any future top scores made
in the Barclay’s Premier League? In our experiment, we are attempting to rewrite
the very unpredictable nature of soccer that comes with this league, by claiming
that past league statistics can adequately predict how many points any given team
will earn in a season.
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 4/11
The Barclays Premier League is the top competition of the English soccer
system, sponsored by Barclays, hence the name. The Premier league is the
country’s primary contest where twenty soccer clubs compete to be crowned
champions of the English soccer system. The way it is set up is every team plays
thirty-eight games each where they play against every squad twice. Teams receive
three points for every win, one point for a draw, and awarded zero points for a loss.
Progressing along the season the teams are ranked by the total points, goal
difference, and then goals scored. For example, if two teams have equal points, the
goal difference or goals scored will determine the winner, and if the data are still
equal a game between the two will identify which team would be placed above the
other. The games are spread roughly over the course of 38 weeks, from August to
May, mainly on weekends with a few games occurring during the week.
There’s a certain aura about the Premier League, the big game on a Sunday
pulls people from all corners of the globe to one stadium full of united, harmonic
chants, either through the air waves or live. Each person rides their luck in hope of
their team securing those 3 all-important points, which collectively determines
rank at the end of the season. However, ‘lady luck’ isn’t always on everyone’s
side, developing a model to help predict points by meeting certain criteria will help
ease the nerves of those religious soccer fans and for those gamblers out there, a
way of edging the odds in their favor.
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 5/11
Method:
To perform this observational study, we first had to collect an adequate
amount of data for the Barclays Premier eague. In order to gather our data, we
surfed the Internet looking for a reliable website that could provide the scores and
other statistics of each soccer team. The website that provided us with this data was
http://www.premierleague.com/page/Statistics/0,,12306,00.html. This site
contained all sorts of statistics such as goals made, wins, losses, draws and more,
which made excellent variables for our study. These statistics were taken starting
from last year and the four previous years, giving us a starting sample size of
eighty (n=80).
After finding our website we placed the data into Minitab and began our
search for a suitable model that would be useful to predict the number of points a
team would make at the end of a season. We thought variables such as goals
scored, goals scored against the team, number of wins, draws, losses, and the
number of World Cup players on each team would all be useful in our model. A
total of six variables were used initially.
Next, to get an idea of how the data looked, we used Minitab to create a
scatterplot. Analyzing the graph, it appeared to have a linear shape but also had
that possibility of a hidden curve. We also noticed the graph contained a couple of
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 6/11
outliers and from then decided to limit our data to the teams who managed to score
higher than 20 points. The sample size was then reduced to seventy-eight. By
doing this we obtained a better scatterplot of the data.
The data from the scatterplot suggested a linear form, which led us to
hypothesize our first model, a complete first order model. After running the
regression in Minitab, the statistics supported the idea that the model was useful
but some coefficients proved not to be. So determine to find a better model using
stepwise regression, we hypothesized further models and ran the tests over again.
After comparing each model’s strengths, only one stood out to be better than the
rest.
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 7/11
Results
1. Hypothesize the form of the model. We hypothesize a model to relate fire
damage y to the distance x from the nearest fire station. We will
hypothesize a straight-line probabilistic model: x y 10
2. Collect sample data. Collect the (x,y) values for each of the n=15
experimental units (residential fires) in the sample.
x, miles y, dollars
3.4 26.2
6543210
45
35
25
15
y ,
d o l l a r s
R-Sq = 92.3 %
Y = 10.2779 + 4.91933X
Regression Plot
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 8/11
1.8 17.8
4.6 31.3
2.3 23.1
3.1 27.5
5.5 36.0
0.7 14.1
3.0 22.3
2.6 19.6
4.3 31.3
2.1 24.0
1.1 17.3
6.1 43.2
4.8 36.4
3.8 26.1
3. Use the sample to estimate unknown model parameters. Use Minitab. See
Plot above.
The regression equation is
y = 10.3 + 4.92 x
Predictor Coef StDev T P
Constant 10.278 1.420 7.24 0.000
x 4.9193 0.3927 12.53 0.000
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 9/11
S = 2.316 R-Sq = 92.3% R-Sq(adj) = 91.8%
Analysis of Variance
Source DF SS MS F P
Regression 1 841.77 841.77 156.89 0.000
Residual Error 13 69.75 5.37
Total 14 911.52
The least squares estimate of the slope, 92.4ˆ1 , implies that the mean
estimated damage increases by $4,920 for each additional mile from the fire
station. This interpretation is valid over the range of x, or from .7 to 6.1 miles
from the station. 3.10ˆ0 has no practical interpretation, since x=0 is outside
the range (it would seem to apply to the fire station itself).
4.
Specify the probability distribution of the random error term ε.Assumptions about the random error component:
a. 0)( E That is the mean of the probability distribution of ε is zero. b. The variance of ε is equal to a constant, say σ2, for all values of x. c. The distribution of ε is normal. d. ε’s are independent
37.52 s = MSE (mean square for error) estimates σ2
Estimated standard deviation of ε is s = 2.316. Most of the observed fire
damage values (y) will fall within 2s = 4.64 thousand dollars of their respective predicted values when using the least squares line.
5. Statistically check the usefulness of the model. That is, does x contributeinformation for the prediction of y using the straight-line model?
a. Test of model utility. For this example, consider the researchhypothesis that x and y are positively related. The hypotheses are:
7/29/2019 Soccer Experiment
http://slidepdf.com/reader/full/soccer-experiment 10/11
0:
0:
1
10
a H
H
. The test statistic (Minitab) is t = 12.53, with a 2-tailed probability of 0.000 (both highlighted). Thus, for a one-tailed test, p =0.0000/2 = 0.000. Since p < α, there is sufficient evidence to reject
the null and conclude that the distance between the fire and the firestation contributes information for the prediction of fire damage, andthat the mean fire damage increases as the distance increases.
b. Confidence interval for slope. A 95% CI for β is 1ˆ2/1
ˆ
st , or
85.92.4)3927)(.160.2(92.4 or (4.07, 5.77). We are 95% confident
that the interval from $4,070 to $5,770 encloses the mean increase 1 in fire damages per additional mile distance from the fire station.
c. Numerical descriptive measures of model adequacy. The coefficientof determination, R-Sq = 92.3% , implies that about 92% of thesample variation in fire damage is explained by the distance betweenthe fire and the fire station. The coefficient of correlation
96.923. r indicates a high correlation , and confirms our
conclusion that 1 differs from 0.
Thus, the results of the test for 1 , the high value of 2
r , and the relatively small 2s
value (step 4), all point to a strong linear relationship between x and y.
6. When satisfied with the model, use it for prediction, estimation, and so on.
Make a prediction with Minitab… Predicted Values
Fit StDev Fit 95.0% CI 95.0% PI
27.496 0.604 ( 26.190, 28.801) ( 22.324, 32.667)