Soccer Experiment

7/29/2019 Soccer Experiment

http://slidepdf.com/reader/full/soccer-experiment 1/11

Soccer

Predicting the points made at the end of aSeason

Carle Peterson-Perry and Tara Hart

Advanced Statistics

2/4/2011



Abstract:



Introduction

Why soccer?

It is a great sport many have a passion for. For players it is the thrill, the fun,

and the physicality. For those who watch it, they just simply enjoy it. Some girls

just love the hot players, but for the guys it’s the controversy and the skill

involved. Ther e’s something about a great goal or an awesome save that makes

you say ‘wow’ and leaves you in awe. Also, when it comes to the world cup,

European cup champions league, premier league or any major competition you feel

almost patriotic to support your country or team. There are definitely people who

cheered all day when the Spain won the world cup final, to the point where they

wore their jerseys 3 days straight and paraded all over city centers. It is something

to put your heart behind, the intensity and passion, the expectation and

unpredictability.

Is it possible to use past game statistics to predict any future top scores made

in the Barclay’s Premier League? In our experiment, we are attempting to rewrite

the very unpredictable nature of soccer that comes with this league, by claiming

that past league statistics can adequately predict how many points any given team

will earn in a season.



The Barclays Premier League is the top competition of the English soccer

system, sponsored by Barclays, hence the name. The Premier league is the

country’s primary contest where twenty soccer clubs compete to be crowned

champions of the English soccer system. The way it is set up is every team plays

thirty-eight games each where they play against every squad twice. Teams receive

three points for every win, one point for a draw, and awarded zero points for a loss.

Progressing along the season the teams are ranked by the total points, goal

difference, and then goals scored. For example, if two teams have equal points, the

goal difference or goals scored will determine the winner, and if the data are still

equal a game between the two will identify which team would be placed above the

other. The games are spread roughly over the course of 38 weeks, from August to

May, mainly on weekends with a few games occurring during the week.

There’s a certain aura about the Premier League, the big game on a Sunday

pulls people from all corners of the globe to one stadium full of united, harmonic

chants, either through the air waves or live. Each person rides their luck in hope of

their team securing those 3 all-important points, which collectively determines

rank at the end of the season. However, ‘lady luck’ isn’t always on everyone’s

side, developing a model to help predict points by meeting certain criteria will help

ease the nerves of those religious soccer fans and for those gamblers out there, a

way of edging the odds in their favor.



Method:

To perform this observational study, we first had to collect an adequate

amount of data for the Barclays Premier eague. In order to gather our data, we

surfed the Internet looking for a reliable website that could provide the scores and

other statistics of each soccer team. The website that provided us with this data was

http://www.premierleague.com/page/Statistics/0,,12306,00.html. This site

contained all sorts of statistics such as goals made, wins, losses, draws and more,

which made excellent variables for our study. These statistics were taken starting

from last year and the four previous years, giving us a starting sample size of

eighty (n=80).

After finding our website we placed the data into Minitab and began our

search for a suitable model that would be useful to predict the number of points a

team would make at the end of a season. We thought variables such as goals

scored, goals scored against the team, number of wins, draws, losses, and the

number of World Cup players on each team would all be useful in our model. A

total of six variables were used initially.

Next, to get an idea of how the data looked, we used Minitab to create a

scatterplot. Analyzing the graph, it appeared to have a linear shape but also had

that possibility of a hidden curve. We also noticed the graph contained a couple of

http://www.premierleague.com/page/Statistics/0,,12306,00.html





outliers and from then decided to limit our data to the teams who managed to score

higher than 20 points. The sample size was then reduced to seventy-eight. By

doing this we obtained a better scatterplot of the data.

The data from the scatterplot suggested a linear form, which led us to

hypothesize our first model, a complete first order model. After running the

regression in Minitab, the statistics supported the idea that the model was useful

but some coefficients proved not to be. So determine to find a better model using

stepwise regression, we hypothesized further models and ran the tests over again.

After comparing each model’s strengths, only one stood out to be better than the

rest.



Results

1. Hypothesize the form of the model. We hypothesize a model to relate fire

damage y to the distance x from the nearest fire station. We will

hypothesize a straight-line probabilistic model: x y 10

2. Collect sample data. Collect the (x,y) values for each of the n=15

experimental units (residential fires) in the sample.

x, miles y, dollars

3.4 26.2

6543210

45

35

25

15

y ,

d o l l a r s

R-Sq = 92.3 %

Y = 10.2779 + 4.91933X

Regression Plot



1.8 17.8

4.6 31.3

2.3 23.1

3.1 27.5

5.5 36.0

0.7 14.1

3.0 22.3

2.6 19.6

4.3 31.3

2.1 24.0

1.1 17.3

6.1 43.2

4.8 36.4

3.8 26.1

3. Use the sample to estimate unknown model parameters. Use Minitab. See

Plot above.

The regression equation is

y = 10.3 + 4.92 x

Predictor Coef StDev T P

Constant 10.278 1.420 7.24 0.000

x 4.9193 0.3927 12.53 0.000



S = 2.316 R-Sq = 92.3% R-Sq(adj) = 91.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 841.77 841.77 156.89 0.000

Residual Error 13 69.75 5.37

Total 14 911.52

The least squares estimate of the slope, 92.4ˆ1 , implies that the mean

estimated damage increases by $4,920 for each additional mile from the fire

station. This interpretation is valid over the range of x, or from .7 to 6.1 miles

from the station. 3.10ˆ0 has no practical interpretation, since x=0 is outside

the range (it would seem to apply to the fire station itself).

4.

Specify the probability distribution of the random error term ε.Assumptions about the random error component:

a. 0)( E That is the mean of the probability distribution of ε is zero. b. The variance of ε is equal to a constant, say σ2, for all values of x. c. The distribution of ε is normal. d. ε’s are independent

37.52 s = MSE (mean square for error) estimates σ2

Estimated standard deviation of ε is s = 2.316. Most of the observed fire

damage values (y) will fall within 2s = 4.64 thousand dollars of their respective predicted values when using the least squares line.

5. Statistically check the usefulness of the model. That is, does x contributeinformation for the prediction of y using the straight-line model?

a. Test of model utility. For this example, consider the researchhypothesis that x and y are positively related. The hypotheses are:



0:

0:

1

10

a H

H

. The test statistic (Minitab) is t = 12.53, with a 2-tailed probability of 0.000 (both highlighted). Thus, for a one-tailed test, p =0.0000/2 = 0.000. Since p < α, there is sufficient evidence to reject

the null and conclude that the distance between the fire and the firestation contributes information for the prediction of fire damage, andthat the mean fire damage increases as the distance increases.

b. Confidence interval for slope. A 95% CI for β is 1ˆ2/1

ˆ

st , or

85.92.4)3927)(.160.2(92.4 or (4.07, 5.77). We are 95% confident

that the interval from $4,070 to $5,770 encloses the mean increase 1 in fire damages per additional mile distance from the fire station.

c. Numerical descriptive measures of model adequacy. The coefficientof determination, R-Sq = 92.3% , implies that about 92% of thesample variation in fire damage is explained by the distance betweenthe fire and the fire station. The coefficient of correlation

96.923. r indicates a high correlation , and confirms our

conclusion that 1 differs from 0.

Thus, the results of the test for 1 , the high value of 2

r , and the relatively small 2s

value (step 4), all point to a strong linear relationship between x and y.

6. When satisfied with the model, use it for prediction, estimation, and so on.

Make a prediction with Minitab… Predicted Values

Fit StDev Fit 95.0% CI 95.0% PI

27.496 0.604 ( 26.190, 28.801) ( 22.324, 32.667)



Documents

Soccer Experiment