Upload
duy-tien
View
215
Download
0
Embed Size (px)
Citation preview
8/13/2019 [Report] Final Project of Statistic
1/18
1
INTERNATIONAL UNIVERSITY
VNU HCMC
REPORT
FINAL PROJECT OF STATISTIC
BUSINESS
Lecturer: Nguyen Bac Huy
Team Members:
1.Lng Tho Nhi_BAFNIU110752.Nguyn Tn Pht_BAFNIU111403.T Phm Duy Tin_BABAIU111444.Nguyn an Vy_BAFNIU110575.Trng Th Ngc Tuyt_BAFNIU110806.Hunh Ngc Tho Uyn_BAFNIU111277.L Ngc Anh Phng_BABAIU11269
8/13/2019 [Report] Final Project of Statistic
2/18
2
CONTENT:
1. Question 13-42. Question 25 -113. Question 39- 13
4. Question 413 145. Question 515 - 17
8/13/2019 [Report] Final Project of Statistic
3/18
3
Question 1:
A US National Public Transportation survey taken few years ago in USA indicated that
less than 5% of US citizens use public transportation. Collect secondary data from
several websites of U.S. Departments of Transportation, US Census Bureau etc. to testthis hypothesis from the survey. Write a short essay to explain the data.
Solution:
We now consider to a sample from Bureau of Labor Statistics of United States
Department of Labor in 2008 and 2009:
On Dec. 2008, the sample of 143,338,000 (the number of working people) was
chosen.
On Dec. 2009, the sample of 137,792,000 (the number of working people) waschosen.
According to the Public Transportation Usage Among U.S. Workers: 2008 and 2009
report of American Community Surveys (ACS), they estimates of the number of workers
who commuted by public transportation in the 50 largest metro areas:
In 2008, the sample of 7,186,530 of people was found using public transportation
to get to work.
In 2009, the sample of 6,992,424 of people was found using public transportation
to get to work.
Let p denote the probability of people using public transportation. Thus, the null and
alternative hypotheses are:
H0:p< 5%
H1:p 5%
We will test this hypothesis in 2 years: 2008 and 2009.
Let begin with the year 2008:
The hypothesized value of the proportion p0= 0.05
The sample size n: n= 143,338,000
The sample proportion : = = 0.05014Because we have:
np0= 0.05 x 143,338,000 =7,166,900 > 5
8/13/2019 [Report] Final Project of Statistic
4/18
4
n(1-p0) = 0.95 x 143,338,000 = 136,171,100 >5
So we use z-test with formula:
z=
=
= 7.69
Z test is 7.69 so p-value = 7.365 x 10-13
. Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.
We continue the data of 2009:
The hypothesized value of the proportion p0= 0.05
The sample size n: n= 137,792,000
The sample proportion : = = 0.05007Because we have:
np0= 0.05 x 137,792,000=6,889,600 > 5
n(1-p0) = 0.95 x 137,792,000= 130,902,400 >5
So we use z-test with formula:
z=
=
= 3.77
Z test is 3.77 so p-value = 8.159x 10-3. Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.
Reference:
Public Transportation Usage AmongU.S. Workers: 2008 and 2009, Table 2 :Public Transportation Usage for the 50 Largest Metropolitan Statistical Areas:12008 and 2009Con.
The Employment Situation: December 2008, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,seasonally adjusted.
The Employment Situation: December 2009, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,
seasonally adjusted.
8/13/2019 [Report] Final Project of Statistic
5/18
5
Question 2:
Discuss among your group, select one company, state one dependent variable, and
more than two independent variables.
a. Collect data. Testing the independence among independent variables.
b. Establish regression relationship, write down the regression equation.
c. Use the regression equation to estimate new value of dependent variable.
Solution:
a/ Col lect data. Test ing th e independence among independent var iables.
Company Kinh Do Food Joint Stock Saigon business units operating in the field of foodproduction and processing.. How can they reach all the customers' needs? To solve thisproblem, a survey about customers' satisfaction has been conducted because of thesefollowing reasons:
trends: About price, quality, forms,
Through this survey, company will think of new strategies to investment closer to thestrengths and overcome the shortcomings attract more customers and make themrespect in the company.
Data:Y: Dependent variableLevel of satisfactionX1: Independent variablePriceX2: Independent variableQualityX3: Independent variableEvaluation compared to other milk brandsX4: Independent variableHow often consumers useX5: Independent variableRepeated use
Level of satisfaction
Y X1 X2 X3 X4 X5
3 3 3 3 5 4
4 3 4 5 5 5
1 1 2 3 3 2
3 3 3 5 5 4
8/13/2019 [Report] Final Project of Statistic
6/18
6
3 3 3 3 2 3
3 3 2 1 5 1
3 3 3 4 2 4
3 3 3 4 3 4
1 1 1 5 5 1
3 3 4 4 2 4
3 3 3 4 3 4
3 3 3 5 5 4
4 3 4 5 2 5
3 3 3 3 5 4
5 3 5 5 3 5
4 3 4 4 2 4
4 3 5 4 5 5
3 3 4 2 5 4
4 3 4 5 2 5
3 4 3 5 4 2
3 3 3 4 5 4
4 3 5 4 4 5
3 3 3 3 2 4
3 3 3 3 3 3
3 3 3 4 2 4
5 4 4 4 3 4
3 3 3 4 2 3
2 2 3 3 4 3
3 1 3 2 3 33 3 4 3 4 4
3 2 4 4 2 4
4 3 4 4 3 5
3 4 5 2 5 5
4 3 3 4 2 4
3 4 3 1 2 3
3 3 3 3 5 2
3 3 3 1 1 3
1 2 1 3 3 2
4 3 4 5 2 54 4 4 4 3 5
4 3 4 5 3 4
4 4 5 5 3 5
2 3 2 4 4 3
4 3 3 4 2 4
3 3 4 4 3 4
8/13/2019 [Report] Final Project of Statistic
7/18
7
3 3 4 5 2 4
4 3 4 5 2 5
2 3 3 5 5 4
3 3 5 3 2 4
4 4 3 4 5 5
2 3 4 4 3 3
3 3 3 4 5 5
4 3 4 2 5 4
3 2 3 4 2 4
4 1 5 5 2 3
1 3 4 3 4 3
3 3 4 4 5 5
4 3 3 4 2 3
3 2 4 4 2 3
3 3 3 4 5 5
Hypothesis test ing:
H0: 1 = 2 = 3 = 4 = 5= 0
H1: Not all the i (i=1,2,3,4,5) are zero
ANOVA
df SS MS FSignificance
F
Regression 5 23.01243 4.602486 11.65684 1.16E-07
Residual 54 21.3209 0.394832
Total 59 44.33333
According to ANOVA table we can see that at all level of significance, the test statisticvalue FT = F-ratio = 11.65684 > F critical = 2.3538 so we can reject H0.
In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we
have enough evidence to prove that there is a regression relationship between the dependent
variable Y and the independent variables Xi (i=1,2,3,4,5)
Coefficient table:
CoefficientsStandard
Error t Stat P-value
8/13/2019 [Report] Final Project of Statistic
8/18
8
Intercept 0.503736 0.542849 0.927948 0.357564
x1 0.298951 0.137257 2.178037 0.03379
x2 0.293209 0.121558 2.412093 0.019289
x3 0.068627 0.08471 0.810139 0.421416
x4 -0.12698 0.065631 -1.93473 0.058269
x5 0.246856 0.119542 2.065025 0.043736
Regression equation: from the table of coefficient, we can set up the regression equation as
followings:
Y =0.503736 + 0.298951 X1 + 0.293209 X2 + 0.068627X30.12698X4 + 0.246856
X5
To test whether the variables of the regression model are significant, we base on p-value. If pvalue of Xi (level of significant) = 0.05, the test statistic value falls intonon-rejection region, Xi is non-significant and we should remove Xi.Based on the coefficient table:P-value of X3 = 0.421416 > 0.05P-value of X4 = 0.058269 > 0.05
Thus, X1, X2 are non-significant and should be removed from the regression equation .
In addition:
P-value X1= 0.03379 < 0.05P-value of X2 = 0.019289 < 0.05P-value of X5 = 0.043736
8/13/2019 [Report] Final Project of Statistic
9/18
9
3 3 3 4
3 3 3 3
3 3 2 1
3 3 3 4
3 3 3 4
1 1 1 1
3 3 4 4
3 3 3 4
3 3 3 4
4 3 4 5
3 3 3 4
5 3 5 5
4 3 4 4
4 3 5 5
3 3 4 4
4 3 4 5
3 4 3 2
3 3 3 4
4 3 5 5
3 3 3 4
3 3 3 3
3 3 3 4
5 4 4 4
3 3 3 3
2 2 3 33 1 3 3
3 3 4 4
3 2 4 4
4 3 4 5
3 4 5 5
4 3 3 4
3 4 3 3
3 3 3 2
3 3 3 3
1 2 1 24 3 4 5
4 4 4 5
4 3 4 4
4 4 5 5
2 3 2 3
4 3 3 4
8/13/2019 [Report] Final Project of Statistic
10/18
10
3 3 4 4
3 3 4 4
4 3 4 5
2 3 3 4
3 3 5 4
4 4 3 5
2 3 4 3
3 3 3 5
4 3 4 4
3 2 3 4
4 1 5 3
1 3 4 3
3 3 4 5
4 3 3 3
3 2 4 3
3 3 3 5
Hypothesis testing:
H0: 1 = 2 =3 = 0H1: Not all the i (i=1,2,3) are zero
ANOVA
df SS MS FSignificance
F
Regression 3 21.18356 7.061186 17.08122 5.35E-08
Residual 56 23.14977 0.413389
Total 59 44.33333
According to ANOVA table, we can see that at all level of significance, the test statisticvalue FT = F-ratio = 17.08122 > F critical = 2.7395 so we can reject H0.
In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we
have enough evidence to prove that there is a regression relationship between the dependent
variable Y and the independent variables Xi (i=1,2,3).
Coefficient table: Multiple Regression
CoefficientsStandard
Error t Stat P-value
Intercept 0.295018 0.434343 0.679228 0.499791
X1 0.243586 0.136693 1.781989 0.080173
8/13/2019 [Report] Final Project of Statistic
11/18
11
X2 0.335106 0.121941 2.748094 0.008051
X3 0.262938 0.113466 2.317325 0.024166
Regression equation: from the table of coefficient, we can set up the regressionequation as followings.
Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3
To test whether the variables of the regression model are significant, we can base on p-value. If pvalue of Xi (level of significant) = 0.05, then the test statistic value fallsinto non-rejection region. So we cannot reject H0 at 0.05 level of significance, Xi is non-significant and we should remove Xi.
Base on the coefficient table:P-value of X1 = 0.080173
8/13/2019 [Report] Final Project of Statistic
12/18
8/13/2019 [Report] Final Project of Statistic
13/18
13
Source of
Variation
Sum of
Square
(SS)
Df
Mean
Squares
(MS)
F ratio
(FT)
Treatment (TR) 381126.6667 2 190563.3333 20.70840377
Error (E) 248460 27 9202.222222
Total (T) 629586.6667 29
Test statistic value:
F-ratio = 20.70
At = 0.05, the critical value:
F (2.27;0.05)= 3.35
Because F-ratio > F, we reject Ho.
It means that based on the ANOVA table and the hypothesis testing we have sufficient
evidence to prove that not all three prototypes have the same average range.
Question 4:We have taken the survey for student of National University HCMC and collect a data
about students who get money from their part-time job or get money from their parents (
are called income ) and their cellphone ( which they can buy to use from their income)
and three big companies. Suppose that a random sample of student is available from
various companies. We will test the independent between these two factors. (Using a
level of significance of 5%). We have result below:
Companies Total
Nokia Sony Samsung
Students < 1 million 42 16 38 96
1-3 millions 57 37 55 149
8/13/2019 [Report] Final Project of Statistic
14/18
14
> 3 millions 22 28 37 87
Total 121 81 130 332
Solution:
H0 : The student of each each income and the number of users in three companies are
independent of each other.
H1 : The student of each income and the number of users in three companies are not
independent .
Expected counts of data points in different cells:
Companies Total
Nokia Sony Samsung
Student of
each income
< 1 million 9634.99 23.42 37.59
1-3 millions 14954.3 36.35 58.35
>3 millions 8731.71 21.23 34.06
Total 121 81 130 332
The chi-square test statistic value for independence is:
ij
ijij
tE
EO 2
2 )(
Degree of freedom = (r1)(c1) = (31)(31) = 4.
Critical value2c=2 (4, 0.05) = 9.4877.
At 0.05 level of significance, we can not reject H0since 2t
8/13/2019 [Report] Final Project of Statistic
15/18
15
the student of each each income and the number of users in three companies are
independent of each other.
8/13/2019 [Report] Final Project of Statistic
16/18
16
Question 5:
For a random sample of 200 U.S. motorists, the mileages driven last year are in data
presented below.
10221 718 8802 2102 4221 3257 2697 4760 5717 4193
2209 8521 6972 6873 6115 5998 2781 3833 6632 2829
2796 2031 7783 45 2692 7912 4447 3018 4895 3511
3571 2202 502 1748 5524 4185 8404 7077 5891 1378
3806 5559 9889 521 7284 7146 4482 7734 1286 2686
4110 1816 6972 3818 3510 4500 6229 167 5889 5349
4402 5973 5174 6198 3330 8836 7500 5466 5942 1654
4500 2079 5281 3668 5246 567 4527 5354 7474 551
4669 5572 402 6182 7250 2859 7124 7924 3625 5734
4720 5492 7941 5966 4801 7289 980 2963 6674 6741
4993 6026 6271 3514 5011 5245 5653 2910 5672 8103
5090 8050 6069 2960 4173 8943 6699 1514 2307 5497
5327 2293 9555 7712 5679 8840 3420 6197 2846 4943
5640 6825 6817 6744 702 6494 5954 3811 5794 2855
5801 3237 5816 4784 5014 7530 4308 3689 6981 1904
6208 6593 4104 5751 5244 1860 5224 655 5401 10304
6723 2683 7990 7645 2336 7869 6657 4223 5857 4336
6829 6274 10703 6669 3469 5682 5144 7044 4059 4673
7326 8198 5731 10962 5667 3615 6465 9577 4047 4694
8/13/2019 [Report] Final Project of Statistic
17/18
17
9167 4435 1879 5912 7440 5259 4132 2617 3026 5967
a. Guess the theoretical distribution could be the best fit for data.
b. Use the 0.01 level of significance in determining whether the given data follow the
distribution on question a.
Solution:
a/ The chi-square testing for goodness of fit can be used to test how well our data
support an assumption about distribution of a population or random variable of interest.
We know the mean and the standard deviation of the population or variable. But in
some cases, they do not give the values of and ,so we need to estimate them from
the data. When this happens, we lose a degree of freedom for each parameterestimated from the data. The degrees of freedom of the chi-square statistic are df= k-2-
1 = k-3 (instead of k-1 as before).
b/
Base on the answer in question above, we have a guess for this population is the bell-
shaped distribution. Then in question b, with the 0.01 level of significance, we determine
whether the given data follow this distribution or not.
The null and alternative hypothesis
H0: the population has a normal distribution
H1: the population is not normally distributed
The chi-square goodness-of-fit test may be applied to testing any hypothesis about the
distribution of a population or a random variable. The test may be applied in particular to
testing how well an assumption of a normal distribution is supported by a given data set.
We have:
n= 200
We divided interval into 6 classes: k=6
We have:
We have: 2797.85
8/13/2019 [Report] Final Project of Statistic
18/18
4078.609 5084.92 = 6091.231 7371.99
The expect E = np