[Report] Final Project of Statistic

8/13/2019 [Report] Final Project of Statistic

1/18

1

INTERNATIONAL UNIVERSITY

VNU HCMC

REPORT

FINAL PROJECT OF STATISTIC

BUSINESS

Lecturer: Nguyen Bac Huy

Team Members:

1.Lng Tho Nhi_BAFNIU110752.Nguyn Tn Pht_BAFNIU111403.T Phm Duy Tin_BABAIU111444.Nguyn an Vy_BAFNIU110575.Trng Th Ngc Tuyt_BAFNIU110806.Hunh Ngc Tho Uyn_BAFNIU111277.L Ngc Anh Phng_BABAIU11269


2/18

2

CONTENT:

1. Question 13-42. Question 25 -113. Question 39- 13

4. Question 413 145. Question 515 - 17


3/18

3

Question 1:

A US National Public Transportation survey taken few years ago in USA indicated that

less than 5% of US citizens use public transportation. Collect secondary data from

several websites of U.S. Departments of Transportation, US Census Bureau etc. to testthis hypothesis from the survey. Write a short essay to explain the data.

Solution:

We now consider to a sample from Bureau of Labor Statistics of United States

Department of Labor in 2008 and 2009:

On Dec. 2008, the sample of 143,338,000 (the number of working people) was

chosen.

On Dec. 2009, the sample of 137,792,000 (the number of working people) waschosen.

According to the Public Transportation Usage Among U.S. Workers: 2008 and 2009

report of American Community Surveys (ACS), they estimates of the number of workers

who commuted by public transportation in the 50 largest metro areas:

In 2008, the sample of 7,186,530 of people was found using public transportation

to get to work.

In 2009, the sample of 6,992,424 of people was found using public transportation

to get to work.

Let p denote the probability of people using public transportation. Thus, the null and

alternative hypotheses are:

H0:p< 5%

H1:p 5%

We will test this hypothesis in 2 years: 2008 and 2009.

Let begin with the year 2008:

The hypothesized value of the proportion p0= 0.05

The sample size n: n= 143,338,000

The sample proportion : = = 0.05014Because we have:

np0= 0.05 x 143,338,000 =7,166,900 > 5


4/18

4

n(1-p0) = 0.95 x 143,338,000 = 136,171,100 >5

So we use z-test with formula:

z=

=

= 7.69

Z test is 7.69 so p-value = 7.365 x 10-13

. Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.

We continue the data of 2009:

The hypothesized value of the proportion p0= 0.05

The sample size n: n= 137,792,000

The sample proportion : = = 0.05007Because we have:

np0= 0.05 x 137,792,000=6,889,600 > 5

n(1-p0) = 0.95 x 137,792,000= 130,902,400 >5

So we use z-test with formula:

z=

=

= 3.77

Z test is 3.77 so p-value = 8.159x 10-3. Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.

Reference:

Public Transportation Usage AmongU.S. Workers: 2008 and 2009, Table 2 :Public Transportation Usage for the 50 Largest Metropolitan Statistical Areas:12008 and 2009Con.

The Employment Situation: December 2008, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,seasonally adjusted.

The Employment Situation: December 2009, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,

seasonally adjusted.


5/18

5

Question 2:

Discuss among your group, select one company, state one dependent variable, and

more than two independent variables.

a. Collect data. Testing the independence among independent variables.

b. Establish regression relationship, write down the regression equation.

c. Use the regression equation to estimate new value of dependent variable.

Solution:

a/ Col lect data. Test ing th e independence among independent var iables.

Company Kinh Do Food Joint Stock Saigon business units operating in the field of foodproduction and processing.. How can they reach all the customers' needs? To solve thisproblem, a survey about customers' satisfaction has been conducted because of thesefollowing reasons:

trends: About price, quality, forms,

Through this survey, company will think of new strategies to investment closer to thestrengths and overcome the shortcomings attract more customers and make themrespect in the company.

Data:Y: Dependent variableLevel of satisfactionX1: Independent variablePriceX2: Independent variableQualityX3: Independent variableEvaluation compared to other milk brandsX4: Independent variableHow often consumers useX5: Independent variableRepeated use

Level of satisfaction

Y X1 X2 X3 X4 X5

3 3 3 3 5 4

4 3 4 5 5 5

1 1 2 3 3 2

3 3 3 5 5 4


6/18

6

3 3 3 3 2 3

3 3 2 1 5 1

3 3 3 4 2 4

3 3 3 4 3 4

1 1 1 5 5 1

3 3 4 4 2 4

3 3 3 4 3 4

3 3 3 5 5 4

4 3 4 5 2 5

3 3 3 3 5 4

5 3 5 5 3 5

4 3 4 4 2 4

4 3 5 4 5 5

3 3 4 2 5 4

4 3 4 5 2 5

3 4 3 5 4 2

3 3 3 4 5 4

4 3 5 4 4 5

3 3 3 3 2 4

3 3 3 3 3 3

3 3 3 4 2 4

5 4 4 4 3 4

3 3 3 4 2 3

2 2 3 3 4 3

3 1 3 2 3 33 3 4 3 4 4

3 2 4 4 2 4

4 3 4 4 3 5

3 4 5 2 5 5

4 3 3 4 2 4

3 4 3 1 2 3

3 3 3 3 5 2

3 3 3 1 1 3

1 2 1 3 3 2

4 3 4 5 2 54 4 4 4 3 5

4 3 4 5 3 4

4 4 5 5 3 5

2 3 2 4 4 3

4 3 3 4 2 4

3 3 4 4 3 4


7/18

7

3 3 4 5 2 4

4 3 4 5 2 5

2 3 3 5 5 4

3 3 5 3 2 4

4 4 3 4 5 5

2 3 4 4 3 3

3 3 3 4 5 5

4 3 4 2 5 4

3 2 3 4 2 4

4 1 5 5 2 3

1 3 4 3 4 3

3 3 4 4 5 5

4 3 3 4 2 3

3 2 4 4 2 3

3 3 3 4 5 5

Hypothesis test ing:

H0: 1 = 2 = 3 = 4 = 5= 0

H1: Not all the i (i=1,2,3,4,5) are zero

ANOVA

df SS MS FSignificance

F

Regression 5 23.01243 4.602486 11.65684 1.16E-07

Residual 54 21.3209 0.394832

Total 59 44.33333

According to ANOVA table we can see that at all level of significance, the test statisticvalue FT = F-ratio = 11.65684 > F critical = 2.3538 so we can reject H0.

In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we

have enough evidence to prove that there is a regression relationship between the dependent

variable Y and the independent variables Xi (i=1,2,3,4,5)

Coefficient table:

CoefficientsStandard

Error t Stat P-value


8/18

8

Intercept 0.503736 0.542849 0.927948 0.357564

x1 0.298951 0.137257 2.178037 0.03379

x2 0.293209 0.121558 2.412093 0.019289

x3 0.068627 0.08471 0.810139 0.421416

x4 -0.12698 0.065631 -1.93473 0.058269

x5 0.246856 0.119542 2.065025 0.043736

Regression equation: from the table of coefficient, we can set up the regression equation as

followings:

Y =0.503736 + 0.298951 X1 + 0.293209 X2 + 0.068627X30.12698X4 + 0.246856

X5

To test whether the variables of the regression model are significant, we base on p-value. If pvalue of Xi (level of significant) = 0.05, the test statistic value falls intonon-rejection region, Xi is non-significant and we should remove Xi.Based on the coefficient table:P-value of X3 = 0.421416 > 0.05P-value of X4 = 0.058269 > 0.05

Thus, X1, X2 are non-significant and should be removed from the regression equation .

In addition:

P-value X1= 0.03379 < 0.05P-value of X2 = 0.019289 < 0.05P-value of X5 = 0.043736


9/18

9

3 3 3 4

3 3 3 3

3 3 2 1

3 3 3 4

3 3 3 4

1 1 1 1

3 3 4 4

3 3 3 4

3 3 3 4

4 3 4 5

3 3 3 4

5 3 5 5

4 3 4 4

4 3 5 5

3 3 4 4

4 3 4 5

3 4 3 2

3 3 3 4

4 3 5 5

3 3 3 4

3 3 3 3

3 3 3 4

5 4 4 4

3 3 3 3

2 2 3 33 1 3 3

3 3 4 4

3 2 4 4

4 3 4 5

3 4 5 5

4 3 3 4

3 4 3 3

3 3 3 2

3 3 3 3

1 2 1 24 3 4 5

4 4 4 5

4 3 4 4

4 4 5 5

2 3 2 3

4 3 3 4


10/18

10

3 3 4 4

3 3 4 4

4 3 4 5

2 3 3 4

3 3 5 4

4 4 3 5

2 3 4 3

3 3 3 5

4 3 4 4

3 2 3 4

4 1 5 3

1 3 4 3

3 3 4 5

4 3 3 3

3 2 4 3

3 3 3 5

Hypothesis testing:

H0: 1 = 2 =3 = 0H1: Not all the i (i=1,2,3) are zero

ANOVA

df SS MS FSignificance

F

Regression 3 21.18356 7.061186 17.08122 5.35E-08

Residual 56 23.14977 0.413389

Total 59 44.33333

According to ANOVA table, we can see that at all level of significance, the test statisticvalue FT = F-ratio = 17.08122 > F critical = 2.7395 so we can reject H0.

In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we

have enough evidence to prove that there is a regression relationship between the dependent

variable Y and the independent variables Xi (i=1,2,3).

Coefficient table: Multiple Regression

CoefficientsStandard

Error t Stat P-value

Intercept 0.295018 0.434343 0.679228 0.499791

X1 0.243586 0.136693 1.781989 0.080173


11/18

11

X2 0.335106 0.121941 2.748094 0.008051

X3 0.262938 0.113466 2.317325 0.024166

Regression equation: from the table of coefficient, we can set up the regressionequation as followings.

Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3

To test whether the variables of the regression model are significant, we can base on p-value. If pvalue of Xi (level of significant) = 0.05, then the test statistic value fallsinto non-rejection region. So we cannot reject H0 at 0.05 level of significance, Xi is non-significant and we should remove Xi.

Base on the coefficient table:P-value of X1 = 0.080173


12/18


13/18

13

Source of

Variation

Sum of

Square

(SS)

Df

Mean

Squares

(MS)

F ratio

(FT)

Treatment (TR) 381126.6667 2 190563.3333 20.70840377

Error (E) 248460 27 9202.222222

Total (T) 629586.6667 29

Test statistic value:

F-ratio = 20.70

At = 0.05, the critical value:

F (2.27;0.05)= 3.35

Because F-ratio > F, we reject Ho.

It means that based on the ANOVA table and the hypothesis testing we have sufficient

evidence to prove that not all three prototypes have the same average range.

Question 4:We have taken the survey for student of National University HCMC and collect a data

about students who get money from their part-time job or get money from their parents (

are called income ) and their cellphone ( which they can buy to use from their income)

and three big companies. Suppose that a random sample of student is available from

various companies. We will test the independent between these two factors. (Using a

level of significance of 5%). We have result below:

Companies Total

Nokia Sony Samsung

Students < 1 million 42 16 38 96

1-3 millions 57 37 55 149


14/18

14

> 3 millions 22 28 37 87

Total 121 81 130 332

Solution:

H0 : The student of each each income and the number of users in three companies are

independent of each other.

H1 : The student of each income and the number of users in three companies are not

independent .

Expected counts of data points in different cells:

Companies Total

Nokia Sony Samsung

Student of

each income

< 1 million 9634.99 23.42 37.59

1-3 millions 14954.3 36.35 58.35

>3 millions 8731.71 21.23 34.06

Total 121 81 130 332

The chi-square test statistic value for independence is:

ij

ijij

tE

EO 2

2 )(

Degree of freedom = (r1)(c1) = (31)(31) = 4.

Critical value2c=2 (4, 0.05) = 9.4877.

At 0.05 level of significance, we can not reject H0since 2t


15/18

15

the student of each each income and the number of users in three companies are

independent of each other.


16/18

16

Question 5:

For a random sample of 200 U.S. motorists, the mileages driven last year are in data

presented below.

10221 718 8802 2102 4221 3257 2697 4760 5717 4193

2209 8521 6972 6873 6115 5998 2781 3833 6632 2829

2796 2031 7783 45 2692 7912 4447 3018 4895 3511

3571 2202 502 1748 5524 4185 8404 7077 5891 1378

3806 5559 9889 521 7284 7146 4482 7734 1286 2686

4110 1816 6972 3818 3510 4500 6229 167 5889 5349

4402 5973 5174 6198 3330 8836 7500 5466 5942 1654

4500 2079 5281 3668 5246 567 4527 5354 7474 551

4669 5572 402 6182 7250 2859 7124 7924 3625 5734

4720 5492 7941 5966 4801 7289 980 2963 6674 6741

4993 6026 6271 3514 5011 5245 5653 2910 5672 8103

5090 8050 6069 2960 4173 8943 6699 1514 2307 5497

5327 2293 9555 7712 5679 8840 3420 6197 2846 4943

5640 6825 6817 6744 702 6494 5954 3811 5794 2855

5801 3237 5816 4784 5014 7530 4308 3689 6981 1904

6208 6593 4104 5751 5244 1860 5224 655 5401 10304

6723 2683 7990 7645 2336 7869 6657 4223 5857 4336

6829 6274 10703 6669 3469 5682 5144 7044 4059 4673

7326 8198 5731 10962 5667 3615 6465 9577 4047 4694


17/18

17

9167 4435 1879 5912 7440 5259 4132 2617 3026 5967

a. Guess the theoretical distribution could be the best fit for data.

b. Use the 0.01 level of significance in determining whether the given data follow the

distribution on question a.

Solution:

a/ The chi-square testing for goodness of fit can be used to test how well our data

support an assumption about distribution of a population or random variable of interest.

We know the mean and the standard deviation of the population or variable. But in

some cases, they do not give the values of and ,so we need to estimate them from

the data. When this happens, we lose a degree of freedom for each parameterestimated from the data. The degrees of freedom of the chi-square statistic are df= k-2-

1 = k-3 (instead of k-1 as before).

b/

Base on the answer in question above, we have a guess for this population is the bell-

shaped distribution. Then in question b, with the 0.01 level of significance, we determine

whether the given data follow this distribution or not.

The null and alternative hypothesis

H0: the population has a normal distribution

H1: the population is not normally distributed

The chi-square goodness-of-fit test may be applied to testing any hypothesis about the

distribution of a population or a random variable. The test may be applied in particular to

testing how well an assumption of a normal distribution is supported by a given data set.

We have:

n= 200

We divided interval into 6 classes: k=6

We have:

We have: 2797.85


18/18

4078.609 5084.92 = 6091.231 7371.99

The expect E = np

Documents

[Report] Final Project of Statistic