[Report] Final Project of Statistic

Embed Size (px)

Citation preview

  • 8/13/2019 [Report] Final Project of Statistic

    1/18

    1

    INTERNATIONAL UNIVERSITY

    VNU HCMC

    REPORT

    FINAL PROJECT OF STATISTIC

    BUSINESS

    Lecturer: Nguyen Bac Huy

    Team Members:

    1.Lng Tho Nhi_BAFNIU110752.Nguyn Tn Pht_BAFNIU111403.T Phm Duy Tin_BABAIU111444.Nguyn an Vy_BAFNIU110575.Trng Th Ngc Tuyt_BAFNIU110806.Hunh Ngc Tho Uyn_BAFNIU111277.L Ngc Anh Phng_BABAIU11269

  • 8/13/2019 [Report] Final Project of Statistic

    2/18

    2

    CONTENT:

    1. Question 13-42. Question 25 -113. Question 39- 13

    4. Question 413 145. Question 515 - 17

  • 8/13/2019 [Report] Final Project of Statistic

    3/18

    3

    Question 1:

    A US National Public Transportation survey taken few years ago in USA indicated that

    less than 5% of US citizens use public transportation. Collect secondary data from

    several websites of U.S. Departments of Transportation, US Census Bureau etc. to testthis hypothesis from the survey. Write a short essay to explain the data.

    Solution:

    We now consider to a sample from Bureau of Labor Statistics of United States

    Department of Labor in 2008 and 2009:

    On Dec. 2008, the sample of 143,338,000 (the number of working people) was

    chosen.

    On Dec. 2009, the sample of 137,792,000 (the number of working people) waschosen.

    According to the Public Transportation Usage Among U.S. Workers: 2008 and 2009

    report of American Community Surveys (ACS), they estimates of the number of workers

    who commuted by public transportation in the 50 largest metro areas:

    In 2008, the sample of 7,186,530 of people was found using public transportation

    to get to work.

    In 2009, the sample of 6,992,424 of people was found using public transportation

    to get to work.

    Let p denote the probability of people using public transportation. Thus, the null and

    alternative hypotheses are:

    H0:p< 5%

    H1:p 5%

    We will test this hypothesis in 2 years: 2008 and 2009.

    Let begin with the year 2008:

    The hypothesized value of the proportion p0= 0.05

    The sample size n: n= 143,338,000

    The sample proportion : = = 0.05014Because we have:

    np0= 0.05 x 143,338,000 =7,166,900 > 5

  • 8/13/2019 [Report] Final Project of Statistic

    4/18

    4

    n(1-p0) = 0.95 x 143,338,000 = 136,171,100 >5

    So we use z-test with formula:

    z=

    =

    = 7.69

    Z test is 7.69 so p-value = 7.365 x 10-13

    . Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.

    We continue the data of 2009:

    The hypothesized value of the proportion p0= 0.05

    The sample size n: n= 137,792,000

    The sample proportion : = = 0.05007Because we have:

    np0= 0.05 x 137,792,000=6,889,600 > 5

    n(1-p0) = 0.95 x 137,792,000= 130,902,400 >5

    So we use z-test with formula:

    z=

    =

    = 3.77

    Z test is 3.77 so p-value = 8.159x 10-3. Because p-value < 5%, we reject H0. There forwe do not accept that in 2008, less than 5% of US citizens use public transportation.

    Reference:

    Public Transportation Usage AmongU.S. Workers: 2008 and 2009, Table 2 :Public Transportation Usage for the 50 Largest Metropolitan Statistical Areas:12008 and 2009Con.

    The Employment Situation: December 2008, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,seasonally adjusted.

    The Employment Situation: December 2009, Bureau of Labor Statistics, UnitedStates Department of Labor, Table A: Major indicators of labor market activity,

    seasonally adjusted.

  • 8/13/2019 [Report] Final Project of Statistic

    5/18

    5

    Question 2:

    Discuss among your group, select one company, state one dependent variable, and

    more than two independent variables.

    a. Collect data. Testing the independence among independent variables.

    b. Establish regression relationship, write down the regression equation.

    c. Use the regression equation to estimate new value of dependent variable.

    Solution:

    a/ Col lect data. Test ing th e independence among independent var iables.

    Company Kinh Do Food Joint Stock Saigon business units operating in the field of foodproduction and processing.. How can they reach all the customers' needs? To solve thisproblem, a survey about customers' satisfaction has been conducted because of thesefollowing reasons:

    trends: About price, quality, forms,

    Through this survey, company will think of new strategies to investment closer to thestrengths and overcome the shortcomings attract more customers and make themrespect in the company.

    Data:Y: Dependent variableLevel of satisfactionX1: Independent variablePriceX2: Independent variableQualityX3: Independent variableEvaluation compared to other milk brandsX4: Independent variableHow often consumers useX5: Independent variableRepeated use

    Level of satisfaction

    Y X1 X2 X3 X4 X5

    3 3 3 3 5 4

    4 3 4 5 5 5

    1 1 2 3 3 2

    3 3 3 5 5 4

  • 8/13/2019 [Report] Final Project of Statistic

    6/18

    6

    3 3 3 3 2 3

    3 3 2 1 5 1

    3 3 3 4 2 4

    3 3 3 4 3 4

    1 1 1 5 5 1

    3 3 4 4 2 4

    3 3 3 4 3 4

    3 3 3 5 5 4

    4 3 4 5 2 5

    3 3 3 3 5 4

    5 3 5 5 3 5

    4 3 4 4 2 4

    4 3 5 4 5 5

    3 3 4 2 5 4

    4 3 4 5 2 5

    3 4 3 5 4 2

    3 3 3 4 5 4

    4 3 5 4 4 5

    3 3 3 3 2 4

    3 3 3 3 3 3

    3 3 3 4 2 4

    5 4 4 4 3 4

    3 3 3 4 2 3

    2 2 3 3 4 3

    3 1 3 2 3 33 3 4 3 4 4

    3 2 4 4 2 4

    4 3 4 4 3 5

    3 4 5 2 5 5

    4 3 3 4 2 4

    3 4 3 1 2 3

    3 3 3 3 5 2

    3 3 3 1 1 3

    1 2 1 3 3 2

    4 3 4 5 2 54 4 4 4 3 5

    4 3 4 5 3 4

    4 4 5 5 3 5

    2 3 2 4 4 3

    4 3 3 4 2 4

    3 3 4 4 3 4

  • 8/13/2019 [Report] Final Project of Statistic

    7/18

    7

    3 3 4 5 2 4

    4 3 4 5 2 5

    2 3 3 5 5 4

    3 3 5 3 2 4

    4 4 3 4 5 5

    2 3 4 4 3 3

    3 3 3 4 5 5

    4 3 4 2 5 4

    3 2 3 4 2 4

    4 1 5 5 2 3

    1 3 4 3 4 3

    3 3 4 4 5 5

    4 3 3 4 2 3

    3 2 4 4 2 3

    3 3 3 4 5 5

    Hypothesis test ing:

    H0: 1 = 2 = 3 = 4 = 5= 0

    H1: Not all the i (i=1,2,3,4,5) are zero

    ANOVA

    df SS MS FSignificance

    F

    Regression 5 23.01243 4.602486 11.65684 1.16E-07

    Residual 54 21.3209 0.394832

    Total 59 44.33333

    According to ANOVA table we can see that at all level of significance, the test statisticvalue FT = F-ratio = 11.65684 > F critical = 2.3538 so we can reject H0.

    In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we

    have enough evidence to prove that there is a regression relationship between the dependent

    variable Y and the independent variables Xi (i=1,2,3,4,5)

    Coefficient table:

    CoefficientsStandard

    Error t Stat P-value

  • 8/13/2019 [Report] Final Project of Statistic

    8/18

    8

    Intercept 0.503736 0.542849 0.927948 0.357564

    x1 0.298951 0.137257 2.178037 0.03379

    x2 0.293209 0.121558 2.412093 0.019289

    x3 0.068627 0.08471 0.810139 0.421416

    x4 -0.12698 0.065631 -1.93473 0.058269

    x5 0.246856 0.119542 2.065025 0.043736

    Regression equation: from the table of coefficient, we can set up the regression equation as

    followings:

    Y =0.503736 + 0.298951 X1 + 0.293209 X2 + 0.068627X30.12698X4 + 0.246856

    X5

    To test whether the variables of the regression model are significant, we base on p-value. If pvalue of Xi (level of significant) = 0.05, the test statistic value falls intonon-rejection region, Xi is non-significant and we should remove Xi.Based on the coefficient table:P-value of X3 = 0.421416 > 0.05P-value of X4 = 0.058269 > 0.05

    Thus, X1, X2 are non-significant and should be removed from the regression equation .

    In addition:

    P-value X1= 0.03379 < 0.05P-value of X2 = 0.019289 < 0.05P-value of X5 = 0.043736

  • 8/13/2019 [Report] Final Project of Statistic

    9/18

    9

    3 3 3 4

    3 3 3 3

    3 3 2 1

    3 3 3 4

    3 3 3 4

    1 1 1 1

    3 3 4 4

    3 3 3 4

    3 3 3 4

    4 3 4 5

    3 3 3 4

    5 3 5 5

    4 3 4 4

    4 3 5 5

    3 3 4 4

    4 3 4 5

    3 4 3 2

    3 3 3 4

    4 3 5 5

    3 3 3 4

    3 3 3 3

    3 3 3 4

    5 4 4 4

    3 3 3 3

    2 2 3 33 1 3 3

    3 3 4 4

    3 2 4 4

    4 3 4 5

    3 4 5 5

    4 3 3 4

    3 4 3 3

    3 3 3 2

    3 3 3 3

    1 2 1 24 3 4 5

    4 4 4 5

    4 3 4 4

    4 4 5 5

    2 3 2 3

    4 3 3 4

  • 8/13/2019 [Report] Final Project of Statistic

    10/18

    10

    3 3 4 4

    3 3 4 4

    4 3 4 5

    2 3 3 4

    3 3 5 4

    4 4 3 5

    2 3 4 3

    3 3 3 5

    4 3 4 4

    3 2 3 4

    4 1 5 3

    1 3 4 3

    3 3 4 5

    4 3 3 3

    3 2 4 3

    3 3 3 5

    Hypothesis testing:

    H0: 1 = 2 =3 = 0H1: Not all the i (i=1,2,3) are zero

    ANOVA

    df SS MS FSignificance

    F

    Regression 3 21.18356 7.061186 17.08122 5.35E-08

    Residual 56 23.14977 0.413389

    Total 59 44.33333

    According to ANOVA table, we can see that at all level of significance, the test statisticvalue FT = F-ratio = 17.08122 > F critical = 2.7395 so we can reject H0.

    In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we

    have enough evidence to prove that there is a regression relationship between the dependent

    variable Y and the independent variables Xi (i=1,2,3).

    Coefficient table: Multiple Regression

    CoefficientsStandard

    Error t Stat P-value

    Intercept 0.295018 0.434343 0.679228 0.499791

    X1 0.243586 0.136693 1.781989 0.080173

  • 8/13/2019 [Report] Final Project of Statistic

    11/18

    11

    X2 0.335106 0.121941 2.748094 0.008051

    X3 0.262938 0.113466 2.317325 0.024166

    Regression equation: from the table of coefficient, we can set up the regressionequation as followings.

    Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3

    To test whether the variables of the regression model are significant, we can base on p-value. If pvalue of Xi (level of significant) = 0.05, then the test statistic value fallsinto non-rejection region. So we cannot reject H0 at 0.05 level of significance, Xi is non-significant and we should remove Xi.

    Base on the coefficient table:P-value of X1 = 0.080173

  • 8/13/2019 [Report] Final Project of Statistic

    12/18

  • 8/13/2019 [Report] Final Project of Statistic

    13/18

    13

    Source of

    Variation

    Sum of

    Square

    (SS)

    Df

    Mean

    Squares

    (MS)

    F ratio

    (FT)

    Treatment (TR) 381126.6667 2 190563.3333 20.70840377

    Error (E) 248460 27 9202.222222

    Total (T) 629586.6667 29

    Test statistic value:

    F-ratio = 20.70

    At = 0.05, the critical value:

    F (2.27;0.05)= 3.35

    Because F-ratio > F, we reject Ho.

    It means that based on the ANOVA table and the hypothesis testing we have sufficient

    evidence to prove that not all three prototypes have the same average range.

    Question 4:We have taken the survey for student of National University HCMC and collect a data

    about students who get money from their part-time job or get money from their parents (

    are called income ) and their cellphone ( which they can buy to use from their income)

    and three big companies. Suppose that a random sample of student is available from

    various companies. We will test the independent between these two factors. (Using a

    level of significance of 5%). We have result below:

    Companies Total

    Nokia Sony Samsung

    Students < 1 million 42 16 38 96

    1-3 millions 57 37 55 149

  • 8/13/2019 [Report] Final Project of Statistic

    14/18

    14

    > 3 millions 22 28 37 87

    Total 121 81 130 332

    Solution:

    H0 : The student of each each income and the number of users in three companies are

    independent of each other.

    H1 : The student of each income and the number of users in three companies are not

    independent .

    Expected counts of data points in different cells:

    Companies Total

    Nokia Sony Samsung

    Student of

    each income

    < 1 million 9634.99 23.42 37.59

    1-3 millions 14954.3 36.35 58.35

    >3 millions 8731.71 21.23 34.06

    Total 121 81 130 332

    The chi-square test statistic value for independence is:

    ij

    ijij

    tE

    EO 2

    2 )(

    Degree of freedom = (r1)(c1) = (31)(31) = 4.

    Critical value2c=2 (4, 0.05) = 9.4877.

    At 0.05 level of significance, we can not reject H0since 2t

  • 8/13/2019 [Report] Final Project of Statistic

    15/18

    15

    the student of each each income and the number of users in three companies are

    independent of each other.

  • 8/13/2019 [Report] Final Project of Statistic

    16/18

    16

    Question 5:

    For a random sample of 200 U.S. motorists, the mileages driven last year are in data

    presented below.

    10221 718 8802 2102 4221 3257 2697 4760 5717 4193

    2209 8521 6972 6873 6115 5998 2781 3833 6632 2829

    2796 2031 7783 45 2692 7912 4447 3018 4895 3511

    3571 2202 502 1748 5524 4185 8404 7077 5891 1378

    3806 5559 9889 521 7284 7146 4482 7734 1286 2686

    4110 1816 6972 3818 3510 4500 6229 167 5889 5349

    4402 5973 5174 6198 3330 8836 7500 5466 5942 1654

    4500 2079 5281 3668 5246 567 4527 5354 7474 551

    4669 5572 402 6182 7250 2859 7124 7924 3625 5734

    4720 5492 7941 5966 4801 7289 980 2963 6674 6741

    4993 6026 6271 3514 5011 5245 5653 2910 5672 8103

    5090 8050 6069 2960 4173 8943 6699 1514 2307 5497

    5327 2293 9555 7712 5679 8840 3420 6197 2846 4943

    5640 6825 6817 6744 702 6494 5954 3811 5794 2855

    5801 3237 5816 4784 5014 7530 4308 3689 6981 1904

    6208 6593 4104 5751 5244 1860 5224 655 5401 10304

    6723 2683 7990 7645 2336 7869 6657 4223 5857 4336

    6829 6274 10703 6669 3469 5682 5144 7044 4059 4673

    7326 8198 5731 10962 5667 3615 6465 9577 4047 4694

  • 8/13/2019 [Report] Final Project of Statistic

    17/18

    17

    9167 4435 1879 5912 7440 5259 4132 2617 3026 5967

    a. Guess the theoretical distribution could be the best fit for data.

    b. Use the 0.01 level of significance in determining whether the given data follow the

    distribution on question a.

    Solution:

    a/ The chi-square testing for goodness of fit can be used to test how well our data

    support an assumption about distribution of a population or random variable of interest.

    We know the mean and the standard deviation of the population or variable. But in

    some cases, they do not give the values of and ,so we need to estimate them from

    the data. When this happens, we lose a degree of freedom for each parameterestimated from the data. The degrees of freedom of the chi-square statistic are df= k-2-

    1 = k-3 (instead of k-1 as before).

    b/

    Base on the answer in question above, we have a guess for this population is the bell-

    shaped distribution. Then in question b, with the 0.01 level of significance, we determine

    whether the given data follow this distribution or not.

    The null and alternative hypothesis

    H0: the population has a normal distribution

    H1: the population is not normally distributed

    The chi-square goodness-of-fit test may be applied to testing any hypothesis about the

    distribution of a population or a random variable. The test may be applied in particular to

    testing how well an assumption of a normal distribution is supported by a given data set.

    We have:

    n= 200

    We divided interval into 6 classes: k=6

    We have:

    We have: 2797.85

  • 8/13/2019 [Report] Final Project of Statistic

    18/18

    4078.609 5084.92 = 6091.231 7371.99

    The expect E = np