Basic stats information

Embed Size (px)

Citation preview

  • 7/30/2019 Basic stats information

    1/16

    Main topic Topic What it says

    Histogram

    summarize data and present in graphical

    form. Shows the number of times the

    data points repeat on various frequency

    OutliersUnderstand what it implies - Decide toleave it or remove from the dataset

    Mean

    shows the average value of the data set.

    Outliers can distort the mean value

    easily and skew the data towards it.

    Weighs the value of every data

    when plotting histogram if the mean is

    moved towards the right of the graph it

    means that outliers are pulling the mean

    towards it

    Median

    middle value of the data set when

    arranged in ascending order.

    Answer to problem when the data is

    skewed due to outliers

    Mode

    Most frequently ocurring value in data

    set. Use this when value is not important

    When data has two modes, its called

    bimodal distribution

    Standard deviation

    measures the level of

    dispersion(variability) in the given set ofdata

    Std dev tells us how far the data set

    represent from mean. Higher std dev

    means data is widely spread from mean.

    Lower std dev means data closer to

    mean

    Coeffcient of variation mesured at std dev divided by mean

    Two variables

    Scatter plot - visual summary of

    relationship between variables

    when one variable is time, relation ship

    is called time series

    false relationship could be purely due to

    co incidences. Look out for hidden

    variables

    scatter plot - does not prove casuality,

    never prove one variable cuase the other

  • 7/30/2019 Basic stats information

    2/16

  • 7/30/2019 Basic stats information

    3/16

    Factors that affect interval level

    - Sample mean should be at center of the

    range

    - higher std dev greater uncertainity

    about population, wider range to bring in

    confidence- small sample size demand wider range

    to create confidence that pm is within

    the SM

    - More confident we want our SM

    represent PM, wider would be the range

    Normal

    Distribution

    - Shape of bell curve with mean at the

    center

    - X axis is the variable we are studying

    and Y axis is the likelyhood of different

    value that occurs

    whats special

    mean and median are the same.

    Probabiility of value less than mean is

    50% and more than mean is 50%

    - location, widthness and narrowness of

    the curve depends on the std deviation

    and mean

    importance of std dev

    Large std dev makes the curve flat, small

    std dev makes the curve narrow and tall

    (with values more close to mean)

    rule of thumb

    68% of the time, the range lies within 1std dev from the mean

    95% of probability, range lies within 2 std

    dev from the mean

    Z value

    translates any value in to corresponding

    Z value by subracting the mean and

    divide by std dev

    z multipliedby std dev and add/subract

    from mean would give range and the

    probability within which range is present

    (68%, 95%, 99%)

    if we start from very left of the curve

    then it measures cumulative probability.

    Probability works only on normal

    distribution curve (not on all the curves)

    a

  • 7/30/2019 Basic stats information

    4/16

    How to find cummulative probability

    first standardize the value of the variable

    by using excel standardize function (this

    will find out the value of Z). Second use

    norms dist function to find out the

    cummulative probability

    Other option is to use normdist withvalue of True. This wil return the

    cummulative probability

    How to find Z value if you have

    cummulative prob value

    How to find value of the variable if you

    have cummulative prob, sample mean

    and std dev

    Central Limit Theorom

    sample mean distributed approximately

    normally regardless of distribution of the

    population

    more samples, better approximation of

    normal distribution

    Mean distribution of sample =

    population

    Properties of normal distribution to

    extract info from sample

    Confidence

    intervals Estimating population mean

    It's important to emphasize: We are not

    saying that 95% of the time our sample

    mean is the population mean, but we aresaying that 95% of the time a range that

    is two standard deviations wide centered

    around the sample mean contains the

    population mean.

    Increase confidence level

    accept higher range or increase sample

    size

    How wide the interval

    How do we know if an interval is too

    wide? Typically, if we would make a

    different decision for different values

    within an interval, that interval is too

    wide.

    std dev of sample mean

    How to find confidence interval this works only if the sample size is > 30

    need to know the level of confidence

  • 7/30/2019 Basic stats information

    5/16

    Obtaining Z value

    Converting the desired confidence level

    into the corresponding cumulative

    probability on the standard normal curve

    is essential because Excel's NORMSINV

    function and the z-table work with

    cumulative probabilitiesFor smaller sample size (less than 30) we

    have to use T value Degree of freedom = sample size -1

    Choosing sample size

    based on initial estimate, find out sd,

    also find out what should be the

    maximum deviation allowed. Apply

    following formula to get the desired

    sample size

    summary of how to build the range that

    constitutes population mean

    working with proportions

    often used to indicate frequency of some

    phenomenon in the population

    p bar is the proportion of yes to a total

    population

    sample size selection

    selected sample size should satisfy the

    condition mentioned

  • 7/30/2019 Basic stats information

    6/16

    Method Calculations/Formulas

    Use excel function (under analysis tool

    pak)

    Greek letter mu represent mean of data

    aset user average formula in excel

    use median formula in excel

    use mode formula in excel

    Greem letter sigma Use excel formula STDEV

    can be used to compare among different

    set of data

  • 7/30/2019 Basic stats information

    7/16

    use excel correl function to find out the

    correlation

    - select elements from population at

    random

    - Analyze the sample

    - Draw inference about total population

    we are interested in

    Need to know x bar (sample mean), std

    dev of sampel s and sample siize n. Z

    represent confidence level. Higher value

    of Z higher the confidence level is

  • 7/30/2019 Basic stats information

    8/16

  • 7/30/2019 Basic stats information

    9/16

    standardize, normsdist

    Normdist

    Normsinv 2.807033768

    Norminv

    std dev of population mean divided by sq

    root of n

  • 7/30/2019 Basic stats information

    10/16

    to convert desired conf level, take 1-

    desired conf level and divide by 2. Then

    add the result to the desire conf level.Input 1-confidence interval and degree

    of freedom Use TINV

    solve the equation or use the excel utility

    use the excel utility - confidence interval

    use excel utility

    n x p bar >= 5, n x (1- p bar) > = 5

    number of rooms available divided by

    upper limit of the confidence leve

  • 7/30/2019 Basic stats information

    11/16

    Type of Estimate:

    Sample Size:

    Input Area n 70 n 20 n 100x-bar 4.5 x-bar 5 p-bar 0.1s 1.2 s 10 confidence level 0.95confidence level 0.95 confidence level 0.95

    Output Area Center of Interval 4.50 Center of Interval 5.00 Center of Interval 0.10z*s/sqrt(n) 0.28 t*s/sqrt(n) 4.68 z*s/sqrt(n) 0.06

    Lower end of int'l 4.22 Lower end of int'l 0.32 Lower end of int'l 0.041

    Upper end of int'l 4.78 Upper end of int'l 9.68 Upper end of int'l 0.159

    Other Interval width 0.56 Interval width 9.36 Interval width 0.12

    Calculations 1-confidence level 0.05 1-conf 0.05 1-confidence level 0.05(1-confidence level)/2 0.025 t 2.09 (1-confidence level)/2 0.025z 1.96 z 1.96

    sqrt(n) 8.37 sqrt(n) 4.47 (p)(1-p) 0.09s/sqrt(n) 0.14 s/sqrt(n) 2.24 s = sqrt[(p)(1-p)} 0.30

    sqrt(n) 10.00s/sqrt(n) 0.03

    Check assumptions:np>5 OKn(1-p) > 5 OK

    Confidence Interval Utility

    Mean Proportions

    n >= 30 n < 30 n >= 30

    146230062.xlsx.ms_office Confidence Intervals

  • 7/30/2019 Basic stats information

    12/16

    Type of Estimate:

    Input Area Sample Standard Deviation, s 50 Estimate of p 0.1Desired Accuracy: Half Width of Interval, d 5 Desired Accuracy: Half Width of Interval, d 0.02Confidence level 0.95 Confidence level 0.95

    Output Area Required Sample Size 385 Required Sample Size 865

    Other 1-confidence level 0.05 1-confidence level 0.05Calculations (1-confidence level)/2 0.025 (1-confidence level)/2 0.025

    z 1.96 z 1.96

    z*s 98.00 (p)(1-p) 0.09z*s/d 19.60 s = sqrt[(p)(1-p)} 0.30Minimal n 384.15 z*s = {z*sqrt[(p)(1-p)]} 0.59

    z*s/d = {z*sqrt[(p)(1-p)]}/d 29.40Minimal n to ensure np>5 50.0Minimal n to ensure n(1-p)>5 5.6Minimal n to ensure d < (zs/sqrt(n)) 864.3Minimal n to satisfy all constraints 864.3

    Assumptions:

    Sample Size will be above 30. If not, raise sample size to 30 to make assumptions valid.Proportion Estimate is the maximum you expect p to be. If you don't have a good estimate of the proportion,use p = .5, which gives maximal standard deviation.

    Sample Size Utility

    Mean Proportion

    146230062.xlsx.ms_office Sample Size

  • 7/30/2019 Basic stats information

    13/16

    Cereal

    Protein

    (grams

    per

    serving)

    Carbohydr

    ates

    (grams

    per

    serving)

    100% Bran 1 12

    All-Bran 1 12Almond Delight 1 12

    Apple Cinnamon Cheerios 1 13

    Apple Jacks 1 13

    Bran Chex 1 13

    Bran Flakes 1 13

    Cap'n'Crunch 1 13

    Cheerios 1 14

    Cinnamon Toast Crunch 1 14

    Cocoa Puffs 1 15

    Corn Chex 1 23

    Corn Flakes 2 9

    Corn Pops 2 10

    Count Chocula 2 11

    Cracklin' Oat Bran 2 11

    Cream of Wheat (Quick) 2 11

    Crispix 2 12

    Double Chex 2 14

    Froot Loops 2 15

    Frosted Flakes 2 15

    Frosted Mini-Wheats 2 15

    Fruit & Fibre Dates, Walnuts, and Oats 2 16Fruity Pebbles 2 18

    Golden Grahams 2 21

    Grape Nuts Flakes 2 21

    Grape-Nuts 2 21

    Great Grains Pecan 2 21

    Honey Nut Cheerios 2 21

    Honey-comb 2 22

    Kix 2 22

    Life 3 10

    Lucky Charms 3 11

    Maypo 3 11

    Muesli Raisins, Dates, & Almonds 3 12

    Muesli Raisins, Peaches, & Pecans 3 12

    Mueslix Crispy Blend 3 13

    Nut&Honey Crunch 3 13

    Nutri-grain Wheat 3 14

    Post Nat. Raisin Bran 3 14

    Product 19 3 15

  • 7/30/2019 Basic stats information

    14/16

    Puffed Rice 3 15

    Puffed Wheat 3 16

    Quaker Oat Squares 3 17

    Raisin Bran 3 17

    Raisin Nut Bran 3 17

    Raisin Squares 3 17

    Rice Chex 3 18Rice Krispies 3 20

    Shredded Wheat 3 21

    Smacks 4 5

    Special K 4 7

    Total Corn Flakes 4 12

    Total Raisin Bran 4 14

    Total Whole Grain 4 16

    Triples 4 16

    Trix 4 16

    Wheat Chex 6 16

    Wheaties 6 17

    mean 2.49 14.81

    median 2.00 14.00

  • 7/30/2019 Basic stats information

    15/16

    Variable 1 Variable 2 Age

    a ary($thousan

    ds)

    -1.0 1.0 53 145

    -1.0 1.0 43 621

    1.0 -1.0 33 262

    1.0 -1.0 45 208

    -1.0 1.0 46 362

    1.0 -1.0 55 424

    -1.0 1.0 41 339

    -1.0 1.0 55 736

    -1.0 1.0 36 291

    1.0 -1.0 45 58

    1.0 -1.0 55 498

    1.0 -1.0 50 643

    49 390

    -1.000000 47 332

    69 75051 368

    48 659

    62 234

    45 396

    37 300

    50 343

    50 536

    50 543

    58 217

    53 298

    57 1103

    53 406

    61 254

    47 862

    56 204

    44 206

    46 250

    58 21

    48 298

    38 350

    74 80060 726

    32 370

    51 536

    50 291

    40 808

    61 543

    63 149

    56 350

  • 7/30/2019 Basic stats information

    16/16

    45 242

    61 198

    70 213

    59 296

    57 317

    69 482

    44 15556 802

    50 200

    56 282

    43 573

    48 388

    52 250

    62 396

    48 572

    0.13