7/30/2019 Basic stats information
1/16
Main topic Topic What it says
Histogram
summarize data and present in graphical
form. Shows the number of times the
data points repeat on various frequency
OutliersUnderstand what it implies - Decide toleave it or remove from the dataset
Mean
shows the average value of the data set.
Outliers can distort the mean value
easily and skew the data towards it.
Weighs the value of every data
when plotting histogram if the mean is
moved towards the right of the graph it
means that outliers are pulling the mean
towards it
Median
middle value of the data set when
arranged in ascending order.
Answer to problem when the data is
skewed due to outliers
Mode
Most frequently ocurring value in data
set. Use this when value is not important
When data has two modes, its called
bimodal distribution
Standard deviation
measures the level of
dispersion(variability) in the given set ofdata
Std dev tells us how far the data set
represent from mean. Higher std dev
means data is widely spread from mean.
Lower std dev means data closer to
mean
Coeffcient of variation mesured at std dev divided by mean
Two variables
Scatter plot - visual summary of
relationship between variables
when one variable is time, relation ship
is called time series
false relationship could be purely due to
co incidences. Look out for hidden
variables
scatter plot - does not prove casuality,
never prove one variable cuase the other
7/30/2019 Basic stats information
2/16
7/30/2019 Basic stats information
3/16
Factors that affect interval level
- Sample mean should be at center of the
range
- higher std dev greater uncertainity
about population, wider range to bring in
confidence- small sample size demand wider range
to create confidence that pm is within
the SM
- More confident we want our SM
represent PM, wider would be the range
Normal
Distribution
- Shape of bell curve with mean at the
center
- X axis is the variable we are studying
and Y axis is the likelyhood of different
value that occurs
whats special
mean and median are the same.
Probabiility of value less than mean is
50% and more than mean is 50%
- location, widthness and narrowness of
the curve depends on the std deviation
and mean
importance of std dev
Large std dev makes the curve flat, small
std dev makes the curve narrow and tall
(with values more close to mean)
rule of thumb
68% of the time, the range lies within 1std dev from the mean
95% of probability, range lies within 2 std
dev from the mean
Z value
translates any value in to corresponding
Z value by subracting the mean and
divide by std dev
z multipliedby std dev and add/subract
from mean would give range and the
probability within which range is present
(68%, 95%, 99%)
if we start from very left of the curve
then it measures cumulative probability.
Probability works only on normal
distribution curve (not on all the curves)
a
7/30/2019 Basic stats information
4/16
How to find cummulative probability
first standardize the value of the variable
by using excel standardize function (this
will find out the value of Z). Second use
norms dist function to find out the
cummulative probability
Other option is to use normdist withvalue of True. This wil return the
cummulative probability
How to find Z value if you have
cummulative prob value
How to find value of the variable if you
have cummulative prob, sample mean
and std dev
Central Limit Theorom
sample mean distributed approximately
normally regardless of distribution of the
population
more samples, better approximation of
normal distribution
Mean distribution of sample =
population
Properties of normal distribution to
extract info from sample
Confidence
intervals Estimating population mean
It's important to emphasize: We are not
saying that 95% of the time our sample
mean is the population mean, but we aresaying that 95% of the time a range that
is two standard deviations wide centered
around the sample mean contains the
population mean.
Increase confidence level
accept higher range or increase sample
size
How wide the interval
How do we know if an interval is too
wide? Typically, if we would make a
different decision for different values
within an interval, that interval is too
wide.
std dev of sample mean
How to find confidence interval this works only if the sample size is > 30
need to know the level of confidence
7/30/2019 Basic stats information
5/16
Obtaining Z value
Converting the desired confidence level
into the corresponding cumulative
probability on the standard normal curve
is essential because Excel's NORMSINV
function and the z-table work with
cumulative probabilitiesFor smaller sample size (less than 30) we
have to use T value Degree of freedom = sample size -1
Choosing sample size
based on initial estimate, find out sd,
also find out what should be the
maximum deviation allowed. Apply
following formula to get the desired
sample size
summary of how to build the range that
constitutes population mean
working with proportions
often used to indicate frequency of some
phenomenon in the population
p bar is the proportion of yes to a total
population
sample size selection
selected sample size should satisfy the
condition mentioned
7/30/2019 Basic stats information
6/16
Method Calculations/Formulas
Use excel function (under analysis tool
pak)
Greek letter mu represent mean of data
aset user average formula in excel
use median formula in excel
use mode formula in excel
Greem letter sigma Use excel formula STDEV
can be used to compare among different
set of data
7/30/2019 Basic stats information
7/16
use excel correl function to find out the
correlation
- select elements from population at
random
- Analyze the sample
- Draw inference about total population
we are interested in
Need to know x bar (sample mean), std
dev of sampel s and sample siize n. Z
represent confidence level. Higher value
of Z higher the confidence level is
7/30/2019 Basic stats information
8/16
7/30/2019 Basic stats information
9/16
standardize, normsdist
Normdist
Normsinv 2.807033768
Norminv
std dev of population mean divided by sq
root of n
7/30/2019 Basic stats information
10/16
to convert desired conf level, take 1-
desired conf level and divide by 2. Then
add the result to the desire conf level.Input 1-confidence interval and degree
of freedom Use TINV
solve the equation or use the excel utility
use the excel utility - confidence interval
use excel utility
n x p bar >= 5, n x (1- p bar) > = 5
number of rooms available divided by
upper limit of the confidence leve
7/30/2019 Basic stats information
11/16
Type of Estimate:
Sample Size:
Input Area n 70 n 20 n 100x-bar 4.5 x-bar 5 p-bar 0.1s 1.2 s 10 confidence level 0.95confidence level 0.95 confidence level 0.95
Output Area Center of Interval 4.50 Center of Interval 5.00 Center of Interval 0.10z*s/sqrt(n) 0.28 t*s/sqrt(n) 4.68 z*s/sqrt(n) 0.06
Lower end of int'l 4.22 Lower end of int'l 0.32 Lower end of int'l 0.041
Upper end of int'l 4.78 Upper end of int'l 9.68 Upper end of int'l 0.159
Other Interval width 0.56 Interval width 9.36 Interval width 0.12
Calculations 1-confidence level 0.05 1-conf 0.05 1-confidence level 0.05(1-confidence level)/2 0.025 t 2.09 (1-confidence level)/2 0.025z 1.96 z 1.96
sqrt(n) 8.37 sqrt(n) 4.47 (p)(1-p) 0.09s/sqrt(n) 0.14 s/sqrt(n) 2.24 s = sqrt[(p)(1-p)} 0.30
sqrt(n) 10.00s/sqrt(n) 0.03
Check assumptions:np>5 OKn(1-p) > 5 OK
Confidence Interval Utility
Mean Proportions
n >= 30 n < 30 n >= 30
146230062.xlsx.ms_office Confidence Intervals
7/30/2019 Basic stats information
12/16
Type of Estimate:
Input Area Sample Standard Deviation, s 50 Estimate of p 0.1Desired Accuracy: Half Width of Interval, d 5 Desired Accuracy: Half Width of Interval, d 0.02Confidence level 0.95 Confidence level 0.95
Output Area Required Sample Size 385 Required Sample Size 865
Other 1-confidence level 0.05 1-confidence level 0.05Calculations (1-confidence level)/2 0.025 (1-confidence level)/2 0.025
z 1.96 z 1.96
z*s 98.00 (p)(1-p) 0.09z*s/d 19.60 s = sqrt[(p)(1-p)} 0.30Minimal n 384.15 z*s = {z*sqrt[(p)(1-p)]} 0.59
z*s/d = {z*sqrt[(p)(1-p)]}/d 29.40Minimal n to ensure np>5 50.0Minimal n to ensure n(1-p)>5 5.6Minimal n to ensure d < (zs/sqrt(n)) 864.3Minimal n to satisfy all constraints 864.3
Assumptions:
Sample Size will be above 30. If not, raise sample size to 30 to make assumptions valid.Proportion Estimate is the maximum you expect p to be. If you don't have a good estimate of the proportion,use p = .5, which gives maximal standard deviation.
Sample Size Utility
Mean Proportion
146230062.xlsx.ms_office Sample Size
7/30/2019 Basic stats information
13/16
Cereal
Protein
(grams
per
serving)
Carbohydr
ates
(grams
per
serving)
100% Bran 1 12
All-Bran 1 12Almond Delight 1 12
Apple Cinnamon Cheerios 1 13
Apple Jacks 1 13
Bran Chex 1 13
Bran Flakes 1 13
Cap'n'Crunch 1 13
Cheerios 1 14
Cinnamon Toast Crunch 1 14
Cocoa Puffs 1 15
Corn Chex 1 23
Corn Flakes 2 9
Corn Pops 2 10
Count Chocula 2 11
Cracklin' Oat Bran 2 11
Cream of Wheat (Quick) 2 11
Crispix 2 12
Double Chex 2 14
Froot Loops 2 15
Frosted Flakes 2 15
Frosted Mini-Wheats 2 15
Fruit & Fibre Dates, Walnuts, and Oats 2 16Fruity Pebbles 2 18
Golden Grahams 2 21
Grape Nuts Flakes 2 21
Grape-Nuts 2 21
Great Grains Pecan 2 21
Honey Nut Cheerios 2 21
Honey-comb 2 22
Kix 2 22
Life 3 10
Lucky Charms 3 11
Maypo 3 11
Muesli Raisins, Dates, & Almonds 3 12
Muesli Raisins, Peaches, & Pecans 3 12
Mueslix Crispy Blend 3 13
Nut&Honey Crunch 3 13
Nutri-grain Wheat 3 14
Post Nat. Raisin Bran 3 14
Product 19 3 15
7/30/2019 Basic stats information
14/16
Puffed Rice 3 15
Puffed Wheat 3 16
Quaker Oat Squares 3 17
Raisin Bran 3 17
Raisin Nut Bran 3 17
Raisin Squares 3 17
Rice Chex 3 18Rice Krispies 3 20
Shredded Wheat 3 21
Smacks 4 5
Special K 4 7
Total Corn Flakes 4 12
Total Raisin Bran 4 14
Total Whole Grain 4 16
Triples 4 16
Trix 4 16
Wheat Chex 6 16
Wheaties 6 17
mean 2.49 14.81
median 2.00 14.00
7/30/2019 Basic stats information
15/16
Variable 1 Variable 2 Age
a ary($thousan
ds)
-1.0 1.0 53 145
-1.0 1.0 43 621
1.0 -1.0 33 262
1.0 -1.0 45 208
-1.0 1.0 46 362
1.0 -1.0 55 424
-1.0 1.0 41 339
-1.0 1.0 55 736
-1.0 1.0 36 291
1.0 -1.0 45 58
1.0 -1.0 55 498
1.0 -1.0 50 643
49 390
-1.000000 47 332
69 75051 368
48 659
62 234
45 396
37 300
50 343
50 536
50 543
58 217
53 298
57 1103
53 406
61 254
47 862
56 204
44 206
46 250
58 21
48 298
38 350
74 80060 726
32 370
51 536
50 291
40 808
61 543
63 149
56 350
7/30/2019 Basic stats information
16/16
45 242
61 198
70 213
59 296
57 317
69 482
44 15556 802
50 200
56 282
43 573
48 388
52 250
62 396
48 572
0.13