Basic Descriptive Statistics for INCM doctoral students

Basic Descriptive Statistics for INCM doctoral students

Center for Statistics and Analytical Services at Kennesaw State University

2

INCM Boot Camp NotesThese notes are intended to prepare students for the for the quantitative portion of the INCM curriculum.

The Course will be a combination of Statistical and Mathematical Theory and Practical Application, with most exercises being executed in Excel and SPSS.

Should you have any questions regarding the materials included in this document, please contact Dr. Jennifer Priestley at [email protected].

mailto:[email protected]


3

INCM BOOT CAMP Topics in Basic Statistics

SECTION 1: Descriptive Statistics

SECTION 2: The Normal Distribution

SECTION 3: Samples versus Populations

SECTION 4: The Central Limit Theorem

SECTION 5: Binomial Distribution


4

SECTION 1 – DESCRIPTIVE STATISTICS

Whenever you receive a dataset, before you execute any analysis, you need to “KNOW YOUR DATA”. This is true for many reasons – the primary reason being that when things go wrong (which on occasion they will) - you will have a better understanding why your analysis results are not what you thought they should have been. Its also important because this is the first thing we report to people when executing any kind of analysis. We will go through a few examples of this. “Knowing” your data – really means running the descriptive statistics.

The term “descriptive statistics” almost always means that the following information will be reported:

Mean Median Histogram Standard Deviation/Variance

It also sometimes means –

Skew KurtosisConfidence Interval (typically 95%)

Each of these will be explained .


5


Within descriptive statistics, the mean and the median are considered the measurements of central tendency of the variable in question. In other words, the central tendency is considered to be the most representative value for the data.

You probably recall that the formulas for the mean are: And, that the formulas for the median are:

SAMPLE MEAN Odd # of obs

POPULATION MEAN Even # of obsM edian X n ( ) /1 2

M edianX Xn n

/ /2 2 1

2X

X

n

X

N

ii

n

ii

N

1

1


6

That is all straightforward. But which one is “right”? Which one should be reported as the best measurement of “central tendency”? Mathematically they are both correct. Most people will expect to see both. But what if the values are very different?

If there are no extreme values, then we would typically report the mean rather than the median – although in this instance they should be fairly similar. If there are extreme values, then, the two values will be very different and we would be better served with the median as the best measurement of central tendency.

For example –

Lets say there is a small law firm with 5 employees. Their position and their salaries are provided below:


Position Salary

Paralegal $70,000

Paralegal $80,000

Secretary $40,000

Junior Lawyer $85,000

Senior Lawyer (owner) $150,000


7

What is the best measurement of central tendency of the salaries?

The mean is $85,000 (verify this number for yourself).

The median is $80,000 (verify this number for yourself).

These numbers are not that far apart.

But…

What if the firm has a REALLY good year. And…the salaries become:


Position Salary

Paralegal $70,000

Paralegal $80,000

Secretary $40,000

Junior Lawyer $85,000

Senior Lawyer (owner) $1,500,000


8


Hey…its his firm. So…back to the measurements of central tendency -

The mean is now $355,000 (verify this number for yourself).

The median is still $80,000 (verify this number for yourself).

Note that the mean was (and always is) HIGHLY sensitive to extreme values. Where the median is not affected by extreme values. For this reason, if extreme values are present, the median is the best measurement of central tendency.

Ok…how can we easily determine if extreme values are present?

The primary method for evaluating the distribution of a variable and assessing the presence of extreme values is through the use of a histogram.


9

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000More0

1000

2000

3000

4000

5000

6000

7000

Histogram of Total Average Kilowatt Hours Used Per Day

Total Average Kilowatt Hours Used Per Day

Fre

quency


This visualization is the best way to represent the distribution of a quantitative variable. It allows the reader to immediately recognize any extreme values. Variables with this distribution should be represented with the median rather than with the mean. This histogram has extreme values to the right and is considered to be “right tailed skewed”.


10

The statistic which is used to communicate the degree to which the data is dispersed, is the standard deviation and/or the variance. They actually are effectively the same value – the variance is just the square of the standard deviation:


2

12 2

2 1 1

2

( ):

:

N

N Ni

i i

X

X XN

VarianceN N

Standard Deviation

Practioners (like us) tend to prefer the standard deviation over the variance because…the standard deviation is expressed in the original units of the data. In other words, the standard deviation, the mean and the median are all expressed in the same units – like dollars (for the law salaries) or kilowatts. The variance is expressed in the original units squared. Which is more difficult to apply.

Verify that the standard deviation for the first set of salaries for the law firm is $40,311.29 and $640,312.42 for the second set of salaries.


11

The concepts of skew and kurtosis are less commonly requested as descriptive statistics than are mean, median and standard deviation, but they are reported. Both are measurement of the shape of the distribution of the variable:


skewness

X X

N S

ii

N

3

131

kurtosisX X

N S

ii

N

4

141

3

A variable with a symmetric distribution – such as the normal distribution - will have a skew of 0. Variables with a “right tail” distribution – such as shown in the previous histogram – will have a positive skew, while variables with a “left tail” distribution will have a negative skew. The larger the skew, the more the median should be used over the mean for the measurement of central tendency.

A symmetrical or normal distribution has a kurtosis of 3. A distribution that has a higher peak than a normal curve will have a kurtosis greater than 3 and is said to be “leptokurtic” and a distribution that has a lower peak than a normal curve will have a kurtosis less than 3 and is said to be “platykurtic”.


12

SECTION 2- NORMAL DISTRIBUTION

While there are a lot of distributions that can be studied in statistics, the normal distribution – the one with the “bell shape” is the most important. There are several reasons for this –

While not all data has a “normal” shape – the normal distribution is the most common.

From the Central Limit Theorem (SECTION 4) we know that the distribution of sample means will

always form a normal distribution – this is effectively why statistics “works”.

The normal curve is symmetrical and has several easy-to-understand properties. Once these

concepts are understood and applied to the normal curve, other distributions are easier to

understand.

The Mathematical expression for the normal curve is:

Where, μ and σ are the parameters which define the distribution, e = natural log base = 2.7182..., π =3.14159...,

f x e

x

( )/

1

2 2

1 22


13

SECTION 2- NORMAL DISTRIBUTION

The Normal Distribution is symmetrical around the mean. We know that when data is normally distributed: 68% of all observations will fall within one standard deviation of the mean 95% of all observations will fall within two standard deviations of the mean. Because the distribution

is symmetrical, this also means that about 2.5% of all observations will fall above 2 standard deviations from the mean, and about 2.5% will fall below 2 standard deviations of the mean.

99% of all observations will fall within three standard deviations of the mean. Extreme values can be identified visually using the histogram approach, but can be defined

mathematically as falling more than 2 (or sometimes 3) standard deviations from the mean.


14

What is the difference between a Sample and a Population? And, why do we sample?

SECTION 3 – SAMPLES VERSUS POPULATIONS

A sample is a carefully generated subset of a population. In Statistics, we use different symbols to represent information referring to a population versus a sample.

For example, these symbols, represent the standard deviation, mean and size for each.


15

Samples are typically taken for one of three reasons:

1. It would be too expensive and/or too time consuming to analyze the entire population.

Consider the Gallup poll. It would be too expensive and too time consuming (some might say impossible) to ask EVERY

person in the country who they are supporting for president. As a result, Gallup must take a sample of voters (typically

about 1,000) and then make an inference back onto the full population regarding the percent of the voters supporting

Candidate A. You will notice that these sample results ALWAYS include the error information regarding the sample – such

as “95% confidence” and “3% Margin of Error”. These concepts of confidence and margin of error will be covered during

the course.

2. The analysis process destroys the units in the population – this is call destructive testing.

Consider a manufacturer like Michelin Tires. When they report the performance of their tires, they have to test a sample

of the tires and then report the sample results, the confidence level and the margin of error. It would, of course, be more

precise to report the performance of ALL of the tires based upon a test that included ALL of the tires. But, in testing the

tire, it is destroyed.

3. Analysis of the entire population creates too much “statistical power”.

“Power”- or the probability of finding an effect if one actually exists – is a function of sample size, effect size and alpha.

As the sample size increases, smaller and smaller effects are found to be significant – but possibly not meaningful.



16


A few things to consider about sampling:

1. What makes a sample valid?2. What are the implications for a non-valid sample?3. What if the only sample I have is not valid? 4. What if I have lots of different groups – like commercial and residential units? Do I need to sample

these groups differently?5. What if I have no idea how the data I was given was collected? How can I tell if the sample is

valid?

Think about these questions…think of examples where you might have been faced with these issues. We will use these as a basis to form our discussion around sampling.


17

SECTION 4 – CENTRAL LIMIT THEOREM

There are A LOT of really fascinating statistical theorems. No really. While we could go on for hours about them, there is really just one that you REALLY need to understand – The Central Limit Theorem (CLT).

Because the CLT works, we can take a sample and make an inference onto a population. In effect, the CLT is the backbone of statistics – it is why statistics works.

Here are a few principles regarding the CLT that you need to understand:

The distribution of all possible sample means will approach a normal distribution (regardless of how the population is distributed).

The mean of all sample means will approximate the population mean. The standard deviation of all sample means is the standard deviation of the population/the SQRT of

the sample size. If the population is NOT normally distributed (or unknown), sample sizes must be greater than 30

to assume that the CLT will work. If the population IS normally distributed, samples can be of any (reasonable) size (although

greater than 30 is always preferred).

Here is a great website to understand the concept:

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html




18

SECTION 4 – CENTRAL LIMIT THEOREM

Here is an example that might be useful to understand the CLT:

There is a particular small business who has just started their electrical service. Over a 1 month period (30 days), their average 24 hour KW usage was 2500.

Consider this to be a sample for this company – with a mean of 2500, and a standard deviation of 390 KW (remember that standard deviations are in the same units as the original data).

Lets say that we know that companies of the same size and in the same industry average 2200 KW every 24 hours. Is this company unusual? Are they an “extreme” user? Lets see…

If we convert their usage into units of standard deviations through the application of the “Z-score” -

We get – (2500 – 2200)/(390/SQRT(30)) = 4.21 standard deviations. From slide 13, we know that only 5% of all expected sample means would be above 2 standard deviations – and less than .5% of all expected sample means would be above 3 standard deviations. So, the probability of this type of outcome occurring by “chance” is less than .5%, since it is more than 3 standard deviations above the mean. This is an “extreme” value – and warrants further investigation.


19

SECTION 5 – BINOMIAL DISTRIBUTION

Another interesting distribution to understand is the binomial distribution (also called the Bernoulli distribution). The binomial distribution comes in handy when dealing with binary outcomes – like yes/no, pass/fail, high/low, response/no response, etc.

The mathematical expression for the binomial distribution is:

b x n p

n

n xp qx n x( ; , )

!

!

Where, this is the calculated probability of x successes in n trials where the independent probability on each trial is p.

Lets take a look at an example…

Across all airlines, the on time arrival percentage is 74%. If you commute to Atlanta for work by air, and fly 10 times a month, what is the chance that 8 of those 10 flights will be on time?


20

In this example, each flight is a “trial”. So, n=10. Each on time arrival is a “success”. So x = 8. The probability of any individual success is .74. So, p = .74 and q = .26.

The probability of 8 out of 10 flights being on time (jokes about Atlanta Hartsfield Airport aside) is:

[10!/((10-8)!*8!)]*.748*.262

= .2735 or 27.35%

Consider another example dealing with bill payment…

What is the probability that out of a heterogeneous sample of 50 customers, that 47 will pay their bills on time, where the overall percentage of on time payment is 85%

In this example, each customer represents a “trial”. So n=50. Each on time payment is a “success”. So x = 47. The probability of any individual on time payment (ceteris paribus) is .85. So p = .85 and q = .15.

The probability of 47 out of 50 customers paying on time is:

[50!/((50-47)!*47!)]*.8547*.153

= 3.2%

SECTION 5 – BINOMIAL DISTRIBUTION

Documents

Basic Descriptive Statistics for INCM doctoral students