Rsch STATISTICS.docx

Embed Size (px)

Citation preview

  • 8/14/2019 Rsch STATISTICS.docx

    1/17

    Measures of Variability

    Author(s)David M. Lane

    Prerequisites

    Percentiles,Distributions,Measures of Central Tendency

    Learning Objectives

    1.Determine the relative variability of two distributions2.Compute the range3.Compute the inter-quartile range4.Compute the variance in the population5.Estimate the variance from a sample6.Compute the standard deviation from the variance

    WHAT IS VARIABILITY?

    Variability refers to how "spread out" a group of scores is.

    To see what we mean by spread out, consider graphs in

    Figure 1. These graphs represent the scores on two quizzes.

    The mean score for each quiz is 7.0. Despite the equality of

    means, you can see that the distributions are quite

    different. Specifically, the scores on Quiz 1 are more

    densely packed and those on Quiz 2 are more spread out.

    The differences among students were much greater on Quiz

    2 than on Quiz 1.

    Quiz 1

    Quiz 2

    http://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/percentiles.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/summarizing_distributions/measures.htmlhttp://onlinestatbook.com/2/introduction/distributions.htmlhttp://onlinestatbook.com/2/introduction/percentiles.html
  • 8/14/2019 Rsch STATISTICS.docx

    2/17

    Figure 1. Bar charts of two quizzes.

    The terms variability, spread, and dispersion are synonyms,

    and refer to how spread out a distribution is. Just as in the

    section on central tendency where we discussed measures

    of the center of a distribution of scores, in this chapter we

    will discuss measures of the variability of a distribution.

    There are four frequently used measures of variability: the

    range, interquartile range, variance, and standard

    deviation. In the next few paragraphs, we will look at each

    of these four measures of variability in more detail.

    RANGEThe range is the simplest measure of variability tocalculate, and one you have probably encountered many

    times in your life. The range is simply the highest score

    minus the lowest score. Lets take a few examples. What isthe range of the following group of numbers: 10, 2, 5, 6, 7,3, 4? Well, the highest number is 10, and the lowestnumber is 2, so 10 - 2 = 8. The range is 8. Lets takeanother example. Heres a dataset with 10 numbers: 99,45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? Thehighest number is 99 and the lowest number is 23, so 99 -23 equals 76; the range is 76. Now consider the twoquizzes shown in Figure 1. On Quiz 1, the lowest score is 5and the highest score is 9. Therefore, the range is 4. The

    range on Quiz 2 was larger: the lowest score was 4 and thehighest score was 10. Therefore the range is 6.

    INTERQUARTILE RANGE

    Theinterquartile range(IQR) is the range of the middl

    50% of the scores in a distribution. It is computed as

    follows:

    IQR = 75th percentile - 25th percentile

    For Quiz 1, the 75th percentile is 8 and the 25th perce

    is 6. The interquartile range is therefore 2. For Quiz 2,

    which has greater spread, the 75th percentile is 9, the

    percentile is 5, and the interquartile range is 4. Recall

    in the discussion ofbox plots, the 75th percentile was

    the upper hinge and the 25th percentile was called the

    lower hinge. Using this terminology, the interquartile r

    is referred to as the H-spread.

    A related measure of variability is called thesemi-

    interquartile range. The semi-interquartile range is def

    simply as the interquartile range divided by 2. If a

    distribution is symmetric, the median plus or minus the

    semi-interquartile range contains half the scores in the

    distribution.

    VARIANCE

    Variability can also be defined in terms of how close th

    scores in the distribution are to the middle of the

    distribution. Using the mean as the measure of the mid

    of the distribution, the variance is defined as the averasquared difference of the scores from the mean. The d

    from Quiz 1 are shown in Table 1. The mean score is 7

    Therefore, the column "Deviation from Mean" contains

    http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://glossary%28%27interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://glossary%28%27semi-interquartile_range%27%29/http://onlinestatbook.com/2/graphing_distributions/boxplots.htmlhttp://glossary%28%27interquartile_range%27%29/
  • 8/14/2019 Rsch STATISTICS.docx

    3/17

    score minus 7. The column "Squared Deviation" is simply

    the previous column squared.Table 1. Calculation of Variance for Quiz 1 scores.

    Scores Deviation from Mean Squared Deviation

    9 2 4

    9 2 4

    9 2 4

    8 1 1

    8 1 1

    8 1 1

    8 1 1

    7 0 0

    7 0 0

    7 0 0

    7 0 0

    7 0 0

    6 -1 1

    6 -1 1

    6 -1 1

    6 -1 1

    6 -1 1

    6 -1 1

    5 -2 4

    5 -2 4

    Means

    7 0 1.5

    One thing that is important to notice is that the mean

    deviation from the mean is 0. This will always be the ca

    The mean of the squared deviations is 1.5. Therefore,

    variance is 1.5. Analogous calculations with Quiz 2 sho

    that its variance is 6.7. The formula for the variance is

    where 2is the variance, is the mean, and N is the

    number of numbers. For Quiz 1, = 7 and N = 20.

    If the variance in a sample is used to estimate the

    variance in a population, then the previous formula

    underestimates the variance and the following formula

    should be used:

    where s2is the estimate of the variance and M is the

    sample mean. Note that M is the mean of a sample tak

    from a population with a mean of . Since, in practice,

  • 8/14/2019 Rsch STATISTICS.docx

    4/17

    variance is usually computed in a sample, this formula is

    most often used. The simulation "estimating variance"

    illustrates the bias in the formula with N in the

    denominator.

    Let's take a concrete example. Assume the scores 1, 2,

    4, and 5 were sampled from a larger population. To

    estimate the variance in the population you would computes2as follows:

    M = (1 + 2 + 4 + 5)/4 = 12/4 = 3.

    s2= [(1-3)2+ (2-3)2+ (4-3)2+ (5-3)2]/(4-1)

    = (4 + 1 + 1 + 4)/3 = 10/3 = 3.333

    There are alternate formulas that can be easier to use if you

    are doing your calculations with a hand calculator. You

    should note that these formulas are subject to rounding

    error if your values are very large and/or you have an

    extremely large number of observations.

    and

    For this example,

    STANDARD DEVIATION

    Thestandard deviationis simply the square root of the

    variance. This makes the standard deviations of the tw

    quiz distributions 1.225 and 2.588. The standard devia

    is an especially useful measure of variability when the

    distribution is normal or approximately normal (see Ch

    on Normal Distributions) because the proportion of the

    distribution within a given number of standard deviatiofrom the mean can be calculated. For example, 68% o

    distribution is within one standard deviation of the mea

    and approximately 95% of the distribution is within tw

    standard deviations of the mean. Therefore, if you had

    normal distribution with a mean of 50 and a standard

    deviation of 10, then 68% of the distribution would be

    between 50 - 10 = 40 and 50 +10 =60. Similarly, abou

    95% of the distribution would be between 50 - 2 x 10

    and 50 + 2 x 10 = 70. The symbol for the populationstandard deviation is ; the symbol for an estimate

    computed in a sample is s. Figure 2 shows two normal

    distributions. The red distribution has a mean of 40 an

    standard deviation of 5; the blue distribution has a me

    http://onlinestatbook.com/2/summarizing_distributions/variance_est.htmlhttp://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://onlinestatbook.com/2/normal_distribution/normal_distribution.htmlhttp://glossary%28%27standard_deviation%27%29/http://onlinestatbook.com/2/summarizing_distributions/variance_est.html
  • 8/14/2019 Rsch STATISTICS.docx

    5/17

    60 and a standard deviation of 10. For the red distribution,

    68% of the distribution is between 35 and 45; for the blue

    distribution, 68% is between 50 and 70.

    Figure 2. Normal distributions with standard deviations

    of5and 10.

    Levels of Measurement

    There are four levels of measurement. They are:

    Nominal Ordinal Interval, and Ratio

    Associated with each level are acceptable statistical methods. The nominal

    level is simplest, while Ratio measures are the most sophisticated.

    Nominal Level (Grouping)

    Nominal level data is generally preferred to as the "lowest" level of

    measure. Data is limited to groups and categories. No numerical data is

    ever provided.

    Example:

    Male Female

    Catholic Protestant Jewish Muslim Other

    Ordinal Level (Grouping and Ranking)

    Ordinal level data can be grouped and and ranked. With Ordinal datacan say that a measure is higher or lower than another measure. But,

    may not say how much higher or lower.

    Example:

    Preferred Flavors of Ice Cream

    1. Vanilla

    2. Chocolate

    3. Strawberry

    4. Cherry

    Interval Level (Grouping, ranking, and includes the exact distan

    between measures)

    Interval level data can be grouped, ranked, and include the exact dista

    between measures. Note: Interval measures never contain a zero ( 0 )

    starting point.

    Example:

    Sam is 2" taller than Bill an

    taller than Steve.

  • 8/14/2019 Rsch STATISTICS.docx

    6/17

    Bill is 2" shorter than Sam and 1" taller than Steve.

    Steve is 3" shorter than Sam and 1" shorter than Bill.

    Note: We do not know how tall Sam, Bill or Steve are, we only know

    exactly the difference in their heights when compared to one another.

    -

    Ratio Level (Grouping, ranking, exact distance between measurement,

    and contains an absolute "0")

    Ratio level data are said to be at the highest level and can be grouped,ranked, and the exact distance between measures determined. Also, Ratio

    level measures contain an absolute "0". By having an absolute "0" in yourmeasurement "scale", you are able to describe data in terms of ratios.

    Example:

    You could say that Jack, who weighs 200 lbs., is twice as heavy as Marywho weighs 100 lbs. (twice as heavy is a ratio statement).

    It should be noted that with social science data, there are rarely any outside

    standard scales, such as a yardstick to measure height. Therefore, socialresearch rarely generates data that goes beyond Interval level measures.

    Sample Size

    The question ""how large should my sample be?" is a common one. And,

    one with no simple answer. While there are a number of elegant approaches

    to answer this question, for our purposes, several "rules of thumb" will

    serve us better.

    Rule of Thumb #1

    Use sample groups larger than 30 for interval level measures.

    Rule of Thumb #2

    If the total population that you are examining is less than 30. Use all

    them.

    Rule of Thumb #3

    You should have a sample size of 30 for every relationship you meas

    30+ people ----------------> compared against "X" OK

    15 Women ----------------> compared against "X" Not OK

    15 Men -------------------> compared against "X" Not OK

    30+ Women --------------> compared against "X" OK30+ Men -----------------> compared against "X" OK

    Rule of Thumb #4

    Consult this table

    Selecting Samples

    To select a random sample, use a table of Random Numbers or use a

    computerized random number generator.

    Note: For a more detailed discussion of sample size, see pages 385-3your textbook.

    Measures of Central Tendency

    http://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htmhttp://www.internetraining.com/Statkit/SampleTable.htm
  • 8/14/2019 Rsch STATISTICS.docx

    7/17

    -If you want to find a Yak-Yak bird, the first question you

    might ask yourself is, "Where do most Yak-Yak birds

    live?" In other words, "where would I have to go to have

    the greatest chance of finding a Yak-Yak bird?" Measures

    of Central Tendency tell you where most of whatever youare measuring can be found.

    -

    Mean = Average

    All scores are added up and divided by the number of scores.

    Median = Middle score

    Count the total number of scores. The one in the middle is the median.

    Note: If there are an even number of scores, select the middle two and

    average them. This will give you the median.

    Mode = Most common score

    The mode is the score that occurs most often.

    Range

    Range is the difference between the highest and lowest scores. You sonly use the range to describe interval or ratio level data. To calculate

    range, subtract the lowest score from the highest score

    Example:

    Note: In some statistics books, they will define range as the High Scominus the Low Score, Plus one (1). This is an inclusive measure of ra

    rather than a measure of the difference between two scores. For exam

    the inclusive range for data ranging from 6 to10 would be 5.

  • 8/14/2019 Rsch STATISTICS.docx

    8/17

    For our purposes, we will define the range as the difference betweenthe highest and lowest scores.

    Skewness

    When the mean, median and mode are equal, you will have a normal or bell

    shaped distribution of scores.

    Example:

    Scores: 7, 8, 9, 9, 10, 10, 10, 11, 11, 12, 13

    Mean: 10Median: 10

    Mode: 10

    Range: 6

    We should note at this point that a normal distribution (Bell Curve) iimportant concept for statisticians because it gives them a "theoretica

    standard" by which to compare data that may not form a perfect bell

    If you have data where the mean, median and mode are quite differenscores are said to be skewed.

    Example:

    Scores: 7, 8, 9, 10, 11, 11, 12, 12, 12, 13, 13

    Mean: 10.7

    Media: 11Mode: 12

    Range: 6

    Scores that are "bunched" at the right or high end of the scale are said

    have a negative skew.

  • 8/14/2019 Rsch STATISTICS.docx

    9/17

  • 8/14/2019 Rsch STATISTICS.docx

    10/17

    On the other hand, if my standard deviation is small, it indicates that myscores are close to the Mean, and therefore, the Mean is a good indicator of

    the "average" score.

    To calculate the standard deviation of a set of scores, use the following

    formula.

    The following should help. Let's say the data below represents the test

    scores for 10 trainees. (The top score anyone could make was 50).

    Scores (n) = 10Mean = 400/10 = 40

    Median = 40.5Mode = 41

    Range = 16

    Let me try it on the Internet

    So what does the standard deviation tell you? It tells you that most tr

    made 40, give or take 5 points. And, that in this case, you can say wit

    confidence that the Mean is a very good indicator of the "average" sc

    http://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.htmlhttp://www.physics.csbsju.edu/stats/descriptive2.html
  • 8/14/2019 Rsch STATISTICS.docx

    11/17

    By establishing the standard deviation for a set of scores, you will be able

    to describe accurately how various scores differs from one another and

    from the Mean (average).

    Note: In a Normal Distribution:

    approximately 68% of all scores will fall within one (+/-) 1 standarddeviation from the Mean;

    approximately 95% of all scores will fall within one (+/-) 2 standarddeviation from the Mean;

    approximately 99% of all scores will fall within one (+/-) 3 standarddeviation from the Mean.

    When we know the standard deviation for a set of scores, it is possibcompare our data at a glance with a Normal Distribution to determine

    degree of dispersion.

    In the above example, we can see that all of our scores fell within two

    standard deviations of the Mean. Which again reaffirms that our Mea

    very good indicator of the "average" score.

    A Quick and Dirty Way for Determining the Standard Deviation

    Set of Scores

    If your Mean, Median and Mode are very close to one another, you c

    calculated the Standard Deviation by:

    1. Determine the Range

    2. Divide the Range by four

    3. The resulting number will approximately equal the true standard

    deviation

    Example:

  • 8/14/2019 Rsch STATISTICS.docx

    12/17

    Scores: 7, 8, 9, 10, 10, 10, 11, 11, 12, 13Mean: 10

    Media: 10

    Mode: 10

    Range: 6

    True Standard Deviation: 1.67

    Q&D Standard Deviation: 6/4 = 1.5

    CAUTION: Only use this technique IF your Mean, Media and Mode

    are very similar.

    Measures of Association

    Pearson's rInterval Level Measure

    There will be many times when you will want to know if two variables are"related," and to what extent. For example, you may want to find out therelationship between Yearly Income and Education. To do this, you would

    randomly select 10 individuals . To help you "sort out" your data, you

    would construct the following table.

    With this data, you can now use the Pearson's r or product moment

    correlation coefficient formula.

  • 8/14/2019 Rsch STATISTICS.docx

    13/17

    Let me try it on the Internet

    Using the following table, you can see that there is a very high correlation

    between Income and Years of Education.

    Measures of Association

    Spearman's Rank Order

    Ordinal Level Measure

    Once in a while, it is useful to compare rankings between two people

    groups. Fox example, let's say a group of employees and a group of

    managers want to find out if there is a difference in the workplace vaheld by each group. In this example, each group ranks 10 workplace

    http://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.htmlhttp://faculty.vassar.edu/lowry/corr_stats.html
  • 8/14/2019 Rsch STATISTICS.docx

    14/17

    In order to calculate a Spearman Rank Order, you must first construct the

    following table.

    Let me try it on the Internet

    Again, using the following table, you can see that this time there is ncorrelation between the two rankings.

    http://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.htmlhttp://faculty.vassar.edu/lowry/corr_rank.html
  • 8/14/2019 Rsch STATISTICS.docx

    15/17

    Chi-Square

    Test of Significance

    Nominal Level Measure

    Let's say you wanted to know if there was any significant difference

    between the production rates for departments which had trained supervisors

    versus those departments whose supervisors were not trained.

    Step 1

    Lets begin by collecting some data.

    9 departments produced above standard 11 departments produced at standard 10 departments produced below standard 17 departments are supervised by trained supervisors 13 departments are supervised by untrained supervisors

    Step 2

    From this data, you can construct the following cross-tabs table.

    Note:We have included the totals for trows and columns in our table

    Step 3

    We must now construct a table which describes what would have hap

    if training did not impact production. (We call this table an Expectan

    Table).

    A = 9 departments with above standard productionB = 11 departments with standard production

  • 8/14/2019 Rsch STATISTICS.docx

    16/17

    C = 10 departments with below standard productionM = Number of departments with trained supervisors (17)

    N = Number of departments with untrained supervisors (13)

    T = Total number of supervisors

    Step 4

    We are now ready to calculate the Chi-square using the following formula.

    Where O = the actual production records for each cell in the table

    Where E = the expected production record for each cell in the table

    Let me try it on the Internet v.1 - Chi Square

    Let me try it on the Internet v.2 - Contingency Table

    Step 5

    A Chi-square by itself is not of much value. You have to use a Chi-sq

    table that you can find in most statistics books. However, before you use the table, you must first determine the degree of freedom of your

    To do this, blot out one row and one column. The remaining number

    cells will be your degrees of freedom.

    In our example table, we can see that we have 2 cells that are not fille

    which means we have 2 degrees of freedom (df)

    Now, using a Chi-square table, find where the 8.416 falls on the line

    numbers for df 2. (Online Chi Square Table)

    http://graphpad.com/quickcalcs/chisquared1.cfmhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.richland.cc.il.us/james/lecture/m170/tbl-chi.htmlhttp://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.htmlhttp://graphpad.com/quickcalcs/chisquared1.cfm
  • 8/14/2019 Rsch STATISTICS.docx

    17/17

    What this tells you is that with 98%+* confidence (probability) that

    supervisory training is responsible for increase production.

    Note: You can use Chi-square with cross-tab tables that have up to 30

    degrees of freedom. (The maximum for most Chi-square tables).

    __________________

    * .02 = 2%

    100% - 2% = 98%