Statistics Tutorial for First unit

Embed Size (px)

Citation preview

  • 8/7/2019 Statistics Tutorial for First unit

    1/29

    Statistics Tutorial: Measures of Central

    Tendency

    Researchers are often interested in defining a value that best describes some attribute of thepopulation. Often this attribute is a measure of central tendency or a proportion.

    Measures of Central Tendency

    Several different measures of central tendency are defined below.

    y The mode is the most frequently appearing value in the population or sample. Supposewe draw a sample of five women and measure their weights. They weigh 100 pounds,100 pounds, 130 pounds, 140 pounds, and 150 pounds. Since more women weigh 100

    pounds than any other weight, the mode would equal 100 pounds.

    y To find the median, we arrange the observations in order from smallest to largest value.If there is an odd number of observations, the median is the middle value. If there is aneven number of observations, the median is the average of the two middle values. Thus,

    in the sample of five women, the median value would be 130 pounds; since 130 pounds isthe middle weight.

    y The mean of a sample or a population is computed by adding all of the observations anddividing by the number of observations. Returning to the example of the five women, themean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.

    Proportions and Percentages

    When the focus is on the degree to which a population possesses a particular attribute, themeasure of interest is a percentage or a proportion.

    y Aproportion refers to the fraction of the total that possesses a certain attribute. Forexample, we might ask what proportion of women in our sample weigh less than 135pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.

    y Apercentage is another way of expressing a proportion. A percentage is equal to theproportion times 100. In our example of the five women, the percent of the total who

    weigh less than 135 pounds would be 100 * (3/5) or 60 percent.

    Notation

    Of the various measures, the mean and the proportion are most important. The notation used to

    describe these measures appears below:

  • 8/7/2019 Statistics Tutorial for First unit

    2/29

    y X: Refers to a population mean.y x: Refers to a sample mean.y P: The proportion of elements in the population that has a particular attribute.y p: The proportion of elements in the sample that has a particular attribute.y Q: The proportion of elements in the population that does not have a specified attribute.

    Note that Q = 1 - P.

    y q: The proportion of elements in the sample that does not have a specified attribute. Notethat q = 1 - p.

    Arithmetic Mean

    The arithmetic mean is what is commonly called the average: When the word "mean" is used without a

    modifier, it can be assumed that it refers to the arithmetic mean. The mean is the sum of all the scores

    divided by the number of scores. The formula in summation notation is:

    = X/N

    where is the population mean and N is the number of scores.

    If the scores are from a sample, then the symbol M refers to the mean and N refers to the sample size.

    The formula for M is the same as the formula for

    .

    M = X/N

    The mean is a good measure ofcentral tendency for roughly symmetric distributions but can be

    misleading in skewed distributions since it can be greatly influenced by scores in the tail. Therefore,

    other statistics such as the median may be more informative for distributions such as reaction time or

    family income that are frequently very skewed

    Click here for an interactive demonstration of properties of the mean and median.

    The sum of squared deviations of scores from their mean is lower than their squared deviations from

    any other number.

    For normal distributions, the mean is the most efficient and therefore the least subject to sample

    fluctuations of all measures of central tendency.

  • 8/7/2019 Statistics Tutorial for First unit

    3/29

    The formal definition of the arithmetic mean is = E[X] where is the population mean of the variable X

    and E[X] is the expected value of X

    The median is the middle of a distribution: half the scores are above the median and half are below the

    median. The median is less sensitive to extreme scores than the mean and this makes it a better

    measure than the mean for highly skewed distributions. The median income is usually more informative

    than the mean income, for example.

    The sum of the absolute deviations of each number from the median is lower than is the sum of

    absolute deviations from any other number. Click here for an example.

    The mean, median, and mode are equal in symmetric distributions. The mean is typically higher than the

    median in positively skewed distributions and lower than the median in negatively skewed distributions,

    although this may not be the case in bimodal distributions. Click here for examples.

    Computation of Median

    When there is an odd number of numbers, the median is simply the middle number. For example, the

    median of 2, 4, and 7 is 4.

    When there is an even number of numbers, the median is the mean of the two middle numbers. Thus,

    the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.

    Mode

    Next section: Trimean

    The mode is the most frequently occurring score in a distribution and is used as a measure of

    central tendency.The advantage of the mode as a measure of central tendency is that its meaning

    is obvious. Further, it is the only measure of central tendency that can be used with nominal data.

    The mode is greatly subject to sample fluctuations and is therefore not recommended to be used

    as the only measure of central tendency. A further disadvantage of the mode is that many

    distributions have more than one mode. These distributions are called "multi modal."

  • 8/7/2019 Statistics Tutorial for First unit

    4/29

    Descriptive Statistics

    Descriptive statistics are used to describe the basic features of the data in a study. They provide

    simple summaries about the sample and the measures. Together with simple graphics analysis,

    they form the basis of virtually every quantitative analysis of data.

    Descriptive statistics are typically distinguished from inferential statistics. With descriptive

    statistics you are simply describing what is or what the data shows. With inferential statistics,you are trying to reach conclusions that extend beyond the immediate data alone. For instance,

    we use inferential statistics to try to infer from the sample data what the population might think.Or, we use inferential statistics to make judgments of the probability that an observed difference

    between groups is a dependable one or one that might have happened by chance in this study.Thus, we use inferential statistics to make inferences from our data to more general conditions;

    we use descriptive statistics simply to describe what's going on in our data.

    Descriptive Statistics are used to present quantitative descriptions in a manageable form. In aresearch study we may have lots of measures. Or we may measure a large number of people onany measure. Descriptive statistics help us to simply large amounts of data in a sensible way.

    Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider asimple number used to summarize how well a batter is performing in baseball, the batting

    average. This single number is simply the number of hits divided by the number of times at bat(reported to three significant digits). A batter who is hitting .333 is getting a hit one time in every

    three at bats. One batting .250 is hitting one time in four. The single number describes a largenumber of discrete events. Or, consider the scourge of many students, the Grade Point Average

    (GPA). This single number describes the general performance of a student across a potentiallywide range of course experiences.

    Every time you try to describe a large set of observations with a single indicator you run the risk

    of distorting the original data or losing important detail. The batting average doesn't tell youwhether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or

    on a streak. The GPA doesn't tell you whether the student was in difficult courses or easy ones,or whether they were courses in their major field or in other disciplines. Even given these

    limitations, descriptive statistics provide a powerful summary that may enable comparisonsacross people or other units.

    Univariate Analysis

    Univariate analysis involves the examination across cases of one variable at a time. There arethree major characteristics of a single variable that we tend to look at:

    y the distributiony the central tendencyy the dispersion

  • 8/7/2019 Statistics Tutorial for First unit

    5/29

    In most situations, we would describe all three of these characteristics for each of the variables inour study.

    The Distribution.The distribution is a summary of the frequency of individual values or ranges

    of values for a variable. The simplest distribution would list every value of a variable and the

    number of persons who had each value. For instance, a typical way to describe the distribution ofcollege students is by year in college, listing the number or percent of students at each of the fouryears. Or, we describe gender by listing the number or percent of males and females. In these

    cases, the variable has few enough values that we can list each one and summarize how manysample cases had the value. But what do we do for a variable like income or GPA?With these

    variables there can be a large number of possible values, with relatively few people having eachone. In this case, we group the raw scores into categories according to ranges of values. For

    instance, we might look at GPA according to the letter grade ranges. Or, we might group incomeinto four or five ranges of income values.

    Table 1. Frequency distribution table.

    One of the most common ways to describe a single variable is with a frequency distribution.

    Depending on the particular variable, all of the data values may be represented, or you maygroup the values into categories first (e.g., with age, price, or temperature variables, it would

    usually not be sensible to determine the frequencies for each value. Rather, the value are groupedinto ranges and the frequencies determined.). Frequency distributions can be depicted in two

    ways, as a table or as a graph. Table 1 shows an age frequency distribution with five categoriesof age ranges defined. The same frequency distribution can be depicted in a graph as shown in

    Figure 2. This type of graph is often referred to as a histogram orbar chart.

  • 8/7/2019 Statistics Tutorial for First unit

    6/29

    Table 2. Frequency distribution bar chart.

    Distributions may also be displayed using percentages. For example, you could use percentages

    to describe the:

    y percentage of people in different income levelsy percentage of people in different age rangesy percentage of people in different ranges of standardized test scores

    Central Tendency.The central tendency of a distribution is an estimate of the "center" of a

    distribution of values. There are three major types of estimates of central tendency:

    y Meany Mediany Mode

    The Mean or average is probably the most commonly used method of describing central

    tendency. To compute the mean all you do is add up all the values and divide by the number ofvalues. For example, the mean or average quiz score is determined by summing all the scores

    and dividing by the number of students taking the exam. For example, consider the test scorevalues:

    15, 20, 21, 20, 36, 15, 25, 15

    The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

    The Median is the score found at the exact middle of the set of values. One way to compute themedian is to list all scores in numerical order, and then locate the score in the center of the

    sample. For example, if there are 500 scores in the list, score #250 would be the median. If weorder the 8 scores shown above, we would get:

    15,15,15,20,20,21,25,36

  • 8/7/2019 Statistics Tutorial for First unit

    7/29

  • 8/7/2019 Statistics Tutorial for First unit

    8/29

    -5.875 * -5.875 = 34.515625+4.125 * +4.125 = 17.015625

    -5.875 * -5.875 = 34.515625

    Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum

    is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875/ 7 = 50.125. This value is known as the variance. To get the standard deviation, we take thesquare root of the variance (remember that we squared the deviations earlier). This would be

    SQRT(50.125) = 7.079901129253.

    Although this computation may seem convoluted, it's actually quite simple. To see this, considerthe formula for the standard deviation:

    In the top part of the ratio, the numerator, we see that each score has the the mean subtractedfrom it, the difference is squared, and the squares are summed. In the bottom part, we take the

    number of scores minus 1. The ratio is the variance and the square root is the standard deviation.In English, we can describe the standard deviation as:

    the square root of the sum of the squared deviations from the mean divided by the number

    of scores minus one

    Although we can calculate these univariate statistics by hand, it gets quite tedious when you havemore than a few values and variables. Every statistics program is capable of calculating themeasily for you. For instance, I put the eight scores into SPSS and got the following table as a

    result:

    N 8

    Mean 20.8750

  • 8/7/2019 Statistics Tutorial for First unit

    9/29

    Median 20.0000

    Mode 15.00

    Std. Deviation 7.0799

    Variance 50.1250

    Range 21.00

    which confirms the calculations I did by hand above.

    The standard deviation allows us to reach some conclusions about specific scores in ourdistribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the

    following conclusions can be reached:

    y approximately 68% of the scores in the sample fall within one standard deviation of themean

    y approximately 95% of the scores in the sample fall within two standard deviations of themean

    y approximately 99% of the scores in the sample fall within three standard deviations of themean

    For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, wecan from the above statement estimate that approximately 95% of the scores will fall in the range

    of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind ofinformation is a critical stepping stone to enabling us to compare the performance of an

    individual on one variable with their performance on another, even when the variables aremeasured on entirely different scales.

    Standard Deviation

    Simple Example of Calculating Standard Deviation

    Let's say we wanted to calculate the standard deviation for the amounts of gold coins pirates on a

    pirate ship have.

    There are 100 pirates on the ship. In statistical terms this means we have a population of 100. If

    we know the amount of gold coins each of the 100 pirates have, we use the standard deviation

    equation for an entire population:

  • 8/7/2019 Statistics Tutorial for First unit

    10/29

  • 8/7/2019 Statistics Tutorial for First unit

    11/29

    2. Calculate for each value in the sample:

    3. Calculate :

    4. Calculate the standard deviation:

    The standard deviation for the amounts of gold coins the pirates have is 2.24 gold coins.

    Mean or average

    The simplest statistic is the mean or average. Years ago, when laboratories were beginning to

    assay controls, it was easy to calculate a mean and use that value as the "target" to be achieved.For example, given the following ten analyses of a control material - 90, 91, 89, 84, 88, 93, 80,

    90, 85, 87 - the mean orXbar is 877/10 or 87.7. [The term Xbarrefers to a symbol having a lineor bar over the X, , however, we will use the term instead of the symbol in the text of these

    lessons because it is easier to present.]

  • 8/7/2019 Statistics Tutorial for First unit

    12/29

    The mean value characterizes the "central tendency" or "location" of the data. Although the meanis the value most likely to be observed, many of the actual values are different than the mean.

    When assaying control materials, it is obvious that technologists will not achieve the mean valueeach and every time a control is analyzed. The values observed will show a dispersion or

    distribution about the mean, and this distribution needs to be characterized to set a range of

    acceptable control values.

    Standard deviation

    The dispersion of values about the mean is predictable and can be characterized mathematically

    through a series of manipulations, as illustrated below, where the individual x-values are shownin column A.

    Column A Column B Column C

    X value X value-Xbar (X-Xbar)2

    90 90 - 87.7 = 2.30 (2.30)2 = 5.29

    91 91 - 87.7 = 3.30 (3.30)2 = 10.89

    89 89 - 87.7 = 1.30 (1.30)2 = 1.69

    84 84 - 87.7 = -3.70 (-3.70)2 = 13.69

    88 88 - 87.7 = 0.30 (0.30)2 = 0.09

    93 93 - 87.7 = 5.30 (5.30)2 = 28.09

    80 80 - 87.7 = -7.70 (-7.70)2 = 59.29

    90 90 - 87.7 = 2.30 (2.30)2 = 5.29

    85 85 - 87.7 = -2.70 (-2.70)2 = 7.29

    87 87 - 87.7 = -0.70 (-0.70)2 = 0.49

    X = 877 (X-Xbar) = 0 (X-Xbar) = 132.10

    y The first mathematical manipulation is to sum ( ) the individual points and calculate the meanor average, which is 877 divided by 10, or 87.7 in this example.

    y The second manipulation is to subtract the mean value from each control value, as shown incolumn B. This term, shown as X value - Xbar, is called the difference score. As can be seen here,

  • 8/7/2019 Statistics Tutorial for First unit

    13/29

    individual difference scores can be positive or negative and the sum of the difference scores is

    always zero.

    y The third manipulation is to square the difference score to make all the terms positive, as shownin Column C.

    y Next the squared difference scores are summed.y Finally, the predictable dispersion or standard deviation (SD or s) can be calculated as follows:

    = [132.10/(10-1)]1/2 = 3.83

    Degrees of freedom

    The "n-1" term in the above expression represents the degrees of freedom (df). Loosely

    interpreted, the term "degrees of freedom" indicates how much freedom or independence there is

    within a group of numbers. For example, if you were to sum four numbers to get a total, youhave the freedom to select any numbers you like. However, if the sum of the four numbers isstipulated to be 92, the choice of the first 3 numbers is fairly free (as long as they are low

    numbers), but the last choice is restricted by the condition that the sum must equal 92. Forexample, if the first three numbers chosen at random are 28, 18, and 36, these numbers add up to

    82, which is 10 short of the goal. For the last number there is no freedom of choice. The number10 must be selected to make the sum come out to 92. Therefore, the degrees of freedom have

    been limited by 1 and only n-1 degrees of freedom remain. In the SD formula, the degrees offreedom are n minus 1 because the mean of the data has already been calculated (which imposes

    one condition or restriction on the data set).

    Variance

    Another statistical term that is related to the distribution is the variance, which is the standarddeviation squared (variance = SD ). The SD may be either positive or negative in value because

    it is calculated as a square root, which can be either positive or negative. By squaring the SD, theproblem of signs is eliminated. One common application of the variance is its use in the F-test to

    compare the variance of two methods and determine whether there is a statistically significantdifference in the imprecision between the methods.

    In many applications, however, the SD is often preferred because it is expressed in the same

    concentration units as the data. Using the SD, it is possible to predict the range of control values

    that should be observed if the method remains stable. As discussed in an earlier lesson,laboratorians often use the SD to impose "gates" on the expected normal distribution of controlvalues.

  • 8/7/2019 Statistics Tutorial for First unit

    14/29

    Normal or Gaussian distribution

    Traditionally,after the discussion of the mean, standard deviation, degrees of freedom, and variance, the next

    step was to describe the normal distribution (a frequency polygon) in terms of the standarddeviation "gates." The figure here is a representation of the frequency distribution of a large set

    of laboratory values obtained by measuring a single control material. This distribution shows theshape of a normal curve. Note that a "gate" consisting of 1SD accounts for 68% of the

    distribution or 68% of the area under the curve, 2SD accounts for 95% and 3SD accounts for>99%. At 2SD, 95% of the distribution is inside the "gates," 2.5% of the distribution is in the

    lower or left tail, and the same amount (2.5%) is present in the upper tail. Some authors call thispolygon an error curve to illustrate that small errors from the mean occur more frequently than

    large ones. Other authors refer to this curve as a probability distribution.

    Coefficientof variation

    Another way to describe the variation of a test is calculate the coefficient of variation, or CV.

    The CV expresses the variation as a percentage of the mean, and is calculated as follows:

    CV% = (SD/Xbar)100

    In the laboratory, the CV is preferred when the SD increases in proportion to concentration. Forexample, the data from a replication experiment may show an SD of 4 units at a concentration of

    100 units and an SD of 8 units at a concentration of 200 units. The CVs are 4.0% at both levels

    and the CV is more useful than the SD for describing method performance at concentrations inbetween. However, not all tests will demonstrate imprecision that is constant in terms of CV. Forsome tests, the SD may be constant over the analytical range.

    The CV also provides a general "feeling" about the performance of a method. CVs of 5% or less

    generally give us a feeling of good method performance, whereas CVs of 10% and higher soundbad. However, you should look carefully at the mean value before judging a CV. At very low

    concentrations, the CV may be high and at high concentrations the CV may be low. For example,

  • 8/7/2019 Statistics Tutorial for First unit

    15/29

    a bilirubin test with an SD of 0.1 mg/dL at a mean value of 0.5 mg/dL has a CV of 20%, whereasan SD of 1.0 mg/dL at a concentration of 20 mg/dL corresponds to a CV of 5.0%.

    Alternate formulae

    The lessons on Basic QC Practices cover these same terms (see QC - The data calculations), butuse a different form of the equation for calculating cumulative or lot-to-date means and SDs.Guidelines in the literature recommend that cumulative means and SDs be used in calculating

    control limits [2-4], therefore it is important to be able to perform these calculations.

    The cumulative mean can be expressed as Xbar = ( xi)t /nt, which appears similar to the priormean term except for the "t" subscripts, which refer to data from different time periods. The idea

    is to add the xi and n terms from groups of data in order to calculate the mean of the combined

    groups.The cumulative or lot-to-date standard deviation can be expressed as follows:

    This equation looks quite different from the prior equation in this lesson, but in reality, it isequivalent. The cumulative standard deviation formula is derived from an SD formula called the

    Raw Score Formula. Instead of first calculating the mean orXbar, the Raw Score Formulacalculates Xbar inside the square root sign.

    Oftentimes in reading about statistics, an unfamiliar formula may be presented. You shouldrealize that the mathematics in statistics is often redundant. Each procedure builds upon theprevious procedure. Formulae that seem to be different are derived from mathematical

    manipulations of standard expressions with which you are often already acquainted.

    AP* Statistics Tutorial: A Simple Regression

    Example

    In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret

    the results of our analysis.

    Note: Regression computations are usually handled by a software package or a graphing

    calculator. For this example, however, we will do the computations "manually", since the gorydetails have educational value.

    Problem Statement

  • 8/7/2019 Statistics Tutorial for First unit

    16/29

    Last year, five randomly selected students took a math aptitude test before they began theirstatistics course. The Statistics Department has three questions.

    y What linear regression equation best predicts statistics performance, based on mathaptitude scores?

    yIf a student made an 80 on the aptitude test, what grade would we expect her to make instatistics?

    y How well does the regression equation fit the data?How to Find the Regression Equation

    In the table below, the xi column shows scores on the aptitude test. Similarly, the yi columnshows statistics grades. The last two rows show sums and mean scores that we will use to

    conduct the regression analysis.

    Student xi yi (xi - x) (yi - y) (xi - x)2

    (yi - y)2

    (xi - x)(yi - y)

    1 95 85 17 8 289 64 136

    2 85 95 7 18 49 324 126

    3 80 70 2 -7 4 49 -14

    4 70 65 -8 -12 64 144 96

    5 60 70 -18 -7 324 49 126

    Sum 390 385 730 630 470

    Mean 78 77

    The regression equation is a linear equation of the form: = b0 + b1x . To conduct a regression

    analysis, we need to solve for b0 and b1. Computations are shown below.

    b1 = [ (xi - x)(yi - y) ] / [ (xi - x)2]

    b1 = 470/730 = 0.644b0 = y - b1 * x

    b0 = 77 - (0.644)(78) = 26.768

    Therefore, the regression equation is: = 26.768 + 0.644x .

    How to Use the Regression Equation

    Once you have the regression equation, using it is a snap. Choose a value for the independentvariable (x), perform the computation, and you have an estimated value () for the dependentvariable.

    In our example, the independent variable is the student's score on the aptitude test. The

    dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test,the estimated statistics grade would be:

  • 8/7/2019 Statistics Tutorial for First unit

    17/29

    = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288

    Warning: When you use a regression equation, do not use values for the independent variablethat are outside the range of values used to create the equation. That is called extrapolation, and

    it can produce unreasonable estimates.

    In this example, the aptitude test scores used to create the regression equation ranged from 60 to95. Therefore, only use values inside that range to estimate statistics grades. Using values outside

    that range (less than 60 or greater than 95) is problematic.

    How to Find the Coefficient ofDetermination

    Whenever you use a regression equation, you should ask how well the equation fits the data. One

    way to assess fit is to check the coefficient of determination, which can be computed from thefollowing formula.

    R2 = { ( 1 / N ) * [ (x i - x) * (yi - y) ] / (x * y ) }2

    where N is the number of observations used to fit the model, is the summation symbol, xi is thex value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y

    value, x is the standard deviation of x, and y is the standard deviation of y. Computations forthe sample problem of this lesson are shown below.

    x = sqrt [ ( xi - x )2

    / N ]x = sqrt( 730/5 ) = sqrt(146) = 12.083

    y = sqrt [ ( yi - y )2

    / N ]y = sqrt( 630/5 ) = sqrt(126) = 11.225

    R2 = { ( 1 / N ) * [ (x i - x) * (yi - y) ] / (x * y ) }2R

    2= [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]

    2= ( 94 / 135.632 )

    2= ( 0.693 )

    2= 0.48

    A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics

    grades (the dependent variable) can be explained by the relationship to math aptitude scores (the

    independent variable). This would be considered a good fit to the data, in the sense that it wouldsubstantially improve an educator's ability to predict student performance in statistics class.

    Regression Definition:A regression is a statistical analysis assessing the association between two variables. It is used

    to find the relationship between two variables.

    Regression Formula:Regression Equation(y) = a + bxSlope(b) = (NXY - (X)(Y)) / (NX

    2- (X)

    2)

    Intercept(a) = (Y - b(X)) / N

    where

  • 8/7/2019 Statistics Tutorial for First unit

    18/29

    x and y are the variables.

    b = The slope of the regression linea = The intercept point of the regression line and the y axis.

    N = Number of values or elementsX = First Score

    Y = Second ScoreXY = Sum of the product of first and Second Scores

    X = Sum of First ScoresY = Sum of Second Scores

    X2 = Sum of square First Scores

    Regression Example: To find the Simple/Linear Regression of

    X Values Y Values

    60 3.161 3.6

    62 3.8

    63 4

    65 4.1

    To find regression equation, we will first find slope, intercept and use it to form regression

    equation..

    Step 1: Count the number of values.

    N = 5

    Step 2: Find XY, X2

    See the below table

    X Value Y Value X*Y X*X

    60 3.1 60 * 3.1 = 186 60 * 60 = 3600

    61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721

    62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844

    63 4 63 * 4 = 252 63 * 63 = 396965 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

    Step 3: Find X, Y, XY, X2.

    X = 311Y = 18.6

    XY = 1159.7

  • 8/7/2019 Statistics Tutorial for First unit

    19/29

    X2

    = 19359

    Step 4: Substitute in the above slope formula given.

    Slope(b) = (NXY - (X)(Y)) / (NX2

    - (X)2)

    = ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)

    = (5798.5 - 5784.6)/(96795 - 96721)= 13.9/74

    = 0.19

    Step 5: Now, again substitute in the above intercept formula given.Intercept(a) = (Y - b(X)) / N

    = (18.6 - 0.19(311))/5= (18.6 - 59.09)/5

    = -40.49/5= -8.098

    Step 6: Then substitute these values in regression equation formulaRegression Equation(y) = a + bx= -8.098 + 0.19x.

    Suppose if we want to know the approximate y value for the variable x = 64. Then we can

    substitute the value in the above equation.

    Regression Equation(y) = a + bx= -8.098 + 0.19(64).

    = -8.098 + 12.16= 4.06

    This example will guide you to find the relationship between two variables by calculating the

    Regression from the above steps.

    Correlation

    The correlation is one of the most common and most useful statistics. A correlation is a single

    number that describes the degree of relationship between two variables. Let's work through anexample to show you how this statistic is computed.

    Correlation Example

    Let's assume that we want to look at the relationship between two variables, height (in inches)and self esteem. Perhaps we have a hypothesis that how tall you are effects your self esteem

    (incidentally, I don't think we have to worry about the direction of causality here -- it's not likelythat self esteem causes your height!). Let's say we collect some information on twenty

    individuals (all male -- we know that the average height differs for males and females so, to keep

  • 8/7/2019 Statistics Tutorial for First unit

    20/29

    this example simple we'll just use males). Height is measured in inches. Self esteem is measuredbased on the average of 10 1-to-5 rating items (where higher scores mean higher self esteem).

    Here's the data for the 20 cases (don't take this too seriously -- I made this data up to illustratewhat a correlation is):

    Person Height Self Esteem

    1 68 4.1

    2 71 4.6

    3 62 3.8

    4 75 4.4

    5 58 3.2

    6 60 3.1

    7 67 3.8

    8 68 4.1

    9 71 4.3

    10 69 3.7

    11 68 3.5

    12 67 3.2

    13 63 3.7

    14 62 3.3

    15 60 3.4

    16

    63

    4.0

    17 65 4.1

    18 67 3.8

    19 63 3.4

  • 8/7/2019 Statistics Tutorial for First unit

    21/29

    20 61 3.6

    Now, let's take a quick look at the histogram for each variable:

    And, here are the descriptive statistics:

    Variable Mean StDev Variance Sum Minimum Maximum Range

    Height 65.4 4.40574 19.4105 1308 58 75 17

    Self 3.755 0.426090 0.181553 75.1 3.1 4.6 1.5

  • 8/7/2019 Statistics Tutorial for First unit

    22/29

    Esteem

    Finally, we'll look at the simple bivariate (i.e., two-variable) plot:

    You should immediately see in the bivariate plot that the relationship between the variables is a

    positive one (if you can't see that, review the section on types of relationships) because if youwere to fit a single straight line through the dots it would have a positive slope or move up fromleft to right. Since the correlation is nothing more than a quantitative estimate of the relationship,

    we would expect a positive correlation.

    What does a "positive relationship" mean in this context? It means that, in general, higher scoreson one variable tend to be paired with higher scores on the other and that lower scores on one

    variable tend to be paired with lower scores on the other. You should confirm visually that this isgenerally true in the plot above.

    Calculating the Correlation

    Now we're ready to compute the correlation value. The formula for the correlation is:

  • 8/7/2019 Statistics Tutorial for First unit

    23/29

    We use the symbol r to stand for the correlation. Through the magic of mathematics it turns out

    that r will always be between -1.0 and +1.0. if the correlation is negative, we have a negativerelationship; if it's positive, the relationship is positive. You don't need to know how we came up

    with this formula unless you want to be a statistician. But you probably will need to know howthe formula relates to real data -- how you can use the formula to compute the correlation. Let's

    look at the data we need for the formula. Here's the original data with the other necessarycolumns:

    Person Height (x) Self Esteem (y) x*y x*x y*y

    1 68 4.1 278.8 4624 16.81

    2 71 4.6 326.6 5041 21.16

    3 62 3.8 235.6 3844 14.44

    4 75 4.4 330 5625 19.36

    5 58 3.2 185.6 3364 10.24

    6 60 3.1 186 3600 9.61

    7 67 3.8 254.6 4489 14.44

    8 68 4.1 278.8 4624 16.81

    9 71 4.3 305.3 5041 18.49

    10 69 3.7 255.3 4761 13.69

  • 8/7/2019 Statistics Tutorial for First unit

    24/29

    11 68 3.5 238 4624 12.25

    12 67 3.2 214.4 4489 10.24

    13 63 3.7 233.1 3969 13.69

    14 62 3.3 204.6 3844 10.89

    15 60 3.4 204 3600 11.56

    16 63 4 252 3969 16

    17 65 4.1 266.5 4225 16.81

    18 67 3.8 254.6 4489 14.44

    19 63 3.4 214.2 3969 11.56

    20 61 3.6 219.6 3721 12.96

    Sum = 1308 75.1 4937.6 85912 285.45

    The first three columns are the same as in the table above. The next three columns are simple

    computations based on the height and self esteem data. The bottom row consists of the sum of

    each column.T

    his is all the information we need to compute the correlation. Here are the valuesfrom the bottom row of the table (where N is 20 people) as they are related to the symbols in theformula:

    Now, when we plug these values into the formula given above, we get the following (I show ithere tediously, one step at a time):

  • 8/7/2019 Statistics Tutorial for First unit

    25/29

    So, the correlation for our twenty cases is .73, which is a fairly strong positive relationship. I

    guess there is a relationship between height and self esteem, at least in this made up data!

    Testing the Significance of a Correlation

    Once you've computed a correlation, you can determine the probability that the observedcorrelation occurred by chance. That is, you can conduct a significance test. Most often you are

    interested in determining the probability that the correlation is a real one and not a chanceoccurrence. In this case, you are testing the mutually exclusive hypotheses:

    Null Hypothesis: r = 0

    Alternative Hypothesis: r 0

    The easiest way to test this hypothesis is to find a statistics book that has a table of critical values

    of r. Most introductory statistics texts would have a table like this. As in all hypothesis testing,

    you need to first determine the significance level. Here, I'll use the common significance level ofalpha = .05. This means that I am conducting a test where the odds that the correlation is a

  • 8/7/2019 Statistics Tutorial for First unit

    26/29

    chance occurrence is no more than 5 out of 100. Before I look up the critical value in a table Ialso have to compute the degrees of freedom or df. The df is simply equal to N-2 or, in this

    example, is 20-2 = 18. Finally, I have to decide whether I am doing a one-tailed ortwo-tailedtest. In this example, since I have no strong prior theory to suggest whether the relationship

    between height and self esteem would be positive or negative, I'll opt for the two-tailed test.

    With these three pieces of information -- the significance level (alpha = .05)), degrees of freedom(df = 18), and type of test (two-tailed) -- I can now test the significance of the correlation Ifound. When I look up this value in the handy little table at the back of my statistics book I find

    that the critical value is .4438. This means that if my correlation is greater than .4438 or less than-.4438 (remember, this is a two-tailed test) I can conclude that the odds are less than 5 out of 100

    that this is a chance occurrence. Since my correlation 0f .73 is actually quite a bit higher, Iconclude that it is not a chance finding and that the correlation is "statistically significant" (given

    the parameters of the test). I can reject the null hypothesis and accept the alternative.

    The Correlation Matrix

    All I've shown you so far is how to compute a correlation between two variables. In most studieswe have considerably more than two variables. Let's say we have a study with 10 interval-levelvariables and we want to estimate the relationships among all of them (i.e., between all possible

    pairs of variables). In this instance, we have 45 unique correlations to estimate (more later onhow I knew that!). We could do the above computations 45 times to obtain the correlations. Or

    we could use just about any statistics program to automatically compute all 45 with a simpleclick of the mouse.

    I used a simple statistics program to generate random data for 10 variables with 20 cases (i.e.,

    persons) for each variable. Then, I told the program to compute the correlations among thesevariables. Here's the result:

    C1 C2 C3 C4 C5 C6 C7 C8

    C9 C10

    C1 1.000

    C2 0.274 1.000

    C3 -0.134 -0.269 1.000

    C4 0.201 -0.153 0.075 1.000

    C5 -0.129 -0.166 0.278 -0.011 1.000

    C6 -0.095 0.280 -0.348 -0.378 -0.009 1.000

    C7 0.171 -0.122 0.288 0.086 0.193 0.002 1.000

    C8 0.219 0.242 -0.380 -0.227 -0.551 0.324 -0.082 1.000

    C9 0.518 0.238 0.002 0.082 -0.015 0.304 0.347 -0.013

    1.000

    C10 0.299 0.568 0.165 -0.122 -0.106 -0.169 0.243 0.014

    0.352 1.000This type of table is called a correlation matrix. It lists the variable names (C1-C10) down the

    first column and across the first row. The diagonal of a correlation matrix (i.e., the numbers thatgo from the upper left corner to the lower right) always consists of ones. That's because these are

    the correlations between each variable and itself (and a variable is always perfectly correlatedwith itself). This statistical program only shows the lower triangle of the correlation matrix. In

  • 8/7/2019 Statistics Tutorial for First unit

    27/29

    every correlation matrix there are two triangles that are the values below and to the left of thediagonal (lower triangle) and above and to the right of the diagonal (upper triangle). There is no

    reason to print both triangles because the two triangles of a correlation matrix are always mirrorimages of each other (the correlation of variable x with variable y is always equal to the

    correlation of variable y with variable x). When a matrix has this mirror-image quality above and

    below the diagonal we refer to it as a symmetr

    ic

    matr

    ix

    . A correlation matrix is always asymmetric matrix.

    To locate the correlation for any pair of variables, find the value in the table for the row andcolumn intersection for those two variables. For instance, to find the correlation between

    variables C5 and C2, I look for where row C2 and column C5 is (in this case it's blank because itfalls in the upper triangle area) and where row C5 and column C2 is and, in the second case, I

    find that the correlation is -.166.

    OK, so how did I know that there are 45 unique correlations when we have 10 variables?There'sa handy simple little formula that tells how many pairs (e.g., correlations) there are for any

    number of variables:

    where N is the number of variables. In the example, I had 10 variables, so I know I have (10 *9)/2 = 90/2 = 45 pairs.

    Other Correlations

    The specific type of correlation I've illustrated here is known as the Pearson Product MomentCorrelation. It is appropriate when both variables are measured at an interval level. Howeverthere are a wide variety of other types of correlations for other circumstances. for instance, if you

    have two ordinal variables, you could use the Spearman rank Order Correlation (rho) or theKendall rank order Correlation (tau). When one measure is a continuous interval level one and

    the other is dichotomous (i.e., two-category) you can use the Point-Biserial Correlation. Forother situations, consulting the web-based statistics selection program, SelectingStatistics at

    http://trochim.human.cornell.edu/selstat/ssstart.htm.

  • 8/7/2019 Statistics Tutorial for First unit

    28/29

    SECTION-A

    all questions carry 8 marks

    Answer any FIVE Questions

    1. What are the methods of classification of data?

    2. Calculate arithmetic mean of the weight of ten students in a class.

    Weight (in kgs.): 42 56 49 50 49 53 52 48 47 54

    3. Calculate median marks from the following data.

    35 46 29 57 60 73 20 59 50 70

    4. Calculate index number by Kelly's method.

    2001 2002

    Commodity Price Quantity Price Quantity

    (in Rs.) (in Rs.)

    A 10 20 12 16

    B 15 10 18 12

    C 12 15 15 9

    D 25 5 20 11

    E 30 3 28 5

    F 20 10 15 8

    5. Calculate quartile deviation from the following data.

    25 15 30 45 40 20 50

    6. From the following data calculate Karl Pearson's coefficient of skewness.

    40 36 42 53 20 65 67 20 36

    7. Calculate Karl Pearson's coefficient of correlation.

    X : 25 10 20 32 50 28 26 14 12 13

    Y : 15 12 19 26 23 22 13 10 15 5

    8. Fit trend line through semi average method.

    Year : 1985 86 87 88 89 90 91 92

    Income: 45 49 54 60 80 96 100 120

    SECTION-B

    all qestions carry 20 marks

    Answer any THREE Questions

  • 8/7/2019 Statistics Tutorial for First unit

    29/29

    9. Find out the literates by graphic method for the year 1982.

    Year : 1965 70 75 80 85 90

    No. of literates : 47 59 63 78 92 112

    10. A coin is tossed 7 times. find the probability of obtaining (a) 6 heads (b) 7 heads and (c) 5 or more

    heads

    11. Solve the following transportation problem.

    From To

    A B C Available

    I 50 30 220 1

    II 90 45 170 3

    III 250 200 50 4

    Requirement 4 2 2

    12. Calculate geometric mean and harmonic mean fron the data given below.

    42 57 115 127 39 55

    13. State the various methods of ascertaining the regression equation?