63
Introduction to Data & Statistics Module 10 Sept. 3, 2014

Module 10 Sept. 3, 2014 Agenda Stats Lecture 1) Univariate analysis (looking at one variable) … central tendencies, and variability (dispersion) 2) Bivariate

Embed Size (px)

Citation preview

PowerPoint Presentation

Introduction to Data & StatisticsModule 10Sept. 3, 2014

1AgendaStats LectureUnivariate analysis (looking at one variable) central tendencies, and variability (dispersion)2) Bivariate analysis (comparing two variables) correlation, t-test, chi-square association3) Additional context for stats assignment in group project Applications in SPSS (handout)DiscussionWhere do we start? Univariate AnalysesNeed to make sure all our variables (e.g. scores on a scale, income figures, gender, ethnicity) are behaving appropriately for statistical testingEach must have some variability (e.g. if all women, no variability, cannot do outcomes based on gender)Need to check out how much variability and typical values for eachFor example, a typical value may be its average or mean valueThese analyses called univariate analyses.Univariate analysis involves the examination across cases of one variable at a time.

Summarizing Univariate DistributionsAny set of measurements that summarizes a variable should have two important properties: 1. The Central Tendency (or typical value) mode, median, mean 2. The Spread (variability or dispersion) about that value range, variance, standard deviation(That is, how do each of the data values differ from the mean or median value? )

4Use depression scale as exampleExample of central tendency and variationAssume mean = 5.0Each point varies around the mean.

2Example of central tendency and variationAssume mean = 5.0Each point varies around the mean.This variation contributes to the overall standard deviation (SD)

More on standard deviations, later

2Measures of Central TendencyAn estimate of the center of a distribution of values; how much our data are similar The means to determine what is most typical, common, and routineCentral tendency is usually summarized with one of three statistics:1) Mode2) Median3) Mean

Measures of Central Tendency 1The ModeThe mode, the most frequent value in a distribution, is the least often used as it easily gives a misleading impression: mnemonic - mode = most.If the mode occurs twice, then the distribution is called bimodal.Can be used for all four levels of measurement (for nominal, just the most common response: ex. the number of female and male in a study)May not be effective in describing what is typical in the distribution of a variable

Measures of Central Tendency 1The Mode exampleWhat is the most frequent value?

28, 31, 38, 39, 42, 42, 42, 42, 43, 47, 51, 51, 54, 55, 56, 56, 58, 59, 59, 59

(this listing of the data set is called an array)

Where is the mode in each of these distributions?Measures of Central Tendency 2The MedianThe median, the point that divides the distribution in half; the midpoint of a set of numbersTo find the median value of a data set, arrange the data in order from smallest to largestMust be used for at least ordinal level of measurement why?Unlike the mode, the median does not always coincide with an actual value in the set (unless the set has an odd number of values

Measures of Central Tendency 2The Median Example2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 2019 points, 10th one is the Median= 9 Median

If the number of points is even then average the two values around the middle (n = 18):

2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20

9 + 10 / 2 = 9.5 MedianMeasures of Central Tendency 3MeanThe mean, or statistical average, takes into account the values of each case in the distribution It is the sum of all of the values divided by the total # of the values.Must be interval or ratio level measurements (e.g., weight, age, miles driving).Should not be computed for ordinal level why?Mean can promote accuracy or distortion depending on whether the distribution is symmetrical or skewed.Measures of Central Tendency 3The Mean Example2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20

ANSWER:

2+2+3+3+4+5+5+7+8+9+10+11+11+14+14+15+16+18+20Total N = 19= 177 / 19 = 9.32= SUM of all values / N

What is the Normal Distribution?

It looks like a bell with one hump in the middle, centered around the population mean, and the number of cases (data) tapering off to both sides of the mean;the symmetrical distribution of scores around the mean

1515Variables behaving wellNormal Distribution (aka, Bell Curve) where is the mean, median, and mode?

161616Normal Distribution (aka, Bell Curve) where is the mean, median, and mode?In a perfect normal distribution, mean, median and mode are equal!

17ModeMedianMean1717Means and variances are best measures for symmetric or normal distributionsDescribe by using arithmetic MEANVARIANCE (standard deviation) Secondarily, RangeMode (most common value)Skew (left or right)Kurtosis (thickness of tails)

Normal Distribution - Skewness19 Skewness is used in describing abnormal distributions. In a normal curve, the right and left halves of the curve are mirror images of each other. If this is not the case, the curve is said to be skewed, either positively (to the right) or negatively (to the left). If the scores tend to be concentrated toward the high end of the score scale, the curve is negatively skewed. If they are concentrated toward the low end of the score scale, they are positively skewed

Skewness is measured from -3.0 to + 3.0

0 skew score = symmetrical distribution 1919Normal Distribution - Skewness20

2020

Example. Means and standard deviations for all study variablesMeanStd. DeviationNSF-36 Scale80.4720.37257Number of people in Household2.811.35257Number of hours housework (sqrt)29.6810.31257Financial stress scale5.071.95257

Mean=50Mean=80Use CES-D here for Mean and positive skewness22The Outlier AffectOutlier: a result that is far different from most of the results for the group; extreme value(s) that can skew the overall resultsMedian and mode are not sensitive to outliers. That is, they tend not to change with outliersMean is sensitive to outliers. Mean can change greatly with outliers.

ArrayMean Median Mode1, 1, 1, 1, 5010.8111, 1, 1, 1, 10020.81123Use test as exampleTo Address Outliers in Mean CalculationsTrimmed mean: do not use the top and bottom five percent of scoresIn this example, we have 20 values. The lowest and highest values reflect the lowest 5% and highest 5% values in this list2 40 45 46 52 52 55 59 60 61 61 63 64 66 66 66 67 69 70 259 Mean for n = 20 is 66.2, Trimmed mean for n = 18 is 53.1 Which measure of central tendency should we use?Both the median and mean are used to summarize the central tendency of quantitative variables. To decide which to use, consider these issues:1. Level of measurement: the median can be used with ordinal level data (often used in scales); but,the mean requires interval or ratio level data.the mode should be used for nominal level data. (Think Yes=1 and No=0 data. What would 0.36 mean? And 0.72?)

Which measure of central tendency should we use?Both the median and mean are used to summarize the central tendency of quantitative variables. To decide which to use, consider these issues:2. The shape of the distribution the median should be used when the data is skewed or has many outliersthe mean should be used when the data is fairly bell shaped or normal.Tip: Use the mean when the mean and median are very similar.

Mean or Median?Shape of variables distribution: The mean and median will be the same when the distribution is perfectly symmetric.When the distribution is not symmetric, the mean is pulled in the direction of extreme values, but the median is not affected in any way by extreme values.Purpose of the statistical summary: If the purpose is to report the middle position, then the median is the appropriate statistic. If the purpose is to report a mathematical average, the mean is the appropriate statistic.

Normal distributions: means and medians are very closeArithmetic MEAN (average value) is nearly the same at the MEDIAN (50th percentile, or value where half of the ranked data points lie above and below.)

Measures of Variability (Variation/Dispersion)How different the data are from each other and is reported by how the scores fall around the meanFor nominal data, simply looks at how many in each category, for the restCaptures how widely and densely spread a variables distribution is.

Start CES-D scale example29Measures of VariabilityVariability is usually summarized with one of four statistics:1) The Percent of responses in each category (nominal data)2) The Range (ordinal and higher)3) The Variance (interval and ratio)4) The Standard Deviation (interval and ratio)

Measures of Variability 1Percentage & Range For nominal data, simply report percentage in categories (51% female, 22% social workers)For ordinal, interval 7 ratio data, the range is calculated as the difference between the highest value in a distribution and the lowest value. It can be drastically altered by a extreme value (an outlier)Maximum value minus the minimum value + 1Example:2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20Range is 20 2 + 1 = 192, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 100Range is 100 2 + 1 = 99 (outlier effect)Example of central tendency and variationAssume mean = 5.0Each point varies around the mean.

2Example of central tendency and variationAssume mean = 5.0Each point varies around the mean.This variation contributes to the overall standard deviation (SD)

2Measures of Variability 2VarianceVarianceThe variance is the average of the squared differences from the mean. It takes into account all the scores to determine the spread.To calculate the variance follows these steps:Work out the mean (the simple average of the numbers)For each number: subtract the mean and then square the result (the squared difference)Work out the average of those squared differences. 68% in one SD of the mean, 95% within two, SD in same terms as was measured34Example of central tendency and variation

2Calculations in Excel table

Variance Example

The heights are: 600mm, 470mm, 170mm, 430mm and 300mm.

Find the Mean:Mean = 600+470+170+430+300/5=394

2. Calculate each dogs difference from the Mean: (600-394=206), (470-394=76), (170-394=-224).You and your friends have just measured the heights of your dogs. Mean=394

3. To calculate the Variance, take each difference, square it, and then average the result: Variance: 2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 108,520/5 = 21,704Measures of Variability 3Standard DeviationStandard Deviation Standard deviation is the square root of the variance: (variance) SD tells us what degree the values cluster around the mean.

Standard Deviation: = 21,704 = 147.32... = 147Now we can show which heights are within one Standard Deviation (147mm) of the Mean:

So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweillers are tall dogs. And Dachsunds are a bit short The variance and standard deviation are calculated via your software programs like SPSS, Excel, SAS and others, even on hand calculators Thank goodness for modern technology!Overview 39NominalOrdinalInterval or RatioCentral Tendency (best represents all cases)ModeMedianMean;Median;ModeVariability(spread; dispersion)Percent of cases in categoriesRangeVariance; Standard deviation; Range

3939Bivariate StatisticsNow that we know a bit about each of our variables, we can start comparing them to each otherWe can also look at differences among groupsWhen comparing two variables or groups, use bivariate statisticsMultivariate statistics look at the relationships among many variables or groups at one time, beyond the scope of our classComparing variables and groupsParametric StatisticsParametric statistics require certain assumptions/qualities in data/variables: Normal distributionsDependent variable is interval/ratioGood sample size (at least 30)Examples of parametric statistics 1. Correlation: Is there a relationship between variables? 2. T-Tests : Are there mean differences in outcomes between two groups? 3. Analysis of Variance (ANOVA) : Are there mean differences in outcomes among groups? (two or more groups; will not do in this class)Probability ValueA report of how likely the relationship indicated is statistically significant or may have happened by chanceIn other words, how sure are we what we found was not just a fluke?Most researchers set the level for statistical significance at 0.05 or smaller (or 0.01, 0.001)Indicated by P Value, e.g. P< .05 means there is less than 1 in 20 chance of results due to sampling error P