Quantitative Data Analysis *updated Oct. 5, 2011 – requires more slides Chapter 8 Part 2 Copyright © 2009 Elsevier Canada, a division of Reed Elsevier

Quantitative Data Analysis *updated

Oct. 5, 2011 – requires more slides

Chapter 8

Part 2

Copyright © 2009 Elsevier Canada, a division of Reed Elsevier Canada, Ltd.

Copyright © 2009 Elsevier Canada, a division of Reed Elsevier Canada, Ltd. 2

The following slides (part 2) are based on inferential statisics

Descriptive statistics can describe the actual sample you study. But to extend your conclusions to a broader population, like all such classes, all workers, all women, you must be use inferential statistics, which means you have to be sure the sample you study is representative of the group you want to generalize to.


Normal DistributionA theoretical concept that observes that interval or ratio data group themselves about a midpoint in a distribution closely approximating the normal curve – much data are “normally distributed” – this means there will be few low & high values.

Online stats text: http://www.statsoft.com/textbook/elementary-concepts-in-statistics/

Copyright © 2009 Elsevier Canada, a division of Reed Elsevier Canada, Ltd. 48-4© 2007 Pearson Education Canada

Figure 8.2 Normal Distribution Curve


Characteristics of a normal curve

Symmetrical, bell-shaped curve Mean, median, mode will be similar 68.28% of cases + or – 1 SD from the mean,

95.46% of cases + or – 2 SDs from the mean

Used because most variables naturally take the shape of the normal curve (tests of significance assume a normal distribution – aka: bell curve


Algorithm for standard deviation1. Get the mean (average, symbol X Bar), of the given

observations (e.g., mean of grades: 50, 60, 70, 80, 90 =70)

2. Calculate the differences between each value and the mean (average). (e.g., (50-70= -20)(60-70= -10) (70-70= 0) (80-70= 10) (90-70= 20)

3. Square the differences, (e.g. – 20 squared = 400) square all 5 total them (400+100+0+100+400)= 1000, and divide by the number in the sample (e.g., there are 5 grades therefore --- 1000/5 = 200.

4. Extract the square root (e.g., the sq rt of 200 =14.1421. . )

n = Total Number in the sample Steps 1 – 3 gives you the variance, the sd is the sq rt of

the variance Note: If (n) is a sample of a population (This is what SPSS assumes),

then subtract 1. (n - 1) This is the degrees of freedomSee formula: http://www.sixsigmaspc.com/dictionary/sigma-standarddeviation.html


Z scores

A Z score represents the distance from the mean, in standard deviation units, or any value in a distribution- It convert the mean to 0 and standard deviation to 1 (it is used to standardize raw scores for comparison)

Z score formula

Algorithm:

1. Find the mean and standard dev of the distribution.

2. Subtract the mean from the score. (notice that if the difference is positive it is above the mean and if it is negative it is below the mean.)

3. Divide the difference found in step 2 by the standard deviation.

sd

XXZ


Z test and probability

The p-value associated with a 95% confidence level is 0.05. The critical Z score values when using a 95% confidence level are -1.96 and +1.96 standard deviations. The p-value associated with a 95% confidence level is 0.05. If your Z score is between -1.96 and +1.96, your p-value will be larger than 0.05, and you cannot reject your null hypothesis; the pattern exhibited is a pattern that could be one version of a random pattern.

Source: http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref/spatial_statistics_toolbox/what_is_a_z_score_what_is_a_p_value.htm


Skewness Most but not all data

follow a normal curve Positive skew equals low

range mean −Example: world income

Negative skew equals high range mean −Example: age at death


Inferential Statistics

Definition: the type of statistical methods that is used to estimate population values from the observations and analyses of a sample. Hypothesis testing and developing confidence intervals are two contexts in which inferential statistics are used.


Inferential statistics

1. allow us to judge the accuracy of generalizing from a limited sample to the larger population

2. Conduct hypothesis testing to examine the relationship between two or more variables – see if independent var(s) influence the dependent variable. (E.g. to see if there is a relationship between age and one’s attitudes toward male nurses. )


What does statistically significant mean?1. A test of significance reports the probability that an observed

difference or association is a result of sampling fluctuations and not reflective of a “true” difference in the population from which the sample was selected. (e.g., If you get a p value of .14 (p=.14)– this indicates that you have a 14% chance that the observed difference or association is a result of sampling fluctuations and not reflective of a “true” difference – If our confidence level is 95% - in other words our alpha level was p < .05 we would have to accept the null hypothesis (say that there is no relationship – the risk is too high as we only allow a 5% risk)

2. Three main tests of statistical significance are, Chi square, t-test and F-test

Additional information: http://www.surveysystem.com/signif.htm


Hypothesis Testing

Answers questions such as:− How much of this effect is a result of

chance? (probability level)− How strongly are these two variables

associated with each other?


Hypothesis Testing (cont’d)

Scientific or alternative hypothesis (H1): is what the researcher believes the outcome will be, that the variables will interact in some way

Null hypothesis (H0): is the hypothesis that can actually be tested by statistical methods; states that no difference exists between the groups under study * It is the null hypothesis that is tested with statistical procedures – hence we reject or accept the null hypothesis.


Probability- The use of statistics to analyze past

predictable patterns and to reduce risk in future plans Support for the scientific hypothesis by

rejecting the null hypothesis This is done by applying probability theory


Type I and Type II

Type I: Rejection of the null hypothesis when it is actually true (a false positive) Alpha (a): Usually set to be 0.05, although this is somewhat arbitrary. This is the probability of a type I error, that is the probability of rejecting the null hypothesis given that that the null hypothesis is true. To use the search analogy, it is the probability of thinking we have found something (are relationship between two variables) when it is not really there.

Type II: Accepting the null hypothesis when it is false (a false negative) Saying there is no relationship between variables, when there is one.


Level of Significance (Alpha Level) Probability of making type I error = .05 The researcher is willing to accept the fact

that if the study was done 100 times, the decision to reject the null hypothesis would be wrong 5 times out of those 100 trials


Level of Significance (Alpha Level) (cont’d) Can set it at .01 if one wants a smaller risk of

reflecting a true null hypothesis (the decision to reject the null hypothesis would be wrong 1 time out of 100 trials)

Selected alpha level depends on how important it is not to make an error (we can never be certain a sample is representative but we aim to be 95% certain that our sample is representative of the population from which it was drawn).


Statistical Significance vs practical significance A statistically significant hypothesis - finding unlikely to have

occurred by chance Magnitude of significance is important to the outcome of data

analysis (it is important to examine the data as you might have a relationship that is statistically significant but is not practically significant – e.g., if you found that 81% of the people who received a vaccine AND 80% of those who did NOT receive the vaccine did NOT get ill, you might get a statistically significant result ( a p value of less than .05) if your sampling error was small (due to a very large sample) you might get a statistically significant result (meaning there is less than a 5% chance that this difference would occur if you drew other samples from the same population) your common sense should tell you that this difference is not likely to be practically significant). In other words, you would not likely rush out to get the vaccine even though the stat. anal. Is telling you that it would make a difference.


Crosstabulation for previous example. P=0.49 **This is an imaginary data set with an imaginary p value.

Received Vaccine

No Vaccine

Did not get ill 81% 80%

Did get ill 19% 20%


Parametric versus Nonparametric Parametric: More powerful and more

flexible than nonparametric; used with interval and ratio variables

Nonparametric: Not based on the estimation of population parameters; used with nominal or ordinal variables


Three Tests of sig. differences

Chi-square: Uses nominal data to determine whether frequencies in each group are different from what would be expected by chance (used with crosstabs on spss)

The t statistic: Tests whether the means of two groups are different (used with correlations on spss)

ANOVA: Tests variations between and within multiple groups – f statistic


Nonparametric

Chi-square –used primarily with nominal level variables (and also ordinal especially if they have a few categories)


Tests of sig. Relationship

Exploring the relationship between two or more variables reflecting interval data

Determining the correlation, the degree of association (ranges from –1.0 to +1.0)

Most common (three names for same test)− Pearson product moment correlation

coefficient− Pearson r− Pearson correlation coefficient


Correlation CoefficientsRange from: –1.0 to +1.0 Negative correlation

r = –.38 Positive correlation

r = .65 Perfect correlation

r = +1.0 (positive) or –1.0 (negative)


Correlational Analysis: Correlation Correlational analysis is a procedure for

measuring how closely two ratio level variables co-vary together It tells you about the strength and the direction of the relationship.

Basis for more advanced procedures: partial correlations, multiple correlations, regression, factor analysis, path analysis and canonical analysis (not discussed in class)

Advantage: can analyze many variables (multivariate analysis) simultaneouslyRelies on having interval or ratio level measures


Two Basic Concerns

1. What is the equation that describes the relation between two variables?

2. What is the strength of the relation between the two?

Two visual estimations procedures

A. The linear equation: Y = a + bX

B. Correlation coefficient: r


The Linear Equation The linear equation, Y = a + bX, describes the

relation between the two variables (the b is the slope of the line in this formula – it expresses the change in y for every unit increase in X)

Components:Y - dependent variable (e.g., starting salary)X - independent variable (e.g., years of post-

secondary education)a - the constant, which indicates where the

regression line intersects the Y-axisb - the slope of the regression line


A. The Linear Equation:A Visual Estimation ProcedureStep 1: Plot the relation on a

graph

Sample data setX Y

2 33 45 47 68 8


A. The Linear Equation (cont’d)

Step 2: Insert a straight regression line

From the regression line one can estimate how much one has to change the independent variable in order to produce a unit change in the dependent variable


A. The Linear Equation (cont’d)Step 3: Observe where the

regression line crosses the Y axis; this represents the constant in the equation (a = 1.33 on Figure 8.6)

Step 4: Draw a line parallel to the X axis and one parallel to the Y axis to form a right-angled triangle

Measure the lines; divide the horizontal distance into the vertical distance to compute the b value (72/91 = 0.79)


A. The Linear Equation (cont’d)Step 5: If the slope of the

regression is such that it is lower on the right-hand side, the b coefficient is negative, meaning the more X, the less Y. If the slope is negative,

use a minus sign in your equationY= a – bX

Step 6: Write the equation:Y = 1.33 + 0.79(X)

The above formula is our visually estimated equation between the two variablesEquation used to predict the value of a Y variable given a value of the X variable Done in regression

analysis


B. Correlation Coefficient:

A Visual Estimation Procedure *** This was

covered in class Goal: to develop a sense of what correlations of different magnitudes

look like Correlation coefficient (r) is a measure of the strength of the

association between two interval-ratio variables (sometimes ordinal if there are several categories) It also tells you the direction of that relationship (positive or negative)

For hands on practice to see weak and strong correlations see the following website (you will see a few graphs - go to the bottom graph on this page and move the r value to see what a scatter plot would look like if you have different r values -- It only takes a minute!):

http://staff.argyll.epsb.ca/jreed/math9/strand4/scatterPlot.htm


Eight Linear Correlations (4 on next slide)


Eight Linear Correlations (cont’d)


B. Correlation Coefficient (cont’d) Graphing allows you to visually

estimate the strength of the association The closer the plotted points are to the

regression line (e.g., Plots 1 and 2), the higher the correlation (.99 and .85)

Greater spread (e.g., Plots 3 and 4) ~ lower correlation (.53 and .36)

Would be difficult to draw regression line if r < .36


B. Correlation Coefficient (cont’d)

Plots 5 and 6: curvilinear: not linear, hence r = 0

Procedure not appropriate for curvilinear relations

Plots 7 and 8: problem plots: deviant cases This is one of the reasons it is important to plot

relationships; extreme values indicate a non-linear relationship, therefore linear regression procedure are not appropriate for studying these relationships


Try predicting the correlation coefficient by eye http://www.ruf.rice.edu/~lane/stat_sim/reg

_by_eye/


Calculating the Coefficient of determination – r2

R2 is a very powerful number. It tells us a lot about how well we can use the independent variable to predict variation in the dependent variable. It is one of a family of measures call proportional reduction in error (PRE) measures because they express the proportion or percentage that we reduce error in our prediction of the BV weh we use information afout the distribution of the IV insetead of just using the DV mean as the best precdictor (or for categoric – nominal & ordinal data- the mode of the DV). R2 tells us exactly how much our knowledge of the IV imporved with our DV estimates. R and r2 can be tested for significance. Even a weak correlation can be significant.

The estimation of the correlation coefficient takes two kinds of variability into account:1. Variations around the regression line2. Variations around the mean of Y

r2 = 1 – variations around regression variations around mean of Y

Can calculate but computer programs used by most researchers today so we do not have to know the formula in this class.


Other Correlation Procedures

Spearman Correlation Appropriate measure of association for ordinal level

variables Partial Correlation

Measures the strength of association between two variables while simultaneously controlling for the effects of one or more additional variables

Also varies from +1 to –1 Commonly used by social researchers


The following slides have formulas that are used in the calculation of statistical significance (you do not need to know these calculations for the exam but they can help you see how we test for significance). How does knowing the values of one variable help us make predictions about the other – we discussed PRE in class.


Proportionate reduction of error (PRE) Lambda = errors not knowing – errors knowing

errors not knowing (nominal)

Men Women Total

900 200 1,100 employed

100 800 900 unemployed

1,000 1,000 2,000

Lambda 900-300 / 900 = .67 (nominal)

(In this case it is errors of not knowing gender – errors if you knew gender) range 0-1


Proportionate reduction of error (PRE): Ordinal level (using gamma) (-1 - +1)Gamma = same - opposite 830,000-3,430,000 = -.61 Same + opposite 830,000+3,430,000

Level of prejudiceSES Low med highLow 200 400 700Med 500 900 400High 800 300 1 00

The number of pairs with the same ranking would be: 200(900 +300 +400 +100) + 500(300 +100) + 400(400+100) + 900(100), or 340,000 +200,000 + 200,000 + 90,000 = 830,000.The number of pairs with the opposite ranking would be 700(500+800+900+300) +400(800+300) + 400(500+800)+ 900 (800) or 1,750,000+440,000 +520.000 +720,000 =3, 430,000. (for another example see text chapter 8)


Confidence Intervals (added mar 29/11) also next 6 slides added An estimated range of values that provides a

measure of certainty about the sample findings Most commonly reported in research is a 95%

degree of certainty, meaning 95% of the time, the findings will fall within the range of values given as the Confidence Interval – CI (the Sampling Error (SE) is used to determine the Confidence interval. (also called standard error)

http://www.socialresearchmethods.net/kb/sampstat.php


Sampling Error = the standard deviation divided by the square root of the sample size (added Mar. 29/11) This term is somewhat deceptive as it is not

referring to an error that was made, rather it refers to the problem that comes when you sample a population. We talk about sampling error when we use probability sampling to make inferences about a population. As you know, when we use probability sampling we can never be certain that our sample is representative of the total population. We rely on probability theory (Central limit theory) to say how probable (likely) it is that our sample is representative.


SE con’t

Even when we are 95% confident our sample is representative – this does not mean that our sample is exactly the same at the total population -- we allow for some leeway. We know our sample is unlikely to be exact so we want to know how much “error” we will allow. We calculate how much leeway (error) when we calculate the standard error. The number we get is an estimation. Why is it an estimation? Because we don’t know the true population mean.


Using SE Based on our knowledge of Central Limit Theorem we know that

95% of the sample outcomes will fall + or – (appox. 2 sampling errors/standard errors (1.96 to be exact) from the population parameters) To calculate this we take the SE and multiply it by 1.96 (when doing in our heads we can use the approximate value 2 to make it easier) If we have a SE of 2 we can be 95% confident that our sample outcomes will be approximately + or – 4 from the mean (2 x 1.96 = 3.92 to be exact – this is called our confidence interval). It tells us how accurate we can be – remember if we want to be more accurate we need to increase our sample size – what would happen if we quadrupled our sample size? We would double our accuracy. Based on the above example: if we wanted to be 95% confident that our sample outcomes would be + or – 2 , we would have to quadruple the sample size.

Sampling error calculator: http://www.rogerwimmer.com/mmr/mmrsampling_error.htm


Sampling Error= sample standard deviation (sd) divided by the square root of the sample size (n)

Look at the formula and notice how it supports two intuitive conclusions.

1. A large n (sample size) decreases the interval width for a given confidence level

2. Low variability (a small standard deviation) decreases the interval width for a given confidence level


Example We find that for a random sample of 100 workers, the mean time for

widget assembly is 10 minutes. To make it easier, pretend we know that the population sd is 3 minutes – (although typically we would have to estimate the sd from the sample data and then use the t distribution, rather than the z distribution)

Mean= 10 mins Sd = 3 SE=3/10 (sd / sq rt of sample size) = .3 Z=1.96 (here we specify the number of SEs in the sampling

distribution that includes 95% of sample outcomes). Confidence interval = .3 x 1.96 = mean (10) plus or minus.588

giving us the range 9.412 mins to10.588 mins We are saying that it is very likely (95% probable) that the true

mean time for assembling a widget is somewhere between 9.412 and 10.588

Can we be sure the true mean time (the population parameter) is in this interval? No, but with this method, only 55 of the intervals constructed from data based on random samples will fail to include the population parameter. (Joy of Statistics, 2010:133)


Standard errors and z scores.

A standard deviation is the spread of the scores around the average in a single sample. The standard error is the spread of the averages around the average of averages in a sampling distribution.

Both sd and se can be standardized by converting to z scores


Critical Thinking Decision Path: Descriptive Statistics


Critical Thinking Decision Path: Inferential Statistics—Difference Questions


Critical Thinking Decision Path: Inferential Statistics—Relationship Questions


Critiquing Descriptive Statistics

Were appropriate descriptive statistics used? What level of measurement is used? Is the sample size large enough? What descriptive statistics are reported? Were these appropriate to the level of

measurement used? Are appropriate summary statistics provided

for each major variable?


Critiquing Inferential Statistics

Does the hypothesis reflect if differences or relationships are being tested?

Is the level of significance indicated? Does the measurement level permit parametric

testing? Is the sample size large enough for parametric

testing? Is there enough information given to assess

appropriateness of parametric use?


Critiquing Inferential Statistics (cont’d) Do the statistics used match the problem,

hypothesis, method, sample, and level of measurement?

Are hypothesis results clearly presented? Do tables and graphs enhance text? Are the results understandable? Are practical and statistical significance

distinguishable?


Take-Home Message?

Science and research prove nothing in isolation—research evidence only provides support for a theory

One study’s findings are rarely sufficient to support a major practice change

Documents

Quantitative Data Analysis *updated Oct. 5, 2011 – requires more slides Chapter 8 Part 2 Copyright © 2009 Elsevier Canada, a division of Reed Elsevier