18
Scatterplots & Correlations Chapter 4

Scatterplots & Correlations

  • Upload
    totie

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

Scatterplots & Correlations. Chapter 4. What we are going to cover. Explanatory (Independent) and Response (Dependent) variables Displaying relationships with scatterplots Interpreting scatterplots Adding categorical variables to scatterplots Measuring linear associations with correlation - PowerPoint PPT Presentation

Citation preview

Page 1: Scatterplots & Correlations

Scatterplots & Correlations

Chapter 4

Page 2: Scatterplots & Correlations

What we are going to cover

• Explanatory (Independent) and Response (Dependent) variables

• Displaying relationships with scatterplots• Interpreting scatterplots• Adding categorical variables to scatterplots• Measuring linear associations with correlation• Important facts and issues with correlations

Page 3: Scatterplots & Correlations

Starting with some terminology

• Response variables (Dependent) = Ys• Explanatory variables (Independent) = Xs• When stating relationships we generally state

the dependent first.• When graphically depicting relationship we

generally place the dependent on the y axis.• In most stats software the dialogue boxes follow

this convention and ask you to enter the dependent or response variable first

Page 4: Scatterplots & Correlations
Page 5: Scatterplots & Correlations

Here is the Scatterplot dialogue box for EXCEL with the publisher’s plugin

Page 6: Scatterplots & Correlations

• A scatterplot displays the relationship between two quantitative variables measured on the same individual or event, etc.

Page 7: Scatterplots & Correlations

• Just as we began our discussion of the distribution of individual variables by graphically depicting them, so we do when we are interested in relationships between variables

• Scatterplots are a great way to do this depiction.

Page 8: Scatterplots & Correlations

Adjusting your graph (art and science)

This is the original excel scatterplot

This is my adjusted excel scatterplot

Page 9: Scatterplots & Correlations

• Once again let’s look for patterns of regularity and outliers

• Using the four step method– State the problem (in this case does the percent of

students taking the SAT influence math scores– Plan we can try to observe this with a scatterplot– Solve (interpret the plot), notice there is something of

a downward sloping left to right line and some clustering

– Conclude, there does appear to be a negative association between the variables, as the percent of students taking the SAT in a state increases, the average math score of the state declines

Page 10: Scatterplots & Correlations

We can also group data in a scatterplot

• As can be seen, the data in the previous chart has been group by region (a nominal variable) in this example

• In the last class I did the same thing when I divided my data on income into two separate sets for men and women and made side by side box plots

Page 11: Scatterplots & Correlations

Measuring Linear Correlations

• Just as in the past lesson, where we moved from depicting data in graphs to summarizing them with numbers, so we can do the same with associations.

• A statistic which is commonly used to measure the strength of an association when data is measured at the interval and ratio level is “r” (Pearson’s r).

Page 12: Scatterplots & Correlations

• Pearson’s r really just builds on what we did with descriptive statistics. Now we are finding the distance of each point from the mean of x over the x variance multiplied by the mean of y over the y variance. In other words, it is based on standardized values

syy

sxx

y

i

x

i

nr

1

1

Page 13: Scatterplots & Correlations

Some important points about “r”• Correlations are symmetrical statistics, they will

produce the same result whichever variable you tag as explanatory and respondent

• Because “r” uses standardized values it does not change if you rescale variables

• A negative signed “r” indicates a negative association, a positive sign indicates a positive relationship.

• r varies between -1 and 1. – Values approaching 0 indicate no association. – Values approaching -1 indicate a near perfect negative linear

relationship– Values approach 1 indicate a near perfect positive linear

relationship.

Page 14: Scatterplots & Correlations

Some warnings

• As noted, Pearson’s r only works if both variables are measured at least at the interval level

• Do a scatterplot first. – r only works with linear (or nearly linear) relationships.

As curvature enters the picture, r’s use declines– outliers (extreme high and low values) will distort r

• Correlations do not provide a total summary of relationships, you should usually also provide the means of x and y and their standard deviations so people can evaluate the usefulness of the correlation

Page 15: Scatterplots & Correlations

Spearman’s rho (a correlation for ordinal data)

• Spearman’s rho (or rank order correlation) is a correlation you can use with ordinal data. As with “r” it varies between -1 and 1 and a value approaching 0 indicates no meaningful relationship between the variables.

• It is very handy and is used in a number of situations. For example, in sports very elaborate computer programs are used to rank players and/or teams. We could use rho to analyze whether the rankings reliably predict who wins (for example in tennis).

• Another common use is when you are looking for associations among opinion data which is collected at the ordinal level.

• We won’t calculate this. Enough to say that most programs that do “r” will have a nearby function for rho.

Page 16: Scatterplots & Correlations

The following table is from Cohn, CJPS 38:2 (2005), 415-434.

Page 17: Scatterplots & Correlations

Some things you will note,

• In the previous table beside “rho” there was a number titled “significance”.

• As with most statistics, “r” and “rho” have known distributions with given data set sizes (degrees of freedom [N-1]).

• Significance answers the question, given the degrees of freedom, how likely are we to see this score for the statistic?

Page 18: Scatterplots & Correlations

• A score of 0.05 or less would mean there is a 5% or less chance that these results could occur if we randomly drew results. In other words, there is a 95% chance that these results represent a genuine association of the strength reported between the variables.

• The score in the table was 0.000. This means there is almost no chance a Rho of this strength could occur with this many cases by simple random chance.

• Therefore, there is a very high likelihood that the strength of association reported between the variables is a genuine association.

• You will hear more about significance as the course proceeds.