12

Click here to load reader

Learning R while exploring statistics

Embed Size (px)

DESCRIPTION

Introduction to simulating datasets in R. No prior knowledge of R required. Illustrates idea of spurious correlation, and accompanies this blogpost: http://tinyurl.com/7h5j8eo

Citation preview

Page 1: Learning R while exploring statistics

Version 1.1 9th June 2012 1

Learning R while exploring statistics This exercise is designed to help you learn R while at the same time gaining insights into the phenomenon of illusory correlation. We will go through the following steps: 1. Downloading R and R Studio, an interface to the R programming language that is rather easier to work with than the basic interface. 2. Familiarisation with basic operations in R 3. Generating simulated data: two correlated variables, X and Y 4. Generating simulated data from two groups with different means on uncorrelated variables U and V to demonstrate spurious correlation between U and V. 5. Demonstrating how incorporating group identity in a linear model unmasks the spurious nature of the correlation between U and V. 6. Demonstrating how removing the effect of group will be misleading if group identity is highly dependent on one of the variables. These instructions apply to those working on PC, and I don't know whether equivalent on Mac. For steps 5 and 6 it's assumed you have a basic understanding of simple regression.

1. Downloading R and R Studio Downloading R R is a powerful language for statistical computing, but much of the documentation is written for experts, and so it can be daunting for beginners. If you go to the website: http://www.r-project.org/ You will see instructions for how to download R. Do not be put off by the instruction to "choose your preferred CRAN mirror": this just means you should select a download site from the list provided that is geographically close to where you are. You may then be offered further options that you may not fully understand. Just persevere by selecting the 'windows' option from the "Download and install R" section, and then select 'base', which at last takes you to a page with straightforward download instructions. Installation of R will create a Start Menu item and an icon for R on your desktop. Downloading R Studio To download R studio, go to this website and follow the instructions. http://rstudio.org/ If for any reason you prefer not to use R Studio, the examples should all work from the original R interface, but your screen may look different, and it may be difficult to arrange items such as figures in a sensible way.

2. Familiarisation with basic operations in R After opening R Studio your screen will be divided into several windows. Move your cursor to the window called R console, in which you can type commands. You will see a > cursor. This cursor will not be shown in the examples below, but it indicates that the console is awaiting input from you. At the > cursor, type: help.start() As with other programming languages, you hit Enter at the end of each command. This will open a window showing links to various manuals. You may want to briefly explore this before going further. Just to familiarise yourself with the console, type: 1+2 R evaluates the expression and you see output: [1] 3

Page 2: Learning R while exploring statistics

Version 1.1 9th June 2012 2

The [1] at the beginning of the output line indicates that the answer is the first row of the variable. This looks confusing if you just have a single number, as in this case, but, as we will see, output can consist of an array of numbers. Now type: x = 1+2 Nothing happens. But the variable x has been assigned, and if you now type x on the console, you will again see the output [1] 3 In R, the results of variable assignments are not shown automatically, but you can see them at any time by just typing the name of the variable. You can also see all current variables in the Workspace screen on the right. The value assigned to variable x will remain assigned unless you explicitly remove it using the 'rm' command. Type: rm(x) You now see that x has disappeared from the If you type x again, the console gives the message: Error: object 'x' not found You can repeat an earlier command by pressing the up arrow until it reappears. Use this method to redo the assignment x=1+2, and then type X. Again you get the error message, because R is case-sensitive, and so X and x are different variables. Now type: y = c(1, 3, 6, 7) The workspace tells you y is a numeric variable with four values, i.e. a vector. To see the values, type y on the console. You will see the vector of numbers [1 3 6 7]. The 'c' in the previous command is not a variable name, but rather denotes the operation of concatenation. It just instructs R to create a variable consisting of the sequence of material that follows in brackets. Now type: x= and hit Enter. The cursor changs to + This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter, your regular cursor returns, because the command is completed. It can happen that you start typing a command and think better of it. To escape from an incomplete command, and restore the > cursor, just hit Escape. The Console is useful for doing quick computations and checking out commands, but in general, when you do computations, you will want to use a script, i.e. a set of commands that you can save, so you can repeat the sequence of operations at any time. The script is written in the Source window (also known as the Editor window). From the menu at the top of the screen select File|New|R script. You will see a new tab in the Source window, labelled Untitled1. You want to save it with a name. Select a name such as Demo1 and type this in Source window, preceded by the symbol #. It is important that the name contains no blank spaces. If you make a script name with blank spaces, this can create havoc later on, because when you try to execute it, R will interpret all but the first word as commands, and you will get misleading error messages that will have you scratching your head as to what they mean.

Page 3: Learning R while exploring statistics

Version 1.1 9th June 2012 3

The hash symbol that you typed before the script name is used to create a comment in a script, i.e. a line that is used to remind the user of important information, but which is not executed when the script runs. It is customary to put the title of the script, plus information about it function, author and date at the head of the script. Select the menu command File|SaveAs to save the script with that name. Currently, your script doesn't do anything. Let's give it some content. In the Source window type: x=2+3 y=4+5 z=x+y Now select the top menu item Edit|Run Code|Run All. As the script executes, you will see the commands in the script repeated in the Console window, and the values of the variables x, y and z in the Workspace window. These variables will remain assigned to these values until explicitly cleared. You can test this by typing a command at the console such as: x-y which will give the answer -4. Important: Traditionally, R scripts use <- instead of =. So, you will see instances of scripts which have commands such as

a <- 1+3. This is equivalent to

a = 1+3. It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means the same thing. My view of life is that you should never make two keystrokes when one will do, and so I persist with the use of the equals sign, but R purists disapprove of this. One reason for avoiding = in assigning values to variables, is that it can be confusing, because the equals symbol is also used in other contexts, such as judging whether two things are the same. For the present, I'm not going to worry you further about this, but you may want to squirrel that fact away. Confusion between different uses of the = operator causes much grief, not just in R but in most programming languages. Loops: A loop is a way of repeatedly executing the same code. Suppose we wanted to print out the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is to use a loop, where we multiply 10 times a variable, myx, and specify the range of values that myx will take at the start of the loop. Thus we can type in the commands: for (myx in c(1:10)) { print(10*myx) } The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly, incrementing the value of myx each time it does so, until it gets to the final value, whereupon it exits the loop. Stopping a program: Sometimes a program has been written in a way that it keeps running and never stops. If you need to abort you just type Ctrl+c. Commenting: A good script will contain many lines preceded by # This indicates that the line is a comment – it does not contain commands to be executed, but provides explanation of how the script works. Before you go any further, create a new directory that will contain all of your scripts, data, and workspace for a project. Then go to the menu and select Tools|Set Working Directory|Choose Directory and navigate to your new directory. This means that all your

Page 4: Learning R while exploring statistics

Version 1.1 9th June 2012 4

work will be saved in one place. Whenever you start up R from a file in that directory, it will continue as your working directory. A note on quotes: If you paste a script into your R console or browser, quotes may get reformatted, causing an error. Always check: for R, single quotes should be straight quotes, not 'smart quotes' (i.e. quotes that slant or curl in a different direction at the start and end of a quoted section). You may need to retype them if your system has reformatted them. Further reading The best way to learn R is to play with it. You should try typing in commands to see what happens. Use the R Manuals from the Help screen to get started. In addition, these texts are recommended. Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R. Cambridge: Cambridge University Press. Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. New York: Springer. (Do not be put off by the title: it really should be entitled 'with S and R')

3. Generating simulated data: two correlated variables, X and Y An important aspect of R is the ease with which you can generate simulated data. Playing with simulated data is one of the best ways of gaining an intuitive grasp of statistics. You can create a dataset with certain characteristics, and then see what happens when you analyse it in different ways. Most introductions to statistics ignore the potential of simulated data, and simulation is often seen as an advanced topic. My view is that it should be one of the first things you learn to do. As a first exercise in running a script in R, we shall generate a simulated set of data for two variables, look at some basic statistics for the variables, plot them, and save the data. We will be using the data for more interesting purposes later on, but for the time being, the aim is to familiarise you with some key R commands. In addition, it is very useful to know how to simulate datasets with specific characteristics, as these can be used to check how various analyses work. Unfortunately, most people do find R commands quite daunting, and the command needed to create simulated data will probably look horrific if you are a newbie. Also, in R the help commands are often not all that helpful, as they are written for statisticians. Don't lose your nerve. I shall walk you through it and all will become clear. One of the first things you need to understand about R is that there is a huge number of functions that you can use to carry out various statistical, mathematical and graphic operations, but they aren't all available when you start up R. Many of them are available in 'packages' which you have to specify if you want to use them. There's a nice explanation of how you can find and use packages here: http://ww2.coastal.edu/kingw/statistics/R-tutorials/package.html. We're going to use commands from a package called MASS, which contains functions and datasets from 'Modern Applied Statistics with S' by Venables and Ripley (see Recommended Reading above). All we need to do is to include the following line in our script: require(MASS) Once that command is executed, all the functions from MASS will be available for us to use. When learning R, it's a good idea to run each new command and see what, if anything happens, and whether the workspace changes. If you just highlight one or more commands in the Editor window and then hit the Run button with a green arrow at the top of the window, this just runs that command. If you run the 'require' command as above, the Console just reassuringly tells you it is loading MASS.

Page 5: Learning R while exploring statistics

Version 1.1 9th June 2012 5

Now, we're going to generate two columns of correlated numbers, X and Y. We'll start by creating a variable to hold their names. The next line of your script should be: mylabels=c('X','Y') # Put labels for the two variables in a vector Remember: You could just omit the bit after the hash, which is a comment. It's there to remind you what you are doing. It may be obvious now, but, trust me, it won't be if you come back a week later. You should add your own comments, using language that will be helpful to you. If you run this command you will see that the Workspace shows mylabels as a character variable with two values. It knows to treat mylabels as a character, rather than number variable because you have enclosed the labels in quotes. Does it matter if you use single or double quotes? I couldn't remember, so I just tried making a different variable by typing a command on the console with double quotes - you should do the same. It's always good practice to just play around with commands and see what happens. We are going to use a fancy command from MASS called mvrnorm. It's not uncommon to forget the precise format that a commands need, but help is at hand. On the console type help(mvrnorm), and you will find that the Help screen shows you the way the command is used. It first tells you what the arguments are for the command, i.e. the things you need to specify to make it work, it then terrifies you with a more technical explanation, and finally gives a worked example. The worked example may be helpful or may just baffle you completely. Let's look at mvrnorm. The help screen starts as follows: mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE) and then gives an account of what each argument is.

n the number of samples required.

mu a vector giving the means of the variables.

Sigma a positive-definite symmetric matrix specifying the covariance matrix of the variables.

tol tolerance (relative to largest variance) for numerical lack of positive-definiteness in Sigma.

empirical logical. If true, mu and Sigma specify the empirical not population mean and covariance matrix.

Thus the first things you need to specify are the number of cases to simulate (n), the mean of the variable, and the covariance matrix. We are going to be working with z-scores, to make life easier. Remember that for z-scores, a correlation is equivalent to a covariance, and the SD and variance are both equal to 1. We first specify the correlation that we want: myr = .5 Add that to your script, and run it, so that we have a value in myr. For sigma, we need to specify the following matrix: [1 myr myr 1] In R, you can create a matrix using the c (concatenate) command, but if you just typed c(1, myr, myr, 1), then this wouldn't work. Why not? Try typing this at the console and see. You'll find you have the right numbers, but they aren't in a 2x2 matrix. To get them properly arranged, you need to explicitly specify that you want a matrix with two rows and two columns. So the full command is mysigma = matrix(c(1,myr,myr,1),2,2)

Page 6: Learning R while exploring statistics

Version 1.1 9th June 2012 6

The last two numbers in the command indicate we want 2 rows and 2 columns. Look at mysigma. You could then try making another matrix, but with 1, 4 rather than 2, 2 at the end. I can't stress enough that to understand commands, you just have to try them out. If you aren't sure how something works, tweak a command and see what happens. Note - there's nothing to stop you typing .5 rather than myr in the command above. It will give the same answer. But we want a flexible script that will allow us to play around and look at different values of the correlation, and if we use the variable myr in the code, rather than a specific value, this allows us to do that easily. So all you now need is to specify the number of cases and the mean values for X and Y. We do this with the commands: myn=50 # we're going to create 50 rows of data mymean=c(0,0) #means are zero for both X and Y We are now ready to go! What about the other arguments, tol and empirical? They are optional and we'll leave them alone for the moment, though we will look at empirical later on. We need a variable name for our simulated data. Let's call it myarray. So we type: myarray=mvrnorm(n=myn, mymean,mysigma) Now run the whole script. Each command is reflected in the console as it executes. But where are the results? The workspace now confirms that you have created myarray which is a matrix with 50 rows and 2 columns. To look at the results, just type myarray on the console. There are your 50 paired z-scores! Before going further, I'll just explain why I've created variables that all start with 'my'. This is not essential, but it's a fairly common method. It has the advantage that you are unlikely to inadvertently use a variable name that corresponds to an existing R command, and when reading a script it makes it generally easier to distinguish your variables from other parts of R language. We have created paired variables, but they aren't yet labelled. Assigning column names to a matrix in R is easy. Remember, we created mylabels earlier. We can assign these as our column names as follows: colnames(myarray)=mylabels So now you have built up a whole script to generate paired numbers, which looks like this: # simulate_XY # Script to simulate z-scores X and Y, with specific correlation require(MASS) # Load functions from Modern Applied Statistics for S mylabels=c('X','Y') # Labels to be used later for our variables myr=.5 # Correlation (can be changed) mysigma = matrix(c(1,myr, myr,1),2,2) # 2 x 2 ccovariance matrix # (with zscores, equiv to correlation matrix) myn=50 # N rows of data to simulate mymean=c(0,0) # Means for each variable (zero for zscores) myarray=mvrnorm(n=myn, mymean,mysigma) # create array of simulated data colnames(myarray)=mylabels #Assign labels to columns of simulated data

But you may be suspicious. How do you know that the numbers you have generated have mean of zero and are actually correlated .5? You can use R commands to find out. This command gives you a range of descriptive statistics, including the means: summary(myarray)

Page 7: Learning R while exploring statistics

Version 1.1 9th June 2012 7

and this one gives the correlation matrix: cor(myarray) At this point, you may start to think (depending on your locus of control) either that you have done something wrong, or that R is not very good. It's highly likely that your means will differ from zero, and the correlation will be smaller or bigger than .5. The reason is that we did not specify empirical = TRUE. R has faithfully generated a sample of observations from a population of values where the true correlation is .5, but because of sampling error, the observed value in this sample is likely to deviate from .5. If you re-run the program, but this time alter the mvrnorm command to: myarray=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE) then you will find the means are zero (or, more likely a real number that is infinitesimally small) and the correlation is .5. Alternatively, you could remove the empirical command (or specify empirical=FALSE, which has the same effect), but specify n = 50000, or another very large number. The larger the sample you take from the population, the closer the sample correlation will approach to the population correlation. It's always a good idea to plot data as well as looking at summary statistics. To see a scatterplot of your data, add this command to your script: plot(myarray) A graph will now pop up in the Plots tab of the right hand lower window. Finally, you might want to save your simulated data so you can use them at a later time. This command will write a data file to your current directory: write.table(myarray,"mysimdata") If you want to get your data back on another occasion, this command will read the saved data into a matrix called newdata newdata=read.table("mysimdata") The mvrnorm command uses a random number generator, which means that each time you run the script, different numbers will be generated. If you want to always get the same numbers, you can do so by just specifying a 'seed' for the random number generator. This can be any number, but provided it is the same number each time, you'll get the same result. Just put this command somewhere before the mvrnorm command: set.seed(2)

If you have started from scratch and got this far, then you should take a break and reward yourself with a cup of coffee or whatever other substances hit the spot for you. 4. Generating simulated data from two groups with different means on uncorrelated variables U and V We're now going to apply what we've learned to generate data from two separate groups on two variables that are uncorrelated. The only difference is that the means differ on both variables for the two groups. Let's set means for X and Y for group A as -1 and for group B they'll be 1. We'll generate 60 cases for each group. We'll call these datasets myarrayA and myarrayB. If you've followed what we've done so far, you should be able to work out how to do this. It will be a good exercise to try, as you learn R by thinking it through, rather than by just copying. But I'll give you a script to do it anyway, in case you get stuck: #demo_spurious_corr_script require(MASS) #Load functions from Modern Applied Statistics for S mylabels=c('U','V') myr=0 #U and V are uncorrelated, and so r is set to zero mysigma = matrix(c(1,myr, myr,1),2,2)

Page 8: Learning R while exploring statistics

Version 1.1 9th June 2012 8

myn=60 set.seed(3) #Array for group A mymean=c(-1,-1) #mean zscore for group A myarrayA=mvrnorm(n=myn, mymean,mysigma) #Generate uncorrelated U and V for grp A colnames(myarrayA)=mylabels summary(myarrayA) cor(myarrayA) plot(myarrayA) #Array for group B mymean=c(1,1) #mean zscore for group B myarrayB=mvrnorm(n=myn, mymean,mysigma)#Generate uncorrelated U and V for grp B colnames(myarrayB)=mylabels summary(myarrayB) cor(myarrayB) plot(myarrayB) We now want to combine the two arrays into one long column, and call this combined array with a new name, myarrayAB. This can be achieved with a single command for concatenating rows, as follows: myarrayAB=rbind(myarrayA,myarrayB) We can then look at the correlation for the combined groups: cor(myarrayAB) Even though the correlation within either group was set to zero, the correlation for the combined groups is around .5 and highly significant. This is the phenomenon of spurious correlation. To make it more concrete, consider if U and V were height and chest hairiness and groups A and B were males and females. Since men tend to be taller and hairier than women, you could find a spurious correlation between height and hairiness in a combined group, even though they are uncorrelated within either sex. One reason I like simulations is that they can give you new insights into such phenomena. Note that we specified massive mean differences between our groups: one group with a mean z-score of +1 and the other with mean z-score of -1. When I first attempted this simulation, I used much smaller group differences, and was surprised at how hard it was to generate a spurious correlation. With a simulation like this, you can play around and get a good feel for the phenomenon by repeatedly generating datasets with different values. The phenomenon of spurious correlation is a source of major concern, especially for those interested in correlational data, but my impression is that its importance may have been overemphasised, because in practice it doesn't become a problem except in quite extreme situations where you have two groups with very different mean values. 5. Demonstrating how incorporating group identity in a linear model unmasks the spurious nature of the correlation between U and V Let us stick with the interpretation of our simulated data as representing height and hairiness in males and females (ignoring the fact that the group mean differences are vastly greater than would be realistic). We now need to add to our combined dataset another column that specifies gender. The R command rep will just create a vector of repeated numbers. We make a set of 60 values = .5 for males, and 60 values = -.5 for females. The reason for picking these specific values is because it helps interpretation of regression output if we set the average for two groups to zero and make the mean difference between them equal to one. However, it's not

Page 9: Learning R while exploring statistics

Version 1.1 9th June 2012 9

essential to do this, and you could have picked other numbers, such as 0 and 1 to indicate group identity. males=rep(.5,myn) #Create vector with myn repetitions of value .5 females=rep(-.5,myn) #Create vector with myn repetitions of value -.5 Having made our two sets of numbers, we then join them together in a variable called mygender as follows: gender=c(males, females) Run these commands and then type gender at the console to check the result. All that is now needed is to bolt this column on to our existing myarrayAB, which we can do with a single command for concatenating columns, cbind. myarrayAB=cbind(gender,myarrayAB) Note that I have created a lot of intermediate variables in the course of generating myarrayAB. This is unnecessary and uses up memory. It would be possible to combine several steps in one command and so avoid creating the intermediate variables. However, when learning R, I think it is helpful to break commands down into small steps and create new variables, as this allows you to see the logic of what is being done, and to check the values of each variable. It also makes your scripts easier to understand when you come back to them later. Very experienced programmers may write much more compact code than this, but with modern computers, memory is seldom a problem unless working with very large data arrays, and so, apart from demonstrating how clever you are, compact code doesn't serve much function. We now want to do a regression analysis. We will start with simple regression of V on U for the combined group data. R has many powerful commands for doing regression, but it requires that the data are formatted in what is called a data frame. Fortunately, this transformation is trivially easy: we just add the command: mydata=data.frame(myarrayAB) Commands for regression in R are formulated in terms of the general linear model. This is a very general and flexible approach to statistical analysis that readily incorporates the more traditional methods beloved of psychologists such as analysis of variance. However, I suspect that many psychologists reading this won't find it a very intuitive way to think about data, and it takes a while to map the R commands onto pre-existing statistical knowledge. The other thing that can be puzzling is that with programs such as SPSS, we are used to running a command and then looking at the output screen. Although R can be used in an analogous way, it is more usual to write the results to another variable. The variable that holds the results is likely to be a fairly complex structure, as we shall see. But the basic idea is that you don't just use a command to do the analysis: you actually specify a name for the output of the analysis. The simplest form of regression is pretty easy. The command lm just stands for linear model, and requires two obligatory commands: you have to specify a formula that indicates the relationship between predicted and predictor variables, and specify the dataset used to estimate regression coefficients. So let's illustrate this with our U and V variables. Add this command to the script: myreg1=lm(V~U,mydata) and then inspect the myreg1 variable that is created. This contains two coefficients, an intercept, that is close to zero, and a slope, that is close to 0.5. Note that when you type lm you also get information about the formula used to generate the coefficients, labelled call. The output of lm contains a complex set of varied information in a

Page 10: Learning R while exploring statistics

Version 1.1 9th June 2012 10

structure. If you want to look at just part of the structure, you have to use the $ sign to indicate which bit. Try this, by just typing at the console: myreg1$call and myreg1$coef You will see that the portion after the $ indicates which bit of the myreg1 structure is referred to. The term V~U tells the program to fit a straight line according to the formula: V = b1 + b2.U where b1 is a slope and b2 is an intercept. It is these slopes and intercepts that are then generated when the lm command is executed. We can use these outputs to plot the regression line. First plot the raw data. This command will achieve that: plot(U~V, mydata) The command abline plots a straight line with a given intercept and slope. You could add a straight line through the intercept zero and with slope of 1, as follows: abline(0,1) The regression line is simply the straight line with intercept and slope corresponding to the computed regression coefficients, and so can be plotted just by typing: abline(myreg1$coef) The lty command allows you to specify the type of line you want. This command will make the regression line a dotted line. abline(myreg1$coef,lty=5) As an aside here, I haven't used R very much, and when I first saw a command with lty I was confused and thought it was some kind of variable. This is, in my experience, a common difficulty with R. Various letter sequences that look like variables or functions, aren't. What did I do? I Googled "R lty" and immediately all became clear. Perhaps the single most important advice if you want to learn R is to just use Google if you get stuck. We now want to look at the regression with gender included. A simple modification to the syntax achieves this. We have taken care to code gender so that the sum of the two gender codes is zero, and we can include it in the linear model, even though it is a categorical variable. Here is the command: myreg2=lm(V~U+gender,mydata) This corresponds to the regression equation: V = b1 + b2.U + b3.gender If we type myreg2, we see that the output now has one intercept and two regression coefficients, like this: (Intercept) U gender 0.03357 0.04529 1.68913 Your values may differ from this because the simulated data will be different, but the overall pattern will be similar. Note that the regression coefficient associated with U is now close to zero, whereas that associated with gender is much bigger. Once we have run the model we can get much more detailed statistical output by requesting a summary, as follows: summary(myreg2)

Page 11: Learning R while exploring statistics

Version 1.1 9th June 2012 11

Now we have not only the coefficients, but their standard errors, associated t-values and significance levels. This confirms that gender is a substantial predictor of V, and U is not. Finally, you can use the anova command to produce an anova table comparing the fit of the two models: anova(myreg1,myreg2) I've learned a lot about using R for regression analysis from this site. It also has information on how to do diagnostic plots, for instance. However, for the present, I won't get diverted into that, but will rather press on to look at what happens if you have groups defined on a variable that is highly correlated with one of the dependent variables. 6. Demonstrating how removing the effect of group will be misleading if group identity is highly dependent on one of the variables. You should by now be able to follow this script, which is heavily commented to explain each step. This time we are going to generate a multivariate normal distribution with 3 variables. Two of them, L1 and L2 are language measures and A is an auditory measure. The language measures show moderate correlation with the auditory measure and are highly intercorrelated with one another. Group identity (control or language impaired, .5 or -.5) is defined in terms of whether the score on L1 is above z-score of -1 or not. This, then, is analogous to the case of dyslexia or language impairment, where we define whether or not the child has the diagnosis on the basis of a low test score. In a case like this, removing the effect of group can abolish the relationship between L2 and A, simply because L1 and L2 are highly intercorrelated. It would be quite wrong to conclude from this that L2 and A are not related. #demo_spurious_corr_script3 # Using a group variable that is highly correlated with one variable # With these settings, gives the result that by including SLI category # you remove influence of L2 require(MASS) #Load functions from Modern Applied Statistics for S mylabels=c('L1','L2','A') #3 variables, two language and one auditory myr=.8 #correlation between the language measures myr2=.3 #correlation of both language measures with auditory mysigma = matrix(c(1,myr,myr2, myr,1,myr2, myr2,myr2,1),3,3) myn=60 set.seed(6) #change or comment out this line to get different set of estimates mymean=c(0,0,0) #Means for L1, L2, and A are zero myarray3=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE) colnames(myarray3)=mylabels summary(myarray3) cor(myarray3) myL1=myarray3[,1] #first column # Now determine which cases are control or SLI and put in mygroup variable mygroup=rep.int(-1,myn) #default is SLI, coded -1 mycon=which(myL1> -1) #row index of those with L1 in con range mygroup[mycon]=1 These rows are assigned group code of 1 (control) myarrayAB=cbind(mygroup,myarray3) #add mygroup to the data array mydata=data.frame(myarrayAB) #Regression with only Group included myreg1=lm(A~mygroup,mydata)

Page 12: Learning R while exploring statistics

Version 1.1 9th June 2012 12

summary(myreg1) #Regression with both group and L2 included myreg2=lm(A~L2+mygroup,mydata) summary(myreg2) anova(myreg1,myreg2) #Regression if we exclude group ID myreg3=lm(A~L2,mydata) summary(myreg3) The point I want to make with this simulation is that if we want to 'take out' the effect of group identity from a correlation, then we need to think carefully about the logic of what we are doing. In the previous example of spurious correlation, we defined gender quite independently of our two measures, height and hairiness. Although males and females differed substantially on both measures, their gender was not determined by those measures. In any logical causal route, we can confidently treat gender as a primary cause, and so it makes sense to 'take out' its effect. For certain developmental disorders (and indeed other conditions), the causal route is much less certain, because the disorder is diagnosed on the basis of measured variables. So, for instance, dyslexia is defined in terms of low scores on reading measures. In the simulation above, we looked at correlation between L2 and A, and defined our disorder in terms of L1 - which was highly correlated with L2. We could have defined dyslexia in terms of L2 - you might like to try that: it will achieve a similar effect. The results we got from our simulation are actually sensible, but there is a danger they will be misinterpreted. What they are actually telling us is that language measures and auditory measures are significantly correlated, and this is evident regardless of whether we use a categorical language measure, where group identity is determined by cutoff on a test, or a quantitative measure. What this analysis is defintely not saying is that the correlation between language and auditory measures is spurious. It's possible to imagine a situation where you could have a spurious association with these kinds of variables. For instance, poor social environment may affect both language measures and auditory measures. To show that, we'd need to incorporate a measure of social environment in our regression analysis. But the bottom line is that if we want to argue an association between variables X and Y is spurious, we must have a third variable, Z, that is (a) measureable and (b) not dependent on X or Y. Z may be highly correlated with X and Y: that's not a problem. The problem is when Z is determined by X or Y.