Lecture 3: Basics of R Programming - Boston Universitypeople.bu.edu/aimcinto/720/lec3.pdf · 2016-02-08 · 1 Lecture 3: Basics of R Programming This lecture introduces how to do

1

Lecture 3: Basics of R Programming This lecture introduces how to do things with R beyond simple commands. We will explore programming in R. What is programming? It is the act of instructing a computer to execute a task. That sounds pretty general, but today we’ll see some specific examples of how to get R to do a task more complicated than rendering a graph or computing a mean. Outline: 1. R as a programming language 2. Grouping, loops and conditional execution 3. Creating your own functions Objectives By the end of this session students will be able to: 1. Perform basic data manipulations on vectors (numeric, logical, character); deal with missing

values; index vectors; many-to-one, one-to-many merging 2. Use grouped expression and if-else statements: 3. Know how to write your own functions

Trivia: In the U.S, what is the record snowfall in a 24-hour period? http://docs.lib.noaa.gov/rescue/mwr/081/mwr-081-02-0038.pdf A couple odds and ends that may be of use to some of you:

a) Reading in huge datasets http://simplystatistics.org/2011/10/07/r-workshop-reading-in-large-data-frames/

b) Web Scraping http://thebiobucket.blogspot.com/2011/10/little-webscraping-exercise.html

c) Sorting use command sort(), e.g. > sort( c(1,55,-2,11) ) [1] -2 1 11 55

d) Generating a random permutation of a set of data > sample(c("First","Second","Third","Fourth"), replace=F) [1] "Fourth" "First" "Third" "Second"

2

e) Reading in individual xlsx sheets As mentioned, this can be a pain. Easiest method is to export as .csv, but another fine option is to use the XLConnect package: http://cran.r-project.org/web/packages/XLConnect/vignettes/XLConnect.pdf

3.1 A Quick Review of Matrices and Data frames We discussed vectors, matrices and data structures in Lectures 1 and 2. Let us recall how to create a matrix of data from some given measurements, say heights and weights of 15 students. Suppose we have the data below: height 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 weight 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164 We can read this into R using the following commands: > height = c(58,59,60,61,62,63,64,65,66,67,68,69,70,71,72) > weight = c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164) The above gives us two vectors with the height and weight. But it may be useful to have this as a matrix, so that each person’s height and weight appear together. To do this, we can use the command: > htwtmatrix = matrix(c(height,weight),15,2) # what do 15 and 2 refer to? > htwtmatrix [,1] [,2] [1,] 58 115 [2,] 59 117 [3,] 60 120 [4,] 61 123 [5,] 62 126 [6,] 63 129 [7,] 64 132 [8,] 65 135 [9,] 66 139 [10,] 67 142 [11,] 68 146 [12,] 69 150 [13,] 70 154 [14,] 71 159 [15,] 72 164

3

What do you notice about how R creates a matrix from a vector? It constructs matrices column-wise by default, so if you want to create a matrix row-by-row, you need to give it an additional argument “byrow=T”. Exercise 1. How would you create a matrix that has height and weight as the two rows instead of columns? Look up the help on the “matrix” function if necessary. Now we have each person’s height and weight together. However, for future reference, instead of storing the data as a matrix, it might be helpful to have column names with the data. Recall from lecture 1 that in order to assign column names, we first have to convert htwtmatrix to a data frame. A data frame has a unique set of row and column names. We use the command: > htwtdata = data.frame(htwtmatrix) > htwtdata X1 X2 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 7 64 132 8 65 135 9 66 139 10 67 142 11 68 146 12 69 150 13 70 154 14 71 159 15 72 164 (as.data.frame() works as well.) Notice that now the columns are named “X1” and “X2”. We can now assign names to the columns by means of the “names()” command: > names(htwtdata) = c(“height”,”weight”)

4

We can find the column names of a data frame, without opening up the whole data set, by typing in > names(htwtdata) [1] "height" "weight" Quick aside: Let’s say we have a very large dataset and we don’t want to search through the column names to find out the column number of a particular variable we’re interested in. We can do this: > which(names(dataset.of.interest)=="column name") For example, with our two-column data frame htwtdata, to find the column number of weight, we can type > which(names(htwtdata)=="weight") [1] 2 > This is telling us that weight is the second column in the data frame. Let us recall how R operates on matrices, and how that compares to data frames. Recall that R evaluates functions over entire vectors (and matrices), avoiding the need for loops (more on this later). For example, what do the following commands do? > htwtmatrix*2 > htwtmatrix[,1]/12 # convert height in inches to feet > mean(htwtmatrix[,2]) To get the dimensions or number of rows or columns of a data frame, it is often useful to use one of the following commands: > dim(htwtdata) > nrow(htwtdata) > ncol(htwtdata) Exercise 2. What does the following R command do?1 > htwtdata[,2]*703/htwtdata[,1]^2

1 See http://www.nature.com/ijo/journal/vaop/naam/abs/ijo201617a.html for a recent discussion on this health metric.

5

Exercise 3. How would you get R to give you the height and weight of the 8th student in the data set? The 8th and 10th student? That was all a quick review. Now onto the new stuff: 3.2 Programming: loops, if-then/for/while statements So far we have mainly used R for performing one-line commands on vectors or matrices of data. One of the most powerful features of R is in being able to do programming, that is, automating a task. Today we will look at some simple yet powerful programming tools in R, such as loops, if-then and while statements.

If/else statements

In R, one can write a conditional statement as with syntax as follows: ifelse(condition on data, true value returned, false returned) The above expression reads: if condition assigned on the data is true, then do the “true value” operation, otherwise execute the “false value.” For example, > ifelse(3 > 4, x <- 5, x <- 6) > x [1] 6

The operators && and || are often used to denote multiple conditions in an if statement. Whereas & (and) and | (or) apply element-wise to vectors, && and || apply to vectors of length one, and only evaluate their second argument in the sequence if necessary. Thus it is important to remember which logical operator to use in which situation.

In general, operations across vectors will use the double-symbol convention, while loops will use the single-symbol convention. This will become clearer with some examples

> hmean = mean(htwtdata$height) > wmean = mean(htwtdata$weight) > ifelse( hmean > 61 && wmean > 120, x <- 5, x <- 6)

6

> x [1] 5 Are hmean and wmean vectors of length 1? > htwt_cat<-ifelse (height>67 | weight>150, “high”, “low”) > htwt_cat [1] "low" "low" "low" "low" "low" "low" "low" "low" "low" "low" "high" [12] "high" "high" "high" "high" > htwt_cat<-ifelse (height>67 || weight>150, “high”, “low”) > htwt_cat [1] "low" (Notice that in the second ifelse statement only the first element in the series was computed.) If/else statements can be extended to include multiple conditions. Suppose we have the following data: final_score<- c(39, 51, 60, 65, 72, 78, 79, 83, 85, 85, 87, 89, 91, 95, 96, 97, 100, 100) passfail<-ifelse(final_score>=60, "pass", "fail") Suppose we want to create a variable called grades that is assigned as follows: “F” if final_score <60 “D” if 60≤final_score<70 “C” if 70≤final_score<80 “B” if 80≤final_score<90 “A” if 90≤final_score We can use a “nested” ifelse command as follows: grade <- ifelse(final_score<60,"F",

ifelse (final_score<70,"D",

ifelse(final_score<80,"C",

ifelse (final_score<90,"B", "A"))))

7

If you have missing values in your vector (NA), it’s not a problem. However (!), if you have some odd coding for missing values (-99 is common), what happens to grade for that value?

This nested logical statements method is really useful for putting different colors in graphs for different conditions. Recall the Beijing air quality plot.

8

The code for the color section of the Beijing air graph in the plot() command reads:

col = ifelse(pm25<=50,"green",

ifelse(pm25<101,"yellow",

ifelse(pm25<150,"orange",

ifelse(pm25<201,"red",

ifelse(pm25<301,"purple",

"firebrick")

)

)

)

)

Repetitive execution: for loops, repeat and while

All of these examples are analogous to MACROS in SAS. We want to do something many times, without doing it by hand each time; we are automating the process. These methods can be greatly expanded to do dynamic, efficient, extremely complicated operations. The idea of this section is really just to get you familiar with how these programs do complicated operations.

There is a for loop construction in R which has form

> for (name in expr_1) execute expr_2 That means, for some subsection (name) of some set (expr_1), perform operation expr_2.

Here is the simplest loop there is:

> for(i in 1:12){print(i)}

9

Suppose, based on the ozone measurements in the airquality data set, we want to figure out which days were good air quality days (1) or bad air quality (0), based on a cutoff of ozone levels above 60ppb. Let us create a new vector called “goodair”, which stores the information on good and bad air-quality days. We can do this using a for loop.

> numdays = nrow(airquality) > numdays [1] 153 > goodair = numeric(numdays) # creates an object which will store the vector > for(i in 1:numdays) if (airquality$Ozone[i] > 60) goodair[i] = 0 else goodair[i] = 1 ## (Notice that we have an if statement here within a for loop.) Does the command above work? Why/why not? Let us check the Ozone variable. What do you notice below? > airquality$Ozone [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 [19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA [37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA [55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 [91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22 [109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73 [127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13 [145] 23 36 7 14 30 NA 14 18 20 When there are missing values, many operations in R fail. One way to get around this is to create a new data frame that deletes all the rows corresponding to observations with missing rows. This can be done by means of the command “na.omit” > airqualfull = na.omit(airquality) > dim(airqualfull) [1] 111 6 > dim(airquality) [1] 153 6 # How many cases were deleted because of missing data? Sometimes deleting all cases with missing values is useful, and sometimes it is a horrible idea… We could get around this without deleting missing cases with an ifelse statement within the for loop. (See R code for this method.)

10

Now let us try doing this again with the data with the complete cases. > numdays = nrow(airqualfull) > numdays [1] 111 > goodair = numeric(numdays) # initialize the vector > for(i in 1:numdays) if (airqualfull$Ozone[i] >60) goodair[i] = 0 else goodair[i] = 1 > goodair [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 [38] 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 [75] 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 At this point we might be interested in which days were the ones with good air quality. The “which” command returns a set of indices corresponding to the condition specified. We can then use the indices to find the day of the month this corresponds to. > which(goodair == 1) ## notice the double “=” signs! > goodindices <- which(goodair == 1) > airqualfull[goodindices,] If we had wanted to keep the entire dataset airquality intact and not remove the observations with missing values, we could put a condition in the loop that deals with NA values. (Again: see R code.) Did we really need a loop?? I could have used an ifelse() statement instead of a loop to get the same results: > goodairIFELSE <- ifelse(is.na(airquality$Ozone), NA,

ifelse(airquality$Ozone>60,1,0)

11

We can also index the rows as above in an easier fashion. We’ll get to this in the section Conditional Indexing. Exercise 4. Suppose we want to define a day with good quality air (1) as one with ozone levels below 60ppb, and temperatures less than 80 degrees F. Use an ifelse() statement to do this with the airqualityfull dataset, and output the resulting subset (only values that meet these criteria) of the data to a file called goodquality.txt. Other looping options: WHILE & REPEAT Similar to a loop function, the while statement can be used to perform an operation while a given condition is true. For example: z <-0 while (z<5){ z<-z+2 print(z) } [1] 2 [1] 4 [1] 6 In the above while statement we initiate z to have a value of 0. We then state that as long as z is less than 5 we will continue to perform the following loop operation z<-z+2. Thus we have z <- 0+2 ##Initially z is 0, but after the first iteration of the loop the value of z is 2 z <- 2+2 ## After the second iteration the value of z is 4 z <- 4+2 ## After the third iteration the value of z is 6 The while statement stops here because now z is now bigger than 5. Another option for looping is the repeat function. An example follows: > i<-1 > repeat{ + print(i) + if( i == 15) break + i<-i+1 + } [1] 1 [1] 2 [1] 3 [1] 4

12

[1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 (I won’t test you on any of this. I just want to give you a flavor for the options available.) Example of a useful LOOP (and a not-so-useful one): I’m not sure if I can share this dataset with you or not, so I’m just going to show you this code. > expression.data <- read.table("/Users/Avery/Documents/classes/Graduate/830 Microarray Data/data/data.cn.adh.dcis.clean.raw.txt", header=T, sep="\t")

> dim(expression.data) #big data [1] 22283 110 There are 36 participants who have had genetic measurements (a microarray luminosity score) on 22k RNA expression levels. Note that rows and columns here are reversed from what we’re used to: subjects are on the columns, not rows. The data of interest are at every third column, so I subset the data as follows: > col.signal <- seq(2,107,by=3) > data.signal <- data.frame(expression.data[, col.signal]) > dim(data.signal) [1] 22283 36pl > > data.signal[10:15, 1:5]

Person 1 Person 2 Person 3 Person 4 Person 5 … Gene1 188.9910 139.5375 120.3423 166.5347 295.7254… Gene2 322.2060 273.0769 232.9695 259.2409 460.1872… Gene3 222.3313 163.3700 128.9506 177.7558 297.9407… Gene4 478.7550 418.3608 366.6336 448.8449 780.4992… Gene5 415.2705 329.6245 341.5033 338.3168 557.4103… … This is a made up example, but say we want to see if any of our participants were in fact just a replicate of Person 1. If any column was a replicate of column (person) 1, the expression levels

13

for each gene would be very, very close (about a 45 degree line on a plot). So we want to make a plot of every person’s RNA expression level compared to person 1. (Again: this is a made up example, but serves the purposes of this lecture). We can run the following code to automate this: setwd("/Users/Avery/Desktop/plots") for( i in 2:ncol(data.signal)){ png( paste("myplot_", i, ".png", sep="") ) plot(data.signal[,1],data.signal[,i], xlab="Person 1 score", ylab=paste("Person", i, sep=""),col="blue") dev.off() print(paste("plotting graph ", i)) } Don’t pay attention to the specifics of the data. This is not a genetics class. The aim is just to have you see an automated script for generating multiple plots. (non-useful LOOP): Simulating # of ties in card game War (script not included in lecture) #when a loop is a big waste of effort > x<-‐seq(1:20) > x.sq<-‐numeric(20) > for(i in 1:20){ x.sq[i] <-‐ x[i]^2 } > x2 <-‐ x^2 Conditional Indexing Data frames and matrices allow for conditional indexing in R, which is often very useful. Instead of creating the goodair vector using a loop or ifelse() statement, we could directly extract the good air quality days using conditional indexing, using the single command below: > airqualfull[airqualfull$Ozone < 60,]

14

It is worthwhile to keep in mind that many things in R can be done avoiding loops, and using conditional indexing can save a lot of time and effort! However, there are other times when using loops may be the best (or only) way to do things. Conditional indexing is also useful for extracting groups of data from data frames, or splitting data into groups according to some criterion. For example, to get sets of ozone measurements for days with temperatures higher and lower than 80 degrees F, we can use: > split(airqualfull$Ozone, airqualfull$Temp < 80) $`FALSE` [1] 45 29 71 39 23 135 49 32 64 40 77 97 97 85 27 7 48 35 [19] 61 79 63 80 108 20 52 82 50 64 59 39 9 16 122 89 110 44 [37] 28 65 168 73 76 118 84 85 96 78 73 91 47 32 20 44 16 36 $`TRUE` [1] 41 36 12 18 23 19 8 16 11 14 18 14 34 6 30 11 1 11 [19] 4 32 23 115 37 21 37 20 12 13 10 16 22 59 23 31 44 21 [37] 9 45 23 21 24 21 28 9 13 46 18 13 24 13 23 7 14 30 [55] 14 18 20

Exercise 5. Using conditional indexing, write an R command to replicate the results of exercise 4. (Hint: use conditional indexing with a single &, not double &&.) 3.3 Merging and Sorting Dataframes This topic has a lot to it, so I only cover the basics. There are plenty of online resources to do this on your own. There won’t be any homework questions on this section. Say you have two datasets and you want to merge them based on an ID number. A simple example of merging these by variable ID follows: > dataset.1<-matrix(c(1,13,12,1, 2,12,10,2, 3,13,9,1, 4,9,8,2, 5,3,7,3, 6,5,6,1, 7,6,5,2, 8,5,5,3), ncol=4, byrow=T) > > dataset.1<-data.frame(dataset.1) > names(dataset.1)<-c("ID","read 1","read 2","read 3") > dataset.1 ID read 1 read 2 read 3 1 1 13 12 1 2 2 12 10 2 3 3 13 9 1

15

4 4 9 8 2 5 5 3 7 3 6 6 5 6 1 7 7 6 5 2 8 8 5 5 3 > dataset.2<-matrix(c(1,12, 2,13, 3,3, 4,15, 5,31, 6,15, 7,4, 8,6, 9,22), ncol=2, byrow=T) > > dataset.2<-data.frame(dataset.2) > names(dataset.2)<-c("ID","read 4") > dataset.2 ID read 4 1 1 12 2 2 13 3 3 3 4 4 15 5 5 31 6 6 15 7 7 4 8 8 6 9 9 22 > dataset.merged <- merge(dataset.1,dataset.2,by="ID", all.y=TRUE) > dataset.merged ID read 1 read 2 read 3 read 4 1 1 13 12 1 12 2 2 12 10 2 13 3 3 13 9 1 3 4 4 9 8 2 15 5 5 3 7 3 31 6 6 5 6 1 15 7 7 6 5 2 4 8 8 5 5 3 6 9 9 NA NA NA 22 Now, a nice simple example of many-to-one merging: > library(reshape) > my.test.2<-matrix(c(1,13,12,1, 1,12,10,2, 2,13,9,1, 2,9,8,2, 2,3,7,3, 3,5,6,1, 3,6,5,2, 3,5,5,3), ncol=4, byrow=T)

#create sample data > my.test.2<-as.data.frame(my.test.2) #convert to a dataframe, a more robust format

16

#add column names > names(my.test.2)<-c("ID","read A","read B","visit") #print it to get a look. This is in "long" format > my.test.2 ID read A read B visit 1 1 13 12 1 2 1 12 10 2 3 2 13 9 1 4 2 9 8 2 5 2 3 7 3 6 3 5 6 1 7 3 6 5 2 8 3 5 5 3 #above is a matrix of repeated measures on same individual, different visits. The function below coerces the data frame into a 'wide' format, with one row per individual, renaming variables to account for missing values.

> wide_mytest <- reshape(my.test.2, direction="wide",idvar="ID",timevar="visit") > > wide_mytest ID read A.1 read B.1 read A.2 read B.2 read A.3 read B.3 1 13 12 12 10 NA NA 2 13 9 9 8 3 7 3 5 6 6 5 5 5 Finally, imagine you have a matrix you want to sort from smallest to largest of a particular column, but you want to keep each row intact. Do the following: > setwd("/Users/Avery/Desktop/classes/720 spring 2015/classes/wk3/sorting matrices") > ZZ<-read.table("ZZ") p result x1 x2 1 0.8417549 1 2 0.4213440 2 0.9136235 1 3 -0.6412975 3 0.8361460 0 2 0.3798271 4 0.9850423 1 4 -0.5625415 5 0.3114491 1 0 1.4566465 #note each row here starts with an assigned number; this is NOT an actual column of (1,2,3,…), it’s just a row designation. #now sort it via column named "p" > ZZ[ order(ZZ[,1]) , ]

17

#or: > attach(ZZ) > ZZ[order(p),] p result x1 x2 5 0.3114491 1 0 1.4566465 3 0.8361460 0 2 0.3798271 1 0.8417549 1 2 0.4213440 2 0.9136235 1 3 -0.6412975 4 0.9850423 1 4 -0.5625415 Now the matrix is sorted by column “p” while keeping the structure of the dataframe. There won’t be any homework questions on this section! I just wanted to show you all some of these techniques as a reference for later on. 3.4 Writing simple functions in R: Why and How

The R language allows the user to create objects of mode function. These are true R functions that are stored in a special internal form and may be used in future expressions. By this tool, the language gains enormously in power, convenience and elegance, and learning to write useful functions is one of the main ways to make your use of R comfortable and productive.

It should be emphasized that most of the functions supplied as part of the R system, such as mean(), var(), dim() and so on, are themselves written in R and thus do not differ materially from user written functions.

A function is defined by an assignment of the form

> f.name <- function(arg_1, arg_2, ...) expression

The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate some value. The value of the expression is the value returned by the function.

A call to the function then usually takes the form f.name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate.

18

Simple functions

As a first example, consider a function to calculate a one-sample t-statistic to test the null hypothesis that in the height and weight data set, the mean population weight is “x” lb, where “x” can be specified by the user. This is an artificial example, of course, since there are other, simpler ways of achieving the same end (we’ll do this next class).

> onesam <- function(y1, x) { n1 <- length(y1) ##sample size yb1 <- mean(y1) ##mean of y1 s1 <- var(y1) ##variance of y1 tstat <- (yb1 - x)/sqrt(s1/n1) ##computing t-statistic = (mean-x)/SE tstat }

With this function defined, you could perform one-sample t-tests using a call such as

> t.statistic <- onesam(htwtdata$weight, 130); t.statistic To check whether this function works, compare it to running the actual t-test function inbuilt in R: > t.test(htwtdata$weight-130) A function can be called within a loop, or can be applied to elements of a vector or matrix at once, making R very powerful. We will continue looking at similar examples throughout the course. Another example: min.max.range <- function(x){ minimum<- min(x)

r <- max(x) - min(x) maximum <- max(x) print(minimum) print(maximum) print(r) } vec.1<- c(10, 20, 50) min.max.range(vec.1) [1] 10 [1] 50

19

[1] 40 Exercise 6. Write a function called summarystat, which returns the mean, median, and standard deviation of a set of numbers. Recap:

• Review of matrix/dataframe operations • if statements, ifelse(), Loops • Conditional indexing (the most useful topic of this class) • Merging & sorting dataframes (not on homework—but very useful in real applications) • Creating functions

Reading:

• VS. Chapter 8.1, 8.2, 9 and 10 Assignment:

• Homework 2 due, Homework 3 assigned. (With extra credit for those interested in simulation.)