An Introduction to R for Epidemiologists using RStudio ...sjm2186/SER2014/intro.pdfFirst steps: Use...

Preview:

Citation preview

An Introduction to R for Epidemiologists using RStudiothe basics

Steve Mooney (much borrowed from C. DiMaggio)

Department of EpidemiologyColumbia UniversityNew York, NY 10032

sjm2186@columbia.edu

An Introduction to R for Epidemiologists using RStudioIntroduction to R Concepts and Object Types

SER Summer 2014

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 2 / 56

getting our hands dirty calculating, assigning, combining

First steps: Use R as a calculator

math operators and functions

arithmetic + , - , * , /

power ^

convert 68 degrees Fahrenheit to Celsius (C 0 = 59(F 0 − 32))

5/9*(68-32)

First type it directly in the console. Then type it into the editor and sendit to the console.(Remember how to do that?)

S. Mooney (Columbia University) R intro 2014 3 / 56

getting our hands dirty calculating, assigning, combining

assignment operator‘memory’ key

<-

y <- 5/9*(68-32) #assignment (no display)

y

(y <- 5/9*(68-32)) #assignment (display)

S. Mooney (Columbia University) R intro 2014 4 / 56

getting our hands dirty calculating, assigning, combining

functions

FunctionName(parameter1, parameter2, ...)

math operators and functions

mathematical functions - sqrt, log, exp, sin, cos, tan

simple functions - max, min, length, sum, mean, var, sort

abs(-23) #absolute value

exp(8) # exponentiation

log(exp(8)) # natural logarithm

sqrt(64) # square root

S. Mooney (Columbia University) R intro 2014 5 / 56

getting our hands dirty calculating, assigning, combining

concatenation functioncombine or ”vectorize”

c()

x <- c(100,90,80,70,60)

x

y <- c("a", "b", "c", "d")

y

S. Mooney (Columbia University) R intro 2014 6 / 56

getting our hands dirty calculating, assigning, combining

Put it together: Vectorized computations

The calcuation you tried with 68 (a scalar) can also work with the vectoryou just created:

5/9*(x-32)

z<-5/9*(x-32)

z

S. Mooney (Columbia University) R intro 2014 7 / 56

getting our hands dirty from calculations to programming

write your own functionR is a programming language

my.function<-function(x){

5/9*(x-32)

}

my.function(68)

[1] 20

a<-c(134,156,222)

my.function(a)

[1] 56.66667 68.88889 105.55556

We’ll revisit creating functions if we have time...

S. Mooney (Columbia University) R intro 2014 8 / 56

how R thinks (vs. SAS and SPSS)

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 9 / 56

how R thinks (vs. SAS and SPSS)

A Quick Warning...

We’re going to shift gears into abstract territory for a little while.

I hope the material that follows will orient you as we learn more concretepieces of R.

But I want to acknowledge that it gets away from the concrete; pleasedon’t worry if you feel you’re not grasping all the details fully right now.

S. Mooney (Columbia University) R intro 2014 10 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingHow to work with R

In my experience, data analysis is usually an iterative process:

1 Massage data (merge datasets, select the items you want to analyze,ensure measures are created properly, etc)

2 Call some procedure to do analytic step (e.g. look at 2x2 table)

3 Interpret output & generate new questions (back to step 1 or 2)

In SAS (& SPSS(?)), data massage mostly happens in DATA statementsand analysis mostly happens in PROC steps.

In R, there’s no formal separation between massage and analysis: we usesimilar functions for both.

S. Mooney (Columbia University) R intro 2014 11 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingGetting abstract for a moment...

Most data massage and analysis procedures actually have similar components:

1 Three classes of thing you tell the statistical program:1 What type of operation to do2 How specifically to do it3 A dataset to do it on

2 Two types of thing happen when you run the code:1 Changes get made to the data2 Output or results are returned

Let’s look at how this plays out in SAS, SPSS, and R...

S. Mooney (Columbia University) R intro 2014 12 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingFunction input (SAS example)

In SAS:

1 The type of thing to do is the DATA or PROC XYZ statement.2 The dataset to use is specified with data=XYZ for a PROC step and set

XYZ; for a data step.3 And everything else is how specifically to do it.

For example, consider the following SAS statement:

PROC FREQ DATA=XYZ; table X*Y/missing; RUN;

FREQ is the type of thing.XYZ is the dataset (and the X*Y specifies the columns)./missing is how specifically to do the FREQ

S. Mooney (Columbia University) R intro 2014 13 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingFunction input (SPSS example)

In SPSS:

1 The type of thing to do is the statement type.2 The dataset is implicit based on a previous DATA statement.3 And everything else is how specifically to do it.

For example, consider the following SPSS statement:

crosstabs

/tables X by Y

/missing=report

crosstabs is the type of thing.the current dataset is the dataset (and X by Y specifies the columns)./missing= report indicates how specifically to do the crosstab

S. Mooney (Columbia University) R intro 2014 14 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingFunction input in R

In R:

1 The type of thing to do is the function name type.2 The dataset and how specifically to do it are both parameters to the

function.

For example, consider the following R statement:

table(XYZ$X, XYZ$Y, na.rm=TRUE)

table is the type of thing.XYZ$X and XYZ$Y are the data.na.rm=TRUE is how specifically to handle missing data

S. Mooney (Columbia University) R intro 2014 15 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingFunction output in R

I claimed that analytic steps have up to two kinds of effects:

1 Changes made to the data2 Output or results returned

In SAS and SPSS:

1 There is an output window that displays the results of an analyticprocedure.

2 Some procedures change data and others do not. The programmer knowswhich procedures modify the data

In R:

1 An analytic function typically returns an object whose default display isthe result of interest.

2 If the programmer wants data modified by the procedure, she or he usuallyworks with the return value of the function in the next programming step.

S. Mooney (Columbia University) R intro 2014 16 / 56

how R thinks (vs. SAS and SPSS)

Programming and AnalyzingFunction output in R

Consider the R statement

table(XYZ$X, XYZ$Y, na.rm=TRUE)

This is a function that returns an object (a 2x2 matrix, in this case) whosedefault visualization looks like a 2x2 table.

If you want a chi-square test on that 2x2 table, you can use the output fromthe table function as the input to the chisq.test function as follows:

chisq.test(table(XYZ$X, XYZ$Y, na.rm=TRUE))

Using return values rather than side effects is characteristic of a functionalprogramming model of language design.

Don’t worry

This may seem complicated or abstract, but it will become more clear afterusing R more.

S. Mooney (Columbia University) R intro 2014 17 / 56

data

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 18 / 56

data

the cbind() functionsCombining vectors into matrices

weight <- c(134, 156, 222)

height <- c(60, 63, 72)

bmi <- (weight*703)/height^2

cbind(weight, height, bmi)

weight height bmi

[1,] 134 60 26.16722

[2,] 156 63 27.63114

[3,] 222 72 30.10532

S. Mooney (Columbia University) R intro 2014 19 / 56

data

getting your own data into R”there’s a function for that”

read.table() (/read.csv/read.fwf) is how you get data into base R

but RStudio’s Import Dataset can generate the code for you...

cars<-read.table(

"http://www.columbia.edu/~sjm2186/SER2014/cars.txt",

header=T, stringsAsFactors=F)

str(cars)

We will revisit this...

S. Mooney (Columbia University) R intro 2014 20 / 56

packages

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 21 / 56

packages

packages

Packages contain code that enable extra functionality in RAnalogous to a SAS file containing several macros

install.packages("epitools")

library(epitools)

epitab(c(10, 20, 30, 40))

We will revisit these as well...

S. Mooney (Columbia University) R intro 2014 22 / 56

help

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 23 / 56

help

getting help

R has a lot of built-in help mechanisms...

help() opens help page

apropos() displays all objects matching topic

library(help=packageName) help on a specific package

vignette(package=”packageName”);

help(sample) ; ?sample ; ??sample

apropos("sam")

library(help=epitools)

vignette(package="utils")

vignette("Sweave")

S. Mooney (Columbia University) R intro 2014 24 / 56

help

getting help

..but web resources can be even more helpful:

tutorial: http://www.ats.ucla.edu/stat/r/

search: http://www.r-project.org/search.html

books: Venebles, Aragon, etc.

Two major online communities:

R mailing list archive: http://r.789695.n4.nabble.com/

Stack Overflow: http://stackoverflow.com/questions/tagged/r

S. Mooney (Columbia University) R intro 2014 25 / 56

objects

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 26 / 56

objects

5 important objects

objects are ”specialized data structures”

1 vector - collection of like elements (numbers, characters...)

2 matrix - 2-dimensional vector

3 array - >2-dimensional vector

4 list - collection of groups of like elements any kind

5 dataframe - tabular data set, each row a record, each column a (like)element or variable

S. Mooney (Columbia University) R intro 2014 27 / 56

objects

objects for epidemiologists

matrix for contingency, e.g. 2x2, tables

arrays for stratified tables

dataframe for observations and variables

factors for categorical variables

numeric representation of charactersread.table converts characters to factors

S. Mooney (Columbia University) R intro 2014 28 / 56

objects

examples of R objects

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

y <- matrix(x, nrow = 2)

z <- array(x, dim = c(2, 3, 2))

mylist <- list(x, y, z)

names <- c("alice", "bob", "charlie")

gender <- c("girl", "boy", "boy")

age <- c(28, 22, 34)

race <- factor(c("Asian", "Asian", "Black"),

levels=c("Asian", "Black", "White"))

data<- data.frame(names, gender, age, race)

S. Mooney (Columbia University) R intro 2014 29 / 56

objects about objects

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 30 / 56

objects about objects

modesatomic vs recursive objects

mode - type of data: numeric, character, logical, factor

age <- c(34, 20); mode(age)

lt25 <- age<25

lt25

mode(lt25)

atomic - only one modecharacter, numeric, factor, or logical, i.e. vectors, matrices, arrays

logical (1 for TRUE 0 for FALSE)categorical - appear numeric but stored as factors

recursive - more than one mode

lists, data frames, functions

S. Mooney (Columbia University) R intro 2014 31 / 56

objects about objects

atomic objectsall elements the same

vector: one dimension

y <- c("Tom", "Dick", "Harry") ; y #character

x <- c(1, 2, 3, 4, 5) ; x #numeric

z <- x<3 ; z #logical

matrix: two-dimensional vector

x <- c("a", "b", "c", "d")

y <- matrix(x, 2, 2) ; y

array: n-dimensional vector

x <- 1:8

y <- array(x, dim=c(2, 2, 2)) ; y

S. Mooney (Columbia University) R intro 2014 32 / 56

objects about objects

recursive objectsdiffer

list: collections of data

x <- c(1, 2, 3)

y <- c("Male", "Female", "Male")

z <- matrix(1:4, 2, 2)

xyz <- list(x, y, z)

dataframe: tabular (2-dimensional) list

subjno <- c(1, 2, 3, 4) ; age <- c(34, 56, 45, 23)

sex <- c("Male", "Male", "Female", "Male")

case <- c("Yes", "No", "No", "Yes")

mydat <- data.frame(subjno, age, sex, case) ; mydat

S. Mooney (Columbia University) R intro 2014 33 / 56

objects about objects

coercionchanging an object’s mode

this is importantR will automatically coerce all the elements in an atomic object to a singlemode (character >numeric >logical)

c("hello", 4.56, FALSE)

c(4.56, FALSE)

S. Mooney (Columbia University) R intro 2014 34 / 56

objects about objects

coercing objectsdo it yourself

is.xxx / as.xxx - to assess / coerce objectsxxx = vector, matrix, array, list, data.frame, function, character,numeric, factor, na etc...

is.matrix(1:3) # false

as.matrix(1:3)

is.matrix(as.matrix(1:3)) # true

# coercing factor to character

sex <- factor(c("M", "M", "M", "M", "F", "F", "F", "F"))

sex

unclass(sex) #does not coerce into character

as.character(sex) #works

S. Mooney (Columbia University) R intro 2014 35 / 56

objects about objects

Reviewbasic characteristics of R objects

Objects - vector, matrix, array, list, dataframe

mode() - ”type” of object: numeric, character, factor, logical

vectors and matrices - atomic, one mode onlylists and data frames - recursive, can be of >1 mode

class() - for simple vectors, same as mode

more complex objects, array and data frames have their own classaffects how printed, plotted and otherwise handled

S. Mooney (Columbia University) R intro 2014 36 / 56

objects vector

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 37 / 56

objects vector

vectors are 1-dimensional strings of like elements

the basic building block of data in R

use them for quick data entry

S. Mooney (Columbia University) R intro 2014 38 / 56

objects vector

fun with vectors

y<-1:5 #create a vector of consecutive integers

y+2 #scalar addition

2*y #scalar multiplication

x<-c(1,3,2,10,5)

cumsum(x)

S. Mooney (Columbia University) R intro 2014 39 / 56

objects vector

more fun with vectorsvectorized arithmetic

c(1,2,3,4)/2

c(1,2,3,4)/c(4,3,2,1)

log(c(0.1,1,10,100), 10)

c(1,2,3,4) + c(4,3)

c(1,2,3,4) + c(4,3,2)

S. Mooney (Columbia University) R intro 2014 40 / 56

objects vector

creating numerical vectorssequences

the sequence operator :

-9:8

seq() greater flexibility

> seq(1, 5, by = 0.5) # specify interval

> seq(1, 5, length = 8) #specify length

S. Mooney (Columbia University) R intro 2014 41 / 56

objects vector

operations on vectors

x <- rnorm(100)

sum(x)

x <- rep(2, 10)

cumsum(x)

mean(x)

sum(x)/length(x)

var(x) #sample variance

sd(x)

sqrt(var(x)) #sample standard deviation

x <- rnorm(100)

y <- rnorm(100)

var(x, y) # covariance

S. Mooney (Columbia University) R intro 2014 42 / 56

objects vector

logical vectorsthe special vector...

series of TRUEs and FALSEs (Ts and Fs)

created with relational operators:

<, >, <=, >=, ==, !=

used to index, select and subset data

S. Mooney (Columbia University) R intro 2014 43 / 56

objects vector

about logical vectors

logical operators are the key to indexing, and indexing is the keyto manipulating data

= <= > >= == !

x<-1:26

temp<- x > 13 #logical vector temp

#same length as vector x

#TRUE= 1, when condition met

#FALSE = 0, when not met

sum(temp)

We will revist logical vectors when we discuss indexing (coming soon...)

S. Mooney (Columbia University) R intro 2014 44 / 56

objects matrix & array

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 45 / 56

objects matrix & array

a matrix is a 2-dimensional vector2x2 and contingency tables

Option 1: define the matrix from raw data:

myMatrix<-matrix(c("a","b","c","d"),2,2)

myMatrix

myMatrix2<-matrix(c("a","b","c","d"),2,2, byrow=T)

myMatrix2

colnames(myMatrix2)<-c("case", "control")

rownames(myMatrix2)<-c("exposed", "unexposed")

myMatrix2

S. Mooney (Columbia University) R intro 2014 46 / 56

objects matrix & array

cbind and rbind

Option 2: define the data by binding vectors together:

names<-c("Alice", "Bob", "Charlie")

ages<-c(6,7,8)

names

ages

cbind(names, ages)

rbind(names, ages)

S. Mooney (Columbia University) R intro 2014 47 / 56

objects matrix & array

caution: recycling

cbind and rbind - will recycle data

when performing vector or mixed vector and array arithmetic, shortvectors are extended by recycling till they match size of otheroperands

R may return an error message, but still complete the operation

S. Mooney (Columbia University) R intro 2014 48 / 56

objects matrix & array

creating a matrix from dataThe more common scenario...

table() - from characters

titanic<-read.csv(

"http://www.columbia.edu/~sjm2186/SER2014/titanic.csv",

stringsAsFactors=F) #load titanic data

str(titanic) # Check the structure

table(titanic$sex,titanic$survived) # Make the matrix

S. Mooney (Columbia University) R intro 2014 49 / 56

objects matrix & array

an array is an n-dimensional vectorstratified epi tables

stratified titanic survival table:

sex vs. survival vs. passenger class

table(titanic$sex,titanic$survived, titanic$pclass)

S. Mooney (Columbia University) R intro 2014 50 / 56

objects list

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 51 / 56

objects list

a list is a recursive collection of unlike elementslike epi ”variables” and ”observations”

often used to ”store” function results

str() is your friend

also, see stackoverflow discussion

x <- 1:5 ; y <- matrix(c("a","c","b","d"), 2,2)

z <- c("Peter", "Paul", "Mary")

mm <- list(x, y, z)

mm

str(mm)

S. Mooney (Columbia University) R intro 2014 52 / 56

objects dataframe

Outline

1 getting our hands dirtycalculating, assigning, combiningfrom calculations to programming

2 how R thinks (vs. SAS and SPSS)

3 data

4 packages

5 help

6 objectsabout objectsvectormatrix & arraylistdataframe

S. Mooney (Columbia University) R intro 2014 53 / 56

objects dataframe

dataframestabular epi data sets

2-dimensional tabular lists with equal-length fieldseach row is a record or observationeach column is a field or variable (usually numeric vector or factors)

data(infert)

str(infert)

head(infert)

”a list that behaves like a matrix”

S. Mooney (Columbia University) R intro 2014 54 / 56

objects dataframe

creating data frames

1 data.frame()

x <- data.frame(id=1:2, sex=c("M","F"))

2 read.table(), read.csv(), read.delim(), read.fwf()

titanic<-read.csv(

"http://www.columbia.edu/~sjm2186/SER2014/titanic.csv",

stringsAsFactors=F) #load titanic data

str(titanic)

(caution: default char → factor, numeric → integer)

S. Mooney (Columbia University) R intro 2014 55 / 56

objects dataframe

Exercises

You should now be able to complete exercises 1 and 2 inhttp://www.columbia.edu/~sjm2186/SER2014/Exercises.pdf

S. Mooney (Columbia University) R intro 2014 56 / 56

Recommended