35
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation R Basics Peter Dalgaard Department of Biostatistics University of Copenhagen Mixed Models in R, Copenhagen, January 2006 R Basics Department of Biostatistics University of Copenhagen

R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

R Basics

Peter Dalgaard

Department of BiostatisticsUniversity of Copenhagen

Mixed Models in R, Copenhagen, January 2006

R Basics Department of Biostatistics University of Copenhagen

Page 2: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Outline

Using R

R Language Basics

Dealing with the workspace

Reading Data

Data Manipulation

R Basics Department of Biostatistics University of Copenhagen

Page 3: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Basics of R

I What is R?I Interacting with RI Extended user interfacesI Later: Dealing with R’s workspace

R Basics Department of Biostatistics University of Copenhagen

Page 4: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Key Points about R

I Environment built around the programming language R,(an Open Source dialect of the S language).

I R is Free Software, and runs on a variety of platforms (I’llbe using Linux. Computer labs run on Windows.)

I Command-line execution based on function callsI Extensible with user functionsI Workspace containing data and functionsI Graphics devices

R Basics Department of Biostatistics University of Copenhagen

Page 5: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

R packages

I Collections of R functions, data, and compiled codeI Well-defined format that ensures easy installation, a basic

standard of documentation, and enhances portability andreliability,

R Basics Department of Biostatistics University of Copenhagen

Page 6: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Interacting with R

I Command line interface (CLI)I The basic mode of interaction is “read – evaluate – print”I User types an expression at the command line,I R evaluates itI . . . and prints the resultI Batch variation: read commands from a file

R Basics Department of Biostatistics University of Copenhagen

Page 7: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Extended Interfaces

I Windows, Macintosh GUI: Fairly simple extensions of CLI,mostly offloads some tasks to menu interface, and addscommand recall

I Script editing: The ability to work with multiple lines of Rcode, save them to a file for later use, etc. A simple scripteditor is built into the R GUI in recent versions.

I External editor interfaces: TINN-R, R-WinEdt adds syntaxhighlighting. Highly recommended.

I R embedded in a text editor (ESS – Emacs SpeaksStatistics). Popular on Unix/Linux systems.

R Basics Department of Biostatistics University of Copenhagen

Page 8: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Demo 1

2+2log(10)help(log)summary(airquality)demo(graphics) # pretty pictures...

R Basics Department of Biostatistics University of Copenhagen

Page 9: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Basic Vector Types

I R is a vector based language, data types includeI Numeric (integer/double) vectorsI Character (strings) vectorsI Logical vectorsI These types are combined and extended to form more

complex objects

R Basics Department of Biostatistics University of Copenhagen

Page 10: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Basic operations

I Standard arithmetic is vectorized: x + y adds eachelement of x to the corresponding element of y

I c — concatenationI seq or from:to — sequencesI rep, gl — replicationI sum, mean, range, math functions,. . .

R Basics Department of Biostatistics University of Copenhagen

Page 11: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Demo 2

x <- round(rnorm(10,mean=20,sd=5)) # simulate dataxmean(x)m <- mean(x)mx - m # notice recycling(x - m)^2sum((x - m)^2)sqrt(sum((x - m)^2)/9)sd(x)

R Basics Department of Biostatistics University of Copenhagen

Page 12: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Concatenation

I The c() function joins a number of vectors end to endI An important special case is where the vectors are just

numbers> c(7,9,13)[1] 7 9 13

I This is used all over the place where you need to pass ashort vector to a function call (e.g ylim=c(0,100) in aplot.

R Basics Department of Biostatistics University of Copenhagen

Page 13: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Sequences

I The seq() function generates regular sequencesI The arguments are from, to, byI If by is 1 (default), you can also use from:to

> seq(1,9,2)[1] 1 3 5 7 9> 1:5[1] 1 2 3 4 5

R Basics Department of Biostatistics University of Copenhagen

Page 14: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Replication

I The rep() function generates a vector by replicatingelements from another vector a given number of times

I The arguments are x, times, but there are alternateforms, notably the each argument> x <- c(7,9,13)> rep(x,2)[1] 7 9 13 7 9 13> rep(x,each=2)[1] 7 7 9 9 13 13> rep(x, 1:3)[1] 7 9 9 13 13 13

R Basics Department of Biostatistics University of Copenhagen

Page 15: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Generating regular designs

I The gl() function “generates levels” for a regular designI The arguments are

I Number of levelsI Block sizeI Total length

I Notice that this generates a factor (more about this later)> gl(2,3,12) # 2 levels, blocks of 3, total length 12[1] 1 1 1 2 2 2 1 1 1 2 2 2Levels: 1 2

R Basics Department of Biostatistics University of Copenhagen

Page 16: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Classed Objects

I In R objects can have classesI These are used as the basis for function dispatchI I.e. the same (generic) function can have different methods

for different classesI Print methods are a prototypical exampleI There are two object systems, based (roughly) on S

version 3 and version 4. I will not go into details.

R Basics Department of Biostatistics University of Copenhagen

Page 17: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Factors

I Factors are used to describe groupings (the termoriginates from factorial designs)

I Basically, these are just integer codes plus a set of namesfor the levels

I They have class "factor" making them (a) print nicelyand (b) maintain consistency

I A factor can also be ordered (class "ordered"),signifying that there is a natural sort order on the levels

I In model specifications, factors play a fundamental role byindicating that a variable should be treated as aclassification rather than as a quantitative variable (similarto a CLASS statement in SAS)

R Basics Department of Biostatistics University of Copenhagen

Page 18: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Creating factors

I Factors can be created during read (but not alwayscorrectly)

I The factor function is used when, e.g., groups havebeen read as numeric codes> sexnr <- c(0,0,1,1,0,1)> (sex <- factor(sexnr, levels=c(1,0),+ labels=c("male", "female")))[1] female female male male female maleLevels: male female

I Notice the slightly confusing use of levels and labelsarguments.

I levels are the value codes on inputI labels are the value codes on output (and become the

levels of the resulting factor)

R Basics Department of Biostatistics University of Copenhagen

Page 19: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Indexing

I R has several useful indexing mechanisms:I a[5] single elementI a[5:7] several elementsI a[-6] all except the 6thI a[b>200] logical index

R Basics Department of Biostatistics University of Copenhagen

Page 20: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Lists

I A vector where the elements can have different typesI Functions often return listsI lst <- list(A=rnorm(5), B="hello")

I Special indexing:I lst$A

I lst[[1]] first elementI (lst[1] list containing the first element)

R Basics Department of Biostatistics University of Copenhagen

Page 21: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Matrices/Tables/Arrays

I Used in matrix calculus and as input to, e.g.,chisq.test(). Results of tabulation.

I Matrices: Generate with matrix

I Indexing methods are like [i,j], [i,], [,j]I Leaving out the row index gives all rows, etc.

R Basics Department of Biostatistics University of Copenhagen

Page 22: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Data frames

I Like data set in other packagesI Technically: Lists of vectors/factors of same lengthI Indexed like matrices (Beware, though: Data frames are

not matrices) or as listsI Generate from read operation or with data.frame

I Many sample data frames are avalilable using data()

R Basics Department of Biostatistics University of Copenhagen

Page 23: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Demo 3

data(airquality)airquality[1:10,]airquality$Monthairquality[airquality$Month==5,]oz <- airquality[airquality$Month==5,]$Ozonemean(oz)mean(oz, na.rm=TRUE)

R Basics Department of Biostatistics University of Copenhagen

Page 24: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

The workspace

I The global environment contains R objects created on thecommand line.

I There is an additional search path of loaded packages andattached data frames.

I When you request an object by name, R looks first in theglobal environment, and if it doesn’t find it there, itcontinues along the search path.

I The search path is maintained by library(), attach(),and detach()

I Notice that objects in the global environment may maskobjects in packages and attached data frames

R Basics Department of Biostatistics University of Copenhagen

Page 25: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Demo 4

attach(airquality)mean(Ozone, na.rm=TRUE)tapply(Ozone, Month, mean, na.rm=TRUE)detach()search()library(ISwR)data(intake) # From ISwRls()attach(intake)search()ls("intake") # show variables in data framepost - prerm(intake) # remove data framedetach() # remove from search path

R Basics Department of Biostatistics University of Copenhagen

Page 26: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

A Common Mistake

attach(mydata)sex <- factor(sex)tapply(height, sex, mean)detach()attach(subset(mydata, age > 25))sex <- factor(sex)tapply(height, sex, mean)

You get an error saying that height and tanner are ofdifferent length. What went wrong?Second time around, sex was found in the global environmentbefore the attached data frame.

R Basics Department of Biostatistics University of Copenhagen

Page 27: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Getting Organized

Several possibilities:I Save/restore entire workspace (objects only)I Save selected objects and load themI source() script filesI Batch processing (R CMD BATCH file.R)

R Basics Department of Biostatistics University of Copenhagen

Page 28: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Reading Data, Overview

I Simple data vectors can be read using scan()

I Data frames can be read from most reasonably structuredtext file formats (space separated columns, tab- andcomma-delimited files) using read.table() orread.delim().

I The foreign package can read files from Stata, SASexport libraries, SPSS, and Epi-Info, Minitab, and someS-PLUS versions.

I For spreadsheets and databases, the quick and easy wayis to export to a delimited file, but you can work via ODBCconnections and database access packages

R Basics Department of Biostatistics University of Copenhagen

Page 29: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

The Simplest Way to Read Data

I This is what you’d normally want to do:I Have data in a plain text fileI Columns separated by whitespaceI Missing values coded as the string "NA"

I Preferably have a row of variable names at the topI Use d <- read.table("myfile", header=TRUE)

R Basics Department of Biostatistics University of Copenhagen

Page 30: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Demo 5

dir <- system.file("data", package="ISwR")fname <- file.path(dir, "thuesen.txt")fname

file.show(fname)read.table(fname, header=TRUE)

(Notice the use of portable constructs to find the data directoryinside a package and the construction of the full pathname.)

R Basics Department of Biostatistics University of Copenhagen

Page 31: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Options and Details

I read.table has quite a few options and detailsI Different codings of missing values (na.strings)I Different decimal separators (dec argument)I Text strings can be quoted if embedded blanksI You may skip lines, read a limited number of lines, and

more. Please consult the manual page for details.

R Basics Department of Biostatistics University of Copenhagen

Page 32: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Data Manipulation Functions

I Single-column modificationsI Modifying and subsetting data frames

R Basics Department of Biostatistics University of Copenhagen

Page 33: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Constructors

I R deals with many kinds of objects besides data setsI Need to have ways of constructing them from the

command lineI We have seen the c and list functionsI Extracting and setting names with names(x)

I For matrices and arrays, use the (surprise) matrix andarray functions. data.frame for data frames.

I It is also fairly common to construct a matrix from itscolumns using cbind

R Basics Department of Biostatistics University of Copenhagen

Page 34: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

The cut Function

I The cut function converts a numerical variable into groupsaccording to a set of break points

I Notice that the number of breaks is one more than thenumber of intervals

I Notice also that the intervals are left-open, right-closed bydefault (right=FALSE changes that)

I . . . and that the lowest endpoint is not included by default(set include.lowest=TRUE if it bothers you)

R Basics Department of Biostatistics University of Copenhagen

Page 35: R Basics - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/mixed-jan.2006/Basics.pdf · Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation Concatenation

Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation

Modifying and Subsetting Data Frames

I The syntax for indexing data frames gets awkward:airquality[airquality$Month == 5 &airquality$Ozone > 50,]

I The subset function allows you to saysubset(airquality, Month == 5 & Ozone >50). I.e., it evaluates the second argument within the dataframe.

I The transform function is similar. It allows you to definenew variables or modify old ones using code likejuulnew <- transform(juul,

sex=factor(sex, labels=c("M","F")),tanner=factor(tanner))

R Basics Department of Biostatistics University of Copenhagen