110
An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto February 6, 2007 1 With thanks to Sophia Lee, PhD Candidate for preparing the examples used in the Summer Workshop on Modern Applied Methods in Biostatistics (Aug. 2006) that formed the starting point for this paper.

Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

An Introduction to R: A Language and Environment for Statistical

Computing 1

Prof. Kevin E. ThorpeDept. of Public Health Sciences

University of Toronto

February 6, 2007

1With thanks to Sophia Lee, PhD Candidate for preparing the examples used in the Summer Workshop on ModernApplied Methods in Biostatistics (Aug. 2006) that formed the starting point for this paper.

Page 2: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Contents

1 Introduction 41.1 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Quitting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Data 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Creating Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Some Vector Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Vector Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Some Special Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Matrices and Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 Creating Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Matrix Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.3 Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.4 Some Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.1 Creating Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.2 Extracting Data From Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.3 Merging Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.1 Creating Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.2 Extracting List Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7.1 Representing Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7.2 Testing for Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.8 Reading External Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8.1 Reading Rectangular Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8.2 Reading Fixed Width Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.8.3 Other File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1

Page 3: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.9 Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Descriptive Statistics 283.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Common Descriptive Statistics for Numeric Data . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Simple Summaries of Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 The Summary Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 Row and Column Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Grouped Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.7 The apply Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7.1 The apply Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.7.2 The lapply and sapply Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.7.3 The tapply Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 The Formula 42

5 Graphics 445.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Traditional Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.1 The Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.2 Box-Whisker Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.3 Histograms and Density Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.4 Quantile-Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.5 Dot Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.6 Customizing Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Lattice Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.3.1 Loading the Lattice Graphics Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.2 The Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.3 Box-Whisker Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3.4 Histograms and Density Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.3.5 Quantile-Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.3.6 Dot Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.7 Customizing Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 Graphics Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

2

Page 4: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

List of Tables

2.1 Other Matrix and Linear Algebra Operators and Functions . . . . . . . . . . . . . . . . . . . 172.2 Examples of Path Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3 Examples of Common Delimiters and how to specify them . . . . . . . . . . . . . . . . . . . . 252.4 A selection of foreign functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5 A selection of random number functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Functions for Common Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Functions in the apply family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3

Page 5: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Chapter 1

Introduction

This paper describes and introduces new users to R: A Language and Environment for Statistical Computingor R, for short. R is an open source project and runs on most current operating systems. R is an interactivelanguage, that is, when the system is started, the user is presented with a prompt1 awaiting keyboard inputfrom the user. The user then issues commands in the R language which the system processes and presentsthe results, if any, to the user. Note that R is case sensitive. This means that Abc is not the same as abc.

1.1 Starting R

The method for starting R depends on the operating system being used. For UNIX and Linux, you simplytype R at a shell prompt for your OS. On Windows and MAC, R is started in the same way as other programs.

1.2 Getting Help

R has a built-in help system. The simplest way to request help on a function is with the ? so, for example youcode obtain help on the lm function by typing ?lm at the Rprompt. Alternatively, you could type help(lm)at the Rprompt. Both of these will display the help file for lm which is the function for linear regression.

The obvious difficulty with this approach is that you need to know the name of the function you wish helpon. This is where help.search() is useful. Typing help.search("regression") or help.search("linearmodel") at the R prompt will give slightly different results, but lm will be identified in both cases.

Another useful help tool is provided by the help.start() function. This starts an HTML help system.It includes some informative reading, especially An Introduction to R. To browse for help on a particularfunction, you would click Packages2 and then click the package that provides the function. Again, this canbe a problem if you don’t know which package the function. This is solved by clicking on Search Engine &Keywords.

Throughout this paper, numerous functions will be introduced, but only at an elementary level. Manyof them have richer functionality than will be described. It is highly recommended to read the help pages

1There are actually two prompts that the user is likely to encounter. The first is “>” which means R is waiting for the userto start a new command. The other is “+” which appears when the return key is pressed before an R command is complete.

2More about packages later.

4

Page 6: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

of the functions introduced. Besides learning more about their capabilities, there are often related functionsdescribed or referred to that you may wish to know about.

1.3 Quitting R

You quit an R session with the q() function. You will be asked whether or not you wish to save yourworkspace. More detail will be given about workspaces later; for now you may answer “yes.”

5

Page 7: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Chapter 2

Data

2.1 Introduction

One thing that distinguishes R from other statistical systems is how data are handled. In R, everything isan object that can be manipulated. Some objects are data, some are results from analysis (some of thesealso serve as data for other analyses), some are functions which are the workhorses of the system. Datamanipulation in R is both a strength and an obstacle. In particular, those who have cut their teeth on othersystems often find the transition to R challenging as it requires a shift in thinking.

In this chapter I will describe some of the common data structures and data processing and computationfeatures of R.

2.2 Vectors

The simplest data structure in R is the vector (a scalar is a vector of length 1). A vector is essentially aone-dimensional array whose elements are all of the same type (eg. numeric, character). Arithmetic in R isalso “vectorized.” The easiest way to grasp these ideas is with an example.

2.2.1 Creating Vectors

> x <- 3

> x

[1] 3

In R, the assignment operator is “<-” and it is produced by typing the < key followed by the - key onthe keyboard. Thus, the preceeding code assigns the value 3 to the object x. It also shows that to view thecontents of an object, you simply type its name.

> x <- c(10.4, 5.6, 3.1, 6.4, 21.7, 53.5, 3.6, 2.6, 6.1, 1.7)

> x

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

6

Page 8: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Here we see the use of the c() function. This is the most commonly used function for creating vectors.The c() function takes an arbitrary number of arguments and, provided they are all of the same data type,combines them into a vector.

> y <- 0:9

> y

[1] 0 1 2 3 4 5 6 7 8 9

This code segment introduces the : operator. Its generates a sequence of numbers starting at the numberto the left of the operator and proceeding to the number of the right of the operator in increments of 1.

Question: What do you think 0.5:9 will produce?

> z <- seq(0, 10, by = 2)

> z

[1] 0 2 4 6 8 10

The seq() function shown above allows for the creation of more complicated sequences. We also see an= sign for the first time. In R, the = sign is used to assign values to parameters in a function. The seq()function takes other arguments as well, see ?seq for details.

2.2.2 Some Vector Arithmetic

Arithmetic expressions can be entered in R as you would expect. R respects the usual order of operations.Here are some examples involving scalars only.

> 2 + 3

[1] 5

> 2 * 3

[1] 6

> 2^3

[1] 8

> 2 + (2 + 3) * 2 - 5

[1] 7

These types of expressions extend in a rather natural way to vectors. Consider this example.

> x

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> y

7

Page 9: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

[1] 0 1 2 3 4 5 6 7 8 9

> x + y

[1] 10.4 6.6 5.1 9.4 25.7 58.5 9.6 9.6 14.1 10.7

> 3 * x - y

[1] 31.2 15.8 7.3 16.2 61.1 155.5 4.8 0.8 10.3 -3.9

> x/y

[1] Inf 5.6000000 1.5500000 2.1333333 5.4250000 10.7000000[7] 0.6000000 0.3714286 0.7625000 0.1888889

Notice first of all that x and y are the same length, specifically there are 10 elements in each. Nextnotice that the expressions above were computed element by element. In the case of 3*x, each element of xis multiplied by 3 first (ie. scalar multiplication in linear algebra terminology). Finally, note that the firstelement of x/y is Inf. This is how an infinite value is represented in R.

Now suppose that the vectors are not of the same length as is the case with y and z.

> z

[1] 0 2 4 6 8 10

> y + z

[1] 0 3 6 9 12 15 6 9 12 15

The operation is still performed element-wise, however since z is shorter than y it runs out of elementsbefore the elements of y are exhausted. Therefore, R recycles from the beginning of z until the operation iscomplete. In this case, R will also display the following warning.

Warning message:longer object length

is not a multiple of shorter object length in: y + z

The meaning should be clear since z contains 6 elements, which is not a multiple of 10, the number ofelements in y. The warning is issued to make sure the expression was what you intended to do.

A useful function for determining the number of elements in vector is the length() function. Here iswhat it returns for the vectors y and z.

> length(y)

[1] 10

> length(z)

[1] 6

8

Page 10: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.2.3 Vector Indexing

It may be the case that you wish to extract 1 or more elements of a vector. This example illustrates thebasics.

> x

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> x[2]

[1] 5.6

> x[1:3]

[1] 10.4 5.6 3.1

> x[-1]

[1] 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> x[-(1:3)]

[1] 6.4 21.7 53.5 3.6 2.6 6.1 1.7

It should come as no surprise that x[2] extracts the second element of x. The next example requires abit more explanation. Recall that 1:3 creates a vector with the elements 1, 2 and 3. Vectors themselves canbe used in the subscripting operator so that x[1:3] returns the first three elements of x. The next exampleshows that preceeding a subscript with a minus sign negates the effect, that is it returns all elements exceptthe indicated ones; in this case, all but the first element is returned. Finally, the negation effect can beapplied to a vector of subscripts as in the final example which requests all but the first three elements of x.

2.3 Some Special Data Types

Now that we have some experience with vectors of numeric data, we will look at some special non-numericor semi-numeric data types that R can work with.

2.3.1 Logical

R has a boolean data type, called logical in R. A logical takes the values TRUE or FALSE. Logical values aretypically the result of a logical comparison. They are also often used as switches within a function call.

We’ll start with a silly example.

> a <- 20

> a

[1] 20

> a > 13

9

Page 11: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

[1] TRUE

> a == 13

[1] FALSE

We start by assigning 20 to the object a. Next, we compare a with the value 13 which returns TRUE since20 (the value of a) is greater than 13. Next we ask if a is equal the 13, which it is not, and so FALSE isreturned. Note that the operator to test if two numbers are equal is the “==” operator1. A common novicemistake is to use a single equals sign instead.

Logical operations work with vectors as well. Consider this example.

> y

[1] 0 1 2 3 4 5 6 7 8 9

> y > 5

[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

> y == 5

[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

Note that in these examples the comparison returns a vector of logicals.

2.3.2 Character

R can handle character data as well. There are a number of tools available for manipulation of characterdata. Character data is most useful for customizing R output.

> yy <- c("Cat", "Dog", "Pig")

> yy

[1] "Cat" "Dog" "Pig"

Often, it is desirable to combine numeric and character data, especially for labeling in plots. The paste()function is used for this.

> zz <- paste(yy, 1:3, sep = "")

> zz

[1] "Cat1" "Dog2" "Pig3"

In this example, yy has 3 elements and 1:3 has three elements. So, the result is to paste the first elementof yy to the first element of 1:3 and so on. The sep argument says what character to place between thepasted elements. The default is a space.

1Actually, the identical() function and its related functions are preferred since they are floating point arithmetic aware

10

Page 12: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.3.3 Factors

Categorical variables are common in statistical analysis. When used as an explanatory variable in a statisticalmodel they require special treatment, specifically, dummy variables need to be created. The factor data typein R is explicitly for this purpose.

> d <- rep(c(1, 2), 3)

> d

[1] 1 2 1 2 1 2

> f <- factor(d, labels = c("Placebo", "Active"))

> f

[1] Placebo Active Placebo Active Placebo ActiveLevels: Placebo Active

With the creation of d above we introduce another vector creation function, namely, rep(). In its mostcommon usage, the first argument is replicated the number of times indicated by the second argument. So,d is a vector of alternating 1’s and 2’s with three of each.

The factor() function creates a factor vector from a numeric2 vector. The labels argument is used togive meaningful labels to the factor levels. The smallest number in the vector being converted gets the firstlabel in the labels argument and so on.

> contrasts(f)

ActivePlacebo 0Active 1

R will automatically create dummy variables for factors when they are used in statistical models. Thecontrasts() function shows how the dummy variables will be coded. So-called treatment contrasts are thedefault, but others such as helmert contrasts are possible. See the help file for contrasts and the relatedpages for details.

The next example shows a three level factor and the default contrasts.

> d <- rep(1:3, 2)

> d

[1] 1 2 3 1 2 3

> f <- factor(d, labels = c("Single", "Married", "Widowed"))

> f

[1] Single Married Widowed Single Married WidowedLevels: Single Married Widowed

> contrasts(f)

Married WidowedSingle 0 0Married 1 0Widowed 0 1

2Character vectors can also be converted to factors. In this case the sorted character strings form the default coding order.

11

Page 13: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Ordered Factors

The factor() function creates un-ordered categorical variables. A categorical vector that is actually ordinalis created using the ordered() function. The default contrasts for ordered factors are orthogonal polynomials.

2.4 Matrices and Linear Algebra

2.4.1 Creating Matrices

Matrices in R are two dimensional arrays. Like vectors, all elements must be of the same data type. Matricesmay be created in a variety of ways. First, vectors of the same length and data type may be combined eitherrow-wise or column-wise to create a matrix.

> x

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> y

[1] 0 1 2 3 4 5 6 7 8 9

> rbind(x, y)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]x 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7y 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

> cbind(x, y)

x y[1,] 10.4 0[2,] 5.6 1[3,] 3.1 2[4,] 6.4 3[5,] 21.7 4[6,] 53.5 5[7,] 3.6 6[8,] 2.6 7[9,] 6.1 8[10,] 1.7 9

It should be clear that the rbind() function combines vectors row-wise and cbind does this column-wise.Just as the length() functions tells how many elements are in a vector, the dim() function tells you the

dimensions of the matrix.

> dim(cbind(x, y))

[1] 10 2

12

Page 14: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

This shows that the row dimension is given first, followed by the number of columns.The matrix() function is another way to create a matrix. It takes a vector as its first argument and

other arguments specify the dimensions of the matrix and in what order the cells are populated. Notice thedifferent behaviour of these to uses of matrix().

> matrix(1:16, nrow = 4)

[,1] [,2] [,3] [,4][1,] 1 5 9 13[2,] 2 6 10 14[3,] 3 7 11 15[4,] 4 8 12 16

> matrix(1:16, nrow = 4, byrow = TRUE)

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

This takes some explanation. Since the vector to be transformed to a matrix has 16 elements and thematrix is to have 4 rows, indicated by nrow = 4, the matrix will be 4 × 4. In the first call, the matrix isfilled in column by column. This, obviously, is the default behaviour. The byrow = TRUE argument causesthe matrix to be built row by row instead.

2.4.2 Matrix Indexing

Matrix elements and sections are extracted in a similar manner to vectors, except instead of one index, twoare required. The first index selects a row and the second a column.

> X <- cbind(x, y, -x)

> X

x y[1,] 10.4 0 -10.4[2,] 5.6 1 -5.6[3,] 3.1 2 -3.1[4,] 6.4 3 -6.4[5,] 21.7 4 -21.7[6,] 53.5 5 -53.5[7,] 3.6 6 -3.6[8,] 2.6 7 -2.6[9,] 6.1 8 -6.1[10,] 1.7 9 -1.7

> X[2, 1]

[1] 5.6

13

Page 15: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

An entire row can be extracted by omitting the second index. Similarly, an entire column can be extractedby omitting the first index.

> X[3, ]

x y3.1 2.0 -3.1

> X[, 1]

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

A submatrix can be extracted by specifying vectors for each of the indices.

> X[c(1, 4), 2:3]

y[1,] 0 -10.4[2,] 3 -6.4

There is also a function diag() that does a number of different things. If its argument is a matrix, themain diagonal is extracted. If its argument is a vector, a diagonal matrix is constructed with the argumentvector as the main diagonal. If the argument is a number k, the k × k identity matrix is constructed.

> matrix(1:16, nrow = 4)

[,1] [,2] [,3] [,4][1,] 1 5 9 13[2,] 2 6 10 14[3,] 3 7 11 15[4,] 4 8 12 16

> diag(matrix(1:16, nrow = 4))

[1] 1 6 11 16

> diag(c(1, -3, 7))

[,1] [,2] [,3][1,] 1 0 0[2,] 0 -3 0[3,] 0 0 7

> diag(3)

[,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1

14

Page 16: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.4.3 Matrix Arithmetic

Consider the following R code.

> X <- matrix(1:16, nrow = 4, byrow = T)

> X

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

> Y <- matrix(seq(1, 32, by = 2), nrow = 4, byrow = T)

> Y

[,1] [,2] [,3] [,4][1,] 1 3 5 7[2,] 9 11 13 15[3,] 17 19 21 23[4,] 25 27 29 31

> Y + X

[,1] [,2] [,3] [,4][1,] 2 5 8 11[2,] 14 17 20 23[3,] 26 29 32 35[4,] 38 41 44 47

> Y - X

[,1] [,2] [,3] [,4][1,] 0 1 2 3[2,] 4 5 6 7[3,] 8 9 10 11[4,] 12 13 14 15

This is matrix addition and subtraction. The matrices must be of the same dimensions.

> 3 * X

[,1] [,2] [,3] [,4][1,] 3 6 9 12[2,] 15 18 21 24[3,] 27 30 33 36[4,] 39 42 45 48

Scalar multiplication is done in the obvious manner above.

15

Page 17: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> X * Y

[,1] [,2] [,3] [,4][1,] 1 6 15 28[2,] 45 66 91 120[3,] 153 190 231 276[4,] 325 378 435 496

This is not matrix multiplication. This is element by element multiplication. The matrices must havethe same dimension for this to work.

> X %*% Y

[,1] [,2] [,3] [,4][1,] 170 190 210 230[2,] 378 430 482 534[3,] 586 670 754 838[4,] 794 910 1026 1142

Now, this is matrix multiplication. The dimensions of the matrices must be such that the matrix multi-plication is defined.

2.4.4 Some Linear Algebra

R also has a number of functions for computations from linear algebra. Consider this random 3× 3 matrix.

> set.seed(6371)

> W <- matrix(sample(-9:9, 9, replace = TRUE, ), ncol = 3)

> W

[,1] [,2] [,3][1,] 5 2 6[2,] -1 4 -9[3,] 3 7 3

The set.seed() function sets the seed of the random number generator. This is common practice topermit a simulation to be replicated by you or someone else. The sample() function used in constructingW is the function for taking random samples of from a vector. The first argument is the vector to samplefrom and the second argument is the sample size. The replace = TRUE argument forces sampling withreplacement. So, W is constructed by taking a random sample of size 9 from the integers from −9 to 9 andarranging them in a 3× 3 matrix.

The transpose of a matrix is obtained with the t() function.

> t(W)

[,1] [,2] [,3][1,] 5 -1 3[2,] 2 4 7[3,] 6 -9 3

16

Page 18: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

One thing to note is that t() is not needed in one place where you might expect it, calculating quadraticforms, provided there is no ambiguity about what is required. Suppose you were computing the quadraticform x′Ax where x is an n × 1 column vector and A was an n × n matrix. In R x would be representedas a vector x of length n and A would be the matrix A. In R, the quadratic form could be computed asx %*% A %*% x with no transposes at all.

Matrix inverses and solutions to systems of equations are both obtained with the solve() function.

> solve(W)

[,1] [,2] [,3][1,] 0.35211268 0.16901408 -0.1971831[2,] -0.11267606 -0.01408451 0.1830986[3,] -0.08920188 -0.13615023 0.1032864

So, solve() with a single matrix argument computes the inverse, if it exists. The solution to a systemof equations, such as

5x + 2y + 6z = 1−1x + 4y − 9z = −5

3x + 7y + 3z = 2

is obtained by giving the right hand side as a vector to the solve() function as the second argument.

> solve(W, c(1, -5, 2))

[1] -0.8873239 0.3239437 0.7981221

Other Linear Algebra Capabilities

R has many other matrix and linear algebra capabilities. Some of them are summarised in Table 2.1. Seethe help files for additional details.

Table 2.1: Other Matrix and Linear Algebra Operators and FunctionsOperator/Function Description

%o% or outer() Outer product.%x% or kronecker() Kronecker products.crossprod() Computes X ′Y efficiently.tcrossprod() Computes XY ′ efficiently.eigen() Computes eigen values and eigen vectors.det() Computes determinants.svd() Singular value decomposition of a rectangular matrix.qr() QR decomposition of a matrix.chol() Choleski factorization.

17

Page 19: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.5 Data Frames

The data frame is, perhaps, the most frequently used data structure for statistical modeling and analysis. Itis like a matrix except that the columns need not all be of the same type. Specifically, some columns may benumeric while others may be factors. The data frame is analogous to the data set in SAS. So, it is helpfulto think of the columns of a data frame as variables and the rows as observations.

2.5.1 Creating Data Frames

Many of the the functions that read external data files (see Section 2.8) return data frames. The data.frame()function is the usual way to combine a number of R objects to form a data frame.

> df1 <- data.frame(id = 1:10, gender = factor(rep(1:2, 5), labels = c("Male",

+ "Female")), age = trunc(rnorm(10, 45, 5)))

> df1

id gender age1 1 Male 482 2 Female 433 3 Male 444 4 Female 375 5 Male 436 6 Female 497 7 Male 398 8 Female 439 9 Male 3810 10 Female 43

This example creates a data frame called df1 which has three variables, id, gender and age. The variablenames are created in the call to data.frame() by the variable = value syntax shown in the example.

This example introduces the function rnorm. This generates random values from a normal distribution.In this case, the mean of the distribution is 45 and the standard deviation is 5. See Section 2.9 for additionaldetails concerning random numbers in R. The trunc() used here simply takes the integer part withoutrounding.

The names() function returns the variable names for a data frame.

> names(df1)

[1] "id" "gender" "age"

2.5.2 Extracting Data From Data Frames

Data frames can be indexed much like matrices, however, given thee mixed data types common to dataframes, other techniques are often preferable.

A variable (column) can be extracted by the $ operator.

> df1[, 3]

[1] 48 43 44 37 43 49 39 43 38 43

18

Page 20: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> df1$age

[1] 48 43 44 37 43 49 39 43 38 43

These commands have the same effect; that is, they extract the age variable from df1.

> df1[df1$age > 45, ]

id gender age1 1 Male 486 6 Female 49

This is an example of logical subscripting. It returns all rows where the age variable is greater than 45.

2.5.3 Merging Data Frames

When faced with the task of data analysis, especially of clinical trials, it is not uncommon to have multipledata frames that need to be combined. For example, one data frame may contain baseline data, another therandomly assigned treatment groups, yet another, the outcome data. The merge() function joins two dataframes together. If the only arguments to merge() are the two data frames to merge, they will be joined onthe variable names that are common to both data frames. The by arguments allow you to specify the joiningvariables.

> df2 <- data.frame(id = 1:10, rx = factor(sample(c(0, 1), 10,

+ replace = TRUE), labels = c("Active", "Placebo")))

> df2

id rx1 1 Active2 2 Active3 3 Active4 4 Active5 5 Placebo6 6 Active7 7 Placebo8 8 Placebo9 9 Active10 10 Placebo

> merge(df2, df1)

id rx gender age1 1 Active Male 482 2 Active Female 433 3 Active Male 444 4 Active Female 375 5 Placebo Male 436 6 Active Female 49

19

Page 21: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

7 7 Placebo Male 398 8 Placebo Female 439 9 Active Male 3810 10 Placebo Female 43

> merge(df2, df1, by = 1)

id rx gender age1 1 Active Male 482 2 Active Female 433 3 Active Male 444 4 Active Female 375 5 Placebo Male 436 6 Active Female 497 7 Placebo Male 398 8 Placebo Female 439 9 Active Male 3810 10 Placebo Female 43

> merge(df2, df1, by = "id")

id rx gender age1 1 Active Male 482 2 Active Female 433 3 Active Male 444 4 Active Female 375 5 Placebo Male 436 6 Active Female 497 7 Placebo Male 398 8 Placebo Female 439 9 Active Male 3810 10 Placebo Female 43

The three calls to merge() produce identical results. There are arguments by.x and by.y that are usefulwhen the merging variable is in different columns or have different names in the two data frames. There arealso arguments that control what to do if there are rows that fail to match.

2.6 Lists

Lists are the most general data object in R. They store an arbitrary collection of other R objects. In onesense, lists are a generalized vector where each element can be a different type of object, even another list.The results of statistical analyses are themselves lists. The “magic” of R is that attributes called classes canbe associated with lists. This then causes R to treat lists of particular classes in particular ways. This isknown Object Oriented Programming or OOP for short3.

3This paper will not discuss in great detail OOP, but will indicate where it is being used

20

Page 22: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.6.1 Creating Lists

The list() function creates lists. It takes an arbitrary number of arguments. The arguments can be namedas with data.frame(), in which case the elements can be accessed by name as well is index.

> lst1 <- list(x, X, df2)

> lst1

[[1]][1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

[[2]][,1] [,2] [,3] [,4]

[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

[[3]]id rx

1 1 Active2 2 Active3 3 Active4 4 Active5 5 Placebo6 6 Active7 7 Placebo8 8 Placebo9 9 Active10 10 Placebo

> lst2 <- list(x = x, X = X, df2 = df2)

> lst2

$x[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

$X[,1] [,2] [,3] [,4]

[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

$df2id rx

1 1 Active2 2 Active

21

Page 23: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

3 3 Active4 4 Active5 5 Placebo6 6 Active7 7 Placebo8 8 Placebo9 9 Active10 10 Placebo

Notice that in lst1 the elements are “named” [[i]] where i is the element index. In lst2 each elementhas a name as given in the call to list()

2.6.2 Extracting List Elements

If the list elements are named, as in lst2 above, an element can be extracted by name in a similar way thata column (variable) in a data frame can be extracted, specifically, listname$elementname is the syntax ofthe command.

> lst2$x

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> lst2$X

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

If the elements are not named, then they must be extracted by element number. Recall that for vectors,the subscripting operator had the form [i]. For lists, the subscripting operator uses two sqaure brackets, soit takes [[i]] as its form.

> lst2[[1]]

[1] 10.4 5.6 3.1 6.4 21.7 53.5 3.6 2.6 6.1 1.7

> lst2[[2]]

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 13 14 15 16

Note that, as the example shows, elements can be extrated by number even when they have names.

22

Page 24: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

2.7 Missing Data

An unfortunate reality of statistical analysis, especially in biostatistics, is the presence of missing data. Rhas some functionality for dealing with missing data4.

2.7.1 Representing Missing Data

The presence of missing data is easy to recognize when R prints an object containing missing values. Missingvalues are displayed as NA. Missing data usually finds its way into R objects when read from external datafiles, but they can be created explicitly in R.

> xx <- c(1, 2, NA, 5, 7, NA)

> xx

[1] 1 2 NA 5 7 NA

One of the obvious difficulties with missing values is that computations involving them result in missingvalues too. However, because of the vectorized nature of R, arithemetic computations result in missing foronly the elements affected.

> xx + 6:1

[1] 7 7 NA 8 9 NA

Observe that NA is only present in the places in xx that were missing. Note that here the : operator canalso produce a sequence of numbers in decreasing order.

2.7.2 Testing for Missing Values

It is not always practical to display the data objects to visually inspect them for NA (eg. a data frame with50 columns and 1000 rows). Yet, we would like to know about the presence (and extent) of missing data.Although NA is used to display missing values, simple logical comparisons involving the characters “NA” donot work. Consider two possibilities.

> xx == NA

[1] NA NA NA NA NA NA

> xx == "NA"

[1] FALSE FALSE NA FALSE FALSE NA

Neither of these give the desired result. In the first case, xx is being compared the “missing”which alwaysreturns “missing” (NA) as its result. The second case compares xx to the string “NA” which results in FALSE(as it should) for the non-missing elements of xx but results in “missing” for the missing elements of xx.There is an is.na() that does what we want.

> is.na(xx)

4There are packages for multiple imputation, but these will not be discussed

23

Page 25: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

[1] FALSE FALSE TRUE FALSE FALSE TRUE

This may not look immediately helpful since you would still have to visually inspect the results for TRUE.A useful “trick” is to use the fact that logicals treat FALSE as zero and TRUE as one in arithmetic calculations.We’ll use the sum() function here.

> sum(is.na(xx))

[1] 2

This tells us that xx has 2 missing values.One other function worth mentioning at this time is na.omit(). When a data frame is given as the

argument, this function returns a data frame with any cases (rows) containing missing values removed.

2.8 Reading External Data Files

Before long, it will be necessary to bring some data into R for analysis. It is somewhat ironic that readingdata from external files is one of the more tedious tasks with statistical software. R has a number of functionsto help with this.

2.8.1 Reading Rectangular Text Files

The rectangular text file is a common, especially in biostatistics, format for data to appear in. In this format,cases form the rows and variables form the columns. This format requires that all variables on all rows areseperated by a common delimiter (ie. fixed field needs to be read differently). Each row must also havethe same number of variables. Usually, this type of file will become a data frame in R. The read.table()function (and its friends) is the workhorse for this purpose. The read.table() function is quite flexible. Itis instructive to look at its argument list.

> args(read.table)

function (file, header = FALSE, sep = "", quote = "\"’", dec = ".",row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA",colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,comment.char = "#", allowEscapes = FALSE, flush = FALSE,stringsAsFactors = default.stringsAsFactors())

NULL

It is highly recommended to read the help page for read.table(). Among other things it also describessome of its friends. One particularly helpful friend is read.csv(). It is simply a wrapper for read.table()with arguments set for the reading of a CSV file as produced by programs such as Excel. Another reason toread the help file is that read.table() will attempt to adaptively set some of its arguments depending oncertain characteristics of the file.

The file argument is required. It is a character string giving the name of the file to read. If the file isnot in your current directory, a path may be specified. A relative path is assumed unless a full path is given.Table 2.2 shows some examples.

24

Page 26: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Table 2.2: Examples of Path SpecificationsFilename only datafile.txtRelative path data/datafile.txtAbsolute path /data/datafile.txt

C:/Data/datafile.txt

The final example shows an absolute path on Windows. Note the use of the Unix path separator / insteadof the usual Windows \ separator. This is the recommended way to specify paths on Windows.

If the first line of the data file contains the variable names that should be used in the data frame, specifyheader=TRUE.

The sep argument specifies the deliminter between variables. If sep is not specified, white space is used.Table 2.3 gives some examples.

Table 2.3: Examples of Common Delimiters and how to specify themDelimiter Specification

A single space sep = " "A comma sep = ","A colon sep = ":"A tab sep = "\t"

When the data file does not contain the variable names in the first row, the col.names argumentcan be used to specify the variable names. A vector of character strings should be given; for example,col.names = c("id","rx","age") would assign the variable names id, rx and age to the variable beingread. The length of the vector of variable names should be the same as the number of variables being readin the data file.

On occasion, it may be desirable to label the rows of a data frame with one of the variables in the datafile. For example, you may wish to use the patient id as the row labels5. The row.names argument can beused for this purpose.

2.8.2 Reading Fixed Width Files

The read.fwf() function can be used for reading fixed width format files, that is files where the variablesare located by character column on each line (eg. id starts in the first column and is 8 columns wide, rxstarts in column 9 and is one column wide, etc.).

> args(read.fwf)

function (file, widths, header = FALSE, sep = "\t", as.is = FALSE,skip = 0, row.names, col.names, n = -1, buffersize = 2000,...)

NULL5Using patient id as row labels may not be as useful as it first appears, since you lose that as an actual variable in the data

frame which will make merges more difficult.

25

Page 27: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

The file argument is used as in read.table(). The widths argument is an integer vector that givesthe widths of the fields (each field will become a variable in the data frame). The header and sep argumentare used together. If there is a header line (header = TRUE) then sep must be given to specify the characterthat delimits the variables in the header line. The delimiter should not be used anywhere else in the datafile.

2.8.3 Other File Formats

The package foreign provides some functions for reading a variety of data formats6. To use the foreignpackage you must first issue the command library(foreign), then the functions shown in Table 2.4 willbe available. You should review the help pages of these functions before using them.

Table 2.4: A selection of foreign functionsFunction Purpose

read.xport() Read SAS XPORT file (SAS not required)read.ssd() Read SAS data set via read.xport() (SAS required)read.spss() Read SPSS save and export filesread.dta() Read Stata fileread.epiinfo() Read Epi Info file

2.9 Random Numbers

R has functions for generating random numbers from a wide range of probability distributions. In Sec-tion 2.4.4 the set.seed() function was introduced. This function sets the seed of the random generator.By using this function, it permits the results of a simulation to be repeatable and reproducible by others. Inmost instances, it is sufficient to specify an integer as its only argument.

The random number functions in R follow consistent naming and usage conventions. The names are of theform rdistn () where distn is replaced by the name, possibly abbreviated, of the desired distribution. Thefirst argument is the number of random numbers to generate and additional argument specify the parametersfor the particular distribution. Table 2.5 gives a few of the available functions.

Table 2.5: A selection of random number functionsFunction Distribution Parameters and Defaults

runif() Uniform min = 0, max = 1rnorm() Normal mean = 0, sd = 1rbinom() Binomial size, probrpois() Poisson lambda

We close this chapter with a few examples.6The foreign package is a recommended package and should be present in most, if not all, R installations.

26

Page 28: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> set.seed(134)

> runif(10)

[1] 0.2003636 0.6275925 0.8814988 0.2512561 0.6099499 0.1697329 0.3629432[8] 0.7853104 0.1006075 0.8266404

> rnorm(10)

[1] 0.61644019 1.93239850 -0.05272626 0.25326254 1.38111291 1.18090741[7] -2.00703218 0.56596405 0.36471560 1.28983744

> rnorm(10, mean = 100, sd = 15)

[1] 115.31691 84.26098 97.90947 89.16325 67.95200 109.65504 77.60333[8] 121.97977 90.23091 97.72967

> rbinom(10, size = 25, prob = 0.1)

[1] 1 4 3 1 3 3 5 2 4 3

> rpois(10, lambda = 5)

[1] 5 8 3 10 3 3 3 6 5 4

27

Page 29: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Chapter 3

Descriptive Statistics

3.1 Introduction

Now that you can create data objects in R, it is time to see some of the ways data can be summarised. Anumber of data sets are available in a typical R installation. Some of these will be used to illustrate thetechniques described in this chapter.

3.2 Common Descriptive Statistics for Numeric Data

As should be expected from any statistical software, the usual descriptive statistics (eg. mean, median,standard deviation, etc.) are easy to calculate in R. Table 3.1 lists the commonly used statistics.

Table 3.1: Functions for Common Descriptive StatisticsFunction Descriptive Statistic

mean() Arithmetic (including trimmed) meanmedian() Median valuemin() Minumum valuemax() Maximum valuequantile() Quantilesrange() Minimum and maximum valuesvar() Variancesd() Standard deviation

All of these functions except for quantile() have an argument na.rm. If na.rm = TRUE is given, missingvalues are removed before proceeding with the calculations. The default value of this argument is FALSE sothat the functions fail if there are missing values. The quantile() function ignores missing values.

> set.seed(717)

> x <- rnorm(50, 100, 15)

> X <- matrix(rnorm(16 * 3, 100, 15), ncol = 4)

> x

28

Page 30: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

[1] 120.97446 109.63771 98.30667 96.81319 103.09569 104.12727 110.32034[8] 118.36837 111.97740 81.90974 94.52280 100.77938 88.03206 99.92222[15] 101.25841 123.81331 85.89612 97.62856 101.22789 98.40265 83.13586[22] 75.64820 96.82666 98.06411 89.89779 116.77935 102.61170 92.50438[29] 120.72821 78.91202 105.09910 104.31119 111.50415 129.13614 100.80756[36] 120.75280 98.16600 92.29241 103.60759 96.78826 86.47870 67.67744[43] 96.18677 109.24916 85.43288 70.04961 74.13466 124.31621 88.58632[50] 105.70585

> mean(x)

[1] 99.4481

> median(x)

[1] 99.16243

> quantile(x)

0% 25% 50% 75% 100%67.67744 90.49644 99.16243 108.36333 129.13614

> var(x)

[1] 204.8660

> sd(x)

[1] 14.31314

> X

[,1] [,2] [,3] [,4][1,] 121.83039 93.48751 98.34321 91.95100[2,] 96.35655 105.01972 105.80085 91.17786[3,] 133.15817 113.02917 112.48569 96.40013[4,] 86.98315 74.57969 64.08115 103.00269[5,] 93.59827 97.10161 87.47215 109.47768[6,] 102.87646 103.29169 104.37294 105.77635[7,] 110.85849 84.83963 112.96592 75.72850[8,] 95.81701 110.74031 80.44580 99.47122[9,] 114.79001 89.05545 94.24806 81.17177[10,] 120.65722 124.51856 112.48416 95.04181[11,] 118.51383 76.22488 106.19662 84.92285[12,] 112.44962 66.95244 99.44166 117.59759

> mean(X)

[1] 99.5164

29

Page 31: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> var(X)

[,1] [,2] [,3] [,4][1,] 192.41981 42.16547 146.81438 -55.84402[2,] 42.16547 301.32944 80.86767 -13.26668[3,] 146.81438 80.86767 217.95895 -66.58955[4,] -55.84402 -13.26668 -66.58955 145.36362

The behaviour of mean() and var() when given a matrix argument may seem unexpected. The mean()function computes the mean of all elements of the matrix. The var() function computes the variance-covariance matrix treating the matrix columns as variables. Note also that if the first two arguments tovar() are vectors, the covariance is computed. The cov() can also be used for this purpose and the relatedfunction cor() computes correlations.

3.3 Simple Summaries of Categorical Data

Categorical data are typically summarized by means of tables and cross-tabulations. R provides some func-tions that are easy to use for this purpose. The example that follows demonstrates table(), xtabs() andftable().

> x <- factor(sample(c(1, 2), 50, replace = TRUE), labels = c("A",

+ "B"))

> y <- factor(sample(c(1, 2), 50, replace = TRUE), labels = c("Yes",

+ "No"))

> z <- factor(sample(c(1, 2), 50, replace = TRUE), labels = c("Male",

+ "Female"))

> xyz <- data.frame(x = x, y = y, z = z)

> x

[1] B A A A B A A B B B A A A B A B A A B A A A B A A B A A A B B A A B A A B B[39] B A B B A B A B B B B BLevels: A B

> y

[1] No No No No Yes No Yes Yes No No No No Yes Yes No Yes No No Yes[20] No Yes No No No No No No Yes Yes Yes No Yes Yes No Yes No Yes No[39] No Yes No No No No Yes Yes Yes Yes Yes NoLevels: Yes No

> z

[1] Female Female Female Male Female Female Female Male Male Female[11] Female Male Female Female Male Male Male Female Male Female[21] Female Male Male Male Male Female Female Female Female Male[31] Male Female Male Female Female Female Female Male Male Female[41] Male Male Male Male Male Male Male Female Female MaleLevels: Male Female

30

Page 32: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

First, we consider table().

> table(x)

xA B26 24

> table(x, y)

yx Yes NoA 10 16B 11 13

> table(y, x)

xy A BYes 10 11No 16 13

> table(x, y, z)

, , z = Male

yx Yes NoA 2 8B 6 9

, , z = Female

yx Yes NoA 8 8B 5 4

This example shows that table() provides a count of the unique values of a single vector or the cross-tabulation of two or more vectors. For two vectors, the first argument forms the rows and the second, thecolumns. With three or more vectors, the result is a multidimensional array with each additional vectoradding a dimension.

The xtabs() function provides similar functionality to table() with a different interface. Specifically,the contingency table is specified as a formula1 and xtabs() has a data argument which allows the thevariables given in the formula to be in a data frame.

> xtabs(~x)

1More will be said about formulas later

31

Page 33: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

xA B26 24

> xtabs(~x + y)

yx Yes NoA 10 16B 11 13

> xtabs(~x + y, data = xyz)

yx Yes NoA 10 16B 11 13

> xtabs(~x + y + z, data = xyz)

, , z = Male

yx Yes NoA 2 8B 6 9

, , z = Female

yx Yes NoA 8 8B 5 4

Observe that these calls to xtabs() give the same results as the previous calls to table().When table() or xtabs() are used for the crosstabulation of three or more variables, the result is a

multidimensional array. The display of such a result is not particularly compact and becomes more difficultto interpret as variables are added. The ftable() function performs these multiway tables, but displaysthem in a flat form.

> ftable(z ~ x + y)

z Male Femalex yA Yes 2 8No 8 8

B Yes 6 5No 9 4

32

Page 34: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> ftable(x + y ~ z)

x A By Yes No Yes No

zMale 2 8 6 9Female 8 8 5 4

As the example shows, the variable(s) on the left hand side of the ~ form the column(s) and tha variable(s)on the right hand side form the row(s). There is also a non-formula interface to ftable(), but the formulainterface is easier to use.

3.4 The Summary Function

One often used function in R is summary(). Its behaviour depends on the type of object it is asked tosummarise. Its use is also not limited to data objects. It is frequently used to summarise the result of afitted statistical model as will be seen later.

To begin with, the examples that follow show how summary() behaves on a variety of data objects andtypes.

> set.seed(121)

> summary(rnorm(100))

Min. 1st Qu. Median Mean 3rd Qu. Max.-3.21600 -0.72400 -0.01899 -0.09655 0.63890 1.61500

> summary(factor(sample(c(1, 2), 100, TRUE), levels = c(1, 2),

+ labels = c("A", "B")))

A B45 55

> M <- matrix(rnorm(16, 100, 15), ncol = 4)

> M

[,1] [,2] [,3] [,4][1,] 96.21108 69.92806 104.50945 99.46015[2,] 99.86763 100.50151 112.18431 95.24623[3,] 107.81925 106.88135 91.40923 72.12874[4,] 108.80539 116.10529 75.13593 105.57522

> summary(M)

V1 V2 V3 V4Min. : 96.21 Min. : 69.93 Min. : 75.14 Min. : 72.131st Qu.: 98.95 1st Qu.: 92.86 1st Qu.: 87.34 1st Qu.: 89.47Median :103.84 Median :103.69 Median : 97.96 Median : 97.35Mean :103.18 Mean : 98.35 Mean : 95.81 Mean : 93.103rd Qu.:108.07 3rd Qu.:109.19 3rd Qu.:106.43 3rd Qu.:100.99Max. :108.81 Max. :116.11 Max. :112.18 Max. :105.58

33

Page 35: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> names(CO2)

[1] "Plant" "Type" "Treatment" "conc" "uptake"

> summary(CO2)

Plant Type Treatment conc uptakeQn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90Qn3 : 7 Median : 350 Median :28.30Qc1 : 7 Mean : 435 Mean :27.21Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12Qc2 : 7 Max. :1000 Max. :45.50(Other):42

The behaviour of summary() should be clear for these common data types and structures. A numericvector is summarised by six numbers: mean, median, quartiles, minimum and maximum. With a factor,summary() returns the number of times each level occurs. For a matrix, each column is treated as a variableand summarised accordingly. With a data frame, each column (variable) is summarised as appropriate forits data type.

3.5 Row and Column Summaries

We saw in Section 3.2 that when the mean() function is called on a matrix, the result is the mean of allelements of the matrix. This may not always be what is desired. Specifically, you may be interested inthe means of the columns or the means of the rows. The function colMeans() computes the the means ofthe columns of a matrix and rowSums() computes the row means. The related functions colSums() androwSums() compute the sums of the columns and rows.

> colMeans(X)

[1] 108.99076 94.90339 98.19485 95.97662

> rowMeans(X)

[1] 101.40302 99.58874 113.76829 82.16167 96.91243 104.07936 96.09814[8] 96.61859 94.81632 113.17544 96.46455 99.11033

3.6 Grouped Summaries

A common need in biostatistical analysis is to summarise variables in a data frame grouped according toone or more other variables in the data frame. We will consider the functions by() and aggregate() in thissection.

The usual use of by() is to compute descriptive statistics for each variable in a data frame, grouped byone or more variables.

> by(CO2, CO2$Treatment, summary)

34

Page 36: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

CO2$Treatment: nonchilledPlant Type Treatment conc uptake

Qn1 :7 Quebec :21 nonchilled:42 Min. : 95 Min. :10.60Qn2 :7 Mississippi:21 chilled : 0 1st Qu.: 175 1st Qu.:26.48Qn3 :7 Median : 350 Median :31.30Mn3 :7 Mean : 435 Mean :30.64Mn2 :7 3rd Qu.: 675 3rd Qu.:38.70Mn1 :7 Max. :1000 Max. :45.50(Other):0------------------------------------------------------------CO2$Treatment: chilled

Plant Type Treatment conc uptakeQc1 :7 Quebec :21 nonchilled: 0 Min. : 95 Min. : 7.70Qc3 :7 Mississippi:21 chilled :42 1st Qu.: 175 1st Qu.:14.53Qc2 :7 Median : 350 Median :19.70Mc2 :7 Mean : 435 Mean :23.78Mc3 :7 3rd Qu.: 675 3rd Qu.:34.90Mc1 :7 Max. :1000 Max. :42.40(Other):0

> by(CO2, list(CO2$Treatment, CO2$Type), summary)

: nonchilled: Quebec

Plant Type Treatment conc uptakeQn1 :7 Quebec :21 nonchilled:21 Min. : 95 Min. :13.60Qn2 :7 Mississippi: 0 chilled : 0 1st Qu.: 175 1st Qu.:32.40Qn3 :7 Median : 350 Median :39.20Qc1 :0 Mean : 435 Mean :35.33Qc3 :0 3rd Qu.: 675 3rd Qu.:41.80Qc2 :0 Max. :1000 Max. :45.50(Other):0------------------------------------------------------------: chilled: Quebec

Plant Type Treatment conc uptakeQc1 :7 Quebec :21 nonchilled: 0 Min. : 95 Min. : 9.30Qc3 :7 Mississippi: 0 chilled :21 1st Qu.: 175 1st Qu.:27.30Qc2 :7 Median : 350 Median :35.00Qn1 :0 Mean : 435 Mean :31.75Qn2 :0 3rd Qu.: 675 3rd Qu.:38.70Qn3 :0 Max. :1000 Max. :42.40(Other):0------------------------------------------------------------: nonchilled: Mississippi

Plant Type Treatment conc uptake

35

Page 37: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Mn3 :7 Quebec : 0 nonchilled:21 Min. : 95 Min. :10.60Mn2 :7 Mississippi:21 chilled : 0 1st Qu.: 175 1st Qu.:22.00Mn1 :7 Median : 350 Median :28.10Qn1 :0 Mean : 435 Mean :25.95Qn2 :0 3rd Qu.: 675 3rd Qu.:31.10Qn3 :0 Max. :1000 Max. :35.50(Other):0------------------------------------------------------------: chilled: Mississippi

Plant Type Treatment conc uptakeMc2 :7 Quebec : 0 nonchilled: 0 Min. : 95 Min. : 7.70Mc3 :7 Mississippi:21 chilled :21 1st Qu.: 175 1st Qu.:12.50Mc1 :7 Median : 350 Median :17.90Qn1 :0 Mean : 435 Mean :15.81Qn2 :0 3rd Qu.: 675 3rd Qu.:18.90Qn3 :0 Max. :1000 Max. :22.20(Other):0

The summary() function is one of the more useful functions to use with by(). The help file shows anexample of fitting a linear regression model to data frame subsets using by().

The by() function is well suited to obtaining“quick-and-dirty”summaries. However, it is not immediatelyobvious (especially for the new user) how to extract results from the output of by() for use in additionalanalyses. The aggregate() function may be of use. The result of aggregate() is (usually) a data framecontaining the grouping variables and a variable containing the summary. The data frame only contains theunique combinations of the grouping variables.

> aggregate(CO2$uptake, list(CO2$Treatment), mean)

Group.1 x1 nonchilled 30.642862 chilled 23.78333

> aggregate(CO2$uptake, list(CO2$Treatment, CO2$Type), mean)

Group.1 Group.2 x1 nonchilled Quebec 35.333332 chilled Quebec 31.752383 nonchilled Mississippi 25.952384 chilled Mississippi 15.81429

Note the much more compact result. Since the result is a data frame, it can be saved to an object andhave its variables re-named appropriately.

> df.aggr <- aggregate(CO2$uptake, list(CO2$Treatment, CO2$Type),

+ mean)

> names(df.aggr) <- c("Treatment", "Type", "meanuptake")

> df.aggr

36

Page 38: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Treatment Type meanuptake1 nonchilled Quebec 35.333332 chilled Quebec 31.752383 nonchilled Mississippi 25.952384 chilled Mississippi 15.81429

This can then be merged with the original data.

> co2.1 <- merge(CO2, df.aggr)

> set.seed(471)

> co2.1[sample(1:nrow(CO2), 10), ]

Type Treatment Plant conc uptake meanuptake1 Mississippi chilled Mc1 95 10.5 15.8142911 Mississippi chilled Mc2 350 13.0 15.8142960 Quebec chilled Qc3 350 34.0 31.7523867 Quebec nonchilled Qn1 350 37.2 35.3333346 Quebec chilled Qc1 350 34.6 31.7523827 Mississippi nonchilled Mn1 675 32.4 25.9523816 Mississippi chilled Mc3 175 18.0 15.814296 Mississippi chilled Mc1 675 22.2 15.8142959 Quebec chilled Qc3 250 38.1 31.7523819 Mississippi chilled Mc3 500 17.9 15.81429

3.7 The apply Family

If you have looked at the help page for by(), you will notice the the description reads:

Function by is an object-oriented wrapper for tapply applied to data frames.

This description, although accurate, may seem unhelpful. However, it points to but one of a number offunctions designed to process indexible structures without the use of explicit (for) loops.

Historically, loops were known to be very inefficient in the similar languages S and S-Plus. This applyfamily of functions was provided to do looping things without explicitly doing looping things. Although thelooping inefficienies have improved substantially (it was less of a problem in R anyway), these functions arestill the preferred way to do computations along an index of an object. Table 3.2 lists a number of the applyfunctions available in R.

3.7.1 The apply Function

To understand how these functions are used, look at the arguments to apply().

> args(apply)

function (X, MARGIN, FUN, ...)NULL

37

Page 39: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Table 3.2: Functions in the apply familyFunction Description

apply() Apply a function to the margins of an array.lapply() Apply a function to each element of a vector or list. Always

returns a list.sapply() Like lapply() only it returns a vector or matrix.tapply() Apply a function to a vector grouped by the the values of

other variables.mapply() A multivariate version of sapply(). This function will not

be discussed further; see its help page for details.rapply() A recursive version of lapply(). This function will also

not be discussed further, so you are referred to the helppage.

The first argument, X is an array, typically a matrix. The second argument, MARGIN gives the margin ofthe matrix to operate on. The third argument, FUN is the name of a function to use. The final argument,... requires some explanation. The ... argument occurs in many functions and is a mechanism to permitthe passing of arbitrary arguments that are not know until the function is called. In the case of apply() the... allows the user to pass arguments to the function that is to be used. For example, if FUN = mean wasused, the argument na.rm = TRUE could also be included so that mean() would drop missing values.

To see how apply() works, examine the following example.

> apply(X, 2, mean)

[1] 108.99076 94.90339 98.19485 95.97662

> colMeans(X)

[1] 108.99076 94.90339 98.19485 95.97662

> apply(X, 1, mean)

[1] 101.40302 99.58874 113.76829 82.16167 96.91243 104.07936 96.09814[8] 96.61859 94.81632 113.17544 96.46455 99.11033

> rowMeans(X)

[1] 101.40302 99.58874 113.76829 82.16167 96.91243 104.07936 96.09814[8] 96.61859 94.81632 113.17544 96.46455 99.11033

So, MARGIN = 2 gives column summaries and MARGIN = 1 gives row summaries. Note that if you wantrow or column means, the aptly named functions already introduced for this purpose are preferred. However,apply() provides a way to compute other summaries by row or column. Notice that we did not explicitlyname the arguments. If named arguments appear in a function call in the order expected, they do not needto be named. Some more examples follow.

> apply(X, 2, median)

38

Page 40: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

[1] 111.65406 95.29456 101.90730 95.72097

> apply(X, 2, var)

[1] 192.4198 301.3294 217.9589 145.3636

> apply(X, 2, sd)

[1] 13.87155 17.35884 14.76343 12.05668

> apply(X, 2, quantile)

[,1] [,2] [,3] [,4]0% 86.98315 66.95244 64.08115 75.7285025% 96.22166 82.68595 92.55408 89.6141150% 111.65406 95.29456 101.90730 95.7209775% 119.04968 106.44987 107.76851 103.69611100% 133.15817 124.51856 112.96592 117.59759

> apply(X, 2, quantile, probs = seq(0, 1, 0.1))

[,1] [,2] [,3] [,4]0% 86.98315 66.95244 64.08115 75.7285010% 93.82014 74.74421 81.14844 81.5468820% 95.92492 77.94783 88.82733 86.1738530% 98.31252 86.10438 95.47661 91.4098040% 106.06927 90.82827 98.78259 93.1873250% 111.65406 95.29456 101.90730 95.7209760% 113.85386 100.81566 105.22969 98.2427970% 117.39669 104.50131 106.07789 101.9432580% 120.22854 109.59619 111.22665 105.2216290% 121.71307 112.80028 112.48553 109.10755100% 133.15817 124.51856 112.96592 117.59759

The final example shows the use of the ... argument. Here we request percentiles for the quantilefunction.

3.7.2 The lapply and sapply Functions

Since these functions work with lists, make a list from the columns of X.

> Xlst <- list(x1 = X[, 1], x2 = X[, 2], x3 = X[, 3], x4 = X[,

+ 4])

> Xlst

$x1[1] 121.83039 96.35655 133.15817 86.98315 93.59827 102.87646 110.85849[8] 95.81701 114.79001 120.65722 118.51383 112.44962

39

Page 41: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

$x2[1] 93.48751 105.01972 113.02917 74.57969 97.10161 103.29169 84.83963[8] 110.74031 89.05545 124.51856 76.22488 66.95244

$x3[1] 98.34321 105.80085 112.48569 64.08115 87.47215 104.37294 112.96592[8] 80.44580 94.24806 112.48416 106.19662 99.44166

$x4[1] 91.95100 91.17786 96.40013 103.00269 109.47768 105.77635 75.72850[8] 99.47122 81.17177 95.04181 84.92285 117.59759

The following example illustrates the difference between these two functions.

> lapply(Xlst, mean)

$x1[1] 108.9908

$x2[1] 94.90339

$x3[1] 98.19485

$x4[1] 95.97662

> sapply(Xlst, mean)

x1 x2 x3 x4108.99076 94.90339 98.19485 95.97662

The two functions performed the same operation (mean of each list element), however, the returned resultdiffers. Clearly, lapply() gives a list, while sapply() simplified the result to a vector. We look at a morecomplicated result in the next example.

> lapply(Xlst, quantile)

$x10% 25% 50% 75% 100%

86.98315 96.22166 111.65406 119.04968 133.15817

$x20% 25% 50% 75% 100%

66.95244 82.68595 95.29456 106.44987 124.51856

$x3

40

Page 42: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

0% 25% 50% 75% 100%64.08115 92.55408 101.90730 107.76851 112.96592

$x40% 25% 50% 75% 100%

75.72850 89.61411 95.72097 103.69611 117.59759

> sapply(Xlst, quantile)

x1 x2 x3 x40% 86.98315 66.95244 64.08115 75.7285025% 96.22166 82.68595 92.55408 89.6141150% 111.65406 95.29456 101.90730 95.7209775% 119.04968 106.44987 107.76851 103.69611100% 133.15817 124.51856 112.96592 117.59759

3.7.3 The tapply Function

It was noted at the beginning of this section that the tapply() function is the workhorse behind by(). Aspreviously mentioned, by() is well suited for data frames, especially with the function, summary(), althoughother things are possible. The tapply() function is more of a “low-level” function to compute groupedstatistics. We’ll start by looking at the arguments.

> args(tapply)

function (X, INDEX, FUN = NULL, ..., simplify = TRUE)NULL

There are similarities to the apply() function. In this case, X is, typically, a vector, INDEX is a list offactors, each the same length as X that define the unique groupings for X and FUN is the function to apply tothe groupings of X.

> tapply(CO2$uptake, CO2$Treatment, mean)

nonchilled chilled30.64286 23.78333

> with(CO2, tapply(uptake, list(Treatment, Type), mean))

Quebec Mississippinonchilled 35.33333 25.95238chilled 31.75238 15.81429

When only a single factor is used for grouping, it does not need to be enclosed in list(). This examplealso introduces a very nice function in R, called with(). It can get quite tedious (and difficult to read) witha lot of data frame variable references (ie. the $ symbol). The first argument of with() is a data frame andthe second is an R expression that will operate on that data frame. Note the absence of $ symbols in thesecond command.

41

Page 43: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Chapter 4

The Formula

Many functions in R, especially those related to statistical models, accept a formula as input. A formulais a concise way to describe a model1. We already encountered these in Section 3.3 with the xtabs()and ftable() functions. This chapter gives a brief overview of formula in R since they will be used morefrequently in subsequent chapters. The general form of formula in R is

response ~ model specification

The key feature that identifies a formula is the ~ character. The left hand side is usually the response oroutcome variable, sometimes called the dependent variable. In some cases, as seen with xtabs() this maybe omitted. The right hand side contains one or more “variables” seperated by operators. The most efficientway to describe the common operators is to quote from ?formula.

The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form.The ~ operator is basic in the formation of such models. An expression of the form y ~ modelis interpreted as a specification that the response y is modelled by a linear predictor specifiedsymbolically by model. Such a model consists of a series of terms separated by + operators. Theterms themselves consist of variable and factor names separated by : operators. Such a term isinterpreted as the interaction of all the variables and factors appearing in the term.

In addition to + and :, a number of other operators are useful in model formulae. The *operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossingto the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turnexpands to a formula containing the main effects for a, b and c together with their second-orderinteractions. The %in% operator indicates that the terms on its left are nested within those onthe right. For example a + b %in% a expands to the formula a + a:b. The - operator removesthe specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It canalso used to remove the intercept term: y ~ x - 1 is a line through the origin. A model with nointercept can be also specified as y ~ x + 0 or y ~ 0 + x.

There is also a | operator that is used to denote a grouping or conditioning variable.Since operators such as + and * don’t have their usual meaning in formula, there needs to be a way to

use them to say, rescale a covariate in a formula. For example, in the study of dialysis, there are variousmeasures that describe the adequacy of the dialysis. One such measure is called Kt/V. It is unitless and

1We allow a rather wide definition of model here.

42

Page 44: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

clinically meaningful differences on this measure are of the oreder 0.1. Consequently, in a regresion model,you would not be interested in the effect of the outcome from Kt/V difference of 1 but rather 0.1. Thereare two options, rescale the fitted coefficeint or rescale at the time of fitting (ie. fit Kt/V × 10). If Kt/Vis in the variable ktv, specifying ktv * 10 in the model formula would not work because * is not treatedas multiplication. There is a function I() that is used in formula to do this kind of thing. So, specifyingI(ktv * 10) in the formula would have the desired effect. In general, any mathematical transformation ina formula should use I(). An alternative would be to create the rescaled variable in the data frame, butthen you have to remember what you did when interpreting or working with the fitted model.

43

Page 45: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Chapter 5

Graphics

5.1 Introduction

One of the strengths of R (and its cousin S-Plus) is graphics. With R, it is realatively easy to producepublication quality statistical graphics. There are two main approaches to graphics in R — TraditionalGraphics and Lattice Graphics1. Traditional graphics provide the tools for creating “simple” graphs easily aswell as the ability to customize nearly every aspect of a plot. Lattice graphics produce rather complicatedplots (eg. multiple levels of conditioning) in a rather straight forward manner, however, customization ismore complicated. The ability to use both type of graphics is essential for any serious R user.

5.2 Traditional Graphics

It is common to require univarite and bivariate displays of data. Somewhat less frequently, multivariatedisplays are required although the theory behind these is somewhat less developed. This section will illus-trate how to produce common graphical displays for these purposes. The plot() function will be featuredthroughout. It is what is called a generic function, which means that it will do different things depending onthe type of “data” it is given. In some cases there are other specially named functions that are alternativesto plot(). These will be pointed out as they occur.

The remainder of this section will be grouped by plot type. We will begin with scatterplots since theyare familiar and it will introduce the plot() function that is central to traditional graphics.

5.2.1 The Scatterplot

The basic function for creating scatterplots is the plot() function. As will be seen, the plot() function hasmany uses, for example, calling plot() on the result of a linear regression will produce a series of diagnosticplots. To demonstrate a couple of uses of plot(), we will use a data set installed with R. The data set iscalled cars and contains speeds (speed) and stopping distances (dist) of cars in the 1920s.

> cars[1:5, ]

1Trellis Graphics in S-Plus.

44

Page 46: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

speed dist1 4 22 4 103 7 44 7 225 8 16

The original version of plot() took the x-axis variable as its first argument and the q-axis variable asits second argument.

> plot(cars$speed, cars$dist)

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

cars$speed

cars

$dis

t

Current implementations of plot() in R accept a formula argument2. The previous plot can be createdas follows.

2To my knowledge, this is not available in S-Plus

45

Page 47: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> plot(dist ~ speed, data = cars)

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

speed

dist

In this case, plot(cars) will have the same effect, however the effect of plotting a data frame will varydepending on the contents of the data frame.

The pairs() function produces a scatterplot matrix, that is a matrix of all possible n(n−1) scatterplotsfor n variables.

Although, technically not a scatterplot, plot() can also create plots of mathematical functions.

> plot(sin, -pi, pi)

46

Page 48: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

−3 −2 −1 0 1 2 3

−1.

0−

0.5

0.0

0.5

1.0

x

sin

(x)

As you can see, if the first argument is an R function, the second argument is the starting x and the thirdargument is the ending x.

5.2.2 Box-Whisker Plots

We will simulate some data to illustrate the box-whisker plot as well as some other plots available in R forassessing distributions. The vector x will contain 1000 normal variates with mean 100 and standard deviation15, y will contain 1000 exponential variates with mean 5 and z will contain a mixture of normals with means100 and 130 with standard deviation of 15 (the grp will be used later).

> set.seed(823)

> x <- rnorm(1000, 100, 15)

> y <- rexp(1000, 1/5)

> grp <- sample(c(0, 1), 1000, replace = TRUE)

47

Page 49: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> z <- rnorm(1000, 100 + 30 * grp, 15)

> grp <- factor(grp, levels = c(0, 1), labels = c("A", "B"))

Observe that the exponential distribution in R is paramaterized by its rate parameter λ so that its meanis 1/λ.

The boxplot() function produces box-whisker plots.

> boxplot(x)

●●

6080

100

120

140

> boxplot(y)

48

Page 50: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●

●●●

●●

●●●

●●

010

2030

> boxplot(z)

49

Page 51: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

6080

100

120

140

160

One task that box-whisker plots are very useful for is comparing some distributional features of a variableaccording to the values of one or more grouping variables (eg. blood pressure according to drug treatment).This is quite easy to do for one grouping variable using the boxplot() function. For example, the followingcode produces box-whisker plots for z according to the value of grp.

> boxplot(z ~ grp)

50

Page 52: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

A B

6080

100

120

140

160

As this example shows, by using a formula, boxplot() provides a convenient way to display side-by-sideboxplots of one variable grouped by another. This is one example that can also be produced with the plot()function.

> plot(z ~ grp)

51

Page 53: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

A B

6080

100

120

140

160

grp

z

Lattice graphics offer even greater flexibility in this regard, in particular, the ability to condition on morethan one grouping variable or even continuous variables.

5.2.3 Histograms and Density Plots

Although the box-whisker plot is a useful display, it can miss some distributional features such as multi-modality. The histogram is another graphic for visualizing distributions that can capture such features.

> hist(x)

52

Page 54: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of x

x

Fre

quen

cy

60 80 100 120 140

050

100

150

200

250

> hist(y)

53

Page 55: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of y

y

Fre

quen

cy

0 10 20 30 40

010

020

030

040

050

060

0

> hist(z)

54

Page 56: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of z

z

Fre

quen

cy

60 80 100 120 140 160 180

050

100

150

Notice that the y-axis in each of these histograms displays frequency. The hist() function has a numberof arguments, one of which is freq. The help page descibes the default behaviour. There may be times whenyou want to plot the density rather than the frequency. Use freq = FALSE for this. A related function thatis often useful in conjunction with hist() is density(), which computes a kernel density estimate.

> hist(z, freq = FALSE)

> lines(density(z))

55

Page 57: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of z

z

Den

sity

60 80 100 120 140 160 180

0.00

00.

005

0.01

00.

015

This example introduces the function lines() which adds lines to an existing plot. Notice that the densitycurve suggests some bimodality. We can change the number of bins in the histogram and the bandwidth ofthe density estimate.

> hist(z, freq = FALSE, breaks = 20)

> lines(density(z, adjust = 0.5))

56

Page 58: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of z

z

Den

sity

60 80 100 120 140 160

0.00

00.

005

0.01

00.

015

There are a variety of bandwidth selector algorithms available. The adjust = 0.5 above simply saysto use one-half of whatever the currently selected bandwidth is (See the help file for density() for moredetail).

5.2.4 Quantile-Quantile Plots

Quantile-quantile (QQ) plots are perhaps the best way to compare data to a distribution. The qqnorm(),and qqline() functions are used to produce normal QQ plots and the qqplot() is for arbitrary QQ plots.

> qqnorm(x)

> qqline(x)

57

Page 59: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

−3 −2 −1 0 1 2 3

6080

100

120

140

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> qqnorm(y)

> qqline(y)

58

Page 60: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●

●●●

●●●

●●

●●

● ●●

●● ●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

−3 −2 −1 0 1 2 3

010

2030

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> qqnorm(z)

> qqline(z)

59

Page 61: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

−3 −2 −1 0 1 2 3

6080

100

120

140

160

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Naturally, y does not look very normal. We know, in this case what the distribution is exp(1/5), so wecan compare or data to that distribution as follows.

> qqplot(y, rexp(500, 1/5))

60

Page 62: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●

● ●●●●

0 10 20 30

010

2030

y

rexp

(500

, 1/5

)

The Lattice package has addtional tools for QQ plots.

5.2.5 Dot Plots

Plotting categorical data is a challenge. Usually, the aim is to compare frequencies or relative frequencies ofresponses. The dot plot, credited to William Cleveland, is deal for this purpose. In the first example, wewill use grp created above. The dotchart() is the traditional graphics function for this purpose. It requiresa numeric vector or matrix to plot. This vector/matrix therefore, should contain the summary desired. Thetable() function introduced in Section 3.3.

> table(grp)

grpA B

492 508

61

Page 63: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> table(grp)/length(grp)

grpA B

0.492 0.508

Then, dot plots of absolute and relative frequencies can be plotted as follows.

> dotchart(table(grp))

A

B

495 500 505

> dotchart(table(grp)/length(grp))

62

Page 64: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

A

B

0.495 0.500 0.505

Naturally, the plots look the same except for the labelling of the horizontal axis.The dot plot is one method that is effective at displaying multivariate data. To illustrate, consider the

following data that is included with R.

> VADeaths

Rural Male Rural Female Urban Male Urban Female50-54 11.7 8.7 15.4 8.455-59 18.1 11.7 24.3 13.660-64 26.9 20.3 37.0 19.365-69 41.0 30.9 54.6 35.170-74 66.0 54.3 71.1 50.0

These are death rates per 100 people in Virginia from 1940. The data are grouped by age category anda gender/location classification. Since this is a matrix, it can be passed to dotchart() with the followingresult.

63

Page 65: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> dotchart(VADeaths)

50−5455−5960−6465−6970−74

50−5455−5960−6465−6970−74

50−5455−5960−6465−6970−74

50−5455−5960−6465−6970−74

●Rural Male

Rural Female

Urban Male

Urban Female

10 20 30 40 50 60 70

5.2.6 Customizing Plots

Before long, you will need to create a plot with slightly different characteristics than the default settings (eg.plot titles and axis labels). R provides extensive customization facilites so that you can create any plot youdesire.

A number of functions will be introduced in this section, some of them without example where, it ishoped, the expected result should be clear.

Titles and Axis Labels

Many of the high level plotting functions provide the arguments main, xlab and ylab which correspond tothe main title, the x-axis label and the y-axis label. Compare the following two calls to plot.

64

Page 66: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

> plot(dist ~ speed, data = cars)

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

speed

dist

> plot(dist ~ speed, data = cars, xlab = "Speed", ylab = "Stopping Distance",

+ main = "Speed versus Stopping Distance for Cars in the 1920s")

65

Page 67: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

Speed versus Stopping Distance for Cars in the 1920s

Speed

Sto

ppin

g D

ista

nce

Alternatively, one could give xlab = "" and ylab = "" to plot() and use the title() function afterwardto label the axes.

Other Text

There is a text() function for placing text anywhere within a plot and an mtext() function for placing textin the margins of a plot. The usual call to text() is of the form text(x, y, labels) where x is a vector ofx-coordinates, y is a vector of y-coordinates and labels is a vector of text strings to be plotted at the (x, y)pairs. There are other arguments to further customize how the text is plotted. See the help file for details.

The basic mtext() usage is mtext(text, side) where text is a string to plot and side is a numberfrom 1–4 specifying the margin to plot in (1 is bottom, 2 is left, 3 is top and 4 is right).

66

Page 68: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Adding Points or Lines to a Plot

Frequently, you may wish to add points (eg. to plot points of males and females in two different colours) orlines (eg. regression line) to a plot.

The points() function is like plot() only it adds points to an existing plot. The lines() function islike points() only instead of individual points, the points are connected by lines. The abline() functionadds a single straight line to a plot. Here is an example.

> plot(sin, -pi, pi)

> points(seq(-pi, pi, by = pi/4), sin(seq(-pi, pi, by = pi/4)),

+ col = "red")

> lines(seq(-pi, pi, by = pi/4), sin(seq(-pi, pi, by = pi/4)),

+ col = "blue")

> abline(h = 0, lty = 2)

> abline(v = 0, lty = 2)

> abline(0, 1/3, col = "green")

67

Page 69: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

−3 −2 −1 0 1 2 3

−1.

0−

0.5

0.0

0.5

1.0

x

sin

(x)

This example introduces a number of additional arguments available to many of the plotting functionsto customize the result. The col option chooses a colour to plot with. The lty argument changes the styleof line in a plot. The defualt line type (lty = 1) is a solid line. Other numeric values produce various stylesof dotted or dashed lines. Two other comonly used arguemnts are pch to choose a different plotting symboland lwd to choose a different line width. In the abline() function, the h argument says to draw a horizontalline and the v option is for a vertical line. If only two number are passed to abline() they are interpretedas the y-intercept and slope of a line.

The segments() function adds line segments and arrows() is similar except the arrow heads are added.The usage of these two commands is similar. The first four arguments are x0, y0, x1 and y1, where x0and y0 are the coordinates to draw from and x1, y1 are the coordinates to draw to. Multiple, possiblyunconnected, line segments (arrows) and be drawn by specifying vectors for the first four arguments. Theability to draw unconnected line segments is what distinguishes segments() from lines(). The arrows()function has additional arguments for customizing the arrow.

68

Page 70: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Customizing Axes

It is possible to customize the axes that R plots. When axes = FALSE is given to the plot function, no axesare drawn. The axis() function can be used to construct custom axes. The first argument is side which isspecified as in mtext(), the next is at which specifies the location of the tick marks and labels indicateswhat how to label the tick marks. Here is an example.

> plot(dnorm, -3, 3, axes = FALSE, xlab = "x", ylab = "f(x)")

> axis(1, at = c(-3, -2, -1, 0, 1, 2, 3), labels = expression(mu -

+ 3 * sigma, mu - 2 * sigma, mu - sigma, mu, mu + sigma, mu +

+ 2 * sigma, mu + 3 * sigma))

> axis(2, at = c(0, 0.1, 0.2, 0.3, 0.4), labels = expression(0,

+ 0.1/sigma, 0.2/sigma, 0.3/sigma, 0.4/sigma))

> title(main = expression(bolditalic(N)(mu, sigma^2)))

x

f(x)

µµ −− 3σσ µµ −− 2σσ µµ −− σσ µµ µµ ++ σσ µµ ++ 2σσ µµ ++ 3σσ

00.

1σσ

0.2

σσ0.

3σσ

0.4

σσ

N((µµ,, σσ2))

69

Page 71: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

This example also demonstrates the ability to label plots with greek letters (LATEX style constructs arealso possible). See ?plotmath for details and examples. The basic idea is that when certain text-plottingfunctions receive their text embedded in expression it is interpreted in a special way. In the example, muand sigma are converted into their greek counterparts for the plot.

Notice that in most of the plots produced, the plotting region is surrounded by a box. When the axes =FALSE argument is used, this box, if desired, needs to be added manually with the box() function.

Adding a Legend

The legend() function adds a legend to the current plot. The first two arguments are x and y whichpositions the legend. The third argument is leg which is a vector of text strings that make up the text ofthe legend. The majority of the remaining arguments that can be specified are the arguments used in theplotting functions to custumize the plot (eg. col, lty, etc.). Suppose a plot had two lines on it, one for eachof two treatment groups and that group A was a red solid line and group B was a blue dotted line (type 2).A call to legend might be like this.

legend(1, 10, leg = c("Group A", "Group B"), col = c("red", "blue"), lty = 1:2)

Other Customizations

The par() function provides a mechanism to adjust a huge range of pramaters that affect how plots areproduced. You should read the help file many times to fully appreciate all that you can control. A commonlyused paramater is mfrow which takes a vector of the form c(nr, nc). This paramter splits the plotting regioninto matrix of plotting regions with nr rows and nc columns. Multiple plots are then drwan row by row. Analternative parameter, mfcol causes the plots to be drawn column by column.

> par(mfrow = c(2, 2))

> hist(x)

> hist(y)

> hist(z)

> hist(z, freq = FALSE)

> lines(density(z))

> par(mfrow = c(1, 1))

70

Page 72: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Histogram of x

x

Fre

quen

cy

60 80 100 120 140

050

150

250

Histogram of y

y

Fre

quen

cy

0 10 20 30 40

020

040

060

0

Histogram of z

z

Fre

quen

cy

60 80 120 160

050

100

150

Histogram of z

z

Den

sity

60 80 120 160

0.00

00.

010

The final command in this example resets the plotting region for a single plot.The par() function also provides a way toe determine the current graphical settings. This is done by

enclosing a prameter in quotes (eg. par("mfrow") will return the current setting of that paramter. Oneparticularly useful parameter to query is usr which return the plotting limits of the current plot. This canbe useful for positioning annotations automatically.

> plot(sin, -pi, pi)

> par("usr")

[1] -3.39292 3.39292 -1.08000 1.08000

71

Page 73: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

−3 −2 −1 0 1 2 3

−1.

0−

0.5

0.0

0.5

1.0

x

sin

(x)

The vector returned gives xmin, xmax, ymin, ymax.

5.3 Lattice Graphics

Lattice graphics extend the standard graphics capability3 by providing the ability to produce various kindsof plots, conditioned on one or more other variables. Each unique value or combination of the conditioningvariables produces a plot with the collection of plots being arranged in a matrix or lattice. Conditioning oncontinuous variables is also possible and will be describes later.

The remainder of this section will largely parallel the traditional graphics description with some additionsfor lattice specific material.

3Actually it is a complete re-implementation of the graphics capabilities, but that distinction is unimportant for this discusion.

72

Page 74: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

5.3.1 Loading the Lattice Graphics Package

Unlike traditional graphics which are avaiable to the user by default, lattice graphics reside in a package thatmust be loaded before they can be used4. This is accomplised with the library() function as follows.

> library(lattice)

This command need be issued only once per session.

5.3.2 The Scatterplot

In lattice graphics, the function xyplot() is used to create scatterplots. To see some of its capabilities withsimple examples, we will use some simulated data.

> set.seed(121)

> simdf <- data.frame(x1 = rnorm(500, 50, 5), x2 = rnorm(500, 80,

+ 5), x3 = sample(c(-1, 0, 1), 500, replace = TRUE), x4 = sample(c(0,

+ 1), 500, replace = TRUE))

> simdf$x3 <- factor(simdf$x3, levels = c(-1, 0, 1), labels = c("Low",

+ "Med", "High"))

> simdf$x4 <- factor(simdf$x4, levels = c(0, 1), labels = c("Group A",

+ "Group B"))

> err <- rnorm(500, 0, 20)

> simdf$y <- model.matrix(~x1 * x3, data = simdf) %*% c(100, -1,

+ -25, -50, 1, 2) + err

Now, let us use xyplot() to explore these data.Simple scatterplots can be produced in a similar manner to the plot() with a formula aregument.

> print(xyplot(y ~ x1, data = simdf))

4In S-Plus, the package is called trellis and is loaded by defualt. There are some differences between the two implementations

73

Page 75: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

> print(xyplot(y ~ x2, data = simdf))

74

Page 76: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x2

y

0

50

100

150

65 70 75 80 85 90

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

NOTE: The use of print() is not necessary for interactive use, however it is required for batch use tocreate saved graphics as is done in producing this document.

In the simulated data, the variable x3 is categorical (ie. a factor) and y has been created with a changingrelationship with x1 according to x3 (an interaction). Let us look a couple of ways to visualise this.

> print(xyplot(y ~ x1 | x3, data = simdf))

75

Page 77: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●●

●●

● ●

●●

●● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●● ●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

Low

●●

● ●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

Med

0

50

100

150

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●● ●●

●●●

●●●

●● ●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

High

In this call to xyplot(), the vertical bar indicates that the variables which follow are to be conditionedon with each conditioning value resulting in a distinct panel. Note that the same axis scaling is used in allpanels which enables some visual comparisons. The values of the conditioning variable are shown in thestrips above each panel.

Alternatively, you may want to plot these data in one panel but have the data corresponding to the valuesof x3 identified in some way. This is accomplished as follows.

> print(xyplot(y ~ x1, data = simdf, groups = x3))

76

Page 78: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

The a groups argument combines the “plots” for the different levels of the grouping variable and displaysthe level information in different plotting symbols.

It is also possible to condition on continuous variables in lattice plots, however they must first be convertedto a shingle. A single is a generalisation of the factor class for continuous data. That is, it is a categoriztionof a continuous variable into a number of, possibly overlapping, groups. The equal.count() is a quick wayto create a shingle. See ?shingle for more information.

> print(xyplot(y ~ x1 | equal.count(x2), data = simdf))

77

Page 79: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

●●

equal.count(x2)

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

equal.count(x2)

40 50 60 70

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

equal.count(x2)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

equal.count(x2)

40 50 60 70

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

equal.count(x2)

0

50

100

150

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

equal.count(x2)

The darkened portion of each strip shows the proportion of the data range covered by that particularlevel of the shingle. You can condition on more than one variable by specifying the conditioning variables asan additive formula as the net example shows.

> print(xyplot(y ~ x1 | x3 + equal.count(x2), data = simdf))

78

Page 80: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

050

100150

40 50 60 70

● ●● ●●●

●●●●● ●●

●●

●● ●● ●●● ●

● ●●●● ●

●●● ●

●●●●●

●●●●

●●●

●●●●●● ●●

●●

Lowequal.count(x2)

●●●●

●●● ●●

●●● ●●●

●●●●

●●● ●●●●

●●●●

● ●● ●

Medequal.count(x2)

40 50 60 70

●●●●

●●

●●● ●●●● ●

● ●●●● ●●● ●

●●● ●●● ●● ●●●

●● ●● ●●

●●●

●● ●

Highequal.count(x2)

●●● ●●● ●●●●

●●

●●●●●●●●● ● ●

●●

●● ●● ●● ●● ●●●●

●●● ●

● ● ●●●●●●●

●●●●●●

Lowequal.count(x2)

●●●●

●●

●●●

●●

●●● ●

● ●●●● ●●●

●●

●● ●●● ●

●●

Medequal.count(x2)

050100150

●●

● ●●●●● ●●● ●●● ●● ●●●● ●● ●

●●

● ●● ●●

●●● ● ●●●

● ●●● ● ●● ●● ●

Highequal.count(x2)

050

100150

●●

●● ● ●● ●●● ●

●● ●●●

●●●

● ●●●●

●●

●● ● ●

●●

●●

●● ●●●● ●●●

●●●●

●●●●● ●●●●

●●●

Lowequal.count(x2)

●●

●●

● ●●●

●●

●●●

● ●●● ●●●● ●●● ●●

●●● ●

●●

●●

●●●

Medequal.count(x2)

● ●●●

●●● ●●●

●●●● ●

●●●

●● ● ●●●●●●

●●●●

● ●●

●●●

●● ●● ●

Highequal.count(x2)

●● ●

● ●● ●●● ●●●

● ●● ●●● ● ●

●●●●●

●●

● ●● ●● ●●

●●

●●●●●

●●●●● ●●●

●●

Lowequal.count(x2)

●●

● ● ●● ●● ●●● ● ●●●

● ●

●●●●

●●

●● ●● ●●

●●●●

●●●●●

●●●

●●

Medequal.count(x2)

050100150

●●●●

●●●

●●●●

●●

●●● ●●●●●●● ●

● ●●

● ●●●●●●

●●●●

●●

●●

Highequal.count(x2)

050

100150

●●● ●

●●● ●●●● ●●●●

●● ●●

●● ●●●

●● ●●●

●● ●●

● ●●● ●●

●●● ●

Lowequal.count(x2)

●●●●● ●●●

●● ● ●●

●●●

●●●●

●●●●● ● ● ●●

●●

●●● ●●

●● ●●●●●●● ●●●

Medequal.count(x2)

●●●●●● ● ●● ●●

●●●●●● ●

● ●●●● ● ●●

●●●●●● ●●

● ●●●

●●● ● ●●

●●

●●

● ●●

Highequal.count(x2)

●●

●●●●

●● ●●●●

●●

●●●●

●●●● ●●●●

●●●●●●● ●● ●

●●●●●●

Lowequal.count(x2)

40 50 60 70

●● ● ●●●● ●●●●●

● ●●● ●●● ●

●●● ● ●●●●

● ●●●

● ●●

● ● ●●●

●● ● ●●● ●●●

●●● ●●

Medequal.count(x2)

050100150

●● ●●●

●●●● ●

●●●●

●●●

●● ●●●

●●●●● ●●● ●

●●

●●●●●

●●●●

Highequal.count(x2)

The overlapping of the shingles is very obvious in this plot. Conditioning can also be mixed with super-position as shown by the next example.

> print(xyplot(y ~ x1 | equal.count(x2), data = simdf, groups = x3))

79

Page 81: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●● ●

●●

●●

●●● ●

● ●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

● ●●

●●

●● ●

●●

●●

●● ●

● ●

equal.count(x2)

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

● ●●●

●●●

● ●●

● ●●

●●

●●

● ●●

●●

●●

equal.count(x2)

40 50 60 70

●● ●

●●●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●●

●●

● ●●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

equal.count(x2)

●●

●●●

● ●

●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

equal.count(x2)

40 50 60 70

●●

●●●

●●

●● ●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

equal.count(x2)

0

50

100

150

●●

● ●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●● ●

● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●● ●

●●

●●

equal.count(x2)

The plots shown so-far suggest a potential group effect of x3. Let’s see how to enhance these lattice plotsto “discover” other relationships. With lattice graphics, it is easy to superimpose a regression line, or, evena smooth with the type argument.

> print(xyplot(y ~ x1, data = simdf, type = c("p", "r", "smooth")))

80

Page 82: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

Observe that type takes a vector of character strings. In this example, p asks for the points, r asks forthe regression line and smooth asks for a lowess smooth. There appears to be no appreciable relationshipbetween x1 and y. Now, suppose we also take x3 into account as the following two plots do.

> print(xyplot(y ~ x1 | x3, data = simdf, type = c("p", "r")))

81

Page 83: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●●

●●

● ●

●●

●● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●● ●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

Low

●●

● ●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

Med

0

50

100

150

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●● ●●

●●●

●●●

●● ●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

High

> print(xyplot(y ~ x1, data = simdf, groups = x3, type = c("p",

+ "r")))

82

Page 84: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

These plots suggest that relationship between x1 and y depends on the value of x3 (ie. an interaction)5.We can do the same type of thing for x2.

> print(xyplot(y ~ x2, data = simdf, type = c("p", "r")))

5Given how the data were generated, this is not too surprising.

83

Page 85: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x2

y

0

50

100

150

65 70 75 80 85 90

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

> print(xyplot(y ~ x2, data = simdf, groups = x3, type = c("p",

+ "r")))

84

Page 86: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x2

y

0

50

100

150

65 70 75 80 85 90

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

These plots confirm the lack of relationship6 between x2 and y, however, the group effect of x3 is clearlydisplayed in the second plot.

5.3.3 Box-Whisker Plots

The lattice function for box-whisker plots is bwplot(). To plot a single variable, the function is used asfollows.

> print(bwplot(~y, data = simdf))

6Again, expected due to how the data were generated.

85

Page 87: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

0 50 100 150

●●

Note the slightly unusual form of the formula in that the “outcome” is on the right hand side of theformula. Boxplots with a grouping variable are obtained in various ways.

> print(bwplot(~y | x3, data = simdf))

86

Page 88: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

0 50 100 150

●● ●

Low

●● ●

Med

High

This gives a boxplot on each panel corresponding to the values of x3. It is usually more helpful it theboxplots for each level of a categorical variable are display side-by-side in one plot. This is achived by thefollowing commands.

> print(bwplot(x3 ~ y, data = simdf))

87

Page 89: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Low

Med

High

0 50 100 150

● ●

● ●

> print(bwplot(y ~ x3, data = simdf))

88

Page 90: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

0

50

100

150

Low Med High

Notice that the effect of swapping sides of the formula for the continuous and categorical variable changeswhether or not the boxplots are drawn verically or horizontally. We could now condition on a third (or evenmore) variable.

> print(bwplot(x3 ~ y | equal.count(x1), data = simdf))

89

Page 91: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Low

Med

High

0 50 100 150

equal.count(x1)

● ●

equal.count(x1)

0 50 100 150

equal.count(x1)

Low

Med

High

equal.count(x1)

0 50 100 150

equal.count(x1)

●●

equal.count(x1)

> print(bwplot(y ~ x3 | equal.count(x1), data = simdf))

90

Page 92: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

0

50

100

150

Low Med High

equal.count(x1)

Low Med High

equal.count(x1)

Low Med High

●●

equal.count(x1)

●●

equal.count(x1)

equal.count(x1)

0

50

100

150

●●

equal.count(x1)

5.3.4 Histograms and Density Plots

The lattice function for histograms is histogram() and for density plots, it is densityplot(). Their use isillustrated with some examples.

> print(histogram(~y, data = simdf))

91

Page 93: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Per

cent

of T

otal

0

5

10

15

20

25

0 50 100 150

> print(histogram(~y | x3, data = simdf))

92

Page 94: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Per

cent

of T

otal

0

10

20

30

40

0 50 100 150

Low Med

0

10

20

30

40

High

There is a type argument which takes one of "percent", "density", or "count" with the obviousmeanings.

> print(densityplot(~y, data = simdf))

93

Page 95: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Den

sity

0.000

0.005

0.010

−50 0 50 100 150

●● ●●●●

●● ●● ●●● ●●● ●

● ●● ●● ●● ● ●●● ●

● ●●● ●●● ●

●●●●

●●

●●●

●●●

●●●

●● ● ●● ● ●● ● ●● ● ●

● ●● ●●

● ●●

● ● ●● ●●

●●●

●●

●●●● ●

●●● ●●

●●●● ●●

●● ● ●

● ●●● ●●●

●●●● ●●●●●● ● ●● ● ● ●● ● ●

●●

●●●

●● ● ●● ●

●● ●●

●●

●●

●●●●● ● ●

● ● ●●●●

●●●● ●● ● ●●●

●● ●●● ●● ●

●●●● ● ●● ●

●● ●

●●●● ●●● ● ●●● ● ●●

●● ● ●● ●●●

●●●● ●

●●

● ●● ●●●●

● ●● ●● ● ●●

● ●●●

●● ●● ●

●● ●●● ●●●●

● ● ●●●●

●● ●●●●

● ● ●● ●● ●● ●● ●● ●●

●●●● ● ●● ●● ● ●● ●●

●●

●●

● ●● ●● ●

● ●● ●●● ●●

●●

●●●● ●● ●●

●●●●●

●● ●

●●● ●

● ●●● ● ●● ●●●●●● ●● ● ● ●●● ●●

●●●● ●● ●

●●

●●● ●●●

● ●● ●●

● ●●● ●●● ●● ●●

● ●●●●●

● ●●●●●●

● ●● ●● ●●●●● ●●

●● ●●●●● ●● ●●●

●● ● ●● ●●● ● ●●●

●●●●

●●●● ●● ●●

●●●

●● ●●

●●● ●●● ● ●

●●

● ●●● ●●

● ●● ●● ●● ●●● ●● ●●● ● ● ●

●●

●●● ●●● ●●● ●

> print(densityplot(~y | x3, data = simdf))

94

Page 96: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Den

sity

0.000

0.005

0.010

0.015

0.020

0 50 100 150

● ●●● ●● ●● ●●●● ●●●●●● ●● ●●● ●● ●● ● ●●●● ● ●●● ●● ●●● ●● ●●●● ●●● ●●● ●●● ●● ●●●●● ●●●●● ●●●●● ●●●● ●●●●● ●● ●●● ●● ●●● ●●● ●●●●●●● ●●● ●●● ●●●●● ●●●● ●● ●●●● ● ●●● ●●●● ●●● ● ●●●●● ●●●●● ●● ● ●●●●● ●●● ● ●● ● ●●● ●●●●●●● ●● ●● ●● ●● ● ●●●●●

Low

●● ●●● ● ●●● ●●● ●● ●●●●● ●● ●● ●●● ● ●●● ●●● ●● ●●● ●● ●●●●●●●● ●● ●●● ●● ●●● ●● ●●● ●●●● ●●● ●● ●● ● ●●● ●●●●●●●● ● ● ●●●● ●● ●●● ● ●● ●● ●●● ● ●●● ● ●● ●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ●●●●●● ●● ● ●● ●●●● ● ●● ●●

Med

0.000

0.005

0.010

0.015

0.020

●●●●● ●● ●● ● ●●●● ●●● ●●●●● ●●●● ●● ●●●●● ●● ●●● ●●●●●● ●● ●●● ●● ●● ●●●● ●●●● ●● ●●●●●● ●● ●●●● ●● ● ●●● ●●●● ●●● ●●● ● ●●●● ●● ●● ●● ●●● ●●●●●● ●●●● ●● ●●● ●●● ●●●● ●●● ●● ●● ●●● ● ●●● ●● ●● ●● ●● ●●●● ●●●● ●●●●●

High

A density can be superimposed on a histogram, but it is a bit more complicated.

> print(histogram(~y | x3, data = simdf, type = "density", panel = function(...) {

+ panel.histogram(...)

+ panel.densityplot(...)

+ }))

95

Page 97: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

y

Den

sity

0.000

0.005

0.010

0.015

0.020

0 50 100 150

Low Med

0.000

0.005

0.010

0.015

0.020

High

This example introduces the panel argument present in all lattice high-level plotting routines. Byproviding a customized panel function, tremendous flexibility is possible. Notice that some of the densityplots did not “fit” in the panel. There are ways to “fix” this.

5.3.5 Quantile-Quantile Plots

There are two lattice functions for creating QQ plots. The qq() function is used to (visually) compare thequantiles of two vectors and qqmath() displays the quantiles of a vector against theoretical quantiles from adistribution. We will consider qqmath() first.

The default distribution for qqmath() is the normal distribution.

> print(qqmath(~x1, data = simdf))

96

Page 98: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

qnorm

x1

40

50

60

70

−2 0 2

● ●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●

●●

To add a line (like qqline() in traditional graphics) we need the panel argument again.

> print(qqmath(~x1, data = simdf, panel = function(x, ...) {

+ panel.qqmathline(x, ...)

+ panel.qqmath(x, ...)

+ }))

97

Page 99: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

qnorm

x1

40

50

60

70

−2 0 2

● ●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●

●●

Let’s examine the y variable now.

> print(qqmath(~y, data = simdf, panel = function(x, ...) {

+ panel.qqmathline(x, ...)

+ panel.qqmath(x, ...)

+ }))

98

Page 100: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

qnorm

y

0

50

100

150

−2 0 2

●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●

●●

Note some departure from normality. This is not surprising given how y was constructed. However, youwould expect y given x1 and x3 to be normal. This can be examined as follows.

> print(qqmath(~y | x3 + equal.count(x1), data = simdf, panel = function(x,

+ ...) {

+ panel.qqmathline(x, ...)

+ panel.qqmath(x, ...)

+ }))

99

Page 101: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

qnorm

y

050

100150

−2 −1 0 1 2

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●● ●

Lowequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●● ●

Medequal.count(x1)

−2 −1 0 1 2

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●

Highequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●● ●

Lowequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●

Medequal.count(x1)

050100150

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●● ●

Highequal.count(x1)

050

100150

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●

Lowequal.count(x1)

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●● ●

Medequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●● ●

Highequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●

Lowequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●● ●

Medequal.count(x1)

050100150

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●

Highequal.count(x1)

050

100150

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●

Lowequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●● ●

Medequal.count(x1)

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●High

equal.count(x1)● ●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

Lowequal.count(x1)

−2 −1 0 1 2

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ●

Medequal.count(x1)

050100150

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●● ●High

equal.count(x1)

We saw previously how to use qqplot() to plot against non-normal distributions, but we did not see howto add the reference line. With qqmath() you use the distribution argument.

> expeg <- rexp(1000, 1/5)

> print(qqmath(~expeg, distribution = qexp, panel = function(x,

+ ...) {

+ panel.qqmathline(x, ...)

+ panel.qqmath(x, ...)

+ }))

100

Page 102: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

qexp

expe

g

0

10

20

30

0 2 4 6

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●

●●●●

●●●●●●

●● ●

●● ●

In order to use qq(), you need a binary variable that defines the groups and a numeric variable havingthe data to be plotted.

> print(qq(x3 ~ y, data = simdf, subset = x3 != "High"))

101

Page 103: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Low

Med

0

50

100

0 50 100

●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●● ●●●

●●

●●

This example illustrates another useful argument, subset. This allows you to select a portion of a dataframe to plot. In this case, all cases in which x3 is equal to“High”are excluded. If there were more than threelevels to a categorical variable, the levels to plot would need to be selected differently. The next exampleshows how.

> print(qq(x3 ~ y, data = simdf, subset = x3 == "Med" | x3 == "High"))

102

Page 104: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Med

Hig

h

0

50

100

150

0 50 100 150

● ●●

● ●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●

●●●●●●●●●

●●

This plots the quatiles of the “Med” and “High” groups.

5.3.6 Dot Plots

The dotplot() function is used in lattice to create dot plots. Like its counterpart in traditional graphics,some data aggregation is generally required.

> simdf1 <- with(simdf, aggregate(y, list(x3), mean))

> names(simdf1) <- c("x3", "ybar")

> print(dotplot(x3 ~ ybar, data = simdf1))

103

Page 105: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

ybar

High

Low

Med

50 60 70 80 90 100

> simdf2 <- with(simdf, aggregate(y, list(x3, x4), mean))

> names(simdf2) <- c("x3", "x4", "ybar")

> print(dotplot(x3 ~ ybar | x4, data = simdf2))

104

Page 106: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

ybar

High

Low

Med

50 60 70 80 90 100

Group A

50 60 70 80 90 100

Group B

> print(dotplot(x3 ~ ybar, data = simdf2, groups = x4))

105

Page 107: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

ybar

High

Low

Med

50 60 70 80 90 100

5.3.7 Customizing Plots

Clearly, the plots produced by the lattice package are inherently more complex than traditional graphics. Thisdoes not mean that complicated graphs cannot be produced by traditional graphics, but that substantiallymore programming is required to do what a single lattice function can do. As a result of the added complexityin lattice graphs, customizing, particularly “after-the-fact,” is a bit more complicated. Generally speaking,most customization can be done within the function call that produces the plot.

Titles and Axis Labels

The arguments main, xlab and ylab can be given to the high-level plotting functions with the expectedresults.

106

Page 108: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

Adding Points, Lines or Other Text to a Plot

These are all additions that cannot be readily added to a completed plot. Advanced use of panel functionsis generally required for these operations.

Customizing Axes

The high-level plotting functions accept a scales argument. The argument should be a list with each elementgiven in name=value form. It may also contain lists named x and y which give independent control over theaxes. See the help file for xyplot() for the details.

Adding a Legend

In a number of the example plots where the groups argument was used, it would have been helpful to have alegend describing the symbols used. To accomplish this, it is often sufficient to use the auto.key argument.

> print(xyplot(y ~ x1 | equal.count(x2), data = simdf, groups = x3,

+ auto.key = TRUE))

107

Page 109: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

x1

y

0

50

100

150

40 50 60 70

●● ●

●●

●●

●●● ●

● ●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

● ●●

●●

●● ●

●●

●●

●●

●● ●

● ●

equal.count(x2)

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●●

● ●

● ●

●●

●●

● ●●●

●●●

● ●●

● ●●

●●

●●

●●

● ●●

●●

●●

equal.count(x2)

40 50 60 70

●● ●

●●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●● ●

●●●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

equal.count(x2)

●●

●●●

●●

● ●●● ●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

equal.count(x2)

40 50 60 70

●●

●●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●●●

●●

● ●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

equal.count(x2)

0

50

100

150

●●

● ●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●● ●

● ● ●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●● ●

●●

●●

equal.count(x2)

LowMedHigh

More precise control is possible with the key argument. See the help file for xyplot() for usage details.

Other Customizations

We saw that the par() function was used with traditional graphics for customizing the graphics system. Forlattice graphics, par() in general is not useful. Many useful customizations are handled by arguments to thehigh-level plotting functions. See ?xyplot for detailed descriptions. Two useful arguments not encounteredso-far are aspect and layout.

The aspect argument adjusts the aspect ratio of the panels. When the argument aspect = "xy" is givento xyplot() a method called banking to 45 degrees will be applied. This often aids in viewing trends.

The layout argument controls how panels are laid out on the page. The general form of this argumentis layout = c(col, row, page) where col is the number of columns, row is the number of rows and pageis the number of pages7.

7Multipage plots are possible, but most useful in hardcopy via postscript or pdf files which are covered in the next chapter.

108

Page 110: Computing 1 · 9/3/2015  · An Introduction to R: A Language and Environment for Statistical Computing 1 Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto

The functions trellis.par.set() and trellis.par.get() are a pair of functions for setting and re-trieving settings that give fine control over the lattice graphics.

5.4 Graphics Devices

When using R interactively, a call to a graphics function, whether it be traditional or lattice, opens anappropriate graphics display device. This is fine for interactive work, but eventually, you will need to printor save a graph to share with colleagues or include it in a manuscript of some kind.

R is able to produce graphic output in a variety of formats8. See ?Devices the determine which graphicsdevices are available. In order to save graphics output, one needs to start the desired graphics device beforecreating the plot. When the plot is completed, the device must be closed to save the results. Two commonlyused devices are postscript() and pdf(), which produce postscript and pdf files, respectively. Thesetwo devices also accept multiple pages of graphical output such as might be generated from multiple calls toplot() or use of the layout argument in a lattice graph. For lattice graphics, the function trellis.device()is used to open a device. The function dev.off() closes the currently active device.

The basic approach is as follows.

> postscript(file="mygraph.ps")

> [graphics commands]

> dev.off()

For lattice, the call to postscript would look like trellis.device("postscript", file = "mygraph.ps").When using lattice to create graphic files, remember to enclose the high-level plot command inside a call toprint(), otherwise empty files result when running in batch (non-interactive) mode.

8The particular formats available could vary from system to system.

109