29
ILAQH R Tutorial Sam Clifford December 21 2015 Introduction This document outlines some of the more common uses of R for the visualisation and analysis of time series data at the International Laboratory for Air Quality and Health 1 . There are already a great 1 many tutorials online for learning the basics of R, e.g. how to use it as a calculator, how to make vectors, etc. so we’re going to jump straight in to working with realistic data. The code is provided so that you can copy and paste it into R and run it yourself. While all output is provided, it’s only there to show you what you should be getting, and is not intended as a sub- stitute for actually running the code. Data analysis isn’t learned by passively watching someone else do it, so make sure you attempt to run the code. As this is a work in progress, additional sections will be written progressively. This is not meant to be a replacement for a statistics textbook; good resources include Diggle and Chetwynd’s “Statis- tics and Scientific Method”, for a general overview of modern data analysis, MacGillivray, Utts and Heckard’s “Mind on Statistics”, for a first year statistics course reference, and Dobson and Barnett’s “An Introduction to Generalized Linear Models”, for a more thor- ough grounding in statistical theory and the toolkit of the modern scientist. Any questions, suggestions or requests can be directed to Dr Samuel Clifford 2 . 2 Preparation Installing R Download and install the latest version of R 3 and RStudio 4 from 3 4 the internet. The reason we’re using RStudio rather than default R is that the default GUI for R dates to the 1990s and the script editor has almost no features. Installing and loading packages within R Once RStudio is installed and open, we will need to install some extra packages that extend the functionality of R.

ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ILAQH R TutorialSam CliffordDecember 21 2015

Introduction

This document outlines some of the more common uses of R for thevisualisation and analysis of time series data at the InternationalLaboratory for Air Quality and Health1. There are already a great 1 http://ilaqh.qut.edu.au

many tutorials online for learning the basics of R, e.g. how to useit as a calculator, how to make vectors, etc. so we’re going to jumpstraight in to working with realistic data.

The code is provided so that you can copy and paste it into Rand run it yourself. While all output is provided, it’s only there toshow you what you should be getting, and is not intended as a sub-stitute for actually running the code. Data analysis isn’t learned bypassively watching someone else do it, so make sure you attempt torun the code.

As this is a work in progress, additional sections will be writtenprogressively. This is not meant to be a replacement for a statisticstextbook; good resources include Diggle and Chetwynd’s “Statis-tics and Scientific Method”, for a general overview of modern dataanalysis, MacGillivray, Utts and Heckard’s “Mind on Statistics”,for a first year statistics course reference, and Dobson and Barnett’s“An Introduction to Generalized Linear Models”, for a more thor-ough grounding in statistical theory and the toolkit of the modernscientist.

Any questions, suggestions or requests can be directed to DrSamuel Clifford2. 2 [email protected]

Preparation

Installing R

Download and install the latest version of R3 and RStudio4 from 3 https://cran.r-project.org/4 https://www.rstudio.com/

products/rstudio/download/the internet. The reason we’re using RStudio rather than default Ris that the default GUI for R dates to the 1990s and the script editorhas almost no features.

Installing and loading packages within R

Once RStudio is installed and open, we will need to install someextra packages that extend the functionality of R.

install.packages(c("tidyr", "ggplot2",

"GGally", "scales", "openair",

"lubridate", "broom", "readr"))

library(tidyr)

library(ggplot2)

Page 2: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 2

library(GGally)

library(scales)

library(openair)

library(lubridate)

library(broom)

library(readr)

The above code blocks will install and load the following li-braries (and their dependencies):

• tidyr, which provides some extra functionality for dealing withdata

• ggplot2, a powerful plotting library that uses a grammar ofgraphics Extensive help on ‘gg-

plot2‘ can be found athttp://docs.ggplot2.org/current/

• GGally, a library that makes use of the functions in ggplot2 toprovide extra visualisation tools

• scales, a library which allows us to change options for ggplot2’saxes

• openair, a collection of useful tools for analysing air quality data• lubridate, functions to make manipulating time/date informa-

tion much easier• broom, a library for turning diagnostic information into data

frames• readr, a library that makes reading and writing data files much

easier

If you see any error messages about missing packages, you’lllikely need to install them before proceeding with the remainder ofthe tutorial.

Loading the data

The openair library contains a dataset called mydata, which wewill use to demonstrate the use of R. This data was collected from1 January 1998 to 23 June 2005 at the Marylebone (London) airquality monitoring station.

We’ll use the data command to tell R to load the dataset.

data(mydata)

You should be able to click on the “Environment” tab in RStudioand see an object called mydata. We will now look at a few of theproperties of mydata.

By running the class function on mydata, we can see what typeof object it is.

class(mydata)

## [1] "data.frame"

Page 3: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 3

The type data.frame is the most common object type for dealingwith data in R. A data frame has one row per observation, and onecolumn for each variable measured as part of that observation.

Columns in R are named

names(mydata)

## [1] "date" "ws" "wd" "nox" "no2"

## [6] "o3" "pm10" "so2" "co" "pm25"

and we can see how many rows and columns the data frame has

nrow(mydata)

## [1] 65533

ncol(mydata)

## [1] 10

We can use the head function to see the top (by default, 6) rowsand tail to see the bottom rows

head(mydata)

## date ws wd nox no2 o3

## 1 1998-01-01 00:00:00 0.60 280 285 39 1

## 2 1998-01-01 01:00:00 2.16 230 NA NA NA

## 3 1998-01-01 02:00:00 2.76 190 NA NA 3

## 4 1998-01-01 03:00:00 2.16 170 493 52 3

## 5 1998-01-01 04:00:00 2.40 180 468 78 2

## 6 1998-01-01 05:00:00 3.00 190 264 42 0

## pm10 so2 co pm25

## 1 29 4.7225 3.3725 NA

## 2 37 NA NA NA

## 3 34 6.8300 9.6025 NA

## 4 35 7.6625 10.2175 NA

## 5 34 8.0700 8.9125 NA

## 6 16 5.5050 3.0525 NA

tail(mydata)

## date ws wd nox no2 o3

## 65528 2005-06-23 07:00:00 1.5 250 404 156 4

## 65529 2005-06-23 08:00:00 1.5 260 388 145 6

## 65530 2005-06-23 09:00:00 1.5 210 404 168 7

## 65531 2005-06-23 10:00:00 2.6 240 387 175 10

## 65532 2005-06-23 11:00:00 3.1 220 312 125 15

## 65533 2005-06-23 12:00:00 3.1 220 287 119 17

## pm10 so2 co pm25

## 65528 49 NA 1.810345 28

## 65529 48 NA 1.637931 26

## 65530 58 NA 1.293103 34

## 65531 55 NA 1.293103 34

## 65532 52 NA 1.293103 33

## 65533 55 NA 1.293103 35

Page 4: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 4

As mydata is a dataset that has come from within R, we can usethe help function to find out more about it

help(mydata)

We can ask to see a specific column either by referring to it byname or by its position in the data frame

head(mydata$date)

## [1] "1998-01-01 00:00:00 GMT"

## [2] "1998-01-01 01:00:00 GMT"

## [3] "1998-01-01 02:00:00 GMT"

## [4] "1998-01-01 03:00:00 GMT"

## [5] "1998-01-01 04:00:00 GMT"

## [6] "1998-01-01 05:00:00 GMT"

mydata[1:6, 1] # first six rows, first column

## [1] "1998-01-01 00:00:00 GMT"

## [2] "1998-01-01 01:00:00 GMT"

## [3] "1998-01-01 02:00:00 GMT"

## [4] "1998-01-01 03:00:00 GMT"

## [5] "1998-01-01 04:00:00 GMT"

## [6] "1998-01-01 05:00:00 GMT"

We can also ask for multiple columns in this way

mydata[1:6, c(1, 7)]

## date pm10

## 1 1998-01-01 00:00:00 29

## 2 1998-01-01 01:00:00 37

## 3 1998-01-01 02:00:00 34

## 4 1998-01-01 03:00:00 35

## 5 1998-01-01 04:00:00 34

## 6 1998-01-01 05:00:00 16

R is case sensitive, so asking for mydata$Date will give a NULL

answer as there is no column called Date in our mydata data frame.The first column looks like it contains date information. We can

check whether it’s actually a date stamp (and R will treat it as such)or whether it’s just stored as plain text

class(mydata$date)

## [1] "POSIXt" "POSIXct"

POSIXt and POSIXct are two standard date formats that R knowshow to use. So when we go to run functions on mydata$date, Rknows that there is time information there.

Page 5: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 5

Visualisation of time series data

While we can look at data as text, it is incredibly boring to do so.Let’s use the ggplot2 library to visualise our data. We had a look atthe date and pm10 variables above, so let’s plot them

ggplot(data = mydata, aes(x = date,

y = pm10)) + geom_line()

0

200

400

600

800

1998 2000 2002 2004date

pm10

Figure 1: Time series of PM10 data

You should get a plot in the “Plots” pane in RStudio; you canclick the “Zoom” button in there to pop open a new window thatyou can resize if the plot is difficult to read.

The code for the plot does the following

• ggplot(data=mydata), make a ggplot object with the variablesstored in the data frame called mydata

• aes(x=date, y=pm10), our aesthetic options will be that date ison the x axis and pm10 on the y axis. This doesn’t tell us how todraw the data though, just where it goes

• geom_line() draw our pairs of x and y values by connectingthem with lines in order from smallest x to largest x

There are lots of small values of pm10 and a few big values, solet’s use a log scale on the y axis to help visualise them a littleneater.

ggplot(data = mydata, aes(x = date, y = pm10)) + geom_line() +

scale_y_log10() 10

1000

1998 2000 2002 2004date

pm10

Figure 2: Time series of PM10 data,with a logarithmic y axis

This plot looks a little better if we’re trying to understand whatsort of distribution of values we see over time, as opposed to theactual values on the linear scale from before. We can add some tickmarks on the left hand side y axis to help the viewer understandthe logarithmic scale.

ggplot(data = mydata, aes(x = date,

y = pm10)) + geom_line() + scale_y_log10() +

annotation_logticks(sides = "l") 10

1000

1998 2000 2002 2004date

pm10

Figure 3: Time series of PM10 data,with a logarithmic y axis and tickmarks

Perhaps our pm10 values are log-normally distributed. We canuse a histogram to look at how the pm10 values are distributed. Ahistogram is a great way of visualising one variable, and the countwill be on the y axis so we only need to specify what goes on the xaxis

ggplot(data = mydata, aes(x = pm10)) +

geom_histogram()

0

10000

20000

30000

0 250 500 750pm10

coun

t

Figure 4: Default histogram of PM10

values

The default behaviour of geom_histogram is to use 30 bins to dis-play the data. This isn’t always a sensible choice, so let’s manuallychoose the binwidth of the histograms.

ggplot(data = mydata, aes(x = pm10)) +

geom_histogram(binwidth = 5)

Page 6: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 6

0

2500

5000

7500

0 200 400 600 800pm10

coun

t

Figure 5: Histogram of PM10 valueswith adjusted binwidths

Now we can see more of the features in our data, but a lot ofthe data is over on the left with a bunch of blank space on the righthand side above a pm10 value of about 100. We can adjust the xlimits of the plot to have nearly all of the data in visible.

What’s the largest value of pm10 that we observed after omittingall the missing values with na.rm=T?

max(mydata$pm10, na.rm = T)

## [1] 801

That looks quite large. Let’s have a look at some other summarystatistics such as the mean and median to see how far away thatmaximum value is from the bulk of the data

summary(mydata$pm10)

## Min. 1st Qu. Median Mean 3rd Qu.

## 1.00 22.00 31.00 34.38 44.00

## Max. NA's

## 801.00 2162

So we see that 75% of the data are below the quantity marked3rd Qu.; what value are 99.9% of the data below? Let’s calculatesome quantiles.

quantile(mydata$pm10, probs = 0.999,

na.rm = T)

## 99.9%

## 211.63

This may be a reasonable maximum x limit for our plot, seeingas 99.9% of the data are below it. Let’s make a new variable calledxmax and use it to change the limits of our plot

xmax <- quantile(mydata$pm10, probs = 0.999,

na.rm = T)

ggplot(data = mydata, aes(x = pm10)) +

geom_histogram(binwidth = 5) +

xlim(c(0, xmax))

0

2500

5000

7500

0 50 100 150 200pm10

coun

t

Figure 6: Histogram of PM10 valueswith adjusted x limits

We can now see the distribution of values a little more clearlynow, as we’ve effectively zoomed in on most of the data by omittingthe extremely large values from our plot.

Now let’s change the colours of the bars in the histogram tomake them a bit more visible

ggplot(data = mydata, aes(x = pm10)) +

geom_histogram(binwidth = 5,

fill = "white", color = "black") +

xlim(c(0, xmax))

Page 7: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 7

0

2500

5000

7500

0 50 100 150 200pm10

coun

t

Figure 7: Histogram of PM10 val-ues with adjusted x limits and highcontrast colours

The other main visualisation technique is the scatter plot, whichplots each pair of x and y values as a point. Let’s look at how pm10

varies with the wind direction. Because we believe that pm10 isaffected by wind direction (and not the other way around), we willput wind direction on the x axis.

ggplot(data = mydata, aes(x = wd,

y = pm10)) + geom_point() + scale_y_log10() 10

1000

0 100 200 300wd

pm10

Figure 8: Scatter plot of PM10 valuesagainst wind direction

Dealing with time information

It’s a bit difficult to tell what’s going on in the above graph, so let’saverage the data to a daily frequency.

We can use the timeAverage function from the openair libraryto calculate daily means.

mydata.daily <- timeAverage(mydata,

avg.time = "day", vector.ws = T)

We also told timeAverage to do vector averaging of the winddata (ws, wd) because it’s in a polar coordinate system and takingaverages of directions without wind speeds (or vice versa) is non-sense.

Let’s take a look at what’s in our old data frame and in our newdata frame.

head(mydata)

## date ws wd nox no2 o3

## 1 1998-01-01 00:00:00 0.60 280 285 39 1

## 2 1998-01-01 01:00:00 2.16 230 NA NA NA

## 3 1998-01-01 02:00:00 2.76 190 NA NA 3

## 4 1998-01-01 03:00:00 2.16 170 493 52 3

## 5 1998-01-01 04:00:00 2.40 180 468 78 2

## 6 1998-01-01 05:00:00 3.00 190 264 42 0

## pm10 so2 co pm25

## 1 29 4.7225 3.3725 NA

## 2 37 NA NA NA

## 3 34 6.8300 9.6025 NA

## 4 35 7.6625 10.2175 NA

## 5 34 8.0700 8.9125 NA

## 6 16 5.5050 3.0525 NA

head(mydata.daily)

## Source: local data frame [6 x 10]

##

## date ws wd nox

## (time) (dbl) (dbl) (dbl)

## 1 1998-01-01 6.506649 187.9224 153.9545

Page 8: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 8

## 2 1998-01-02 5.611924 222.9729 131.7391

## 3 1998-01-03 10.103868 225.9726 120.1739

## 4 1998-01-04 10.202548 223.0735 105.3478

## 5 1998-01-05 4.973349 236.5812 174.9565

## 6 1998-01-06 3.522448 196.7795 214.1739

## Variables not shown: no2 (dbl), o3 (dbl),

## pm10 (dbl), so2 (dbl), co (dbl), pm25

## (dbl)

We can see that the date column in the mydata.daily data frameonly has one row for each day of observation, and no hour informa-tion, whereas mydata will have 24 for each day, with the hour of theday included in the time stamp.

We’ll make the points semi-transparent by changing their alphaproperty and adding a little bit of random noise to the wind direc-tion value (but keep pm10 the same) as they are only recorded to thenearest 10 degrees.

ggplot(data = mydata.daily, aes(x = wd,

y = pm10)) + geom_point(alpha = 0.25,

position = position_jitter(w = 2.5,

h = 0)) + scale_y_log10()

10

100

0 100 200 300wdpm

10

Figure 9: Second scatter plot of PM10

values against wind direction

Let’s make the x axis label a little more human friendly andadd a smooth line of best fit that shows the overall trend. Thegeom_smooth line of best fit is a very useful way of picking out atrend when there’s an overwhelming amount of data.

ggplot(data = mydata.daily, aes(x = wd,

y = pm10)) + geom_point(alpha = 0.25,

position = position_jitter(w = 2.5,

h = 0)) + scale_y_log10() +

annotation_logticks(sides = "lr") +

geom_smooth() + xlab("Wind direction (degrees)")

10

100

0 100 200 300Wind direction (degrees)

pm10

Figure 10: Jittered scatter plot with lineof best fit

We may be interested in whether or not there’s any annual vari-ation in pm10, but the time series plot we looked at before didn’treally allow us to see that. If we could make a subplot for each year,we may be able to see more detail.

To do that, we will use the year function from the lubridate

library to extract the year information from the date column.

mydata.daily$year <- year(mydata.daily$date)

We’ll now make our time series plot as before, but use Tufte’sidea of small multiples to make a plot for each year. This is achievedwith facet_wrap from ggplot2.

ggplot(data = mydata.daily, aes(x = date,

y = pm10)) + geom_line() + scale_y_log10() +

facet_wrap(~year) + annotation_logticks(sides = "l")

Page 9: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 91998 1999 2000

2001 2002 2003

2004 2005

10

100

10

100

10

100

19982000200220041998200020022004date

pm10

Figure 11: PM10 values for each year

We can make this graph neater by putting our eight subplots ina grid of two rows and four columns, and by letting each subplothave its own x axis. We’ll keep a common y axis so we can makeyear to year comparisons. We’ll also tidy up the date stamps on thex axis to ensure that the dates are readable.

ggplot(data = mydata.daily, aes(x = date,

y = pm10)) + geom_line() + scale_y_log10() +

annotation_logticks(sides = "l") +

facet_wrap(~year, scales = "free_x",

ncol = 4) + scale_x_datetime(breaks = date_breaks("3 months"),

labels = date_format("%b")) +

theme(axis.text.x = element_text(angle = 90))

1998 1999 2000 2001

2002 2003 2004 2005

10

100

10

100

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

Sep

Dec

Mar

Jun

date

pm10

Figure 12: PM10 values for each yearwith their own x axis

The date_breaks function tells R that on our x axis we want atick mark every three months and date_format with the format%b forces the labels to be the three letter abbreviations of monthnames. Finally, we set a theme option to rotate the text on the x axisto ensure it’s legible.

It’s difficult to tell if there’s much of an annual trend here.Rather than using timeAverage to get monthly averages, let’s gen-erate month of the year as a new column in our data frame andmake box plots of our daily averages.

mydata.daily$month <- factor(month(mydata.daily$date))

ggplot(data = mydata.daily, aes(x = month,

y = pm10)) + geom_boxplot() +

annotation_logticks(sides = "l") +

scale_y_log10() + facet_wrap(~year,

ncol = 2)

We need to convert month into a factor (categorical variable) be-cause month of the year is a label rather than an amount, e.g. Aprilis the 4th month, not 4 amounts of month.

Without delving deeply into time series modelling we can proba-bly say here that it doesn’t look like there’s a regular seasonal trendin this pm10 data.

Correlation

One way to look at the relationship between multiple time seriesis to calculate the correlation matrix. We can do this for our airquality data to see how correlated each of our pollutants are witheach other.

Let’s take another look at what’s in our data frame of daily aver-ages.

names(mydata.daily)

Page 10: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 10

1998 1999

2000 2001

2002 2003

2004 2005

10

100

10

100

10

100

10

100

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12month

pm10

Figure 13: Monthly boxplots of dailyaverages of PM10

Page 11: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 11

## [1] "date" "ws" "wd" "nox" "no2"

## [6] "o3" "pm10" "so2" "co" "pm25"

## [11] "year" "month"

We don’t want to calculate correlations for any wind informationor time stamp data, so let’s just take columns 4 to 10 and store themin a new data frame.

indices <- 4:10

names(mydata.daily)[indices]

## [1] "nox" "no2" "o3" "pm10" "so2" "co"

## [7] "pm25"

mydata.aq <- mydata.daily[, indices]

head(mydata.aq)

## Source: local data frame [6 x 7]

##

## nox no2 o3 pm10

## (dbl) (dbl) (dbl) (dbl)

## 1 153.9545 39.36364 6.869565 18.16667

## 2 131.7391 39.47826 6.478261 27.75000

## 3 120.1739 37.95652 8.409091 20.16667

## 4 105.3478 35.26087 9.608696 20.95833

## 5 174.9565 46.04348 4.956522 24.20833

## 6 214.1739 45.30435 1.347826 34.62500

## Variables not shown: so2 (dbl), co (dbl),

## pm25 (dbl)

We can now use the cor function to calculate the correlationbetween all pairs of variables.

cor(mydata.aq)

## nox no2 o3 pm10 so2 co pm25

## nox 1 NA NA NA NA NA NA

## no2 NA 1 NA NA NA NA NA

## o3 NA NA 1 NA NA NA NA

## pm10 NA NA NA 1 NA NA NA

## so2 NA NA NA NA 1 NA NA

## co NA NA NA NA NA 1 NA

## pm25 NA NA NA NA NA NA 1

Due to some observations being missing, cor has returned NA

values for the correlations. We will tell R to use all pairwise com-plete values and display the results rounded to 2 decimal places.

mycor <- cor(mydata.aq, use = "pairwise.complete.obs")

round(mycor, 2)

## nox no2 o3 pm10 so2 co

## nox 1.00 0.74 -0.67 0.66 0.67 0.84

Page 12: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 12

## no2 0.74 1.00 -0.47 0.65 0.53 0.43

## o3 -0.67 -0.47 1.00 -0.41 -0.46 -0.55

## pm10 0.66 0.65 -0.41 1.00 0.49 0.48

## so2 0.67 0.53 -0.46 0.49 1.00 0.65

## co 0.84 0.43 -0.55 0.48 0.65 1.00

## pm25 0.66 0.51 -0.41 0.84 0.57 0.57

## pm25

## nox 0.66

## no2 0.51

## o3 -0.41

## pm10 0.84

## so2 0.57

## co 0.57

## pm25 1.00

That only tells us part of the story, though. We can use ggpairs

from GGally to produce more meaningful graphical summaries too,including density plots (a smoothed histogram) of each variableand scatter plots for each pair of variables.

ggpairs(mydata.aq)

nox

no2

o3pm

10so

2co

pm25

nox no2 o3 pm10 so2 co pm25

0

200

400

600

Corr:

0.74

Corr:

−0.668

Corr:

0.656

Corr:

0.668

Corr:

0.843

Corr:

0.657

0

30

60

90

120

Corr:

−0.472

Corr:

0.652

Corr:

0.525

Corr:

0.426

Corr:

0.512

0

10

20

30

40

Corr:

−0.409

Corr:

−0.459

Corr:

−0.549

Corr:

−0.408

50

100 Corr:

0.494

Corr:

0.48

Corr:

0.843

05

101520

Corr:

0.646

Corr:

0.572

012345

Corr:

0.567

0

20

40

60

0 200 400 6000 30 60 901200 10 20 30 40 50 100 0 5 10 15 200 1 2 3 4 50 20 40 60

Figure 14: Pairs plot for air qualityvariables

We can now see, visually, the distribution of each variable (den-sity plots on main diagonal) the variation of each pair of variables(scatterplots below the main diagonal) and the correlation for eachof those pairs of variables (numbers above the main diagonal).

Page 13: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 13

Another way to visualise the correlation between the variables isto make a heatmap of correlation. This can be achieved in ggplot2

using geom_tile but requires a little bit of data preparation. We’regoing to use gather from tidyr to stack our correlation matrix ascolumns in a new data frame.

mycor <- data.frame(mycor)

mycor$Var1 <- row.names(mycor)

mycor.gather <- gather(mycor, Var2,

value, nox:pm25)

mycor.gather$Var1 <- factor(mycor.gather$Var1,

levels = levels(mycor.gather$Var2))

head(mycor.gather)

## Var1 Var2 value

## 1 nox nox 1.0000000

## 2 no2 nox 0.7395267

## 3 o3 nox -0.6676903

## 4 pm10 nox 0.6557798

## 5 so2 nox 0.6679451

## 6 co nox 0.8431821

ggplot(data = mycor.gather, aes(x = Var1,

y = Var2)) + geom_tile(aes(fill = value),

color = "black") + scale_fill_gradient2(low = "red",

high = "blue", mid = "white",

midpoint = 0, limits = c(-1,

1), name = "Correlation") +

theme(legend.position = "bottom",

axis.text.x = element_text(angle = 90,

hjust = 1, vjust = 0.5)) +

xlab("") + ylab("") + coord_equal()

This is quite an advanced plot to build and an explanation of allof the various elements that make up the plot can be found at theggplot2 docs website. In essence, we have stacked the data withgather, then used factor to make sure that Var1 and Var2 havethe same ordering. We’ve used the tile geometry and set them tobe filled according to the value column in mydata.gather, witha black color on the edges of each tile. The scale for the fill goesfrom red at -1 to white at 0 and then blue at 1, and the colorbar hasbeen moved to the bottom of the plot. The text on the x axis hasbeen rotated and repositioned and the x and y axis labels have beensuppressed as they’re not needed here. Finally, the tiles are forcedto be squares by setting the x and y coordinate scales to be equal.

Page 14: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 14

nox

no2

o3

pm10

so2

co

pm25

nox

no2 o3

pm10 so

2 co

pm25

−1.0−0.5 0.0 0.5 1.0Correlation

Figure 15: Correlation heatmap

Advanced topics

Summarising data

For the entire “Advanced topics” section, we’re going to make useof the power of the dplyr library and it’s pipe operator %>%. Thepipe basically feeds the output of one function as the input of thenext function (usually the data argument). This avoids endlesslynesting the results of functions inside each other or making tempo-rary variables. At first the use of the pipe operator will seem scarybecause you’ll be thinking about data as objects with propertiesrather than as cells in a spreadsheet.

More detail is available on Sean Anderson’s blog.Say we wanted to find the monthly mean pm10 for each month.

We can use floor_date from lubridate to round each time stampdown to the first day of the month.

mydata$floor_month <- floor_date(mydata$date,

"month")

We can then make a new data frame that contains the monthlymean by grouping the pm10 data by month and summarising it withthe mean function.

summarise(group_by(mydata, floor_month),

mean(pm10, na.rm = T))

Page 15: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 15

## Source: local data frame [90 x 2]

##

## floor_month mean(pm10, na.rm = T)

## (time) (dbl)

## 1 1998-01-01 29.18378

## 2 1998-02-01 40.21184

## 3 1998-03-01 32.65223

## 4 1998-04-01 28.86792

## 5 1998-05-01 32.46289

## 6 1998-06-01 32.97893

## 7 1998-07-01 30.91553

## 8 1998-08-01 31.38995

## 9 1998-09-01 35.72574

## 10 1998-10-01 28.41281

## .. ... ...

We could repeat this operation for the standard deviation andsample size (by adding up all the non-NA values) and then calculat-ing the standard error from these values, but this is time consum-ing. Instead, we’ll make use of the pipe to sent the results of thegrouping to the summarise function and then use dplyr’s mutatefunction to make our new column for standard error, and we’llshow the month, mean and standard error columns by selectingthem.

pm10.monthly.summary <- group_by(mydata,

floor_month) %>% summarise(m = mean(pm10,

na.rm = T), s = sd(pm10, na.rm = T),

n = sum(!is.na(pm10))) %>% mutate(se = s/sqrt(n))

select(pm10.monthly.summary, floor_month,

m, se)

## Source: local data frame [90 x 3]

##

## floor_month m se

## (time) (dbl) (dbl)

## 1 1998-01-01 29.18378 0.4780923

## 2 1998-02-01 40.21184 0.6746498

## 3 1998-03-01 32.65223 0.5703393

## 4 1998-04-01 28.86792 0.4923748

## 5 1998-05-01 32.46289 0.5247012

## 6 1998-06-01 32.97893 0.4550000

## 7 1998-07-01 30.91553 0.4245768

## 8 1998-08-01 31.38995 0.5093352

## 9 1998-09-01 35.72574 0.5670221

## 10 1998-10-01 28.41281 0.4817502

## .. ... ... ...

This is the kind of data that we may want to plot. While wecould plot a long time series plot that showed all 90 of these means

Page 16: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 16

and their standard errors, it may be a better idea to come up with away to break the summaries up by month and year.

We can derive month and year variables on the fly to feed toggplot with mutate and the pipe. Rather than plotting ±1 standarderrors, we’ll plot 95% confidence intervals derived correctly fromthe tails of a t distribution with df=n-1. Lastly, we’ll use the textlabels for the months, rather than the number.

pm10.monthly.summary %>% mutate(year = year(floor_month),

month = month(floor_month, label = TRUE,

abbr = FALSE), d = se * qt(df = n -

1, p = 0.975)) %>% ggplot(data = .,

aes(x = year, y = m)) + geom_pointrange(aes(ymin = m -

d, ymax = m + d)) + facet_wrap(~month) +

theme_bw() + ylab("Monthly mean PM10") +

xlab("Year")

January February March April

May June July August

September October November December

30

40

50

60

30

40

50

60

30

40

50

60

1998 2000 2002 2004 1998 2000 2002 2004 1998 2000 2002 2004 1998 2000 2002 2004Year

Mon

thly

mea

n P

M10

Figure 16: Mean and 95% confidenceinterval for monthly average PM10

from year to yearThe pointrange geometry requires a y value to draw the point,a minimum y value for the bottom of the line and a maximum yvalue for the top of said line. This plot lets us look at how similarthe monthly mean pm10 values are from year to year, rather thanlooking at the annual trend. We can see that Septembers are quitevariable whereas July is fairly stable from one year to another.

Page 17: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 17

Advanced visualisation with ggplot2

The wind direction and speed data is in a polar coordinate system,which presents unique challenges for effective visualisation. Wewish to not distort the relationship between wind directions at 0

and 350 degrees by leaving them at opposite ends of a plot. For-tunately, ggplot2 provides the coord_polar function to allow usto generate polar coordinate plots, such as a comb plot represent-ing the prevailing winds for each month with the daily mean windspeed and direction.

mydata.daily$month <- month(mydata.daily$date)

mydata.daily$year <- year(mydata.daily$date)

ggplot(data = mydata.daily, aes(x = wd,

y = ws)) + geom_segment(aes(xend = wd),

yend = 0, size = 0.25, alpha = 0.25) +

coord_polar() + scale_x_continuous(limits = c(0,

360), breaks = seq(0, 360, by = 45)) +

scale_y_continuous(limits = c(-2,

NA)) + theme_bw() + facet_grid(year ~

month, margins = TRUE) + theme(legend.position = "bottom",

axis.text.x = element_text(size = 2),

axis.text.y = element_text(size = 4)) +

xlab("Wind direction (degrees)") +

ylab("Wind speed (m/s)")

Visual summaries with openair

File input/output

Using read.csv and tidyr::read_csv

Regression modelling

Another air quality data set provided by R is the airquality dataset, which contains daily measurements from New York City for 1

May 19 to 30 September 1973.

head(airquality)

## Ozone Solar.R Wind Temp Month Day

## 1 41 190 7.4 67 5 1

## 2 36 118 8.0 72 5 2

## 3 12 149 12.6 74 5 3

## 4 18 313 11.5 62 5 4

## 5 NA NA 14.3 56 5 5

## 6 28 NA 14.9 66 5 6

Say we wanted to know the relationship between Ozone andWind, the wind speed. First, we can set up a scatterplot with asmooth line of best fit.

Page 18: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 18

1 2 3 4 5 6 7 8 9 10 11 12 (all)45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

45

90

135

180

225

270

315

0/360

05

10

05

10

05

10

05

10

05

10

05

10

05

10

05

10

05

10

19981999

20002001

20022003

20042005

(all)

Wind direction (degrees)

Win

d sp

eed

(m/s

)

Figure 17: Daily average winds

Page 19: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 19

ggplot(data = airquality, aes(x = Wind,

y = Ozone)) + geom_point() +

geom_smooth()

0

50

100

150

5 10 15 20Wind

Ozo

ne

Fitting and summarising a model

It looks like there’s a non-linear relationship between the two vari-ables, where the faster the wind is the less ozone is present. Weshouldn’t make inferences based only on plots, though, so will fitsome linear models to provide quantitative evidence.

The bulk of the regression modelling in R is some variation orextension of ordinary least squares linear regression. In R, linearmodelling is performed with the lm function.

To fit a model for a linear effect of wind speed on ozone, we run

lm.WO3 <- lm(data = airquality, Ozone ~

Wind)

lm.WO3

##

## Call:

## lm(formula = Ozone ~ Wind, data = airquality)

##

## Coefficients:

## (Intercept) Wind

## 96.873 -5.551

The standard output for a fitted linear model is a reminder ofwhat the call was when making the object, and the maximum like-lihood estimates of the parameters in the model. This isn’t particu-larly informative, though, so we use the summary function to extractmore information.

summary(lm.WO3)

Page 20: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 20

##

## Call:

## lm(formula = Ozone ~ Wind, data = airquality)

##

## Residuals:

## Min 1Q Median 3Q Max

## -51.572 -18.854 -4.868 15.234 90.000

##

## Coefficients:

## Estimate Std. Error t value

## (Intercept) 96.8729 7.2387 13.38

## Wind -5.5509 0.6904 -8.04

## Pr(>|t|)

## (Intercept) < 2e-16 ***

## Wind 9.27e-13 ***

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 26.47 on 114 degrees of freedom

## (37 observations deleted due to missingness)

## Multiple R-squared: 0.3619, Adjusted R-squared: 0.3563

## F-statistic: 64.64 on 1 and 114 DF, p-value: 9.272e-13

Here we see: information about the quantiles of the residuals;a table of coefficients with their estimates, standard error of es-timates, and information about the hypothesis tests about theirvalues being equal to zero; the standard error of the residuals; thecoefficient of determination, R2, and an F test for whether or notthe model performs better than the null model (i.e. a model withonly a mean).

The three asterisks in the final column indicate that at a 0.001

level of significance the parameters are statistically significant5. 5 The author would like to remindthe reader at this point that Neyman-Pearson frequentist statistics requiresa p-value cutoff be chosen prior toperforming any data analysis, thatthe p-value is not a measure of thestrength of evidence.

The summary of a fitted model is a bit of a mess, but we can usethe tidy and glance functions from the broom library to returnthese as data frames that provide, respectively, the coefficient ma-trix for the model and the model diagnostics.

tidy(lm.WO3)

## term estimate std.error statistic

## 1 (Intercept) 96.872895 7.2386887 13.38266

## 2 Wind -5.550923 0.6904021 -8.04013

## p.value

## 1 3.992068e-25

## 2 9.271974e-13

glance(lm.WO3)

## r.squared adj.r.squared sigma statistic

## 1 0.3618582 0.3562605 26.46729 64.64369

Page 21: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 21

## p.value df logLik AIC

## 1 9.271974e-13 2 -543.5937 1093.187

## BIC deviance df.residual

## 1 1101.448 79859.01 114

In addition to the stars that represent the significance in thesummary output, we can construct confidence intervals with theconfint function. By default, this uses a confidence level of 0.95,but we can construct intervals of whatever level we want.

confint(lm.WO3)

## 2.5 % 97.5 %

## (Intercept) 82.533108 111.212681

## Wind -6.918604 -4.183242

confint(lm.WO3, level = 0.9)

## 5 % 95 %

## (Intercept) 84.868760 108.877029

## Wind -6.695837 -4.406008

If we want to combine the parameter estimates and their confi-dence intervals together in a single data frame (so we can plot themor write them to a table) we can ensure that our confint outputand our tidy output are data frames with a common column calledterm that identifies the parameters by name. We can then mergethese two data frames together. Using the pipe, %>%, we can elim-inate a lot of the intermediate steps that would require creation oftemporary variables.

pars.WO3 <- confint(lm.WO3) %>% data.frame %>%

mutate(term = row.names(.)) %>%

merge(., tidy(lm.WO3)) %>% select(.,

term:estimate)

The long way to do this would be

CI.WO3 <- data.frame(confint(lm.WO3))

CI.WO3$term <- row.names(CI.WO3)

coef.WO3 <- tidy(lm.WO3)

pars.WO3 <- merge(CI.WO3, coef.WO3[,

c("term", "estimate")])

which leaves us with two objects that we aren’t really going touse, CI.WO3 and coef.WO3. We can delete these temporary objects,

rm(list = c("CI.WO3", "coef.WO3"))

Whichever way you choose to do it, we can permute the order ofthe columns so that our estimate is between the columns containingthe upper and lower bounds of our CI, rename the columns to besomething more human friendly and then write the result to a CSVfile in the current working directory.

Page 22: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 22

pars.WO3 <- pars.WO3[, c(1, 2, 4,

3)]

pars.WO3

## term X2.5.. estimate X97.5..

## 1 (Intercept) 82.533108 96.872895 111.212681

## 2 Wind -6.918604 -5.550923 -4.183242

names(pars.WO3)[c(2:4)] <- c("2.5th",

"Estimate", "97.5th")

# what's the current working

# directory?

getwd()

## [1] "C:/git/Thesis/R tutorial"

# what's stored in the current

# working directory?

dir(pattern = "*.csv")

## [1] "pars.WO3.csv"

write_csv(x = pars.WO3, "pars.WO3.csv")

dir(pattern = "*.csv")

## [1] "pars.WO3.csv"

The write_csv function from readr is different to the write.csv

function that comes with R, but it behaves almost identically. Formore details, see the section in this tutorial on file input/output.Once your parameter estimates and confidence intervals are in aCSV file you can open it in Excel and do whatever you want with itlike put it in a Word document.

Assessing a fitted model

There are assumptions inherent in linear regression modelling thatwe would want to check, such as the Normality of the residuals,whether the residuals are indepdent and identically distributedwhite noise, and whether the partial residuals indicate that we havechosen an appropriate functional form for our regression model.

The augment function from broom operates on a fitted linearmodel to extract the data used to fit the model and augments itwith the diagnostic information such as the fitted values, yi, residu-als, yi − yi, and standardised residuals, yi−yi

σ .

aug.WO3 <- augment(lm.WO3)

head(aug.WO3)

## .rownames Ozone Wind .fitted .se.fit

## 1 1 41 7.4 55.79607 2.988030

## 2 2 36 8.0 52.46551 2.773383

Page 23: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 23

## 3 3 12 12.6 26.93127 3.100335

## 4 4 18 11.5 33.03728 2.705129

## 5 6 28 14.9 14.16414 4.258733

## 6 7 23 8.6 49.13496 2.607330

## .resid .hat .sigma .cooksd

## 1 -14.79607 0.012745325 26.54721 0.002043318

## 2 -16.46551 0.010979954 26.53848 0.002172160

## 3 -14.93127 0.013721394 26.54650 0.002244622

## 4 -15.03728 0.010446167 26.54609 0.001721740

## 5 13.83586 0.025890581 26.55142 0.003728115

## 6 -26.13496 0.009704495 26.46910 0.004824339

## .std.resid

## 1 -0.5626290

## 2 -0.6255516

## 3 -0.5680509

## 4 -0.5711367

## 5 0.5296544

## 6 -0.9922700

Let’s check that the standardised residuals, which have a meanof 0 and a standard deviation of 1, follow a N(0, 1) distribution.

ggplot(data = aug.WO3, aes(x = .std.resid)) +

geom_histogram(aes(y = ..density..),

fill = "blue", alpha = 0.5,

color = "black", binwidth = 0.25) +

stat_function(fun = dnorm, geom = "density",

fill = "grey", alpha = 0.5) +

xlim(c(-3, 3)) + xlab("Standardised residuals")

0.0

0.2

0.4

−2 0 2Standardised residuals

dens

ity

Figure 18: Standardised residuals anda N(0,1) distribution

It’s difficult to tell from this graph whether or not the residualsare Normally distributed. Quantile-quantile plots are typicallyused to compare the quantiles of the standardised residuals to thequantiles of a N(0, 1) distribution.

ggplot(data = aug.WO3, aes(sample = .std.resid)) +

stat_qq(geom = "point") + coord_equal() +

geom_abline(xintercept = 0, slope = 1)

−2

0

2

−2 −1 0 1 2theoretical

sam

ple

Figure 19: QQ plot for standardisedresiduals

There’s a bit of departure from Normality in the tails and it lookslike there’s some skewness after all.

Plotting the residuals against the fitted values allows us to iden-tify any heteroskedasticity in the residuals and to see if there’s anon-zero trend that points us towards fitting a different model.

ggplot(data = aug.WO3, aes(y = .resid,

x = .fitted)) + geom_point() +

stat_smooth(se = F, col = "red") +

xlab("Fitted values") + ylab("Residuals")

−50

0

50

0 30 60Fitted values

Res

idua

ls

Figure 20: Residuals vs fitted valuesfor linear model

It looks like there’s some unexplained variation in this model,which is also supported by the R2 value of 36%. A partial residuals

Page 24: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 24

plot, of the residuals against the Wind variable, will tell us whetherthere’s variation in the data due to Wind that has not been explainedby the model.

ggplot(data = aug.WO3, aes(y = .resid,

x = Wind)) + geom_point() + stat_smooth(se = F,

col = "red") + xlab("Wind") +

ylab("Residuals")

−50

0

50

5 10 15 20Wind

Res

idua

ls

Figure 21: Partial residual plot forWind variable

So perhaps we should fit a model which contains a quadraticterm in Wind to model the curvature in the residuals.

lm.W2O3 <- lm(data = airquality,

Ozone ~ poly(Wind, 2))

aug.W2O3 <- augment(lm.W2O3) %>%

merge(., airquality)

ggplot(data = aug.W2O3, aes(y = .resid,

x = Wind)) + geom_point() + stat_smooth(se = F,

col = "red") + xlab("Wind") +

ylab("Residuals")−30

0

30

60

5 10 15 20WindR

esid

uals

Figure 22: Partial residual plot forWind variable from polynomial model

glance(lm.W2O3)

## r.squared adj.r.squared sigma statistic

## 1 0.485711 0.4766085 23.86535 53.3604

## p.value df logLik AIC

## 1 4.821811e-17 3 -531.0787 1070.157

## BIC deviance df.residual

## 1 1081.172 64359.7 113

This new model appears to do a better job of fitting the model,as the R2 value is now 49% and the residuals are a little less het-eroskedastic.

To test whether or not this is much of an improvement we canuse the anova function to calculate an F-test as the straight linemodel is nested within the polynomial model.

anova(lm.WO3, lm.W2O3)

## Analysis of Variance Table

##

## Model 1: Ozone ~ Wind

## Model 2: Ozone ~ poly(Wind, 2)

## Res.Df RSS Df Sum of Sq F Pr(>F)

## 1 114 79859

## 2 113 64360 1 15499 27.213 8.343e-07

##

## 1

## 2 ***

Page 25: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 25

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output here is showing us that the increase in model fit isstatistically significant at a level of 0.05, so we would reject our F-test’s null hypothesis that the two models fit the data equally well.

Non-linear regression

The relationship between Ozone and Wind doesn’t really looklike a polynomial , as it seems to flatten out after a value of 12

miles/hour. Perhaps an exponential curve is most appropriate forthis data. The nls function is a non-linear least squares methodthat allows us to fit models where there’s not a linear relationshipbetween the outcome and the coefficients of the model, e.g. yi =

β0 + β1xi + β2x2i + εi is linear in the coefficients but yi = Ae−kxi + εi

is not.To ensure that the rate parameter is strictly positive, we will

parameterise the rate, k, as e− log k and fit the parameter logk in itsplace. The model is therefore yi = Ae−e− log kxi + εi. The nls functionrequires starting values of the parameters A and logk (the functionmakes use of a numerical optimisation routine and this requires aninitial guess). Here we’ll use A=200 because it looks reasonable fromthe graph of the data. The value of logk=0.1 was chosen because itgave convergence.

nls.WO3 <- nls(data = airquality,

formula = Ozone ~ A * exp(-exp(logk) *

Wind), start = list(A = 200,

logk = -5))

summary(nls.WO3)

##

## Formula: Ozone ~ A * exp(-exp(logk) * Wind)

##

## Parameters:

## Estimate Std. Error t value Pr(>|t|)

## A 192.51560 23.22550 8.289 2.53e-13

## logk -1.75331 0.09884 -17.739 < 2e-16

##

## A ***

## logk ***

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 23.87 on 114 degrees of freedom

##

## Number of iterations to convergence: 8

## Achieved convergence tolerance: 6.53e-06

Page 26: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 26

## (37 observations deleted due to missingness)

tidy(nls.WO3)

## term estimate std.error statistic

## 1 A 192.51560 23.22550482 8.288974

## 2 logk -1.75331 0.09883729 -17.739360

## p.value

## 1 2.525946e-13

## 2 1.412603e-34

confint(nls.WO3)

## 2.5% 97.5%

## A 149.149737 244.818847

## logk -1.975829 -1.563533

So an estimate and 95% confidence interval for Ozone whenWind=0 is 192.52 (149.15, 244.82). For the rate parameter, k, we mustexponentiate to undo our log transform, so we obtain an estimate ofe−1.75 = 0.173 and a 95% confidence interval of (0.139, 0.209).

We can’t use the anova function to compare models, as the mod-els can’t be nested one inside the other. That is, we can’t recoverone model by setting some terms in the other equal to zero. Wecan, however, look at the residuals and compare them to the fittedvalues.

aug.nls <- augment(nls.WO3)

ggplot(data = aug.nls, aes(x = .fitted,

y = .resid)) + geom_point() +

xlab("Fitted") + ylab("Residuals") +

geom_smooth(se = FALSE, color = "red")

−30

0

30

60

0 50 100Fitted

Res

idua

ls

Figure 23: Residuals vs fitted valuesfor non-linear model

These residuals look a little more homoskedastic, but what’sprobably of most interest is our predictive relationships betweenWind and Ozone.

Predicting from a model

We will make use of the linear and non-linear regression modelsabove and look at our line of best fit.

First, make a data frame that contains a variable named Wind,just like in our original airquality data.

airquality.pred <- data.frame(Wind = seq(from = 0,

to = 25, by = 0.1))

head(airquality.pred)

## Wind

## 1 0.0

## 2 0.1

## 3 0.2

Page 27: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 27

## 4 0.3

## 5 0.4

## 6 0.5

Now we will use the predict function, which has methods forboth "lm" and "nls" model objects. We need to tell predict whichmodel we’re predicting from and what the new data frame is.

airquality.pred$lm <- predict(object = lm.W2O3,

newdata = airquality.pred)

airquality.pred$nls <- predict(object = nls.WO3,

newdata = airquality.pred)

head(airquality.pred)

## Wind lm nls

## 1 0.0 162.5722 192.5156

## 2 0.1 160.6343 189.2099

## 3 0.2 158.7094 185.9610

## 4 0.3 156.7974 182.7679

## 5 0.4 154.8985 179.6297

## 6 0.5 153.0125 176.5453

We see that even for the first few predictions the models arequite different. We can plot these two models to see how they dif-fer.

airquality.pred %>% gather(., key = model,

value = value, -Wind) %>% ggplot(data = .,

aes(x = Wind, y = value)) + geom_line(aes(color = model)) +

geom_point(data = airquality,

aes(y = Ozone)) + theme(legend.position = "bottom") +

ylab("Ozone")0

50

100

150

200

0 5 10 15 20 25Wind

Ozo

ne

model lm nls

Figure 24: Lines of best fit from linear(polynomial) and non-linear (exponen-tial) regression models

Both models perform similarly for the bulk of the data, butthe behaviour for strong winds is very different. The exponentialasymptotes towards zero, but the polynomial linear regressionmodel shows an increase in Ozone. It may well be that both ofthese models are wrong and that a better model would have anexponential decay to some base level of Ozone as wind speed in-creases to infinity. This is all part of the iterative process of modelbuilding.

Models with multiple explanatory variables

Data summaries

Contingency tables

Contingency tables are a great way to do cross-tabulation of a largedata frame, counting the number of times each combination occurs.The use of the table function on a single categorical variable isequivalent to the use of summary on a numeric variable, in the way

Page 28: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 28

that a histogram is the continuous version of a bar plot (geom_barin ggplot2).

With the airquality data frame we might wish to know whatproportion (to 3 decimal places) of observations of ozone in eachmonth exceed a value of 80 ppb.

Derive a new column, HighOzone, and use table to determinehow many days Ozone exceeds 80. This will be a logical statement,TRUE or FALSE. Logical statements are the basis of a lot of dataanalysis, such as subsetting data frames, and contingency tables.

airquality$HighOzone <- airquality$Ozone >

80

table(airquality$Month, airquality$HighOzone)

##

## FALSE TRUE

## 5 25 1

## 6 9 0

## 7 20 6

## 8 19 7

## 9 27 2

# or

mutate(airquality, HighOzone = Ozone >

80) %>% select(., Month, HighOzone) %>%

table

## HighOzone

## Month FALSE TRUE

## 5 25 1

## 6 9 0

## 7 20 6

## 8 19 7

## 9 27 2

This only tells us the number of days for which Ozone exceeds80. To convert to a proportion, we will divide the number of TRUEvalues per month by the sum of the TRUE and FALSE column foreach month.

tf <- table(airquality$Month, airquality$HighOzone)

tf.df <- data.frame(Exceedence = round(tf[,

2]/rowSums(tf), 3))

tf.df$Month <- row.names(tf.df)

tf.df

## Exceedence Month

## 5 0.038 5

## 6 0.000 6

## 7 0.231 7

## 8 0.269 8

## 9 0.069 9

Page 29: ILAQH R Tutorial · 2015. 12. 21. · ilaqh r tutorial 5 Visualisation of time series data While we can look at data as text, it is incredibly boring to do so. Let’s use the ggplot2

ilaqh r tutorial 29

# or

mutate(airquality, HighOzone = Ozone >

80) %>% select(., Month, HighOzone) %>%

group_by(Month) %>% summarise(Exceedence = mean(HighOzone,

na.rm = T) %>% round(., 3))

## Source: local data frame [5 x 2]

##

## Month Exceedence

## (int) (dbl)

## 1 5 0.038

## 2 6 0.000

## 3 7 0.231

## 4 8 0.269

## 5 9 0.069

Graphically, we can represent this information with a stackedbarplot. We make the variable HighOzone, mutate the resulting dataframe so that Month is a factor and HighOzone has TRUE as its firstlevel (so that all the exceedences are the on the bottom of the plot),and plot the data such that exceedences are red, non-exceedences arblue, and missing data are grey.

mutate(airquality, HighOzone = Ozone >

80) %>% select(., Month, HighOzone) %>%

mutate(HighOzone = factor(HighOzone,

levels = c(TRUE, FALSE)),

Month = factor(Month)) %>%

ggplot(., aes(x = Month)) + geom_bar(stat = "bin",

aes(fill = HighOzone)) + scale_fill_manual(values = c("salmon",

"skyblue"), na.value = "grey50",

name = "Ozone > 80 ppb")

0

10

20

30

5 6 7 8 9Month

coun

t Ozone > 80 ppb

TRUE

FALSE

Figure 25: Number of days for whichOzone exceeds 80 ppb in airqualitydata set

Additional resources