R as a calculator

Chapter 1

R as a calculator

R is designed to do powerful and difficult series of computations. However, it is also very useful for doing simplecalculations and you can, in fact, use it as a calculator (several of the authors have installed it on their smart phonesfor this very purpose). The more complex things we will learn later in the course depend on being able to do simplemanipulations of data and so this is the first topic we will cover in the workshop.

The objectives of this session are for you to learn:

• How to do basic calculations,

• How to assign values to objects,

• How to use functions,

• How to write your own functions,

• How to store several things inside one object,

• How to use logical statements,

• How to get help (apart from just asking us!), and

• How to do the same thing over and over again using a loop.

• How to plot data.

For all the items on the to-do list, we will show you exactly how to do it by giving you the computer code. However,we will end this session with a challenge to you that requires you use several of the ideas these notes show you.

We assume you have RStudio open. You can either copy the commands below into the script window and then runthem one line at a time, or type them into the console. The bundle of files on the webpage includes these commandsas an .R file, which you can read into R by opening it in your file system or browsing to it from within RStudio.

These notes are derived from R training courses that have been conducted by the Saw Swee Hock School of Public Health by permutations of: AlexCook, Sourav Das, Koh Wee Ming, Ong Suan Ee, Kiesha Prem, Tan Ken Wei, Teo Yik Ying, and Zhao Xiahong.

1

The R Training Course Saw Swee Hock School of Public Health

1.1 Basic calculations

The following commands do basic arithmetic—addition, subtraction, multiplication, division and exponentiation:

1+12-16*81/23ˆ2# Note if you put a '#' on a line,# everything after it is a comment

If you have copied these into your script window, then to run any of these, you can do one of the following:

• Select the text, ctrl-c, then put the cursor in the console and hit ctrl-v.

• Select the text and hit ctrl-enter or ctrl-r, but note on some computers only one of these works.

• Put the cursor on the line (without selecting any text) and hit ctrl-enter or ctrl-r. This runs the wholeline.

• Select text or put the cursor on the line and click the Run button above.

• If you want to run every command in the file, hit the Source button above.

2


1.2 Storing values

You can store values in objects using <- or =. If you use =, the thing on the left becomes the same as the thing on theright. If you use an arrow, the object the arrow points at takes the value on the other side.

x = 1y <- 23 -> zx # You can type the name of an object to see its valueyzx = y+zx

Note that x changed values from 1 to 5 because we reassigned its value. Generally, the stored value in an object lastsonly until a new value is stored in it.

Question: If you type the following sequence, what will happen?

a = 3b <- 6b -> aa # what is this equal to?b # what is this equal to?

If you ever want to, you can run two or more commands on the same line by separating them with semi-colons, suchas this: a = 3; b = 4

3


1.3 Functions

There are many of the basic mathematical functions built into R, for instance:

sqrt(9)sin(0)cos(0)x=1.5*piy=sin(x)x=log(10)x=log10(10)y=exp(x)abs(-2.5)round(2.1)

For all these functions, there are one (or more) arguments in round brackets.

Question: why when you type abs(-2.5) something is printed to the screen but when you type x=log(10)nothing is displayed?

You can make your own functions like this:

square=function(x){

return(xˆ2)}square(5)adder=function(x,y){

return(x+y)}adder(5,7)

Note that there are easier ways to square a number or add two numbers together than to write a function, but theseexamples illustrate how a function is designed.

4


1.4 Vectors

You can also store multiple values in the same object, as a vector (like one column with many rows in excel), matrix(like several columns and many rows in excel), array (like several columns, several rows and several sheets in excel)or other more complicated object types like lists. For now, we just look at vectors. Try the following to see what theydo:

x=seq(0,1,0.1)y=0:10z=c(3,1,4,1,5,9,3)

The ‘c’ stands for concatenate, and is a function to put its arguments together as a vector.

You can then do stuff to the vectors as if they were just normal numbers:

x+y # This works because x and y are the same lengthx+1 # This works because R knows to repeat 1 for each entry in xx+zsquare(y)sqrt(x)

Question: Why does x+z give a warning message?

If you want to extract out one or more entries from a vector, you use [] brackets:

x[3]x[c(3,4)]x[3:5]

You can stick vectors together using the c() function:

a=c(1,2,3)b=c(4,5)c(a,b)

5


1.5 Logic

We often want to know if a statement about a variable is true, and to do something only if the statement is true. Forexample, when we do a chi-square test (in Wednesday’s class), there is a requirement that each cell in the table hasan expected count of 5 or more. If that requirement is not met, we might have to do a different test (perhaps Fisher’sexact test) instead. So if you were doing a lot of tests, you might want R to work out what the expected cell countswere, work out whether the statement ‘the expected number in this cell is 5 or more’ is true, and only if it is true to doa chi-square test.

The following commands show some basic logical statements. Try them and see what happens.

a=c(0,3,5,2,4)a>2a<=3a==4which(a>2)i=which(a>2)a[i]a[a>2]

Question: what is the difference between the following two lines?

which(a>2)i=which(a>2);a[i]

R treats TRUE and FALSE as 1 and 0 when you do maths on them:

(a>2)*3

You can get R to do something only if a condition is true:

x=3y=1if(x>0)y=xyif(x>0)y=x else y=-x

Question: what does the following code do?

alldata <- c(-1.4, 5.3, 7.8, -0.3, -4.5, 9.9, 6.6, -2.5)i=which(alldata>0)somedata <- alldata[i]

6


1.6 Loops

If you are doing something for each entry in a vector it can sometimes be useful to use a loop to go through eachone. The examples below are very simple, small loops, but they really become valuable when you use a loop to doa lot of calculations for which it would be too tedious to go through each one, or when each part of the loop is itselfcomplicated. Here are some simple examples:

a=c(2,4,6,8)b=c() # creates an empty vectorfor(i in 1:4){

b[i]=square(a[i])}b

Question: can you think of any other way to create b with exactly the same numbers in it?

Loops can be more complicated. E.g.:

x=c(-1,4,-9,16,-25,36)y=c()for(i in 1:6){

if(x[i]>=0)y[i]=sqrt(x[i])if(x[i]<0)y[i]=sqrt(-x[i])

}y

If you don’t know how long the vector is, use length():

for(i in 1:length(x)){

if(x[i]>=0)y[i]=sqrt(x[i])if(x[i]<0)y[i]=sqrt(-x[i])

}

7


1.7 Getting help

There are several ways to get help in R:

• Type ? then the function name (if you know the function) e.g.: ?sin.

• Or use the help() function, e.g.: help(cos).

• If you don’t know the function name: help.search(’chi square’). In this case, it suggests chisq.test

• Use autocomplete. If you hit tab after you have input the name of the function, it gives hints on what thearguments could be, e.g. chisq.[hit TAB here please!]

• Use google! As R is a common letter in the alphabet as well as our favorite stats package, you may need to addCRAN into your search term. Note that some of the people who help on these sites can be a bit terse in theiranswers. . .

8


1.8 Graphs

In this part of the notes, we will look at doing exploratory graphs of data, using the lattice package, which isdesigned to show how relationships between variables change as another variable changes. Packages are collectionsof functions that are not loaded by default (to save memory) and are instead loaded as and when needed. Manypackages are installed by default, while others need to be installed prior to use for the first time. To install a packagecalled ‘foobar’, if you don’t have it already, type install.packages(‘foobar’). (Note the inverted commas.)To load an already installed package into R, type library(foobar). (Note the lack of inverted commas.) So,assuming lattice is already installed, to use any of the plotting functions in it, type

library(lattice)

prior to using the function.

One particularly nice thing about lattice is that it is possible to tailor the overall graphical parameters so that all thegraphs within a report have a consistent style. In these notes, we’ve set them to have a red colour scheme, by creating afile that we can read in with the characteristics I want. To see how we’ve done that, check out the supporting functionsdocument to see the graphical parameters that we’ve set (there are more that we kept at default). To read this file, youneed to help R find it. The simplest way to load this and other files is to arrange any files you need for a project intoone folder, set that folder to be your ‘working directory’ (i.e. default location) and then just give the name of the file.This can be done in RStudio by going to Session, then Set Working Directory and browsing, or by calling:

setwd('/home/me/Desktop/r course')

where you replace the string by the location in your own hard disk.

You can then source the graphic parameters as follows:

source('supporting_functions.r')

We’ll plot data1 from a cohort of patients with dengue. To read their data into R, assuming the following file is in yourworking directory, you can type:

denguedata=read.csv('data_Dengue_Singapore.csv')attach(denguedata)

The attach function means that any columns in the data can be used as if they are vectors. This is the simplest way touse data in a dataset but we’ll show you better ways later in the course.

1.8.1 Histograms

The dengue data are from patients who had dengue haemorrhagic fever (DHF) at presentation to the hospital, or whodid not (but had dengue fever, DF). Let’s look to see whether the distribution of platelet counts differed within thesetwo groups. One way to do so is via histograms: we will have two because there are two groups of patients.

The lattice package has an unusual syntax. A vertical bar | character means the variable to the right should beused to stratify the graphs. So in the code below, the variable DHFpresentation (which is either yes or no) isused to make two histograms side by side, with patients on the left not having DHF and patients on the right havingit. A tilde character ∼ is used to indicate what goes on the x and what on the y axes: whatever appears to the right

1These are actually synthetic data based on a cohort of patients. Their details have been mixed up a bit to protect their privacy.

9


goes on the x-axis, and whatever on the left goes on the y (this notation will come up again when we cover regressionmodels). So for this code, because we want the (log) platelet counts to go on the x-axis, and no variable on the y (justthe frequencies of the histogram), we have the platelet variable to the right of the tilde and nothing to the left. We’replotting the log of the platelet count as these measurements are very right skewed.

histogram(˜log10(Platelet)|DHFpresentation,xlab='log Platelet',strip=strip.custom(factor.levels=c('DF','DHF')))

The code also adds a label to the x-axis, and changes the text in the grey strip along the top to read DF and DHF (ratherthan the default, Yes and No).

log Platelet

Per

cent

of T

otal

0

10

20

30

40

0.5 1.0 1.5 2.0 2.5 3.0

DF

0.5 1.0 1.5 2.0 2.5 3.0

DHF

1.8.2 Density plots

An alternative to a histogram is to plot a ‘density plot’. This is a non-parametric estimate of the population distributionthat works by fitting very small normal distributions around about each point and adding them up. Unlike a histogramit is continuous. One issue is that it slightly extrapolates beyond the range of the data, which can be a problem whensome measurements are not possible (e.g. negative counts). The arguments are otherwise the same as the histogram.The function plots the density estimate along with the (jittered) points themselves (along the x-axis).

densityplot(˜log10(Platelet)|DHFpresentation,xlab='log Platelet',strip=strip.custom(factor.levels=c('DF','DHF')))

10


log Platelet

Den

sity

0.0

0.5

1.0

1.5

0.5 1.0 1.5 2.0 2.5 3.0

DF

0.5 1.0 1.5 2.0 2.5 3.0

DHF

1.8.3 Box plots

Both of the previous plots are quite greedy, in that they take up quite a bit of space. Two more compact styles of graphfollow. Both present a single variable’s distribution in a single dimension, with another (categorical) variable in theother dimension. A boxplot displays:

• The median of the distribution, represented by a dot here;

• The lower and upper quartiles (i.e. the 25%ile and 75%ile), represented by the box;

• The extremes, excluding ‘outliers’, represented by the ’umbrellas’ sticking out of the box;

• Any automatically determined outliers, represented by dots.

This loses some of the details of the distribution but is highly compact for it requires almost no space in the y dimen-sion. This allows multiple distributions to be compared easily.

The following code tells R to make the presence of DHF at presentation be the y variable (as it is to the left of thetilde) and the log platelet count to be the x variable (to the right).

bwplot(DHFpresentation˜log10(Platelet),ylab='DHF',xlab='log Platelet')

11


log Platelet

DH

F

No

Yes

0.5 1.0 1.5 2.0 2.5

1.8.4 Strip plots

An alternative is a strip plot. This shows the values directly. For a large number of points, as here, this can be hardto visualise without making the points semi-transparent (as we did in the supporting functions script). It can be easierto visualise by jittering the categorical variable, adding a small amount of random noise to each value to make themmore visually distinct.

stripplot(DHFpresentation ˜ log10(Platelet),jitter=TRUE,ylab='DHF',xlab='log Platelet')

log Platelet

DH

F

No

Yes

0.5 1.0 1.5 2.0 2.5

1.8.5 Bar plots

To plot a bar chart or bar plot in lattice, you need the x and y co-ordinates to be calculated already (i.e. the functiondoes not calculate them for you). For instance, to plot bars for the median platelet count for those with and withoutDHF at presentation, we need to create those medians, store them somewhere, and associate the levels (e.g. DHF andDF) with those medians. The following code creates a new dataset with these numbers:

newdata=data.frame(medianplatelets=c(median(Platelet[DHFpresentation=='Yes']),

median(Platelet[DHFpresentation=='No'])),

12


disease=c('DHF','DF'))

While this code creates the plot. We specify the origin for the bars to be 0 (so they start at 0 rather than at somearbitrary point).

barchart(disease˜medianplatelets,data=newdata,origin=0,xlab='Median platelet count')

Median platelet count

DF

DHF

0 10 20 30 40 50

1.8.6 Scatter plots

If you have two continuous variables and want to visualise their joint distribution, one of the simplest ways is to createa scatter plot (xyplot in lattice). The following code plots log platelet and albumin measurements for patients with andwithout DHF during their illness. Again, the tilde tells the computer what variables to go on the x and y axes, whilethe bar tells R to plot separate panels for the variable on the right.

xyplot(log10(Platelet)˜log10(Albumin)|DHFever,xlab='log Albumin',ylab='log Platelets',

strip=strip.custom(factor.levels=c('DF','DHF')))

13


log Albumin

log

Pla

tele

ts

0.5

1.0

1.5

2.0

2.5

0.5 1.0 1.5 2.0 2.5 3.0

DF

0.5 1.0 1.5 2.0 2.5 3.0

DHF

14


1.9 Challenge: SARI epidemic curve

Let us analyse a fictitious data on a severe acute respiratory infection (SARI) outbreak in Bangkok. This datasetcontains a unique identification code, age, and sex of each notified case during the SARI outbreak. Also included aretheir latitude and longitude, the coordinates where the case occurred, and the dates for onset of symptoms and whenthe case was reported.

1. Read the data into R. The data can be found in the bangkok_SARI_outbreak.csv file. Upon reading innew data from .csv files or .txt files, R will store the data in an object called a data frame.

sari = read.csv('bangkok_SARI_outbreak.csv', as.is = TRUE)

2. Take a quick look at the data. Use the head() function to get the first 6 rows of the data and the variable.

3. Plot the epidemic curve of the outbreak using the data of symptom onset.

You may find the codes below useful to get the day of the outbreak for each case:

sari$onsetdate = as.Date(sari$onset, format = '%Y-%m-%d')sari$day = as.numeric(sari$onsetdate - min(sari$onsetdate))+1

(Hint: upon getting the day of outbreak for each case, plot the histogram of the day of outbreak to get a quick epidemiccurve)

15

Chapter 2

Basic tests in R

This session will cover various simple statistics and tests, from summary statistics, to confidence intervals, and hy-pothesis tests.

2.1 Measures of central tendancy

When most people think of statistics, they think first of averages (and some never progress beyond averages). Thereare three main definitions of average:

• The mean: if xi is the ith measurement of a variable x, then the sample mean of x is x =∑n

i=1 xi/n. Themean is sensitive to outliers1 and may not be a good summary of a skewed distribution.

• The median: if you sorted the xis from smallest to largest, and selected the value in the middle (with half to theleft and half to the right), that would be the median. If there is an even number of xs, the median is the mean ofthe two most central values. The median is robust to outliers, represents skewed distributions well, and is closeto the mean for distributions that are not very skewed.

• The mode: this is the most common value of x in the dataset. It is robust to outliers and skewed distributionsbut can be erratic for small sample sizes. For continuous measurements, the mode strictly should not exist, forthere should not be any repeated data, although measurements may be rounded which may give some ties. Analternative to the plain mode of the sample is to estimate the distribution of the data and take the mode of theestimated distribution.

The R code to generate the mean of a sample is simple, but that for a median is a little non-intuitive and that for amode is rather non-intuitive. We’ll be working with clinical data from dengue patients from a Singapore hospital forthis session. The following code calculates the mean and median of their platelet counts (in 109/l).

denguedata=read.csv('data_Dengue_Singapore.csv')attach(denguedata)

## The following objects are masked from denguedata (pos = 3):#### Age, Albumin, ALT, Arthralgia, AST, Cr, DHFever, DHFlater,## DHFpresentation, Diarrhea, Headache, Hematocrit, K, Lethargy,

1As the old joke goes: three men are in a bar when in walks Bill Gates. The men cheer—their average income has just shot up!

16


## Lymphocytes, Myalgia, Na, Nausea, Neutrophils, Platelet, Rash,## Sex, Urea, Vomiting, WCC

mean(Platelet)median(Platelet)

The attach() function makes all of the variables that appear in the data-frame denguedata now appear in theworkspace. If you don’t use attach(), you would need to typemean(denguedata$Platelet) etc. There is no built-in mode function. We can find the most common value inthe dataset using the following commands:

uP=unique(Platelet)nuP=0*uPfor(i in 1:length(uP))nuP[i]=sum(Platelet==uP[i])uP[which.max(nuP)]

This is a bit tricky to remember, so I’ve put the code in a function which you can load and use with the followingcommands:

source('supporting_functions.r')mode(Platelet)

If you wish to obtain a more robust, estimated mode, you can use the emode() function, also in the supportingfunctions script.

source('supporting_functions.r')emode(Platelet)

You will observe that the four measures of central tendency are all very different for this variable—because the plateletcount is very right skewed. In contrast, the mean, median and estimated mode are much more similar on a logarithmicscale (though the sample mode is still rather far off):

mean(log10(Platelet))median(log10(Platelet))mode(log10(Platelet))emode(log10(Platelet))

If the variable is coded up as 0s and 1s, the mean of the sample is just the proportion that are 1s. You can thus use themean() function to estimate proportions. For instance:

mean(Platelet<100)

returns the proportion of patients with a platelet score less than 100.

2.2 Measures of spread

Statistics is really not about what is typical, but about deviation from what is typical. There are several ways to measuredeviation:

17


• Variance: this is the mean of the square difference between each measurement and the overall mean, i.e.

var(x) =

∑ni=1(xi − x)2

n− 1.

We usually divide by n− 1 as this results in an unbiased estimate (rather than dividing by n, though for large nthe difference between the two is negligible). One problem with the variance is that the units are the units of x,squared (e.g. m2 for height), which makes it hard to interpret.

• Standard deviation: this is just the square root of the variance. We usually denote it as s (or sx or SD(x) ifthere might be ambiguity about what the variable is). The units are the original units that x is measured in.

• Range: this is the maximum of the sample less the minimum, i.e. max(x)−min(x).

• Inter quartile range: this is the third quartile of the sample less the first quartile.

The range is very non-robust and just needs a single measurement to change it substantially. Both the standarddeviation and the variance depend on the sample mean being a good estimate of central tendency, which may notbe the case if the data are quite skewed. The inter-quartile range is relatively more robust, but is harder to interpret.

var(Platelet)sd(Platelet)max(Platelet)-min(Platelet)quantile(Platelet,.75)-quantile(Platelet,.25)

Again, we’ve created simpler implementations of the range and inter-quartile range for you. Here are examples oftheir use:

source('supporting_functions.r')ranger(Platelet)iqr(Platelet)

2.3 Measures of correlation

There are two common measures of how closely related two variables are: their covariance and their correlation. Thecovariance is analogous to the variance of a single variable:

var(x) =

∑ni=1(xi − x)(xi − x)

n− 1, and

cov(x, y) =

∑ni=1(xi − x)(yi − y)

n− 1.

As with the variance, it has an unfortunate scale, i.e. the units of x times the units of y, which makes it hard to interpret.

The correlation gets around this by being scale-free: it is the covariance divided by the standard deviation of bothvariables (sx and sy , respectively):

cor(x, y) =1

n− 1

n∑i=1

(xi − x)

sx

(yi − y)

sy,

18


This falls on the range [−1, 1], with a correlation of 1 meaning a perfect positive linear relationship between x and yand −1 being a perfect negative linear relationship.

Warning! A correlation of 0 could mean no relationship, but it could also result from a strong non-linear relationship,so if you calculate the correlation coefficient, you should also always plot the two variables (e.g. using the xyplot inlattice).

Here is how to calculate these variables in R:

cov(Platelet,Na)cor(Platelet,Na)

Note that we often call the correlation r, and r2 (orR2) is a measure of the strength of linear relationship between xand y, and between y and x. In particular, it measures the proportion of the variance in y (or x) that can be explainedby x (or y).

2.4 Confidence intervals

Suppose that we claim to be psychic, and to prove it we get a French suited deck of cards, ask you to draw cardsfrom the deck and then we “psychically see” the suit of the card (there are four suits: hearts, diamonds, spades andclubs). We do this 20 times, and correctly “see” 6 of the suits (6/20 = 30% > 25% you would expect due to chance).Does this experiment convince you we are psychic? Probably not—you should recognise that if we did the experimentagain, we might get 4, or 7, or 2 cards correct, and that 6/20 is consistent with a range of different success probabilities(including 25%, i.e. pure guessing). We try to represent plausible values of the population parameter (e.g. the actualprobability of guessing correctly at cards) that are consistent with the data by providing a range. If the range ofplausible values is too narrow, there is a high risk that we will embarrass ourselves if further data are collected (e.g. if,after 6/20 correct guesses, I say I think the probability is between 29.5% and 30.5%, I’m likely to be found to bewrong if we play the game longer); if the range is too wide, the range is useless (if I say the probability is between0.1% and 99.9%, no one would take it seriously). To compromise between these extremes, we usually set the width ofthe intervals so that in a high percentage of intervals, the population level parameter is inside. By tradition, we set thepercentage to be 95%, and call the interval thus obtained a 95% confidence interval.

The simplified view of how to construct a confidence interval is to take the best guess based on the sample, and addand subtract about 2 standard errors (the standard error will have a formula based on the statistic you are doing it for).We will see these in confidence intervals for a proportion, mean or correlation coefficient. However, the formula canbe more complex, as for instance in a confidence interval for a median.

2.4.1 For a proportion

If you have observed x successes out of n attempts, the simplest 95% confidence interval has the following form:

p = x/n

CI = p± 1.96

√p(1− p)

n.

The 1.96 value comes from a normal distribution, because if you repeated the study many times in the same populationand did a histogram of the ps, it would look like a normal distribution.

We’ve made a function to calculate the confidence interval simply for you, which you can use if you call the supportingfunctions file. Here is an example, based on the recent Sanofi dengue vaccine trial conducted across ASEAN. The dvariables are the number of children with diagnosed infections in the vaccine and placebo arms.

19


n_vaccine = 6710n_placebo = 3350d_vaccine = 117d_placebo = 133propCI(d_vaccine,n_vaccine)propCI(d_placebo,n_placebo)

2.4.2 For a mean

For a mean, the confidence interval formula depends on the sample size. If the sample mean is x, standard deviations, and sample size n, the confidence interval is

x± tn−1s√n.

The tn−1 is an appropriate value from a t distribution with n − 1 degrees of freedom. The t distribution looks a lotlike a normal distribution but with a fatter tail, and so the value tn−1 takes will be a bit bigger than the 1.96 used forthe proportion confidence interval equation. However, again, you can use the function we have written to calculatethis without worrying about the formula:

p1=Platelet[DHFever=='Yes']p2=Platelet[DHFever=='No']meanCI(p1)meanCI(p2)

2.4.3 For a median

The way the confidence interval for the median is derived is interesting because it is very different from that of aproportion or a mean. It involves sorting the data points and selecting from the list two that are ‘quite’ near themedian. Exactly how near is derived from propoerties of a binomial distribution. Again, you can use our pre-builtfunction:

p1=Platelet[DHFever=='Yes']p2=Platelet[DHFever=='No']medianCI(p1)medianCI(p2)

2.4.4 For a correlation

The confidence interval for a correlation coefficient is derived using a transformation derived by Fisher that is approx-imately normally distributed. Again, you can escape the equation itself and just call our code:

correlationCI(Platelet,Na)

20


2.5 Basic hypothesis tests

In addition to estimating unobservables and quantifying our uncertainty in them, we often wish to quantify how muchevidence we have that one or more parameters fall in a particular subset of the parameter space. For instance, in aclinical trial in which participants are randomised to one of two arms (controlC vs. intervention I) and develop diseaseor not, we might use a model XC ∼ Bin(nC , pC), XI ∼ Bin(nI , pI) for the number of diseased participants on thetwo arms. While it is interesting to estimate pC and pI and their associated uncertainties, we also want to know ifpC < pI (i.e. the intervention increased the risk of disease), pC = pI (i.e. the intervention has no effect relative to thecontrol) or pC > pI (the intervention reduced the risk of disease), and how much evidence there is to support thesescenarios. We do this using a basket of techniques called null hypothesis significance testing.

We do hypothesis testing by setting up two hypotheses (not the three in the opening paragraph) relating to two partitionsof the domain that the true parameters inhabit, θ ∈ Θ0 and θ 6∈ Θ0, assuming that the parameters fall within one ofthese hypothesised sets, and try to find fault with the assumption, by way of evidence that is sufficiently inconsistentwith that set to make it implausible that the true parameters reside there. If we find evidence enough, we decide toreject the hypothesis we had assumed true and accept the other.

We call the hypothesised parameter set that we assume is true in conducting the test the null hypothesis, H0, and itsconverse the alternative hypothesis, H1. For instance, for the aforementioned trial, the hypotheses are:

H0 : pC = pI ∈ [0, 1] vs.H1 : pC 6= pI , (pC , pI) ∈ [0, 1]2.

For an example, consider the phase III Sanofi trial of their tetravalent dengue vaccine in various ASEAN states. Inthis trial, 6710 children were vaccinated (i.e. on the intervention arm, I), of whom 117 were infected with laboratory-confirmed dengue during the course of the study, while 3350 children were given a placebo (i.e. control C) of whom133 were infected.

A null hypothesis significance test proceeds by taking a single, scalar statistic (a single number that is a function of thedata) that we will call the test statistic. This is selected so that its value is suggestive of which region of the parameterspace the truth falls in. For instance, for the two-armed dengue vaccine trial, the test statistic might be the ratio ofinfection rates in the two arms (this isn’t actually the best test statistic and we normally would use a more complicatedone), and we would reject the null hypothesis if the ratio fell too far from one.

To decide whether the test statistic is ‘too far’ from the sampling distribution if the null hypothesis were true, we workout the probability of it being that far or further, which we will call the p-value. We don’t work out the probabilityjust of observing that test statistic, because with large datasets (either with a large number of individuals or a complexstructure), the chance of any particular combination is very unlikely. For instance, for the dengue vaccine trial, if therisk were really 3% in both the vaccine and the placebo arms, the chance that the risk ratio would equal one exactly isclose to 10−90. Instead, we aggregate all the more extreme scenarios we might have observed, so that a dataset thatis as consistent as is possible with the null hypothesis would get a p-value of 1, and an extreme dataset would get ap-value close to 0.

There are different tests that are commonly taught in statistics 101 classes and all can be implemented in R, as follows.

2.5.1 Student’s t-tests

If you have two groups of individuals, each with a continuous measurement that is not too far from a normal distri-bution, you can do a two-sample t-test to assess whether the means of the two groups could be the same. Thinkingback to the log platelet counts presented in session 1, the distribution was approximately normal, so we can set up anhypothesis test of whether the log platelets for DF patients differs from that of DHF patients using a t-test. There aretwo approaches to use this function:

21


Approach 1: two separate vectors. In this version, we have two vectors, which could very well be of differentlengths, which we feed into a t.test() function:

p1=log10(Platelet[DHFever=='Yes'])p2=log10(Platelet[DHFever=='No'])t.test(p1,p2)

#### Welch Two Sample t-test#### data: p1 and p2## t = -7.4175, df = 559.01, p-value = 4.458e-13## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -0.2225354 -0.1293523## sample estimates:## mean of x mean of y## 1.588267 1.764211

This conducts a variant of Student’s test named after Welch which allows the variances of the two distributions notto be equal. Output from the function includes the test statistic, the degrees of freedom (in this case, non-integer dueto the use of Welch’s method), the p-value, a 95% confidence interval for the difference, and the means of the twodistributions.

Approach 2: one vector of data, one vector of labels. In this version, we have a vector containing all the log plateletcounts and another indicating the disease status, separated by a tilde:

t.test(log10(Platelet)˜DHFever)

#### Welch Two Sample t-test#### data: log10(Platelet) by DHFever## t = 7.4175, df = 559.01, p-value = 4.458e-13## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## 0.1293523 0.2225354## sample estimates:## mean in group No mean in group Yes## 1.764211 1.588267

One can see that the results are the same.

If you have two measurements per individual and wish to test whether the means are the same, you can use a pairedt-test. The same function is used but with an additional paired=TRUE argument. Although there are no pairedmeasurements in the dengue data that make sense to test, if there were platelet counts before and after treatment, thecode would look like this:

t.test(plateletbefore,plateletafter,paired=TRUE)

22


2.5.2 Non-parametric tests

If your data are numerical, without too many ties, and not approximately normally distributed, you may prefer to doa non-parametric test instead of a t-test. Tests based on ranks have a variety of names, including the Mann–WhitneyU test, the Mann–Whitney–Wilcoxon test, the Wilcoxon rank-sum test, the Wilcoxon–Mann–Whitney test (all thesame test!), and the Wilcoxon signed rank test. These are called in the same way as the t-test above, but substitutingwilcox.test for t.test. So, for example:

p1=log10(Platelet[DHFever=='Yes'])p2=log10(Platelet[DHFever=='No'])wilcox.test(p1,p2)

#### Wilcoxon rank sum test with continuity correction#### data: p1 and p2## W = 49492, p-value = 1.15e-12## alternative hypothesis: true location shift is not equal to 0

Or

wilcox.test(log10(Platelet)˜DHFever)

#### Wilcoxon rank sum test with continuity correction#### data: log10(Platelet) by DHFever## W = 93287, p-value = 1.15e-12## alternative hypothesis: true location shift is not equal to 0

Or (if the data existed)

wilcox.test(plateletbefore,plateletafter,paired=TRUE)

Note that the Wilcoxon tests yield the same results regardless of transformations, so not using the log transformationgives the following:

wilcox.test(Platelet˜DHFever)

#### Wilcoxon rank sum test with continuity correction#### data: Platelet by DHFever## W = 93287, p-value = 1.15e-12## alternative hypothesis: true location shift is not equal to 0

Note that the commonly understood interpretations of the null and alternative hypotheses for these tests are incorrect:it is often thought that this is a test of whether the medians are equal versus them not being equal, but in fact the nullis that the two distributions are the same while the alternative is that one stochastically dominates the other.

23


2.5.3 Correlation test

To test whether two variables are correlated or not, one can use the cor.test() function:

cor.test(Platelet,Albumin)

#### Pearson's product-moment correlation#### data: Platelet and Albumin## t = -0.20271, df = 794, p-value = 0.8394## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:## -0.07664364 0.06232559## sample estimates:## cor## -0.007193762

This function by default tests Pearson’s correlation, but Spearman’s correlation can be used by as follows:

cor.test(Platelet,Albumin,method='spearman')

#### Spearman's rank correlation rho#### data: Platelet and Albumin## S = 67459000, p-value = 1.926e-08## alternative hypothesis: true rho is not equal to 0## sample estimates:## rho## 0.1974894

Note the very different results of the two tests! The distribution of the two variables can be seen in a scatter plot:

xyplot(PlateletÃlbumin,xlab='Albumin',ylab='Platelets')

Albumin

Pla

tele

ts

0

200

400

600

0 200 400 600 800

24


The standard (Pearson’s) correlation measures the strength of linear association, but the scatter plot indicates a clearlynon-linear relationship. In contrast, Spearman’s variant uses the ranked values, visualised as follows:

xyplot(rank(Platelet)˜rank(Albumin),xlab='Ranked albumin',ylab='Ranked platelets')

Ranked albumin

Ran

ked

plat

elet

s

0

200

400

600

800

0 200 400 600 800

2.5.4 Chi-squared test

A chi-squared test allows you to take a contingency table and assess whether the two variables that make up the tableare related or not. As with the other tests above, the syntax is quite simple. The first step is to set up the contingencytable. Let’s use DF/DHF status and Sex as the two variables of interest:

contable = table(DHFever,Sex)print(contable)

## Sex## DHFever Female Male## No 149 374## Yes 109 164

This then is fed into a chisq.test() call:

chisq.test(contable)

#### Pearson's Chi-squared test with Yates' continuity correction#### data: contable## X-squared = 10.195, df = 1, p-value = 0.001408

The results indicate a significant difference, but it is not immediately clear what this difference is, and so to follow upyou would want to get confidence intervals for the risk of DHF in the two groups or of the risk ratio.

It is common statistical folklore that when a chi-squared test is inappropriate, Fisher’s exact test should be used instead:

25


fisher.test(contable)

#### Fisher's Exact Test for Count Data#### data: contable## p-value = 0.001391## alternative hypothesis: true odds ratio is not equal to 1## 95 percent confidence interval:## 0.4357820 0.8259849## sample estimates:## odds ratio## 0.5998013

An alternative that is actually better is Barnard’s exact test, but this is a bit slow for large datasets like the dengue one(note you may need to install.packages(’Barnard’) prior to using this):

# install.packages("Barnard")library(Barnard)barnard.test(contable[1,1],contable[2,1],contable[1,2],contable[2,2])

2.6 Challenge: dengue in Singapore

In this challenge, we are going to explore the dengue data in Singapore.

1. Earlier we found that there is a significant difference in the log10 platelet counts of dengue cases who have everhad DHF or not. Is there also a significant difference in the log10 platelet counts of those who had DHF atpresentation and those who did not?

2. How is this result reflected visually? (Hint: boxplot from Chapter 1)

3. Nausea, myalgia, and rash are among the several symptoms associated with dengue fever. Perform the appro-priate hypothesis tests to study if there is a significant relationship between the log10 platelet counts of denguecases and the presence of these symptoms.

26

Chapter 3

Linear and Logistic Regression

3.1 Simple Linear Regression

When presented with some response and explanatory variables, we can wonder, if there is a linear relation betweenthem. Linear regression analysis is the art and science of fitting straight lines to patterns of data. It is the statisticalmodelling that describes the relationship between the response and explanatory variables with a straight line.

The simple linear regression assumes the existence of a linear relationship between the response and an explanatoryvariable, which is disturbed by some random error. Mathematically, x and y represent the explanatory and responsevariables, respectively. ε is the error, assumed to be 0 on average. Hence, for each value of x the corresponding y is arandom variable of the form:

y = β0 + β1x+ ε,

where β0 is the intercept parameter and it is the mean response when x = 0. β1 is the slope parameter and it is thechange in y when x changes by one unit. The aim is to fit the best possible straight line to the data thus finding the“line of best fit”. It is achieved by finding the line and its corresponding β0 and β1 that minimizes the sum of squareddeviations from each data point to the line.

In multiple linear regression, we just have more than one explanatory variable and we are interested to know howthey affect the response variable. It is conceptually identical to simple linear regression. This statistical modelling ofpredicting one variable from a group of others rests on the several following assumptions:

27


Linear Regression Assumptions

1. Linearity and additivity of the relationship between dependent and independent variables,

2. Statistical independence of the errors. No correlation between consecutive errors in the case of time series datai.e. autocorrelated errors,

3. Homoscedasticity or constant variance of the errors, and

4. Normality of the error distribution.

Regression diagnostic plots are simple ways to see if any of the assumptions are violated. They include (1) residualsagainst explanatory variable, (2) residual plot against index of dataset, (3) residuals against fitted values, (4) leverageplots, and (5) QQ plots.

3.2 Linear Regression in R

We’ll be using data1 from a cohort of patients, some of which have acute myocardial infarction (AMI). To read theirdata into R, assuming the following file is in your working directory, you can type:

ami = read.csv('ami.csv',as.is=TRUE)

Let’s take a look at the data and some basic descriptive statistics. The head() function allows you to take a look atthe first six rows of the data.ami is a data frame and we can access its variable by using ‘$’.

head(ami)ami$Status

‘Status’ is a variable in ami denoting the AMI status of the subject, 0 means the subject did not have AMI when theinformation was collected and 1 means subject had AMI when the information was collected. The use of [] squarebrackets allows for sub-setting of the data frame and we can also use it to access the variable. For example “Status” isthe first column of the dataset:

ami[,1] # dataset[rows, columns]ami[,'Status']

The table() function builds a contingency table of the counts at each combination of factor levels of the discretevariable.

table(ami$Status, useNA = 'ifany')table(ami$sex, useNA = 'ifany')table(ami$Diabetes, useNA = 'ifany')

3.2.1 Date variables in R

R is able to analyse calendar dates. The date variable needs to be converted to a form that R can understand:as.Date() and as.POSIXlt() are some examples of date-time conversion functions in R. The format ar-gument needs to be carefully specified. After successfully converting the dates, we can find difference in the numberof days (or hours, weeks and so on) between a pair of dates.1These are actually synthetic data based on a cohort of patients. Their details have been mixed up a bit to protect their privacy.

28


head(ami$DateofSpec)as.Date(ami$DateofSpec,format = '%Y-%m-%d')ami$DateofSpec = as.Date(ami$DateofSpec,format = '%Y-%m-%d')head(ami$DateofEvent)as.Date(ami$DateofEvent,format = '%d-%b-%y')ami$DateofEvent = as.Date(ami$DateofEvent,format = '%d-%b-%y')

Just subtracting the variables returns the time differences in days. The difftime() function allows you to specifythe time interval for the difference in time

ami$DateofEvent - ami$DateofSpecdifftime(ami$DateofEvent, ami$DateofSpec,units = 'days')

29


For this linear regression exercise, let’s try to understand the association of BMI and the other variables. First, examinethe univariate linear regression. The lm() is used to fit linear models in R and pairs() produces a matrix ofscatterplots.

pairs(˜ BMI + HbA1c + TG + dha + age_spec, data=ami)

BMI

4 6 8 12 0 2 4 6 8 12

1525

35

46

812

HbA1c

TG

12

34

56

02

46

812

dha

15 25 35 1 2 3 4 5 6 50 60 70 80

5060

7080

age_spec

# pairs(˜ BMI + HbA1c + TotalChol + LDLc + HDLc + TG, data=ami)

30


lm() fits the simple linear regression and the estimates are stored in an object called fit. The summary() commandgives us a detailed output (estimates, standard error and p-values) of the model and if we run plot(fit) we getsome diagnostic plots for linear regression. In the example below, we built a simple linear model of BMI (responsevariable) against the age of the subject when measurements were taken (explanatory variable):

# to facilitate with plotting the diagnostic plots later essentiallylayout(matrix(c(1,2,3,4),2,2))fit = lm(ami$BMI ˜ ami$age_spec)summary(fit)

#### Call:## lm(formula = ami$BMI ˜ ami$age_spec)#### Residuals:## Min 1Q Median 3Q Max## -8.9586 -1.8927 -0.0479 1.5146 18.4209#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 25.37864 0.67536 37.578 < 2e-16 ***## ami$age_spec -0.03541 0.01011 -3.504 0.000472 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.099 on 1462 degrees of freedom## Multiple R-squared: 0.008329,Adjusted R-squared: 0.007651## F-statistic: 12.28 on 1 and 1462 DF, p-value: 0.0004717

Some diagnostic plots for linear regression:

plot(fit)

31


Figure 3.1: Residuals against fitted values (top left) and standardised residuals against fitted values (top right) shouldbe random scatter in space and they highlight potential outliers. Normal Q-Q plot (bottom left) tells you if the errorsdeviate from normality. Residuals against leverage (bottom right) identify points which may have large influence thatmay not be outliers.

32


3.2.2 Coefficient of determination

R2 is percentage of total response variation explained by explanatory variable. A low R2 indicates that not much ofvariation in data can be explained by regression model.

3.2.3 Regression with categorical variables

The function factor() is used to inform R that sex is a categorical variable.

# when regressing factor variablesfit = lm(ami$BMI ˜ factor(ami$sex))summary(fit)

#### Call:## lm(formula = ami$BMI ˜ factor(ami$sex))#### Residuals:## Min 1Q Median 3Q Max## -8.7744 -1.8696 -0.0473 1.5194 18.8029#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 23.05664 0.13583 169.749 <2e-16 ***## factor(ami$sex)Male -0.04297 0.16960 -0.253 0.8## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.112 on 1462 degrees of freedom## Multiple R-squared: 4.391e-05,Adjusted R-squared: -0.00064## F-statistic: 0.06421 on 1 and 1462 DF, p-value: 0.8

Practice: Take some time before the next section to examine the relationship BMI and other variables such as physicalactivity, HbA1c, high blood pressure and so on...

33


3.3 Multivariate Linear Regression

Building a multivariate linear regression model is easy, just add the variables that you want to the model as an extensionof the simple linear regression. Which variables should one add into the model? It is essential to consider clinicalimportance vs. statistical significance.

fit1 = lm(ami$BMI ˜ ami$age_spec + factor(ami$sex)+ ami$HbA1c + ami$TG + ami$HDLc+ factor(ami$HBP) + factor(ami$Diabetes)+ factor(ami$phy_act)+factor(ami$smoking))

summary(fit1)#plot(fit1)

fit2 = lm(ami$BMI ˜ ami$age_spec + factor(ami$sex)+ ami$HbA1c + ami$TG + ami$HDLc+ factor(ami$HBP) + factor(ami$Diabetes))

summary(fit2)# plot(fit2)# layout(matrix(c(1),1,1)

3.3.1 Analysis of Variance (ANOVA)

The anova() function in R computes analysis of variance (or deviance) tables for one or more fitted models. Forcategorical variables with more than two outcomes, we need to interpret the ANOVA p-value, which assess how muchof the variance in the response has been explained by the variable. Some prefer to rely on the ANOVA table to obtainthe p-values for any variables.

anova(fit1)

## Analysis of Variance Table#### Response: ami$BMI## Df Sum Sq Mean Sq F value Pr(>F)## ami$age_spec 1 93.9 93.88 10.9729 0.0009483 ***## factor(ami$sex) 1 9.4 9.36 1.0935 0.2958732## ami$HbA1c 1 341.9 341.85 39.9563 3.490e-10 ***## ami$TG 1 116.4 116.36 13.6001 0.0002349 ***## ami$HDLc 1 341.7 341.68 39.9364 3.525e-10 ***## factor(ami$HBP) 1 357.5 357.53 41.7887 1.404e-10 ***## factor(ami$Diabetes) 1 7.0 7.00 0.8179 0.3659450## factor(ami$phy_act) 2 50.0 24.98 2.9200 0.0542647 .## factor(ami$smoking) 3 248.4 82.79 9.6764 2.539e-06 ***## Residuals 1391 11901.0 8.56## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Furthermore, ANOVA allows us to compare models too.

34


anova(fit2,fit1)

## Analysis of Variance Table#### Model 1: ami$BMI ˜ ami$age_spec + factor(ami$sex) + ami$HbA1c + ami$TG +## ami$HDLc + factor(ami$HBP) + factor(ami$Diabetes)## Model 2: ami$BMI ˜ ami$age_spec + factor(ami$sex) + ami$HbA1c + ami$TG +## ami$HDLc + factor(ami$HBP) + factor(ami$Diabetes) + factor(ami$phy_act) +## factor(ami$smoking)## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 1396 12199## 2 1391 11901 5 298.33 6.9738 1.914e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3.2 Model Selecion

Stepwise regression can be done in R using stepAIC(). It performs stepwise model selection by exact AIC:

library(MASS)fitstep = lm(BMI ˜ age_spec + factor(sex) + HbA1c + TotalChol

+ LDLc + HDLc + factor(phy_act) + elaidic_trans+ linoleic + alpha_linolenic + arachidonic + epa+ dha + factor(HBP) + factor(Diabetes)+ factor(smoking) ,data=ami)

step = stepAIC(fitstep, direction="both")anova(step)

35


It is possible that during the “one-at-a-time” procedure of dropping or adding variables, we may miss the optimalmodel and we run into multiple testing issues. Model selection methods should only be used as a guide. Knowledgeof the study area is a good way to validate the model. Other methods such as the best subsets regression procedureallows us to search through all possible models (given the variable list) and finds the best model(s) according to acertain criterion such as AIC, adjusted coefficient of multiple determination and more.

Practice: Explore other multivariate linear models.

36


3.4 Logistic Regression

For the logistic regression (also logit regression) exercise, let’s try to understand the association of Type 2 Diabetesand the other variables. A type of generalised linear model, it is used to model dichotomous outcome variables. In thismodel the log odds of the outcome is modeled as a linear combination of the predictor variables.

The glm() function in R used to fit generalized linear models, specified by giving a symbolic description of the linearpredictor and a description of the error distribution. The variable “Diabetes” is a dichotomous variable, with values“Yes” and “No”. We can create a new variable called t2d which takes values 1, for yes, and 0, for no:

ami$t2d = ifelse(ami$Diabetes == "Yes",1,0)# ifelse() functions: if "Yes" assign 1, else assign 0

The function below will read in the logistic regression model and create meaningful outputs like odds ratio (OR) and95% confidence interval for OR. It is a user-defined function which will come in handy later.

getLogisiticORCI=function(FIT,sf=2){

ms = summary(FIT)oddsratio = exp(ms$coefficients[-1,1])lowerci = exp(ms$coefficients[-1,1]-1.96*ms$coefficients[-1,2])upperci = exp(ms$coefficients[-1,1]+1.96*ms$coefficients[-1,2])pvalue = ms$coefficients[-1,4]results = data.frame(OddsRatio = round(oddsratio,sf),

CI = paste0("(",round(lowerci,sf),"-",round(upperci,sf),")"),

pvalue = round(pvalue,sf))results$pvalue[results$pvalue<0.01] = ' <0.01'return(results)

}

37


3.4.1 Fitting univariate Logistic regression models

Suppose we want to understand the association of Type 2 Diabetes (response variable) and the age of the subject whenmeasurements were taken (explanatory variable), we can fit a logistic regression model by specifying the family of thedistribution to be ‘binomial’:

fitglm = glm(ami$t2dãmi$age_spec, family = 'binomial')summary(fitglm)

We have to be more cautious when we perform logistic regression with categorical explanatory variable. First, explorethe two-way contingency table of response and categorical explanatory variable to make sure there are no 0 cells.Then, proceed with the regression using the factor() function for the categorical explanatory variable.

table(ami$t2d, ami$sex)fitglm = glm(ami$t2d˜factor(ami$sex), family = 'binomial')summary(fitglm)

3.4.2 Multivariate Logistic Regression

Similar to multivariate linear regression, to build a multivariate logisitic regression model just add the variables thatyou want to the model. The only difference is that family is specified as ‘binomial’.

fitglm1 = glm(ami$t2dãmi$age_spec + factor(ami$sex)+ ami$HbA1c+ ami$LDLc + factor(ami$HBP), family = 'binomial')

# summary(fitglm1)getLogisiticORCI(fitglm1)

## OddsRatio CI pvalue## ami$age_spec 1.01 (0.99-1.04) 0.35## factor(ami$sex)Male 0.78 (0.54-1.14) 0.2## ami$HbA1c 1.99 (1.8-2.2) <0.01## ami$LDLc 0.69 (0.55-0.87) <0.01## factor(ami$HBP)2 2.41 (1.65-3.51) <0.01

fitglm2 = glm(ami$t2dãmi$age_spec + factor(ami$sex)+ ami$HbA1c+ ami$LDLc + factor(ami$HBP) + factor(ami$smoking), family = 'binomial')

# summary(fitglm2)getLogisiticORCI(fitglm2)

3.4.3 Model Comparison

A model fits the data better if it demonstrates an improvement over a model with fewer explanatory variables. Thelikelihood ratio test compares the likelihood of the data under the full model against the likelihood of the data undera model with fewer predictors. Having less explanatory variables in a model will almost always make the model fitworse and thus have a lower log likelihood. However, it is necessary to test whether the observed difference in modelfit is statistically significant. The likelihood ratio test is one way we can perform model comparison in R by using thelrtest() function in lmtest library.

38


# install.packages("lmtest")library(lmtest)

## Loading required package: zoo

#### Attaching package: ’zoo’

## The following objects are masked from ’package:base’:#### as.Date, as.Date.numeric

lrtest(fitglm1,fitglm2)

## Likelihood ratio test#### Model 1: ami$t2d ˜ ami$age_spec + factor(ami$sex) + ami$HbA1c + ami$LDLc +## factor(ami$HBP)## Model 2: ami$t2d ˜ ami$age_spec + factor(ami$sex) + ami$HbA1c + ami$LDLc +## factor(ami$HBP) + factor(ami$smoking)## #Df LogLik Df Chisq Pr(>Chisq)## 1 6 -406.95## 2 9 -399.35 3 15.215 0.001642 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

39


3.5 Challenge: Dengue in Singapore, again

In this challenge, we are going to continue exploring the dengue data in Singapore.

1. Let us fit a single linear regression line to log10 platelet counts and age. This assumes there is no difference inplatelet counts between males and females; and no difference in the relationship between platelet counts and agefor males and females. (A question to think about: why regress log10 platelet counts and not the raw plateletcounts?)

2. Let us now fit two parallel lines, one for each of the two sexes, i.e. include sex into the model. This assumesthere is a systematic difference in log10 platelet counts between males and females but there is no differencein the relationship between age and log10 platelet counts for males and females (which is represented by thesame gradient). (N.B.: To build a mode that can tease out differences in the relationship between age and log10platelet counts for males and females, you need include interaction terms into the model which is representedby the different gradient)

3. In the earlier challenge, you have identified if some symptoms were significantly associated with platelet counts.Include these symptoms into the model of log10 platelet counts and demographics (age and sex).

4. Let us now consider fitting a series of logistic regression to identify the factors that are associated with the everhaving DHF. You may want to start off with the following commands that builds the basic model includingdemographics (age and sex).

fit = glm(denguedata$DHFever=='Yes'˜denguedata$Age+denguedata$Sex,family = binomial)summary(fit)

5. Are other variables like platelet counts and symptoms also associated with DHF after adjusting for demograph-ics? Build a more complex model that includes these variables.

40

Chapter 4

Reproducible Research with R

This chapter introduces some ideas about how to make your research analyses more robust through reproducibility.Because it is important to be able to stand by one’s results, some common practices that do not leave paper trails maybe hard to justify. A few examples follow:

• For instance, if data are in an excel sheet, datapoints should not be manually ‘corrected’ by overwriting thecontents of cells: if you have kept the original version, then later you can track back to find the changes, butidentifying all the changes may be time consuming. A better approach is to document changes via an R scriptby making the ‘corrections’ within R itself.

• Results (e.g. tables of estimates and p-values) should not be copied and pasted individually from the analysissoftware to the final report/manuscript. Such an approach leads to substantial risk of miscopying that can leadto bad decisions or embarrassment.

• If an analysis needs to be redone (on a different or updated dataset, or in response to suggestions from e.g. peerreviewers), there should not be any ambiguity as to steps to be taken or the ordering. Scripts allow even complexanalyses to be rerun with minimal user input, saving time and the risk of making errors.

This chapter will briefly describe some of these issues and offer advice on how to structure your workflow to make itmore reproducible.

4.1 Datafiles

Quite often, one receives data from someone else in excel format, e.g. xlsx. In such a situation, you have two choicesfor getting the data into R:

• You can read it directly from within R using the xlsx package. After loading this package, you may read thedata using the read.xlsx package. This is often quite slow in my experience.

• You can open the dataset in excel and save the worksheets as csv format. This is my preferred option.

Some things to watch out for include the following:

• Blank cells forcing addition rows or columns in the data. This can be observed if you open the csv file in a texteditor, for instance through multiple commas at the ‘end’ of each line. One fix is to reopen the csv in excel, anddelete the columns or rows with non-empty but apparently empty cells. Alternatively, the data could be readinto R anyway, and the excess rows or columns removed through, for instance, the following:

41


dataset = dataset[,-c(9,10)] # if columns 9 and 10 are actually just empty

• Cells that are merged in the original xlsx file, for instance, column headers, will not survive properly into thecsv file. You may need to unmerge them prior to saving, to ensure the format is as intended.

• You may want to rename some column headings: when you read the data into R, duplicate column names willbe differentiated by appending ‘.1’, ‘.2’ and so on to them. This can be a problem if you have two rows ofcolumn names. If you do rename columns, it is a good idea to note the transformations you are performing in areadme file (see later). You may also find the column names are too long to be convenient to work with in theanalysis. If you prefer to rename columns within R, you can do this using the names function. To extract thecolumn names, you can type

names(dataset)

which can of course be copied into a vector. You can also rename the columns as follows:

names(dataset) = c('ID','Age','Sex','Race','SC','BMI','Weight','Height')

A common issue is that the definitions of variables is not clear from the entries in the cells (for instance, if sex is codedas 1 and 2, it won’t be clear in future which sex is 1 and which 2). Thus, at an early stage such definitions shouldbe formally described in a readme file that goes in the same directory as the dataset. This might be a text file, eitherspecific to each datafile, or to the collection of data in a folder. The readme should describe not just the definitionsof variables but also the provenance of the data, for instance, if they are a cohort, the date at which the data wereexported.

4.2 Functions

If you are running the same sets of commands two or more times in your code, you probably should write a functionfor it. Having a function means that (a) your code is shorter and clearer, because you avoid repetition, and (b) thatif you need to change something, you change it once, and reduce the risk of accidentally having inconsistencies indifferent parts of the code.

When you write a function, I find it useful to first identify what are the arguments the function will use, and thenuse some of your data to make variables with those names. Then you can run the commands inside the function onsome demo data to check they work (in at least one instance anyway). It is better not to have the function use globalvariables, i.e. variables not defined in scope or as arguments to the function, because you cannot guarantee that thesewill be present when the function is later used.

Don’t forget that when writing a function, you output one object, so if you need to output multiple things, you shouldstore them inside one object, via for example a list. For instance, if you want to output both a vector of estimates anda residual standard deviation, you could write:

output = list(estimates = somevectoryoumade,rsd = somenumberyoucalculated)

return(output)

Although you do not need to call return() to exit the function, I find it better practice to, because it makes it moreexplicit what you are doing. (The alternative is just to put the object to be returned on the last line of the function,which is not very clear.)

42


To keep your script clean, you might like to move functions to another file, that contains all the functions you havewritten to use for that analysis. This file can then be sourced to allow its contents to be used. This is explained furtherbelow.

4.3 Scripts and sourcing scripts

With complex, multipart analyses, it may be better to break the analysis down into chunks. For instance, you mighthave code to read in raw data, clean the data, run some analysis, create a graph, run some more analysis, create sometext, and create some tables. Rather than have all these pieces of code in one single script, you might break it downinto multiple scripts for each major task. For instance, you might have the following scripts:

supporting_functions.r Functions that will be used in the other scriptsstep1_read_data.r Reads in the raw data, clean them, and output the processed datastep2_analysis1.r Reads in the processed data and does the first part of the analysisstep3_figure1.r Takes the output of the analysis and creates an attractive graphstep4_analysis2.r Reads in the processed data and does some more analysisetc

Then, you can have a master file that calls these in sequence. For instance, master.r might do the following:

source('code/supporting_functions.r')source('code/step1_read_data.r')source('code/step2_analysis1.r')source('code/step3_figure1.r')source('code/step4_analysis2.r')# etc

You’ll note that the example above suggests that the scripts will not be sitting in the working directory, but rather asubdirectory called code. It can be useful to have separate folders for the code, the data, and the output that will becreated, to keep the folders organised and to make it easier for people in future to make sense of what the commandsdo. In this case, I would set the working directory to be the folder containing the code, data and output folders, definingthis in the master script, so that if the work is moved over to a new computer, only one line in one file needs to bechanged to accommodate the change.

This means also that when you read datafiles in, you preface the name of the file with its subdirectory, e.g.

dataset1 = read.csv('data/dataset1.csv')

and similarly, output files such as graphs are output like this:

png('output/figure_1.png')# etcdev.off()

4.4 Output units (graphs, tables and text)

Usually the main reason for writing some code is to create some tangible output that can be included in a report orpresentation. These could take the form of graphs, tables or text.

43


We covered already how to create graphs earlier in the course. As indicated in the previous section, you can outputthese to files directly from a command (e.g. png()). This is much better than manually outputing a graph from thegraph panel in RStudio to a file, because you have control over the size, aspect, and some other graphic parameters.A common problem is for people to make graphs with extremely tiny fonts that cannot be read by the target audience.To overcome this, you can fix the size of the graph to be the final size you intend to use the graph at, and set the fontsize to a sensible size comparable with the text it will appear alongside. I will often output graphs like this:

png("output/figure_1.png",height=8,width=8,units='cm',res=300,pointsize=10)# etcdev.off()

This creates a graph at the size of a single column in a two column document (i.e. 8cm), with a font size of 10(comparable to a standard document in MS Word) and a reasonable resolution of 300DPI.

For tables containing many numbers, it is preferable that the table as a whole be exported, e.g. via the creation of a csvfile. Having some functions would be very useful here. For instance, you could create a function to round p-valuesto a specified accuracy or to < 0.001 if necessary, that can then be tweaked in case you later need to change theaccuracy. You could also have a function that creates a text string that, for instance, contains the variable name, oddsratio, confidence interval, and p-value for each variable in a logistic regression, with each entry separated by a comma,so that you can output to a csv file and copy the entire object directly into word or excel for your report.

Similarly, if you need to output particular numbers for the text of a report, it is better to output them to a text file(ideally with context, e.g. what each number is) rather than to copy them from the results R prints to screen, becausethe paper trail that results from this makes mistakes easier to identify and correct.

4.5 Cleaning your workspace

Unexpected things can happen if the workspace—the global environment containing all your objects—is not keptclean. As you write your code, you probably experiment with different solutions and create temporary variables thatmay not all be represented in the code you write. If the workspace retains these objects, you may then write the codeexpecting them to be present, when in fact nothing in the code actually creates them, causing errors when you later runit from scratch.

To clean the global environment in RStudio, there is a little brush symbol in the environment panel, which will removeall objects therein. You can also, of course, code it. By typing rm('a','b') for instance, you remove the objects aand b from the environment. It can be good practice for your script to remove objects after they are no longer needed,either immediately after they are last used, or at the end of the script before the next part of the analysis begins.

44


4.6 Final challenge

We finish this workshop with one final challenge, which you should feel free to discuss with the other participantsor with the instructors. Load the dataset final_challenge.csv. This contains measurements of blood pressure(e.g. 120/80 means the systolic blood pressume is 120mmHg and the diastolic blood pressure is 80mmHg) and thebody mass index (BMI) from a cross-sectional survey in the community. Individuals are defined to be hypertensive ifany of the following conditions are met: their systolic blood pressure is at least 140, their diastolic blood pressure atleast 90, or they have previously been diagnosed with hypertension and are now taking interventions to reduce theirblood pressure.

The dataset and analysis are quite simple, but treat it as if it is more complex. Do the following steps:

1. Make a directory to be your working directory, with subdirectories called code, data and output. Put the datainto the data folder.

2. Make a script to read the data in and ‘clean’ it, by converting the blood pressure measurements to a variable thatindicates if the individual meets the hypertensive definition and outputs the processed data.

3. Make a script that loads the cleaned data and makes a plot of BMI for hypertensives and non-hypertensives.

4. Make a script that runs a logistic regression with hypertension as the outcome and BMI as the predictor. Writethe results to a table. Use a function to convert the p-value(s) from the logistic regression to display only 2significant figures, or 1 if p ∈ (0.001, 0.01), or to < 0.001.

(Hint: you might like to use the strsplit() in to clean the blood pressure measurements.)

45

Documents

R as a calculator