STAT 260: Lecture 7

STAT 260: Lecture 7

Mik Black

STAT 260: Lecture 7 Slide 1

Last lectures

• We’ve looked at plottingI Introduction to base R graphicsI Summary of plotting with ggplot

• You should be starting to develop a good working knowledge of how to visualizedata


Excursion: COPSS Award

• The COPSS Presidential Award is one of the most prestigious in statisticsI Outstanding contributions to field of statisticsI Recipient under 40I Considered one of the top two awards internationallyI Previous winners are a “who’s who" of statisticians

• In 2019 it was awarded to Hadley WickhamI New ZealanderI Chief scientist at RStudio

• He leads the development of many R packagesI tidyverse: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats, . . .

• We will be making use of a lot of his work in this course


Data frames: constructing

• We can make a data frame using the function data.frame• Suppose we have two vectors x and y

x = c("Katrina", "Mik", "Tilman")y = c(21, 89, 55)

• We can combine those into a data framez = data.frame(x,y)z

## x y## 1 Katrina 21## 2 Mik 89## 3 Tilman 55


Data frames: constructing

• The column name is taken from the variable name• We can also change the name

z2 = data.frame(name = x, value = y)z2

## name value## 1 Katrina 21## 2 Mik 89## 3 Tilman 55


Data frame: constructing

• We can also change the variable names separatelycolnames(z) = c("name","value")z

## name value## 1 Katrina 21## 2 Mik 89## 3 Tilman 55


Data frame: adding a variable

• We can extend data frames• Suppose we have another variable w

w = c(62, 38, 17)

• We can specify an extended data framezext = data.frame(z, ID = w)zext

## name value ID## 1 Katrina 21 62## 2 Mik 89 38## 3 Tilman 55 17


Data handling: subsetting

• Subset with []I Already seen this for finding columns (data frame)

• Look more closely: starting with vectors• sunspots is a built-in data vector1

• We can find the Xth element with sunspots[X]

sunspots[1]

## [1] 58

sunspots[1001]

## [1] 41.3

1It is actually a time-series vectorSTAT 260: Lecture 7 Slide 8


• We can also subset multiple values• If we want the first 5 elements

sunspots[1:5]

## [1] 58.0 62.6 70.0 55.7 85.0

• If we want the 1st, 52nd, 79th and 110thsunspots[c(1,52,79,110)]

## [1] 58.0 38.0 8.6 52.0



• While useful, it is limiting having to specify the indices we want to subset• We can automate this using logical values in R


Excursion: logical values

• Logical values help us to check a conditiona = 5a > 3

## [1] TRUE

a == 4

## [1] FALSE

• Take values TRUE or FALSE


Data handling with logical values

• Suppose we want all values of the vector where the value is over 50tmp = sunspots > 50head(tmp)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

sun50 = sunspots[tmp]head(sun50)

## [1] 58.0 62.6 70.0 55.7 85.0 83.5


Data handling

• The original vector has lengthlength(sunspots)

## [1] 2820

• The reduced vector has length:length(sun50)

## [1] 1205

• Typically I would combine the two commands in prev slide into 1sun50 = sunspots[sunspots > 50]


Quick data visualisation• We can use our what we’ve learned about plotting to quickly visualise the data.• Base R plots are great for taking a quick look at the data.

hist(sunspots, breaks = 25)hist(sun50, breaks = 25, xlim=c(0,max(sun50)))

Histogram of sunspots

sunspots

Fre

quen

cy

0 50 100 150 200 250

010

030

050

0

Histogram of sun50

sun50F

requ

ency

0 50 100 150 200 250

050

150

250


Not quite as easy in ggplot. . .

• ggplot needs a data frame to plotstr(sunspots)

## Time-Series [1:2820] from 1749 to 1984: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...

names(sunspots)

## NULL

• Wrong data type, and no aesthetics to specify. . .


. . . but not too hard to overcomesun50_df = data.frame(sun50 = sun50)str(sun50_df)

## 'data.frame': 1205 obs. of 1 variable:## $ sun50: num 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...

library(ggplot2)ggplot(data=sun50_df, aes(x=sun50)) + geom_histogram(bins=25) + theme_bw()

0

50

100

150

50 100 150 200 250sun50

coun

t


Data handling

• An alternate way of subsetting data is to first find which elements satisfy thecondition

idx50 = which(sunspots > 50)head(idx50)

## [1] 1 2 3 4 5 6

• Then use these indices to subsetsun50v2 = sunspots[idx50]all.equal(sun50,sun50v2)

## [1] TRUE


Data handling: data frames

• The exact same principles apply for data frames• Consider the chickwts data set

head(chickwts)

## weight feed## 1 179 horsebean## 2 160 horsebean## 3 136 horsebean## 4 227 horsebean## 5 217 horsebean## 6 168 horsebean



• If we wanted only those chickens who had been fed linseed• Two approaches

idxlin = which(chickwts$feed=="linseed")chicklin = chickwts[idxlin,] # choose the rows that satisfy condition

head(chicklin)

## weight feed## 11 309 linseed## 12 229 linseed## 13 181 linseed## 14 141 linseed## 15 260 linseed## 16 203 linseed



• We could also specify only some of the columns• Only present the weights for those chickens fed linseed

chickwts[idxlin,1]

## [1] 309 229 181 141 260 203 148 169 213 257 244 271

chickwts$weight[idxlin]

## [1] 309 229 181 141 260 203 148 169 213 257 244 271



• The former is more general than the latterI Choose to have several variables

• E.g. for zext

zext[1,c(2,3)]

## value ID## 1 21 62

zext[c(1,3),c(1,3)]

## name ID## 1 Katrina 62## 3 Tilman 17



• Alternatively, we can do it at oncechicklinv2 = chickwts[chickwts$feed=="linseed",]all.equal(chicklin,chicklinv2)

## [1] TRUE


Data handling: and

• We can have multiple criteria that we use• We may wish to find those chickens fed soybean AND that weigh over 300 gm.• We can include AND with &

idxsoy300 = which(chickwts$feed == "soybean" & chickwts$weight > 300)chicksoy300 = chickwts[idxsoy300,]chicksoy300

## weight feed## 26 327 soybean## 27 329 soybean## 31 316 soybean


Data handling: or

• We can also have conditions that include OR statements• We may wish to find those chickens fed sunflower or weigh under 200 gm• We can include OR with |

idxsun200 = which(chickwts$feed == "sunflower" | chickwts$weight < 200)chicksun200 = chickwts[idxsun200,]str(chicksun200)

## 'data.frame': 29 obs. of 2 variables:## $ weight: num 179 160 136 168 108 124 143 140 181 141 ...## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 3 3 ...


Data handling: subset

• There is also a built-in function subset• From help file:

This is a convenience function intended for use interactively. For programmingit is better to use the standard subsetting functions like [

• Repeat the final three subsetschicklin_v3 = subset(chickwts, feed == "linseed")chicksoy300_v2 = subset(chickwts, feed == "soybean" & weight > 300)chicksun200_v2 = subset(chickwts, feed == "sunflower" | weight < 200)


In practice

• The real power in data handling is when we combine it with visualizationggplot(data = chickwts, aes(x=feed, y=weight)) +

geom_boxplot(outlier.shape = NA, colour='grey') + geom_jitter(width=0.1)ggplot(data = chicksun200_v2, aes(x=feed, y=weight)) +

geom_boxplot(outlier.shape = NA, colour='grey') + geom_jitter(width=0.1)

100

200

300

400

casein horsebean linseed meatmeal soybean sunflower

feed

wei

ght

100

200

300

400

horsebean linseed meatmeal soybean sunflower

feedw

eigh

t


Documents

STAT 260: Lecture 7