26
STAT 260: Lecture 7 Mik Black STAT 260: Lecture 7 Slide 1

STAT 260: Lecture 7

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAT 260: Lecture 7

STAT 260: Lecture 7

Mik Black

STAT 260: Lecture 7 Slide 1

Page 2: STAT 260: Lecture 7

Last lectures

• We’ve looked at plottingI Introduction to base R graphicsI Summary of plotting with ggplot

• You should be starting to develop a good working knowledge of how to visualizedata

STAT 260: Lecture 7 Slide 2

Page 3: STAT 260: Lecture 7

Excursion: COPSS Award

• The COPSS Presidential Award is one of the most prestigious in statisticsI Outstanding contributions to field of statisticsI Recipient under 40I Considered one of the top two awards internationallyI Previous winners are a “who’s who" of statisticians

• In 2019 it was awarded to Hadley WickhamI New ZealanderI Chief scientist at RStudio

• He leads the development of many R packagesI tidyverse: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats, . . .

• We will be making use of a lot of his work in this course

STAT 260: Lecture 7 Slide 3

Page 4: STAT 260: Lecture 7

Data frames: constructing

• We can make a data frame using the function data.frame• Suppose we have two vectors x and y

x = c("Katrina", "Mik", "Tilman")y = c(21, 89, 55)

• We can combine those into a data framez = data.frame(x,y)z

## x y## 1 Katrina 21## 2 Mik 89## 3 Tilman 55

STAT 260: Lecture 7 Slide 4

Page 5: STAT 260: Lecture 7

Data frames: constructing

• The column name is taken from the variable name• We can also change the name

z2 = data.frame(name = x, value = y)z2

## name value## 1 Katrina 21## 2 Mik 89## 3 Tilman 55

STAT 260: Lecture 7 Slide 5

Page 6: STAT 260: Lecture 7

Data frame: constructing

• We can also change the variable names separatelycolnames(z) = c("name","value")z

## name value## 1 Katrina 21## 2 Mik 89## 3 Tilman 55

STAT 260: Lecture 7 Slide 6

Page 7: STAT 260: Lecture 7

Data frame: adding a variable

• We can extend data frames• Suppose we have another variable w

w = c(62, 38, 17)

• We can specify an extended data framezext = data.frame(z, ID = w)zext

## name value ID## 1 Katrina 21 62## 2 Mik 89 38## 3 Tilman 55 17

STAT 260: Lecture 7 Slide 7

Page 8: STAT 260: Lecture 7

Data handling: subsetting

• Subset with []I Already seen this for finding columns (data frame)

• Look more closely: starting with vectors• sunspots is a built-in data vector1

• We can find the Xth element with sunspots[X]

sunspots[1]

## [1] 58

sunspots[1001]

## [1] 41.3

1It is actually a time-series vectorSTAT 260: Lecture 7 Slide 8

Page 9: STAT 260: Lecture 7

Data handling: subsetting

• We can also subset multiple values• If we want the first 5 elements

sunspots[1:5]

## [1] 58.0 62.6 70.0 55.7 85.0

• If we want the 1st, 52nd, 79th and 110thsunspots[c(1,52,79,110)]

## [1] 58.0 38.0 8.6 52.0

STAT 260: Lecture 7 Slide 9

Page 10: STAT 260: Lecture 7

Data handling: subsetting

• While useful, it is limiting having to specify the indices we want to subset• We can automate this using logical values in R

STAT 260: Lecture 7 Slide 10

Page 11: STAT 260: Lecture 7

Excursion: logical values

• Logical values help us to check a conditiona = 5a > 3

## [1] TRUE

a == 4

## [1] FALSE

• Take values TRUE or FALSE

STAT 260: Lecture 7 Slide 11

Page 12: STAT 260: Lecture 7

Data handling with logical values

• Suppose we want all values of the vector where the value is over 50tmp = sunspots > 50head(tmp)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

sun50 = sunspots[tmp]head(sun50)

## [1] 58.0 62.6 70.0 55.7 85.0 83.5

STAT 260: Lecture 7 Slide 12

Page 13: STAT 260: Lecture 7

Data handling

• The original vector has lengthlength(sunspots)

## [1] 2820

• The reduced vector has length:length(sun50)

## [1] 1205

• Typically I would combine the two commands in prev slide into 1sun50 = sunspots[sunspots > 50]

STAT 260: Lecture 7 Slide 13

Page 14: STAT 260: Lecture 7

Quick data visualisation• We can use our what we’ve learned about plotting to quickly visualise the data.• Base R plots are great for taking a quick look at the data.

hist(sunspots, breaks = 25)hist(sun50, breaks = 25, xlim=c(0,max(sun50)))

Histogram of sunspots

sunspots

Fre

quen

cy

0 50 100 150 200 250

010

030

050

0

Histogram of sun50

sun50F

requ

ency

0 50 100 150 200 250

050

150

250

STAT 260: Lecture 7 Slide 14

Page 15: STAT 260: Lecture 7

Not quite as easy in ggplot. . .

• ggplot needs a data frame to plotstr(sunspots)

## Time-Series [1:2820] from 1749 to 1984: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...

names(sunspots)

## NULL

• Wrong data type, and no aesthetics to specify. . .

STAT 260: Lecture 7 Slide 15

Page 16: STAT 260: Lecture 7

. . . but not too hard to overcomesun50_df = data.frame(sun50 = sun50)str(sun50_df)

## 'data.frame': 1205 obs. of 1 variable:## $ sun50: num 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...

library(ggplot2)ggplot(data=sun50_df, aes(x=sun50)) + geom_histogram(bins=25) + theme_bw()

0

50

100

150

50 100 150 200 250sun50

coun

t

STAT 260: Lecture 7 Slide 16

Page 17: STAT 260: Lecture 7

Data handling

• An alternate way of subsetting data is to first find which elements satisfy thecondition

idx50 = which(sunspots > 50)head(idx50)

## [1] 1 2 3 4 5 6

• Then use these indices to subsetsun50v2 = sunspots[idx50]all.equal(sun50,sun50v2)

## [1] TRUE

STAT 260: Lecture 7 Slide 17

Page 18: STAT 260: Lecture 7

Data handling: data frames

• The exact same principles apply for data frames• Consider the chickwts data set

head(chickwts)

## weight feed## 1 179 horsebean## 2 160 horsebean## 3 136 horsebean## 4 227 horsebean## 5 217 horsebean## 6 168 horsebean

STAT 260: Lecture 7 Slide 18

Page 19: STAT 260: Lecture 7

Data handling: data frames

• If we wanted only those chickens who had been fed linseed• Two approaches

idxlin = which(chickwts$feed=="linseed")chicklin = chickwts[idxlin,] # choose the rows that satisfy condition

head(chicklin)

## weight feed## 11 309 linseed## 12 229 linseed## 13 181 linseed## 14 141 linseed## 15 260 linseed## 16 203 linseed

STAT 260: Lecture 7 Slide 19

Page 20: STAT 260: Lecture 7

Data handling: data frames

• We could also specify only some of the columns• Only present the weights for those chickens fed linseed

chickwts[idxlin,1]

## [1] 309 229 181 141 260 203 148 169 213 257 244 271

chickwts$weight[idxlin]

## [1] 309 229 181 141 260 203 148 169 213 257 244 271

STAT 260: Lecture 7 Slide 20

Page 21: STAT 260: Lecture 7

Data handling: data frames

• The former is more general than the latterI Choose to have several variables

• E.g. for zext

zext[1,c(2,3)]

## value ID## 1 21 62

zext[c(1,3),c(1,3)]

## name ID## 1 Katrina 62## 3 Tilman 17

STAT 260: Lecture 7 Slide 21

Page 22: STAT 260: Lecture 7

Data handling: data frames

• Alternatively, we can do it at oncechicklinv2 = chickwts[chickwts$feed=="linseed",]all.equal(chicklin,chicklinv2)

## [1] TRUE

STAT 260: Lecture 7 Slide 22

Page 23: STAT 260: Lecture 7

Data handling: and

• We can have multiple criteria that we use• We may wish to find those chickens fed soybean AND that weigh over 300 gm.• We can include AND with &

idxsoy300 = which(chickwts$feed == "soybean" & chickwts$weight > 300)chicksoy300 = chickwts[idxsoy300,]chicksoy300

## weight feed## 26 327 soybean## 27 329 soybean## 31 316 soybean

STAT 260: Lecture 7 Slide 23

Page 24: STAT 260: Lecture 7

Data handling: or

• We can also have conditions that include OR statements• We may wish to find those chickens fed sunflower or weigh under 200 gm• We can include OR with |

idxsun200 = which(chickwts$feed == "sunflower" | chickwts$weight < 200)chicksun200 = chickwts[idxsun200,]str(chicksun200)

## 'data.frame': 29 obs. of 2 variables:## $ weight: num 179 160 136 168 108 124 143 140 181 141 ...## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 3 3 ...

STAT 260: Lecture 7 Slide 24

Page 25: STAT 260: Lecture 7

Data handling: subset

• There is also a built-in function subset• From help file:

This is a convenience function intended for use interactively. For programmingit is better to use the standard subsetting functions like [

• Repeat the final three subsetschicklin_v3 = subset(chickwts, feed == "linseed")chicksoy300_v2 = subset(chickwts, feed == "soybean" & weight > 300)chicksun200_v2 = subset(chickwts, feed == "sunflower" | weight < 200)

STAT 260: Lecture 7 Slide 25

Page 26: STAT 260: Lecture 7

In practice

• The real power in data handling is when we combine it with visualizationggplot(data = chickwts, aes(x=feed, y=weight)) +

geom_boxplot(outlier.shape = NA, colour='grey') + geom_jitter(width=0.1)ggplot(data = chicksun200_v2, aes(x=feed, y=weight)) +

geom_boxplot(outlier.shape = NA, colour='grey') + geom_jitter(width=0.1)

100

200

300

400

casein horsebean linseed meatmeal soybean sunflower

feed

wei

ght

100

200

300

400

horsebean linseed meatmeal soybean sunflower

feedw

eigh

t

STAT 260: Lecture 7 Slide 26