Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
STAT 260: Lecture 9
Mik Black
STAT 260: Lecture 9 Slide 1
More ggplot2. . .
• Today: faceting and lines• As always, don’t forget to call the ggplot2 package before we start:
library(ggplot2)
• And later I also use dplyr:library(dplyr)
• Might not get through all these slides today. . .
STAT 260: Lecture 9 Slide 2
Faceting
• Faceting refers to the technique of making a particular plot across the levels of adiscrete variable (i.e., a factor in R).
• ggplot gives us the ability to do this in a single plot call via the facet_wrap
function.• We’ll look at this functionality using one of the data sets that are part of the
ggplot2 package - the “mpg” data• This is a data set that records the gas mileage of automobiles relative to their other
characteristics.
STAT 260: Lecture 9 Slide 3
MPG data - variables
• manufacturer: name of manufacturer• model: model name• displ: engine displacement, in liters• year: year of manufacture• cyl: number of cylinders• trans: type of transmission• drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)• cty: city miles per gallon• hwy: highway miles per gallon• fl: fuel type _ class: “type” of car
STAT 260: Lecture 9 Slide 4
MPG data - structurestr(mpg)
## tibble[,11] [234 x 11] (S3: tbl_df/tbl/data.frame)## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...## $ drv : chr [1:234] "f" "f" "f" "f" ...## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...## $ fl : chr [1:234] "p" "p" "p" "p" ...## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
STAT 260: Lecture 9 Slide 5
MPG data - first rows
head(mpg)
## # A tibble: 6 x 11## manufacturer model displ year cyl trans drv cty hwy fl class## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
STAT 260: Lecture 9 Slide 6
MPG data - scatterplot of highway versus city mileageggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point()
10
15
20
25
30
35
20 30 40hwy
cty
STAT 260: Lecture 9 Slide 7
Aside - adding jitter (reminder from last lecture)
• there is a lot of overplotting going on - sometimes adding a little noise improve theplot by making the relationship more obvious (i.e., revealing the overplotted datapoints):
ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position="jitter")
10
20
30
20 30 40hwy
cty
STAT 260: Lecture 9 Slide 8
Colour by vehicle class
• lets use colour to add vehicle class information to the plot:ggplot(data=mpg, aes(x=hwy, y=cty, colour=class)) + geom_point(position="jitter")
10
20
30
20 30 40hwy
cty
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
STAT 260: Lecture 9 Slide 9
Hard to see what is going on. . .
• using colour to denote vehicle class does work, but it is hard to see exactly whatthe relationship is between city and highway mileage for each class.
• this is where “faceting” comes in - we can ask ggplot to make the scatterplot foreach type of vehicle.
• to do this we use the facet_wrap function, along with the ~ operator (you’ll learnmore about this later in the course):
ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class)
STAT 260: Lecture 9 Slide 10
Facet by vehicle classggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +
facet_wrap(~class)
suv
minivan pickup subcompact
2seater compact midsize
20 30 40
20 30 40 20 30 40
10
20
30
10
20
30
10
20
30
hwy
cty
STAT 260: Lecture 9 Slide 11
Facet by vehicle class• we can also specify the number of rows to using for faceting:
ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point(position='jitter') +facet_wrap(~class, nrow=2)
pickup subcompact suv
2seater compact midsize minivan
10 20 30 40 10 20 30 40 10 20 30 40
10 20 30 40
10
20
30
10
20
30
hwy
cty
STAT 260: Lecture 9 Slide 12
Facet mileage histograms by drive type
• we can use faceting for (almost) any sort of plot:ggplot(data=mpg, aes(x=hwy)) + geom_histogram(bins=15) + facet_wrap(~drv)
4 f r
10 20 30 40 10 20 30 40 10 20 30 40
0
10
20
30
40
hwy
coun
t
STAT 260: Lecture 9 Slide 13
More information: engine displacement
• engine displacement, displ, is a continuous variable:ggplot(data=mpg, aes(x=displ)) + geom_histogram(bins=15, colour='black', fill='white')
0
10
20
30
40
2 4 6displ
coun
t
STAT 260: Lecture 9 Slide 14
Engine displacement• definitely varies by vehicle class:
ggplot(data=mpg, aes(x=class, y=displ)) + geom_boxplot() +geom_jitter(width=0.15, alpha=0.3)
2
3
4
5
6
7
2seater compact midsize minivan pickup subcompact suvclass
disp
l
STAT 260: Lecture 9 Slide 15
Colour by engine displacement• can also colour by a continuous variable (mentioned this at the end of the last
lecture):ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position="jitter")
10
15
20
25
30
35
20 30 40hwy
cty
2
3
4
5
6
7displ
STAT 260: Lecture 9 Slide 16
Facet by vehicle class & colour by displacement• and now lets facet by class!
ggplot(data=mpg, aes(x=hwy, y=cty, colour=displ)) + geom_point(position='jitter') +facet_wrap(~class)
suv
minivan pickup subcompact
2seater compact midsize
20 30 40
20 30 40 20 30 40
10
20
30
10
20
30
10
20
30
hwy
cty
2
3
4
5
6
7displ
STAT 260: Lecture 9 Slide 17
Linking point size to a variable
• instead of colour we could use point size to include information about a variables:ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point()
10
15
20
25
30
35
20 30 40hwy
cty
displ
2
3
4
5
6
7
STAT 260: Lecture 9 Slide 18
Linking point size of a variable (alpha)• add transparency via alpha levels:
ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) +geom_point(alpha=0.2)
10
15
20
25
30
35
20 30 40hwy
cty
displ
2
3
4
5
6
7
STAT 260: Lecture 9 Slide 19
Linking point size of a variable (with alpha and jitter)
• now ad some jitter. . .ggplot(data=mpg, aes(x=hwy, y=cty, size=displ)) + geom_point(alpha=0.2, position='jitter')
10
20
30
20 30 40hwy
cty
displ
2
3
4
5
6
7
STAT 260: Lecture 9 Slide 20
Local aesthetics
• ggplot allows us to specify aesthetic locally (i.e., specific to a geom).• if the local value is different to the aes values specified in the main ggplot call,
then those aesthetics will be used for that particular geometric object.• this becomes useful when customising multiple layers in a single plot - we’ll see an
example of this later in the lecture.• here is an example of specifying the point size within geom_point (it gives the
same result as above):ggplot(data=mpg, aes(x=hwy, y=cty)) +
geom_point(aes(size=displ), alpha=0.2, position='jitter')
STAT 260: Lecture 9 Slide 21
Local aestheticsggplot(data=mpg, aes(x=hwy, y=cty)) +
geom_point(aes(size=displ), alpha=0.2, position='jitter')
10
15
20
25
30
35
20 30 40hwy
cty
displ
2
3
4
5
6
7
STAT 260: Lecture 9 Slide 22
Adding lines
• another very powerful feature of ggplot is the ability to add lines to a plot.• in particular, lines that are generated by the application of a statistical procedure to
the data in the plot. For example:I linear regressionI local smoothing techniques such as “loess”
• here we are using the geom_smooth geometric object.• if no method is specified, geom_smooth will choose a method based on sample size:“loess” for n<1000, otherwise a generalised additive model is used (don’t worryabout this for now. . . )
• the syntax is:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()
STAT 260: Lecture 9 Slide 23
Adding linesggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth()
20
30
40
2 3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 24
Adding lines: straight line• use geom_smooth(method=lm) to fit a linear model (i.e., simple linear regression)
to the data:ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(method=lm)
10
20
30
40
2 3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 25
Linear regression
• Here the geom_smooth() function is fitting a linear regression, and then addingthat line (and confidence interval, if se=TRUE) to the plot. Let’s check manually:
linreg = lm(hwy ~ displ, data=mpg)summary(linreg)$coefficients
## Estimate Std. Error t value Pr(>|t|)## (Intercept) 35.697651 0.7203676 49.55477 2.123519e-125## displ -3.530589 0.1945137 -18.15085 2.038974e-46
STAT 260: Lecture 9 Slide 26
Add regression line to plot (base R)plot(mpg$displ, mpg$hwy)abline(linreg)
2 3 4 5 6 7
1520
2530
3540
45
mpg$displ
mpg
$hw
y
STAT 260: Lecture 9 Slide 27
Calculating and adding confidence intervals
newx = seq(min(mpg$displ), max(mpg$displ), by = 0.05)conf_interval = predict(linreg, newdata=data.frame(displ=newx),
interval="confidence", level = 0.95)ci = data.frame(newx, conf_interval)head(ci)
## newx fit lwr upr## 1 1.60 30.04871 29.17768 30.91974## 2 1.65 29.87218 29.01686 30.72750## 3 1.70 29.69565 28.85590 30.53540## 4 1.75 29.51912 28.69479 30.34345## 5 1.80 29.34259 28.53352 30.15166## 6 1.85 29.16606 28.37208 29.96005
STAT 260: Lecture 9 Slide 28
Calculating and adding confidence intervalsplot(mpg$displ, mpg$hwy)abline(linreg, col="lightblue")lines(ci$newx, ci$lwr, col="blue", lty=2)lines(ci$newx, ci$upr, col="blue", lty=2)
2 3 4 5 6 7
1520
2530
3540
45
mpg$displ
mpg
$hw
y
STAT 260: Lecture 9 Slide 29
Check against ggplotggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +
geom_smooth(method='lm', se=TRUE) +geom_abline(intercept=linreg$coef[1], slope=linreg$coef[2], colour='red') +geom_line(data=ci, aes(x=newx, y=lwr)) + geom_line(data=ci, aes(x=newx, y=upr))
10
20
30
40
2 3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 30
Adding lines: remove confidence intervalggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(se=FALSE)
20
30
40
2 3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 31
Colour points by class
• It would be useful to colour the points on the plot by vehicle class (2seater,compact etc)
• Intuitively we can do this by setting colour=class.• Works when we only have geom_point - what happens when we also have the
geom_smooth layer in the plot?
STAT 260: Lecture 9 Slide 32
Colour points by class: oops. . .ggplot(data=mpg, aes(x=displ, y=hwy, colour=class)) + geom_point() + geom_smooth()
20
30
40
2 3 4 5 6 7displ
hwy
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
STAT 260: Lecture 9 Slide 33
What happened?
• The colour=class specification in the main ggplot aesthetics was used for allgeometric objects in the plot.
• What if we only want it to apply to geom_point but not geom_smooth?• Remember the example with point size from above. . . ?• We can specify the colour=class aesthetic within geom_point so that it is only
used for that layer:
ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +
geom_smooth()
STAT 260: Lecture 9 Slide 34
Local aesthetics to the rescue!ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point(aes(colour=class)) +
geom_smooth()
20
30
40
2 3 4 5 6 7displ
hwy
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
STAT 260: Lecture 9 Slide 35
Lines and facets• we can also add lines to our faceted plots:
ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +geom_smooth(method=lm, se=FALSE) + facet_wrap(~drv, nrow=1)
4 f r
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
displ
hwy
STAT 260: Lecture 9 Slide 36
Caution! Faceting and confidence intervals
• When the geom_smooth function is used to add lines (and confidence intervals),the calculations are performed per facet group.
• This can lead to differences to the confidence intervals that are calculated,compared to a regression model fit to the full data set.
I the regression lines will be the sameI the confidence intervals will be different
• This occurs because in the full regression model, all of the data points are used toestimate the standard error, whereas in the per-facet model, only the data pointsfrom that group are used.
STAT 260: Lecture 9 Slide 37
Faceting and confidence intervalsggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() +
geom_smooth(method='lm', se=TRUE) + facet_wrap(~drv)
4 f r
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
10
20
30
40
displ
hwy
STAT 260: Lecture 9 Slide 38
Close up for the rear wheel drive grouprwd = filter(mpg, drv=="r")ggplot(data=rwd, aes(x=displ, y=hwy)) + geom_point() +
geom_smooth(method='lm', se=TRUE) + xlim(3,7) + ylim(10,30)
10
15
20
25
30
3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 39
Regression model, with drv interaction term
linreg2 = lm(hwy ~ displ*drv, data=mpg)summary(linreg2)$coef
## Estimate Std. Error t value Pr(>|t|)## (Intercept) 30.6831131 1.0960630 27.993933 1.018637e-75## displ -2.8784863 0.2637577 -10.913372 1.392287e-22## drvf 6.6949631 1.5670461 4.272346 2.841696e-05## drvr -4.9033952 4.1821302 -1.172464 2.422346e-01## displ:drvf -0.7243016 0.4979149 -1.454669 1.471361e-01## displ:drvr 1.9550477 0.8147555 2.399552 1.721899e-02
STAT 260: Lecture 9 Slide 40
Confidence intervals on ggplot• Confidence intervals from full regression model (using all data with drv interaction
term: black lines) are narrower than the “per-facet” interval calculated bygeom_smooth.
10
15
20
25
30
3 4 5 6 7displ
hwy
STAT 260: Lecture 9 Slide 41