Data Wrangling and Visualisation in R · ggplot2 R package ggplot2 is a power ful data...

Preview:

Citation preview

Data Wrangling and Visualisation in RSTAT3022 Applied Linear Models Lecture 2

2020/02/14

Today1. Using ggplot2 for data

visualisation.

2. Using dplyr and tidyr for datawrangling.

Data VisualisationWhy make your graphs in R?

The graphs are easily reproducible.

You can make publication quality graphs.

How to make your graphs in R?R has many contributed packages that extend from the standard base installation.

Today we will learn about ggplot2 R package.

What is base?

Can you name some functions that are in base that generates a graph?

2 / 32

ggplot2 R packageggplot2 is a powerful data visualisation R package with a largecommunity following that is built on the layered grammar of graphics byWickham (2008).

One of the reason that makes it powerful is because of its ease inextensibility resulting in many extension packages.

ggplot2 uses qplot or ggplot to make graphics

qplot is useful for making quick graphs (especially when data is not in adata.frame) but ggplot is advisable for most occasions.

We will only cover ggplot .

To get started, load the package:

library(ggplot2) # or library(tidyverse)

3 / 32Wickham (2008) Practical tools for exploring data and models. PhD Thesis.

Layered Grammar of GraphicsEvery ggplot2 object has three key components:

1. data,

2. A set of aesthestic mapping between variables in the data and visualproperties (e.g color, size etc)

3. At least one layer describing how to render each observation; usuallycreated with geom function.

str(iris)

'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

ggplot(data=iris) + aes(x=Sepal.Length, y=Sepal.Width) + geom_point()

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sepa

l.Width

4 / 32

Every layer has:1. geom - the geometric object to use display the data, and stat - statistical

transformation to use on the data for this layer.

2. data and mapping (aesthestics) which is usually inherited from ggplot()object.

3. position - position in the coordinate system.

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width))p + geom_point() # blank + geom layer

which is a short-hand for:

p + layer(geom="point", stat="identity", position="identity")

Every ggplot object has:

1. Data

2. Aesthesitc mapping

3. Layer(s)

Purpose of a layer is to display:

the raw data,

a statistical summary, or

additional metadata such ascontext, annotations, andreferences.

5 / 32

Some geom objectsp <- ggplot(iris, aes(Species, Sepal.Width))class(p)

[1] "gg" "ggplot"Image source:http://suruchi�aloke.com/2016-10-13-machine-learning-tutorial-iris-classi�cation/

p + geom_blank() p + geom_point() p + geom_boxplot() p + geom_violin()

6 / 32

Drawing linesp <- ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point(colour="gray")

p + geom_abline(intercept=-0.4,slope=0.4) p + geom_smooth(method="lm")

p + geom_hline(yintercept=0) p + geom_vline(xintercept=0)

7 / 32

Distribution by groupp <- ggplot(iris, aes(Petal.Width, fill=Species))

p + geom_dotplot() p + geom_histogram()

p + geom_density() p + geom_freqpoly(aes(color=Species))

8 / 32

geom Description

geom_abline Reference lines: horizontal, vertical, and diagonalgeom_bar Bar chartsgeom_bin2d Heatmap of 2d bin countsgeom_blank Draw nothinggeom_boxplot A box and whiskers plot (in the style of Tukey)geom_contour 2d contours of a 3d surfacegeom_count Count overlapping pointsgeom_density Smoothed density estimatesgeom_density_2d Contours of a 2d density estimategeom_dotplot Dot plotgeom_errorbarh Horizontal error barsgeom_hex Hexagonal heatmap of 2d bin countsgeom_freqpoly Histograms and frequency polygonsgeom_jitter Jittered pointsgeom_crossbar Vertical intervals: lines, crossbars & errorbarsgeom_map Polygons from a reference mapgeom_path Connect observationsgeom_point Points

geom

9 / 32

Statistical Tranformationhead(iris[, c("Petal.Width", "Species")]) # raw data

Petal.Width Species 1 0.2 setosa 2 0.2 setosa 3 0.2 setosa 4 0.2 setosa 5 0.2 setosa 6 0.4 setosa

stat_bin(bins=7, mapping=aes(Petal.Width, fill=Species)) Under the hood, the raw data is transformed into statistics and this ispassed onto the geom where here geom="bar" is default.

fill y count x xmin xmax density ncount 1 #619CFF 0 0 0.0 -0.2 0.2 0.0 0.0000000 2 #00BA38 0 0 0.0 -0.2 0.2 0.0 0.0000000 3 #F8766D 34 34 0.0 -0.2 0.2 1.7 1.0000000 4 #619CFF 0 0 0.4 0.2 0.6 0.0 0.0000000 5 #00BA38 0 0 0.4 0.2 0.6 0.0 0.0000000 6 #F8766D 16 16 0.4 0.2 0.6 0.8 0.4705882

10 / 32

Using stat with different geom objectp <- ggplot(iris, aes(Petal.Width, fill=Species))

p + stat_bin() p + stat_bin(geom="bar")

p + stat_bin(geom="point") p + stat_bin(geom="line")

11 / 32

stat Description

stat_count Bar chartsstat_bin_2d Heatmap of 2d bin countsstat_boxplot A box and whiskers plot (in the style of Tukey)stat_contour 2d contours of a 3d surfacestat_sum Count overlapping pointsstat_density Smoothed density estimatesstat_density_2d Contours of a 2d density estimatestat_bin_hex Hexagonal heatmap of 2d bin countsstat_bin Histograms and frequency polygonsstat_qq_line A quantile-quantile plotstat_quantile Quantile regressionstat_smooth Smoothed conditional meansstat_spoke Line segments parameterised by location, direction anddistance stat_ydensity Violin plotstat_sf Visualise sf objectsstat_ecdf Compute empirical cumulative distributionstat_ellipse Compute normal con�dence ellipsesstat_function Compute function for each x value

stat

12 / 32

Customisation

There are so many ways to customise a ggplot.

13 / 32

Changing ColorThere are many color palettes available, e.g.

library(RColorBrewer)ggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_brewer(palette="Set3")

14 / 32

Grey-scaleggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_grey()

Manual scaleggplot(iris, aes(Petal.Width, fill=Species)) + geom_dotplot() + scale_fill_manual( values=c("red","blue", "green"), labels=c("setosa", "versicolor", "virginica"))

15 / 32

Color variable is factorggplot(iris, aes(Petal.Width, Petal.Length, color=Species)) + geom_point(size=2) + scale_color_brewer(palette="Set1")

Color variable iscontinuousggplot(iris, aes(Petal.Width, Petal.Length, color=Sepal.Length)) + geom_point(size=2) + scale_color_distiller(palette="YlGnBu")

16 / 32

Data Wrangling

You may need to wrangle the data to get it in the right form for ggplot (or other purposes).

17 / 32

library(agridat) # data is inside herelibrary(dplyr) # for data wrangling; loaded together with library(tidyverse)str(pearl.kernels) # or glimpse(pearl.kernels)

'data.frame': 59 obs. of 6 variables: $ ear: Factor w/ 4 levels "Ear08","Ear09",..: 1 1 1 1 1 1 1 1 1 1 ... $ obs: Factor w/ 15 levels "Obs01","Obs02",..: 1 2 3 4 5 6 7 8 9 10 ... $ ys : int 352 322 298 332 305 313 308 311 327 308 ... $ yt : int 102 49 75 101 101 100 86 101 101 92 ... $ ws : int 52 82 108 71 86 90 95 92 78 95 ... $ wt : int 26 79 51 28 40 29 43 28 26 37 ...

We are using the data pearl.kernels loaded from library(agridat) .

The data contains the counts of yellow/white and sweet/starchy kernels oneach of 4 maize ears by 15 observers.

I want to get the counts for the 8th maize ear by observer 1 (plantpathologist).

(ear8obs1 <- pearl.kernels %>% filter(ear=="Ear08" & obs=="Obs01"))

ear obs ys yt ws wt 1 Ear08 Obs01 352 102 52 26

Image source: http://corncommentary.com/2012/05/22/using-the-kfc-kernel-for-cellulosic/

18 / 32Pearl, Raymond (1911) The Personal Equation In Breeding Experiments Involving Certain Characters of Maize Biological Bulletin 21 339-366

Help!The data:

ear8obs1

ear obs ys yt ws wt 1 Ear08 Obs01 352 102 52 26

How do I make the below graph in ggplot?

ggplot(ear8obs1, aes(x=..., y=...)) + geom_bar()

What if the data was shaped as below?

Type Count Color Kernel 1 ys 352 Yellow Starchy 2 yt 102 Yellow Sweet 3 ws 52 White Starchy 4 wt 26 White Sweet

How do I get the data in this shape easily?

ear8obs1 %>% tidyr::gather("Type", "Count", ys:wt)

ear obs Type Count 1 Ear08 Obs01 ys 352 2 Ear08 Obs01 yt 102 3 Ear08 Obs01 ws 52 4 Ear08 Obs01 wt 26

19 / 32

]Data WranglingGet the counts for the 8th maize ear by observer 1 (plant pathologist):

maize <- pearl.kernels %>% filter(ear=="Ear08" & obs=="Obs01") %>% select(ys, yt, ws, wt) %>% tidyr::gather("Type", "Count", ys:wt) %>% mutate(Color=case_when( Type %in% c("ys", "yt") ~ "Yellow", Type %in% c("ws", "wt") ~ "White" ),Kernel=case_when( Type %in% c("ys", "ws") ~ "Starchy", Type %in% c("yt", "wt") ~ "Sweet"))maize

Type Count Color Kernel 1 ys 352 Yellow Starchy 2 yt 102 Yellow Sweet 3 ws 52 White Starchy 4 wt 26 White Sweet

20 / 32

Example: Observer 1 for Maize Ear 8ggplot(maize, aes(Kernel, Count, fill=Color)) + geom_bar(stat="identity", color="black") + scale_fill_manual(values=c("white", "yellow"), label=c("White", "Yellow")) + guides(fill=FALSE) + theme_minimal(base_size = 20)

Image Source:https://agrifarmingtips.com/maize-cultivation-process/

21 / 32

Position for geom_bar which include stat="identity"p2 <- ggplot(maize, aes("",Count,fill=Type))

p2 + geom_bar() p2 + geom_bar(position="stack")

p2 + geom_bar(position="dodge") p2 + geom_bar(position="fill")

22 / 32

Coordinate system

p + geom_bar() p + geom_bar() + coord_polar(theta="y")

p + geom_bar() + coord_flip() p + geom_bar() + coord_polar(theta="y", direction=-1)

23 / 32All geom_bar include the arguments stat="identity" and color="black" .

Overplottingg <- ggplot(pearl.kernels, aes(ear, ys, color=ear, size=1,shape=)) + xlab(NULL) + guides(color=FALSE, size=FALSE) + ylab("No. of Yellow\n Starchy Kernel")

g + geom_point() g + geom_point(position="jitter")

g + geom_point(alpha=1 / 3) g + geom_point(alpha=1 / 6)

24 / 32

Massaging data to tidy formmaize_all <- pearl.kernels %>% tidyr::gather("Type", "Count", ys:wt) %>% mutate(Color=ifelse(substr(Type, 1, 1)=="y", "Yellow", "White"), Kernel=ifelse(substr(Type, 2, 2)=="s", "Starchy", "Sweet"), obs=factor(as.integer(substring(obs, 4, 5))))

head(pearl.kernels)

ear obs ys yt ws wt1 Ear08 Obs01 352 102 52 262 Ear08 Obs02 322 49 82 793 Ear08 Obs03 298 75 108 514 Ear08 Obs04 332 101 71 285 Ear08 Obs05 305 101 86 406 Ear08 Obs06 313 100 90 29

head(maize_all)

ear obs Type Count Color Kernel1 Ear08 1 ys 352 Yellow Starchy2 Ear08 2 ys 322 Yellow Starchy3 Ear08 3 ys 298 Yellow Starchy4 Ear08 4 ys 332 Yellow Starchy5 Ear08 5 ys 305 Yellow Starchy6 Ear08 6 ys 313 Yellow Starchy

25 / 32

Facetingggplot(maize_all, aes(obs, Count, fill=Type)) + geom_bar(stat="identity") + xlab("Observer") + facet_wrap(~ear)

 ear8 <- maize_all %>% filter(ear=="Ear08") %>% ggplot(aes(obs, Count, fill=Type)) + geom_bar(stat="identity", show.legend=F) + labs(tag="(A)", title="Ear 8", x="Observer") + facet_grid(Color ~ Kernel)

26 / 32

Patching Plots Togetherlibrary(patchwork)ear8 + ear9 + ear10 + ear11 + plot_layout(ncol = 2)

27 / 32

Changing Labelsg <- ggplot(vargas.wheat1.traits, aes(NGS, yield)) + geom_point(size=3) + geom_point(aes(colour=gen)) + geom_smooth(se=F, method="lm") + facet_wrap(~year) + labs(colour="Genotype") + # changes the label name for color legend labs(x="Number of grains per spikelet") + # same as xlab(..) labs(y="Yield (kg/ha)") + # same as ylab(..) labs(title="Durum Wheat at Ciudad Obregon, Mexico 1990-1995") + # same as ggtitle(..) labs(subtitle="Source: Vargas et al. (1998) Interpreting Genotyp # same as ggtitle(subtitle=..)

28 / 32

Theme - customise the lookg + theme(legend.position="bottom", plot.title=element_text(face="bold", size=15),plot.subtitle=element_text(face="italic", size=8),panel.background=element_rect(fill="white"),panel.border=element_rect(colour="grey20", fill=NA),panel.grid=element_line(colour="grey92"),panel.grid.minor=element_line(size=rel(0.5)),strip.background=element_rect(fill="grey85", colour="grey20"),legend.key=element_rect(fill="white"))

29 / 32

Theme - customise the lookg + theme(legend.position="bottom", plot.title=element_text(face="bold", size=15),plot.subtitle=element_text(face="italic", size=8),panel.background=element_rect(fill="white"),panel.border=element_rect(colour="grey20", fill=NA),panel.grid=element_line(colour="grey92"),panel.grid.minor=element_line(size=rel(0.5)),strip.background=element_rect(fill="grey85", colour="grey20"),legend.key=element_rect(fill="white"))

or use a pre-de�ned theme:

g + theme_bw() +theme(legend.position="bottom", plot.title=element_text(face="bold", size=14),plot.subtitle=element_text(face="italic", size=8))

30 / 32

More Pre-De�ned Themesg + theme_gray()

g + theme_classic()

 g + theme_minimal()

g + theme_dark()

31 / 32

SummaryUsing functions such as filter and mutate from dplyr towrangle data.

Using function gather from tidyr to change the data fromwide to long form.

Using ggplot from ggplot2 to make many sorts of plots.

Next lessonRevisitng simple linear regression.

Maximum likelihood estimation.

32 / 32

Recommended