Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
Coding Lab: Visualizing data with ggplot2
Ari Anisfeld
Summer 2020
1 / 36
How to use ggplot
I How to map data to aesthetics with aes() (and what thatmeans)
I How to visualize the mappings with geomsI How to get more out of your data by using multiple aestheticsI How to use facets to add dimensionality
There are whole books on how to use ggplot. This is a quickintroduction!
2 / 36
Understanding ggplot()By itself, ggplot() tells R to prepare to make a plot.texas_annual_sales <-texas_housing_data %>%group_by(year) %>%summarize(total_volume = sum(volume, na.rm = TRUE))
ggplot(data = texas_annual_sales)
3 / 36
Adding a mappingAdding mapping = aes() says how the data will map to“aesthetics”.
I e.g. tell R to make x-axis year and y-axis total_volume.I Each row of the data has (year, total_volume).
I R will map that to the coordinate pair (x,y) .I Look at the data before moving on!
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume))
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
4 / 36
Visualizing the mapping with a geomgeom_<name> tells R what type of visualization to produce.
Here we see points.
I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y).
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_point()
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
5 / 36
Visualizing the mapping with a geomHere we see bars.
I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y)
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_col()
0e+00
2e+10
4e+10
6e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
6 / 36
Visualizing the mapping with a geom
Here we see a line connecting each (x,y) pair.ggplot(data = texas_annual_sales,
mapping = aes(x = year, y = total_volume)) +geom_line()
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
7 / 36
Visualizing the mapping with a geomHere we see a smooth line. R does a statistical transformation!
I Now R doesn’t visualize the mapping (year, total_volume) to each (x,y)pair
I Instead it fits a model to the (x,y) and then plots the “smooth” line
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
2.5e+10
5.0e+10
7.5e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
8 / 36
Visualizing the mapping with a geom
We can overlay several geom.ggplot(data = texas_annual_sales,
mapping = aes(x = year, y = total_volume)) +geom_smooth() +geom_point()
2.5e+10
5.0e+10
7.5e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
9 / 36
Visualizing the mapping with a geom
I We saw that we can visualize a relationship between twovariables mapping data to x and y
I The data can be visualized with different geoms that can becomposed (+) together.
I We can even calculate new variables with statistics and plotthose on the fly.
Next: Now we’ll look at aesthetics that go beyond x and y axes.
10 / 36
Using aesthetics to explore data.We’ll use midwest data and start with only mapping to x and y
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty)) +geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
11 / 36
Using aesthetics to explore data.I color maps data to the color of points or lines.
I Each state is assigned a color.I This works with discrete data and continuous data.
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,color = state)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
12 / 36
Using aesthetics to explore data.I shape maps data to the shape of points.
I Each state is assigned a shape.I This works with discrete data only.
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,shape = state)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
13 / 36
Using aesthetics to explore data.I alpha maps data to the transparency of points.
I Here we map the percentage of people within a known povertystatus to alpha
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,alpha = poptotal)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y poptotal
1e+06
2e+06
3e+06
4e+06
5e+06
14 / 36
Using aesthetics to explore data.I size maps data to the size of points and width of lines.
I Here we map the percentage of people within a known povertystatus to size
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,size = poptotal)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y poptotal
1e+06
2e+06
3e+06
4e+06
5e+06
15 / 36
Using aesthetics to explore data.
We can combine any and all aesthetics, and even map the same variable tomultiple aestheticsmidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,alpha = percpovertyknown,size = poptotal,color = state))+
geom_point()
16 / 36
Using aesthetics to explore data.
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
percpovertyknown
85
90
95
state
IL
17 / 36
Using aesthetics to explore data
Different geoms have specific aesthetics that go with them.
I use ? to see which aesthetics a geom accepts (e.g?geom_point)
I the bold aesthetics are required.I the ggplot cheatsheet shows all the geoms with their associated
aesthetics
18 / 36
FacetsFacets provide an additional tool to explore multidimensional datamidwest %>%
ggplot(aes(x = log(poptotal),y = percbelowpoverty)) +
geom_point() +facet_wrap(vars(state))
OH WI
IL IN MI
8 10 12 14 8 10 12 14
8 10 12 140
1020304050
01020304050
log(poptotal)
perc
belo
wpo
vert
y
19 / 36
discrete vs continuous dataaes discrete continuous
limited number of classes unlimited number of classesusually chr or lgl numeric
x, y yes yescolor, fill yes yesshape yes (6 or fewer categories) nosize, alpha not advised yesfacet yes not advised
Here, discrete and continuous have different meaning than in math
I For ggplot meaning is more fluid.I If you do group_by with the var and there are fewer than 6 to
10 groups, discrete visualizations can workI If your “discrete” data is numeric, as.character() or
as_factor() to enforce the decision.20 / 36
color can be continuousmidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,color = percpovertyknown)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
85
90
95
percpovertyknown
21 / 36
shape does not play well with many categoriesI Will only map to 6 categories, the rest become NA.I We can override this behavior and get up to 25 distinct shapes
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,shape = county)) +
geom_point() +# legend off, otherwise it overwhelmstheme(legend.position = "none")
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
22 / 36
alpha and size can be misleading with discrete datamidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,alpha = state)) +
geom_point()
## Warning: Using alpha for a discrete variable is not advised.
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
23 / 36
Adding vertical linestexas_annual_sales %>%
ggplot(aes(x = year, y = total_volume)) +geom_point() +geom_vline(aes(xintercept = 2007),
linetype = "dotted")
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
I add horizontal lines with geom_hline()I add any linear fit using geom_abline() by providing a slope
and intercept.24 / 36
Key take aways
I ggplot starts by mapping data to “aesthetics”.I e.g. What data shows up on x and y axes and how color,
size and shape appear on the plot.I We need to be aware of ‘continuous’ vs. ‘discrete’ variables.
I Then, we use geoms to create a visualization based on themapping.
I Again we need to be aware of ‘continuous’ vs. ‘discrete’variables.
I Making quick plots helps us understand data and makes usaware of data issues
Resources: R for Data Science chap. 3 (r4ds.had.co.nz); RStudio’sggplot cheatsheet.
25 / 36
Appendix: Some graphs you made along the way
26 / 36
lab 0: a map
geom_path is like geom_line, but connects (x, y) pairs in theorder they appear in the data set.storms %>%
group_by(name, year) %>%filter(max(category) == 5) %>%
ggplot(aes(x = long, y = lat, color = name)) +geom_path() +borders("world") +coord_quickmap(xlim = c(-130, -60), ylim = c(20, 50))
27 / 36
lab 0: a map
20
30
40
50
−120 −100 −80 −60long
lat
Dean
Emily
Felix
Gilbert
Hugo
Isabel
Ivan
Katrina
Mitch
28 / 36
lab 1: a line plot
french_data <-wid_data %>%filter(type == "Net personal wealth",
country == "France") %>%mutate(perc_national_wealth = value * 100)
french_data %>%ggplot(aes(y = perc_national_wealth,
x = year,color = percentile)) +
geom_line()
29 / 36
lab 1: a line plot
0
25
50
75
1900 1925 1950 1975 2000year
perc
_nat
iona
l_w
ealth
percentile
p0p50
p50p90
p90p100
p99p100
30 / 36
lab 2: distributions
I geom_density() only requires an x asthetic and it calculatesthe distribution to plot.
I We can set the aesthetics manually, independent of data fornicer graphs.chi_sq_samples <-tibble(x = c(rchisq(100000, 2),
rchisq(100000, 3),rchisq(100000, 4)),
df = rep(c("2", "3", "4"), each = 1e5))
chi_sq_samples %>%ggplot(aes(x = x, fill = df)) +geom_density( alpha = .5) +labs(fill = "df", x = "sample")
31 / 36
lab 2: distributions
0.0
0.1
0.2
0.3
0.4
0 10 20 30sample
dens
ity
df
2
3
4
32 / 36
lab 4: grouped bar graphs
I position = "dodge2" tells R to put bars next to each other,rather than stacked on top of each other.
I Notice we use fill and not color because we’re “filling” anarea.mean_share_per_country %>%
ggplot(aes(y = country,x = mean_share,fill = percentile)) +
geom_col(position = "dodge2") +labs(x = "Mean share of national wealth",
y = "",fill = "Wealth\npercentile")
33 / 36
lab 4: grouped bar graphs
ChinaFrance
IndiaKorea
RussiaS Africa
UKUSA
0.00 0.25 0.50 0.75Mean share of national wealth
Wealthpercentile
p90p100
p99p100
34 / 36
lab 4: faceted bar graph
I Notice that we manipulate our data to the right specificationbefore making this graph
I Using facet_wrap we get a distinct graph for each timeperiod.
mean_share_per_country_with_time %>%ggplot(aes(x = country,
y = mean_share,fill = percentile)) +
geom_col(position = "dodge2") +facet_wrap(vars(time_period))
35 / 36
lab 4: faceted bar graph
1980 to 1999 2000 to present
1959 and earlier 1960 to 1979
ChinaIndiaUSA ChinaIndiaUSA
0.00.20.40.60.8
0.00.20.40.60.8
country
mea
n_sh
are
percentile
p90p100
p99p100
36 / 36