36
Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 1 / 36

Coding Lab: Visualizing data with ggplot2

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coding Lab: Visualizing data with ggplot2

Coding Lab: Visualizing data with ggplot2

Ari Anisfeld

Summer 2020

1 / 36

Page 2: Coding Lab: Visualizing data with ggplot2

How to use ggplot

I How to map data to aesthetics with aes() (and what thatmeans)

I How to visualize the mappings with geomsI How to get more out of your data by using multiple aestheticsI How to use facets to add dimensionality

There are whole books on how to use ggplot. This is a quickintroduction!

2 / 36

Page 3: Coding Lab: Visualizing data with ggplot2

Understanding ggplot()By itself, ggplot() tells R to prepare to make a plot.texas_annual_sales <-texas_housing_data %>%group_by(year) %>%summarize(total_volume = sum(volume, na.rm = TRUE))

ggplot(data = texas_annual_sales)

3 / 36

Page 4: Coding Lab: Visualizing data with ggplot2

Adding a mappingAdding mapping = aes() says how the data will map to“aesthetics”.

I e.g. tell R to make x-axis year and y-axis total_volume.I Each row of the data has (year, total_volume).

I R will map that to the coordinate pair (x,y) .I Look at the data before moving on!

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume))

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

4 / 36

Page 5: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geomgeom_<name> tells R what type of visualization to produce.

Here we see points.

I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y).

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_point()

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

5 / 36

Page 6: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geomHere we see bars.

I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y)

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_col()

0e+00

2e+10

4e+10

6e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

6 / 36

Page 7: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geom

Here we see a line connecting each (x,y) pair.ggplot(data = texas_annual_sales,

mapping = aes(x = year, y = total_volume)) +geom_line()

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

7 / 36

Page 8: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geomHere we see a smooth line. R does a statistical transformation!

I Now R doesn’t visualize the mapping (year, total_volume) to each (x,y)pair

I Instead it fits a model to the (x,y) and then plots the “smooth” line

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.5e+10

5.0e+10

7.5e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

8 / 36

Page 9: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geom

We can overlay several geom.ggplot(data = texas_annual_sales,

mapping = aes(x = year, y = total_volume)) +geom_smooth() +geom_point()

2.5e+10

5.0e+10

7.5e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

9 / 36

Page 10: Coding Lab: Visualizing data with ggplot2

Visualizing the mapping with a geom

I We saw that we can visualize a relationship between twovariables mapping data to x and y

I The data can be visualized with different geoms that can becomposed (+) together.

I We can even calculate new variables with statistics and plotthose on the fly.

Next: Now we’ll look at aesthetics that go beyond x and y axes.

10 / 36

Page 11: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.We’ll use midwest data and start with only mapping to x and y

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty)) +geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

11 / 36

Page 12: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.I color maps data to the color of points or lines.

I Each state is assigned a color.I This works with discrete data and continuous data.

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,color = state)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

12 / 36

Page 13: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.I shape maps data to the shape of points.

I Each state is assigned a shape.I This works with discrete data only.

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,shape = state)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

13 / 36

Page 14: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.I alpha maps data to the transparency of points.

I Here we map the percentage of people within a known povertystatus to alpha

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,alpha = poptotal)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y poptotal

1e+06

2e+06

3e+06

4e+06

5e+06

14 / 36

Page 15: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.I size maps data to the size of points and width of lines.

I Here we map the percentage of people within a known povertystatus to size

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,size = poptotal)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y poptotal

1e+06

2e+06

3e+06

4e+06

5e+06

15 / 36

Page 16: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.

We can combine any and all aesthetics, and even map the same variable tomultiple aestheticsmidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,alpha = percpovertyknown,size = poptotal,color = state))+

geom_point()

16 / 36

Page 17: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data.

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

percpovertyknown

85

90

95

state

IL

17 / 36

Page 18: Coding Lab: Visualizing data with ggplot2

Using aesthetics to explore data

Different geoms have specific aesthetics that go with them.

I use ? to see which aesthetics a geom accepts (e.g?geom_point)

I the bold aesthetics are required.I the ggplot cheatsheet shows all the geoms with their associated

aesthetics

18 / 36

Page 19: Coding Lab: Visualizing data with ggplot2

FacetsFacets provide an additional tool to explore multidimensional datamidwest %>%

ggplot(aes(x = log(poptotal),y = percbelowpoverty)) +

geom_point() +facet_wrap(vars(state))

OH WI

IL IN MI

8 10 12 14 8 10 12 14

8 10 12 140

1020304050

01020304050

log(poptotal)

perc

belo

wpo

vert

y

19 / 36

Page 20: Coding Lab: Visualizing data with ggplot2

discrete vs continuous dataaes discrete continuous

limited number of classes unlimited number of classesusually chr or lgl numeric

x, y yes yescolor, fill yes yesshape yes (6 or fewer categories) nosize, alpha not advised yesfacet yes not advised

Here, discrete and continuous have different meaning than in math

I For ggplot meaning is more fluid.I If you do group_by with the var and there are fewer than 6 to

10 groups, discrete visualizations can workI If your “discrete” data is numeric, as.character() or

as_factor() to enforce the decision.20 / 36

Page 21: Coding Lab: Visualizing data with ggplot2

color can be continuousmidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,color = percpovertyknown)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

85

90

95

percpovertyknown

21 / 36

Page 22: Coding Lab: Visualizing data with ggplot2

shape does not play well with many categoriesI Will only map to 6 categories, the rest become NA.I We can override this behavior and get up to 25 distinct shapes

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,shape = county)) +

geom_point() +# legend off, otherwise it overwhelmstheme(legend.position = "none")

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

22 / 36

Page 23: Coding Lab: Visualizing data with ggplot2

alpha and size can be misleading with discrete datamidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,alpha = state)) +

geom_point()

## Warning: Using alpha for a discrete variable is not advised.

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

23 / 36

Page 24: Coding Lab: Visualizing data with ggplot2

Adding vertical linestexas_annual_sales %>%

ggplot(aes(x = year, y = total_volume)) +geom_point() +geom_vline(aes(xintercept = 2007),

linetype = "dotted")

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

I add horizontal lines with geom_hline()I add any linear fit using geom_abline() by providing a slope

and intercept.24 / 36

Page 25: Coding Lab: Visualizing data with ggplot2

Key take aways

I ggplot starts by mapping data to “aesthetics”.I e.g. What data shows up on x and y axes and how color,

size and shape appear on the plot.I We need to be aware of ‘continuous’ vs. ‘discrete’ variables.

I Then, we use geoms to create a visualization based on themapping.

I Again we need to be aware of ‘continuous’ vs. ‘discrete’variables.

I Making quick plots helps us understand data and makes usaware of data issues

Resources: R for Data Science chap. 3 (r4ds.had.co.nz); RStudio’sggplot cheatsheet.

25 / 36

Page 26: Coding Lab: Visualizing data with ggplot2

Appendix: Some graphs you made along the way

26 / 36

Page 27: Coding Lab: Visualizing data with ggplot2

lab 0: a map

geom_path is like geom_line, but connects (x, y) pairs in theorder they appear in the data set.storms %>%

group_by(name, year) %>%filter(max(category) == 5) %>%

ggplot(aes(x = long, y = lat, color = name)) +geom_path() +borders("world") +coord_quickmap(xlim = c(-130, -60), ylim = c(20, 50))

27 / 36

Page 28: Coding Lab: Visualizing data with ggplot2

lab 0: a map

20

30

40

50

−120 −100 −80 −60long

lat

Dean

Emily

Felix

Gilbert

Hugo

Isabel

Ivan

Katrina

Mitch

28 / 36

Page 29: Coding Lab: Visualizing data with ggplot2

lab 1: a line plot

french_data <-wid_data %>%filter(type == "Net personal wealth",

country == "France") %>%mutate(perc_national_wealth = value * 100)

french_data %>%ggplot(aes(y = perc_national_wealth,

x = year,color = percentile)) +

geom_line()

29 / 36

Page 30: Coding Lab: Visualizing data with ggplot2

lab 1: a line plot

0

25

50

75

1900 1925 1950 1975 2000year

perc

_nat

iona

l_w

ealth

percentile

p0p50

p50p90

p90p100

p99p100

30 / 36

Page 31: Coding Lab: Visualizing data with ggplot2

lab 2: distributions

I geom_density() only requires an x asthetic and it calculatesthe distribution to plot.

I We can set the aesthetics manually, independent of data fornicer graphs.chi_sq_samples <-tibble(x = c(rchisq(100000, 2),

rchisq(100000, 3),rchisq(100000, 4)),

df = rep(c("2", "3", "4"), each = 1e5))

chi_sq_samples %>%ggplot(aes(x = x, fill = df)) +geom_density( alpha = .5) +labs(fill = "df", x = "sample")

31 / 36

Page 32: Coding Lab: Visualizing data with ggplot2

lab 2: distributions

0.0

0.1

0.2

0.3

0.4

0 10 20 30sample

dens

ity

df

2

3

4

32 / 36

Page 33: Coding Lab: Visualizing data with ggplot2

lab 4: grouped bar graphs

I position = "dodge2" tells R to put bars next to each other,rather than stacked on top of each other.

I Notice we use fill and not color because we’re “filling” anarea.mean_share_per_country %>%

ggplot(aes(y = country,x = mean_share,fill = percentile)) +

geom_col(position = "dodge2") +labs(x = "Mean share of national wealth",

y = "",fill = "Wealth\npercentile")

33 / 36

Page 34: Coding Lab: Visualizing data with ggplot2

lab 4: grouped bar graphs

ChinaFrance

IndiaKorea

RussiaS Africa

UKUSA

0.00 0.25 0.50 0.75Mean share of national wealth

Wealthpercentile

p90p100

p99p100

34 / 36

Page 35: Coding Lab: Visualizing data with ggplot2

lab 4: faceted bar graph

I Notice that we manipulate our data to the right specificationbefore making this graph

I Using facet_wrap we get a distinct graph for each timeperiod.

mean_share_per_country_with_time %>%ggplot(aes(x = country,

y = mean_share,fill = percentile)) +

geom_col(position = "dodge2") +facet_wrap(vars(time_period))

35 / 36

Page 36: Coding Lab: Visualizing data with ggplot2

lab 4: faceted bar graph

1980 to 1999 2000 to present

1959 and earlier 1960 to 1979

ChinaIndiaUSA ChinaIndiaUSA

0.00.20.40.60.8

0.00.20.40.60.8

country

mea

n_sh

are

percentile

p90p100

p99p100

36 / 36