120
R Lab2: Data Management Introduction to Econometrics,Fall 2019 Zhaopeng Qu Nanjing University 9/23/2020 Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 1 / 120

R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

R Lab2: Data ManagementIntroduction to Econometrics,Fall 2019

Zhaopeng Qu

Nanjing University

9/23/2020

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 1 / 120

Page 2: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

1 Last Lecture

2 Data Management in R

3 “Fancy” Data Management using Tidyverse

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 2 / 120

Page 3: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Last Lecture

Section 1

Last Lecture

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 3 / 120

Page 4: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Last Lecture Where are packages installed

Subsection 1

Where are packages installed

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 4 / 120

Page 5: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Last Lecture Where are packages installed

Adding .libPath

Normally,packages are installed in the directory where R was installed.If you want to know the exactly place, just type

.libPaths()

## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"You can also add a directory that you created as the “store” ofpackages.Like as

.libPaths(c(.libPaths(),"~/Dropbox/R/Rlib"))

.libPaths()

## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"## [2] "/Users/byelenin/Dropbox/R/Rlib"

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 5 / 120

Page 6: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Last Lecture Where are packages installed

Adding .libPath

But it is just a temporary setting. If you restart your R and type.libPath() , you will find that the new directory disappear.

.rs.restartR()

.libPaths()

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 6 / 120

Page 7: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Last Lecture Where are packages installed

.Renviron files

If you want to set it permanently, you have to set it in .Renviron file.Open it

file.edit("~/.Renviron")

Just add a line “R_LIBS_USER=your directory” in the file.

Restart R session, you will find the added directory is permanentlychanged(or added).

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 7 / 120

Page 8: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R

Section 2

Data Management in R

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 8 / 120

Page 9: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Subsection 1

Introduction

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 9 / 120

Page 10: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Common Types Of Variable in Statistics

Quantitative(定量)Income,Testscore,Weight,Height

Qualitative(定性)Gender, Nationality, Identity, Profession…

Date(时间)day,month,year,lunar calendar

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 10 / 120

Page 11: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Types in Qualitative(定性)

Qualitative(定性)categoried(nominal) variable 分类变量

性别、民族、行业…ordered(ordinal) variable 有序分类变量

教育水平、自评健康、幸福感

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 11 / 120

Page 12: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Data Types in R

Numeric Datainteger: only store whole numbers: 2Ldouble: floating point numbers (i.e. with a decimal part)

Character Data: store strings of characters.character:factor:

DatesDate: the number of daysPOSIXct: the number of seconds

Logical: store TRUE or FALSETRUR==1 and FALSE==0

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 12 / 120

Page 13: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Data Structure in R

Scalars(标量)Vectors(向量)Matrix(矩阵)Data.frames(数据框)List(列表)Array(数组)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 13 / 120

Page 14: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Vectors(向量)

All elements in a vector are the same type.c(1,2,3,4)c(“R”,“Stata”,“Python”)

Factor Vectors

q<-c("soccer","badminton","soccer","tennis","soccer","tennis")

q

## [1] "soccer" "badminton" "soccer" "tennis" "soccer" "tennis"transform into a factor vector

q.factor <- as.factor(q)q.factor

## [1] soccer badminton soccer tennis soccer tennis## Levels: badminton soccer tennis

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14 / 120

Page 15: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Data.frames(数据框)

data.frame is just like an Excel spreadsheet in that it has columnsand rows.Each column is a variable and each row is an observation.A data.frame can be seen as a collection of vectors.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 15 / 120

Page 16: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Data.frames(数据框)

generate a data.frame

x <- 6:1y <- -4:1df <- data.frame(x,y,q)df

## x y q## 1 6 -4 soccer## 2 5 -3 badminton## 3 4 -2 soccer## 4 3 -1 tennis## 5 2 0 soccer## 6 1 1 tennis

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 16 / 120

Page 17: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Data.frames(数据框)

naming variables

df <- data.frame(Frist=x,Second=y,Sport=q)df

## Frist Second Sport## 1 6 -4 soccer## 2 5 -3 badminton## 3 4 -2 soccer## 4 3 -1 tennis## 5 2 0 soccer## 6 1 1 tennis

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 17 / 120

Page 18: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

List(列表)A list is a container which can contain all numerics or characterselements or a mix of the two or date.frames or other lists.list(1,2,3)

## [[1]]## [1] 1#### [[2]]## [1] 2#### [[3]]## [1] 3

list(c(1,2,3))

## [[1]]## [1] 1 2 3

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 18 / 120

Page 19: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

List(列表)

combination of a data.frame and a vector

list(df,1:6)

## [[1]]## Frist Second Sport## 1 6 -4 soccer## 2 5 -3 badminton## 3 4 -2 soccer## 4 3 -1 tennis## 5 2 0 soccer## 6 1 1 tennis#### [[2]]## [1] 1 2 3 4 5 6

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 19 / 120

Page 20: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Introduction

Conversion functions

is.datatype() return TRUE or FALSEas.datatype() converts the argument to that type

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 20 / 120

Page 21: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Subsection 2

Preparation for Data Analysis

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 21 / 120

Page 22: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Working directory

Working in an appropriate directory.

getwd()setwd("~/Dropbox/R/R_Class/Intro_Metrics/Labs/Lab2")

Clean out your workspaceSession -> Clear

rm(list=ls())

Restart R and open a new RScript file.If you use the Projects function in RStudio, you don’t need itanymore.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 22 / 120

Page 23: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Projects function in RStudio

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 23 / 120

Page 24: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Projects function in RStudio

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 24 / 120

Page 25: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Projects function in RStudio

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 25 / 120

Page 26: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Projects function in RStudio

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 26 / 120

Page 27: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Preparation for Data Analysis

Projects function in RStudio

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 27 / 120

Page 28: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Subsection 3

Data Management in Practice

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 28 / 120

Page 29: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Install and Load gapminder data

install.packages("gapminder")library(gapminder)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 29 / 120

Page 30: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Look at the gapminder data frame

class(gapminder)

## [1] "tbl_df" "tbl" "data.frame"

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 30 / 120

Page 31: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Meet the gapminder data frame or “tibble”

str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 31 / 120

Page 32: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

info of gapminder data

obtain information about the columns:

variable meaningcountrycontinentyearlifeExp life expectancy at birthpop total populationgdpPercap per-capita GDP

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 32 / 120

Page 33: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R

head(gapminder)

## # A tibble: 6 x 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 33 / 120

Page 34: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R

tail(gapminder)

## # A tibble: 6 x 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Zimbabwe Africa 1982 60.4 7636524 789.## 2 Zimbabwe Africa 1987 62.4 9216418 706.## 3 Zimbabwe Africa 1992 60.4 10704340 693.## 4 Zimbabwe Africa 1997 46.8 11404948 792.## 5 Zimbabwe Africa 2002 40.0 11926563 672.## 6 Zimbabwe Africa 2007 43.5 12311143 470.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 34 / 120

Page 35: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic Rnames(gapminder)

## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"

ncol(gapminder)

## [1] 6

length(gapminder)

## [1] 6

dim(gapminder)

## [1] 1704 6

nrow(gapminder)

## [1] 1704

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 35 / 120

Page 36: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R:summaryA statistical overview can be obtained with summary()

summary(gapminder)

## country continent year lifeExp## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20## Algeria : 12 Asia :396 Median :1980 Median :60.71## Angola : 12 Europe :360 Mean :1980 Mean :59.47## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85## Australia : 12 Max. :2007 Max. :82.60## (Other) :1632## pop gdpPercap## Min. :6.001e+04 Min. : 241.2## 1st Qu.:2.794e+06 1st Qu.: 1202.1## Median :7.024e+06 Median : 3531.8## Mean :2.960e+07 Mean : 7215.3## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5## Max. :1.319e+09 Max. :113523.1##Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 36 / 120

Page 37: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Plot

plot(lifeExp ~ year, gapminder)

1950 1960 1970 1980 1990 2000

3040

5060

7080

year

lifeE

xp

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 37 / 120

Page 38: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Plot

plot(lifeExp ~ gdpPercap, gapminder)

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

3040

5060

7080

gdpPercap

lifeE

xp

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 38 / 120

Page 39: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Plot

plot(lifeExp ~ log(gdpPercap), gapminder)

6 7 8 9 10 11

3040

5060

7080

log(gdpPercap)

lifeE

xp

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 39 / 120

Page 40: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Look at the variable inside a data

head(gapminder$lifeExp)

## [1] 28.801 30.332 31.997 34.020 36.088 38.438

summary(gapminder$lifeExp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 23.60 48.20 60.71 59.47 70.85 82.60

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 40 / 120

Page 41: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Look at the variable inside a data

hist(gapminder$lifeExp)

Histogram of gapminder$lifeExp

gapminder$lifeExp

Fre

quen

cy

20 30 40 50 60 70 80

050

100

150

200

250

300

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 41 / 120

Page 42: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Look at the variable inside a data

table(gapminder$continent)

#### Africa Americas Asia Europe Oceania## 624 300 396 360 24

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 42 / 120

Page 43: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

Data Management in R Data Management in Practice

Basic R: Look at the variable inside a data

barplot(table(gapminder$continent))

Africa Americas Asia Europe Oceania

010

020

030

040

050

060

0

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 43 / 120

Page 44: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse

Section 3

“Fancy” Data Management using Tidyverse

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 44 / 120

Page 45: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Introduction to tidydata

Subsection 1

Introduction to tidydata

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 45 / 120

Page 46: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Introduction to tidydata

IntroductionA set of packages that are useful specifically for data manipulation,exploration and visualization with a common philosophy: “tidy” data.Developed by a leading figure in R field: Hadley Wickham

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 46 / 120

Page 47: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Introduction to tidydata

Introduction

The workflow of data analysis is as follows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 47 / 120

Page 48: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Introduction to tidydata

Install and Library tidyverse

We can install and load the set of R packages usinginstall.packages("tidyverse") function.

library(tidyverse) ##

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 48 / 120

Page 49: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Introduction to tidydata

Install and Library tidyverse

When we load the tidyverse package using library(tidyverse),there are six core R packages that load:

readr and haven, for data import.tidyr, for data tidying.dplyr, for data wrangling.ggplot2, for data visualization.purrr, for functional programming.tibble, for tibbles, a modern re-imagining of data frames.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 49 / 120

Page 50: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Import Data

Subsection 2

Import Data

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 50 / 120

Page 51: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Import Data

Import data by readr and haven

read_csv():comma separated (CSV) filesread_tsv():tab separated filesread_delim(): delimited filesread_table():tabular files where columns are separated bywhite-space.read_log():web log filesread_sas() : reads .sas7bdat + .sas7bcat filesread_spss(): reads .sav files of SPSSread_stata() : reads .dta files (up to version 15)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 51 / 120

Page 52: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Import Data

Import data by readr and haven

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 52 / 120

Page 53: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Import Data

Import data by readxl

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 53 / 120

Page 54: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

Subsection 3

“Tidy” Data

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 54 / 120

Page 55: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

What does tidy data mean?

StructureAn easy to deal and clear form for data analysis

Observation quality:Missing valuesDuplicatesOutliers

VariablesSelecting and Generate variables

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 55 / 120

Page 56: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

Which data is tidy?table1

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

table2

## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 56 / 120

Page 57: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

Which data is tidy?

table3

## # A tibble: 6 x 3## country year rate## * <chr> <int> <chr>## 1 Afghanistan 1999 745/19987071## 2 Afghanistan 2000 2666/20595360## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 57 / 120

Page 58: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

Which data is tidy?

table4a

## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766

table4b

## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 19987071 20595360## 2 Brazil 172006362 174504898## 3 China 1272915272 1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 58 / 120

Page 59: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

a tidy data strcture

Each variable forms a column.Each observation forms a row.Each type of observational unit forms a table.

##Reconstruct data into tidy data

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 59 / 120

Page 60: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidyr package

tidyr is a one such package which was built for the sole purpose ofsimplifying the process of creating tidy data.Four fundamental functions of data tidying that tidyr provides:

gather() makes “wide” data longerspread() makes “long” data widerseparate() splits a character column into multiple columnsunite() combines multiple columns into a single column

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 60 / 120

Page 61: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using spread()table1

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

table2

## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 61 / 120

Page 62: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using spread()

library(tidyr)spread(table2,type,count)

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 62 / 120

Page 63: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using spread()

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 63 / 120

Page 64: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using gather()table1

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

table4a

## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766

gather(table4a,"year","cases",2:3)

## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Brazil 1999 37737## 3 China 1999 212258## 4 Afghanistan 2000 2666## 5 Brazil 2000 80488## 6 China 2000 213766

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 64 / 120

Page 65: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using gather()

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 65 / 120

Page 66: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using seprate()table1

## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

table3

## # A tibble: 6 x 3## country year rate## * <chr> <int> <chr>## 1 Afghanistan 1999 745/19987071## 2 Afghanistan 2000 2666/20595360## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 66 / 120

Page 67: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using seprate()

separate(table3,rate,into = c("cases","population"),sep="/")

## # A tibble: 6 x 4## country year cases population## <chr> <int> <chr> <chr>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 67 / 120

Page 68: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse “Tidy” Data

tidy data using unite()

table5 <- separate(table3,rate,into = c("cases","population"),sep="/")

unite(table5,"rate",cases,population,sep="/")

## # A tibble: 6 x 3## country year rate## <chr> <int> <chr>## 1 Afghanistan 1999 745/19987071## 2 Afghanistan 2000 2666/20595360## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 68 / 120

Page 69: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Cleaning

Subsection 4

Data Cleaning

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 69 / 120

Page 70: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Cleaning

Review: tidy data

Data Structure TransformingData Cleaning:

Dealing with missing values and duplicatesData Relating:

Merge or join multiple dataData Wrangling:

Selecting and Creating VariablesSelecting observations

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 70 / 120

Page 71: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Subsection 5

Data Wrangling:

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 71 / 120

Page 72: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Data Wrangling: using dplyr

the most time-consuming and dirty work by the package dplyr

The important dplyr verbs to remember are:

dplyr verbs Descriptionselect() select columnsfilter() filter rowsarrange() re-order or arrange rowsmutate() create new columnsrename() rename columnssummarize() summarize valuesgroup_by() allows for group operations in the

“split-apply-combine” concept

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 72 / 120

Page 73: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Pipe operator: %>%

This operator allows you to pipe the output from one function to theinput of another function.Instead of nesting functions (reading from the inside to the outside),the idea of of piping is to read the functions from left to right.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 73 / 120

Page 74: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Selecting columns using select()To select all the columns except a specific column, use the “-”(subtraction) operator. (drop a variable)

gapminder %>%select(-lifeExp)

## # A tibble: 1,704 x 5## country continent year pop gdpPercap## <fct> <fct> <int> <int> <dbl>## 1 Afghanistan Asia 1952 8425333 779.## 2 Afghanistan Asia 1957 9240934 821.## 3 Afghanistan Asia 1962 10267083 853.## 4 Afghanistan Asia 1967 11537966 836.## 5 Afghanistan Asia 1972 13079460 740.## 6 Afghanistan Asia 1977 14880372 786.## 7 Afghanistan Asia 1982 12881816 978.## 8 Afghanistan Asia 1987 13867957 852.## 9 Afghanistan Asia 1992 16317921 649.## 10 Afghanistan Asia 1997 22227415 635.## # ... with 1,694 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 74 / 120

Page 75: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Selecting columns using select()To select a range of columns by name, use the “:” (colon) operator

gapminder %>%select(lifeExp:gdpPercap)

## # A tibble: 1,704 x 3## lifeExp pop gdpPercap## <dbl> <int> <dbl>## 1 28.8 8425333 779.## 2 30.3 9240934 821.## 3 32.0 10267083 853.## 4 34.0 11537966 836.## 5 36.1 13079460 740.## 6 38.4 14880372 786.## 7 39.9 12881816 978.## 8 40.8 13867957 852.## 9 41.7 16317921 649.## 10 41.8 22227415 635.## # ... with 1,694 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 75 / 120

Page 76: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Additional options to select()

selection function meaningstarts_with("c") Select columns that start with a character string “c”ends_with() Select columns that end with a character stringcontains() Select columns that contain a character stringmatches() Select columns that match a regular expressionone_of() Select columns names that are from a group of names

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 76 / 120

Page 77: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Selecting rows using filter()only want the data from 2007.

gapminder %>%filter(year == 2007)

## # A tibble: 142 x 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 2007 43.8 31889923 975.## 2 Albania Europe 2007 76.4 3600523 5937.## 3 Algeria Africa 2007 72.3 33333216 6223.## 4 Angola Africa 2007 42.7 12420476 4797.## 5 Argentina Americas 2007 75.3 40301927 12779.## 6 Australia Oceania 2007 81.2 20434176 34435.## 7 Austria Europe 2007 79.8 8199783 36126.## 8 Bahrain Asia 2007 75.6 708573 29796.## 9 Bangladesh Asia 2007 64.1 150448339 1391.## 10 Belgium Europe 2007 79.4 10392226 33693.## # ... with 132 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 77 / 120

Page 78: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Selecting rows using filter()only want the data from China.

gapminder %>%filter(country == "China" ) ## String can be used too ""

## # A tibble: 12 x 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 China Asia 1952 44 556263527 400.## 2 China Asia 1957 50.5 637408000 576.## 3 China Asia 1962 44.5 665770000 488.## 4 China Asia 1967 58.4 754550000 613.## 5 China Asia 1972 63.1 862030000 677.## 6 China Asia 1977 64.0 943455000 741.## 7 China Asia 1982 65.5 1000281000 962.## 8 China Asia 1987 67.3 1084035000 1379.## 9 China Asia 1992 68.7 1164970000 1656.## 10 China Asia 1997 70.4 1230075000 2289.## 11 China Asia 2002 72.0 1280400000 3119.## 12 China Asia 2007 73.0 1318683096 4959.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 78 / 120

Page 79: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Selecting rows using filter()

use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to createlogical tests.For example, if we wanted only years after 1977,then

gapminder %>%filter(country == "China",year > 1977)

## # A tibble: 6 x 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 China Asia 1982 65.5 1000281000 962.## 2 China Asia 1987 67.3 1084035000 1379.## 3 China Asia 1992 68.7 1164970000 1656.## 4 China Asia 1997 70.4 1230075000 2289.## 5 China Asia 2002 72.0 1280400000 3119.## 6 China Asia 2007 73.0 1318683096 4959.Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 79 / 120

Page 80: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Use mutate() to add new variablesmutate() is a function that defines and inserts new variables into atibble. You can refer to existing variables by name.

gapminder %>%mutate(gdp = pop * gdpPercap)

## # A tibble: 1,704 x 7## country continent year lifeExp pop gdpPercap gdp## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.## # ... with 1,694 more rowsZhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 80 / 120

Page 81: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Use mutate() to add new variablesgapminder %>%

filter(year==2007) %>% ## only data in 2007select(-continent) %>% ## drop continent variablemutate(gdp = pop * gdpPercap) ## generate a new variable "gdp"

## # A tibble: 142 x 6## country year lifeExp pop gdpPercap gdp## <fct> <int> <dbl> <int> <dbl> <dbl>## 1 Afghanistan 2007 43.8 31889923 975. 31079291949.## 2 Albania 2007 76.4 3600523 5937. 21376411360.## 3 Algeria 2007 72.3 33333216 6223. 207444851958.## 4 Angola 2007 42.7 12420476 4797. 59583895818.## 5 Argentina 2007 75.3 40301927 12779. 515033625357.## 6 Australia 2007 81.2 20434176 34435. 703658358894.## 7 Austria 2007 79.8 8199783 36126. 296229400691.## 8 Bahrain 2007 75.6 708573 29796. 21112675360.## 9 Bangladesh 2007 64.1 150448339 1391. 209311822134.## 10 Belgium 2007 79.4 10392226 33693. 350141166520.## # ... with 132 more rowsZhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 81 / 120

Page 82: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Use rename() to change the name of variables

gapminder %>%rename(life_exp = lifeExp,

gdp_percap = gdpPercap)

## # A tibble: 1,704 x 6## country continent year life_exp pop gdp_percap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.9 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # ... with 1,694 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 82 / 120

Page 83: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Sort or re-order rows using arrange()

gapminder %>%filter(year==2007,continent=="Asia") %>%select(-c(continent,pop,gdpPercap)) %>%arrange(lifeExp)

## # A tibble: 33 x 3## country year lifeExp## <fct> <int> <dbl>## 1 Afghanistan 2007 43.8## 2 Iraq 2007 59.5## 3 Cambodia 2007 59.7## 4 Myanmar 2007 62.1## 5 Yemen, Rep. 2007 62.7## 6 Nepal 2007 63.8## 7 Bangladesh 2007 64.1## 8 India 2007 64.7## 9 Pakistan 2007 65.5## 10 Mongolia 2007 66.8## # ... with 23 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 83 / 120

Page 84: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Sort or re-order rows using arrange()

gapminder %>%filter(year==2007,continent=="Asia") %>%select(-c(continent,gdpPercap)) %>%arrange(desc(lifeExp)) ## descend order

## # A tibble: 33 x 4## country year lifeExp pop## <fct> <int> <dbl> <int>## 1 Japan 2007 82.6 127467972## 2 Hong Kong, China 2007 82.2 6980412## 3 Israel 2007 80.7 6426679## 4 Singapore 2007 80.0 4553009## 5 Korea, Rep. 2007 78.6 49044790## 6 Taiwan 2007 78.4 23174294## 7 Kuwait 2007 77.6 2505559## 8 Oman 2007 75.6 3204897## 9 Bahrain 2007 75.6 708573## 10 Vietnam 2007 74.2 85262356## # ... with 23 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 84 / 120

Page 85: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Summary Statistics by summarise() and group_by

gapminder %>%filter(year==2007) %>%group_by(continent,year) %>%summarize(gdpPercap_con = mean(gdpPercap)) %>%select(c(continent,gdpPercap_con,year)) %>%arrange(desc(gdpPercap_con)) ## descend order

## `summarise()` regrouping output by 'continent' (override with `.groups` argument)

## # A tibble: 5 x 3## # Groups: continent [5]## continent gdpPercap_con year## <fct> <dbl> <int>## 1 Oceania 29810. 2007## 2 Europe 25054. 2007## 3 Asia 12473. 2007## 4 Americas 11003. 2007## 5 Africa 3089. 2007Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 85 / 120

Page 86: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Summary Statistics by summarise() and group_by

gapminder %>%filter(continent=="Asia") %>%group_by(country) %>%summarize(gdpPercap_coun = mean(gdpPercap)) %>%arrange(desc(gdpPercap_coun)) ## descend order

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 33 x 2## country gdpPercap_coun## <fct> <dbl>## 1 Kuwait 65333.## 2 Saudi Arabia 20262.## 3 Bahrain 18078.## 4 Japan 17751.## 5 Singapore 17425.## 6 Hong Kong, China 16229.## 7 Israel 14161.## 8 Oman 12139.## 9 Taiwan 10225.## 10 Korea, Rep. 8217.## # ... with 23 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 86 / 120

Page 87: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Summary Statistics by count()

gapminder %>%filter(year == 2002) %>%count(continent, sort = TRUE)

## # A tibble: 5 x 2## continent n## <fct> <int>## 1 Africa 52## 2 Asia 33## 3 Europe 30## 4 Americas 25## 5 Oceania 2

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 87 / 120

Page 88: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Let’s have a exercise

Question: Which country in which continent experienced thesharpest 5-year drop in life expectancy?

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 88 / 120

Page 89: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Let’s have a exercise

Question: Which country in which continent experienced thesharpest 5-year drop in life expectancy?

gapminder %>%select(country, year, continent, lifeExp) %>%group_by(continent, country) %>%## within country, take (lifeExp in year i) - (lifeExp in year i - 1)## positive means lifeExp went up, negative means it went downmutate(le_delta = lifeExp - lag(lifeExp)) %>%## within country, retain the worst lifeExp change = smallest or most negativesummarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>%## within continent, retain the row with the lowest worst_le_deltatop_n(-1, wt = worst_le_delta) %>%arrange(worst_le_delta)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 89 / 120

Page 90: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Data Wrangling:

Let’s have a exercise

Answer

## `summarise()` regrouping output by 'continent' (override with `.groups` argument)

## # A tibble: 5 x 3## # Groups: continent [5]## continent country worst_le_delta## <fct> <fct> <dbl>## 1 Africa Rwanda -20.4## 2 Asia Cambodia -9.10## 3 Americas El Salvador -1.51## 4 Europe Montenegro -1.46## 5 Oceania Australia 0.170

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 90 / 120

Page 91: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Subsection 6

Relational data

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 91 / 120

Page 92: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Introduction

Collectively, multiple tables of data are called relational data becauseit is the relations, not just the individual datasets, that are important.eg. a household survey often composites with several subsets: someare reported in household level: such as consumption,assets,etc.others are reported in individual level: like working information.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 92 / 120

Page 93: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Types of join

Mutation joins first matches observations by their keys, then copiesacross variables from one table to the other.Filtering joins match observations in the same way as mutatingjoins, but affect the observations, not the variables.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 93 / 120

Page 94: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

nycflights13 dataa dataset on flights departing New York City in 2013.

installed.packages(nycflights13)

library(nycflights13)airlines

## # A tibble: 16 x 2## carrier name## <chr> <chr>## 1 9E Endeavor Air Inc.## 2 AA American Airlines Inc.## 3 AS Alaska Airlines Inc.## 4 B6 JetBlue Airways## 5 DL Delta Air Lines Inc.## 6 EV ExpressJet Airlines Inc.## 7 F9 Frontier Airlines Inc.## 8 FL AirTran Airways Corporation## 9 HA Hawaiian Airlines Inc.## 10 MQ Envoy Air## 11 OO SkyWest Airlines Inc.## 12 UA United Air Lines Inc.## 13 US US Airways Inc.## 14 VX Virgin America## 15 WN Southwest Airlines Co.## 16 YV Mesa Airlines Inc.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 94 / 120

Page 95: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

nycflights13:airports

airports

## # A tibble: 1,458 x 8## faa name lat lon alt tz dst tzone## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo~## 2 06A Moton Field Municipal A~ 32.5 -85.7 264 -6 A America/Chicago## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo~## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo~## 6 0A9 Elizabethton Municipal ~ 36.4 -82.2 1593 -5 A America/New_Yo~## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo~## 8 0G7 Finger Lakes Regional A~ 42.9 -76.8 492 -5 A America/New_Yo~## 9 0P2 Shoestring Aviation Air~ 39.8 -76.6 1000 -5 U America/New_Yo~## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An~## # ... with 1,448 more rowsZhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 95 / 120

Page 96: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

nycflights13:planes

planes

## # A tibble: 3,322 x 9## tailnum year type manufacturer model engines seats speed engine## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>## 1 N10156 2004 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~## 2 N102UW 1998 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 3 N103US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 4 N104UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 5 N10575 2002 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~## 6 N105UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 7 N107US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 8 N108UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 9 N109UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## 10 N110UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~## # ... with 3,312 more rowsZhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 96 / 120

Page 97: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

nycflights13:weather

weather

## # A tibble: 26,115 x 15## origin year month day hour temp dewp humid wind_dir wind_speed## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 97 / 120

Page 98: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Relations in nycflights13 data

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 98 / 120

Page 99: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Key variables

The variables used to connect each pair of tables are called keys. Akey is a variable (or set of variables) that uniquely identifies anobservation.

Primary Key: uniquely identifies an observation in its own tableForeign Key: uniquely identifies an observation in another table

head(planes$tailnum) # primary key in `planes` table

## [1] "N10156" "N102UW" "N103US" "N104UW" "N10575" "N105UW"

head(flights$tailnum) # foreign key in `flights` table

## [1] "N14228" "N24211" "N619AA" "N804JB" "N668DN" "N39463"

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 99 / 120

Page 100: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Key variables

Verify that key variables do indeed uniquely identify each observation.count() the primary keys and look for entries where n is greater thanoneplanes %>%count(tailnum) %>%filter(n>1)

## # A tibble: 0 x 2## # ... with 2 variables: tailnum <chr>, n <int>

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 100 / 120

Page 101: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Key variables

Sometimes you can

weather %>%count(year,month,day,hour,origin) %>%filter(n>1)

## # A tibble: 3 x 6## year month day hour origin n## <int> <int> <int> <int> <chr> <int>## 1 2013 11 3 1 EWR 2## 2 2013 11 3 1 JFK 2## 3 2013 11 3 1 LGA 2

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 101 / 120

Page 102: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Join(merge) tables

A primary key and the corresponding foreign key in another tableform a relation.

1 : M(any)1 : 1M : M

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 102 / 120

Page 103: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Join(merge) tables

mutating join: It first matches observations by their keys, thencopies across variables from one table to the other.Aim: add the full airline name to the flight2

flights2 <- flights %>%select(year:day,hour,tailnum,carrier,origin)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 103 / 120

Page 104: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Join(merge) tablesAim: add the full airline name to the flight2

flights2 %>%left_join(airlines, by = "carrier")

## # A tibble: 336,776 x 8## year month day hour tailnum carrier origin name## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>## 1 2013 1 1 5 N14228 UA EWR United Air Lines Inc.## 2 2013 1 1 5 N24211 UA LGA United Air Lines Inc.## 3 2013 1 1 5 N619AA AA JFK American Airlines Inc.## 4 2013 1 1 5 N804JB B6 JFK JetBlue Airways## 5 2013 1 1 6 N668DN DL LGA Delta Air Lines Inc.## 6 2013 1 1 5 N39463 UA EWR United Air Lines Inc.## 7 2013 1 1 6 N516JB B6 EWR JetBlue Airways## 8 2013 1 1 6 N829AS EV LGA ExpressJet Airlines Inc.## 9 2013 1 1 6 N593JB B6 JFK JetBlue Airways## 10 2013 1 1 6 N3ALAA AA LGA American Airlines Inc.## # ... with 336,766 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 104 / 120

Page 105: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Join(merge) tables

planes %>%select(tailnum,type) %>%right_join(flights,by = "tailnum")

## # A tibble: 336,776 x 20## tailnum type year month day dep_time sched_dep_time dep_delay arr_time## <chr> <chr> <int> <int> <int> <int> <int> <dbl> <int>## 1 N10156 Fixe~ 2013 1 10 626 630 -4 802## 2 N10156 Fixe~ 2013 1 10 1120 1032 48 1320## 3 N10156 Fixe~ 2013 1 10 1619 1540 39 1831## 4 N10156 Fixe~ 2013 1 11 632 634 -2 810## 5 N10156 Fixe~ 2013 1 11 1116 1120 -4 1328## 6 N10156 Fixe~ 2013 1 11 1845 1819 26 1959## 7 N10156 Fixe~ 2013 1 12 830 838 -8 1024## 8 N10156 Fixe~ 2013 1 12 1410 1359 11 1555## 9 N10156 Fixe~ 2013 1 13 1551 1536 15 1804## 10 N10156 Fixe~ 2013 1 13 2221 2039 102 2346## # ... with 336,766 more rows, and 11 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 105 / 120

Page 106: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Understanding joinsx <- tribble(

~key, ~val_x,1, "x1",2, "x2",3, "x3"

)y <- tribble(

~key, ~val_y,1, "y1",2, "y2",4, "y3"

)

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 106 / 120

Page 107: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Inner joinsInner join matches pairs of observations whenever their keys areequal.

## # A tibble: 2 x 3## key val_x val_y## <dbl> <chr> <chr>## 1 1 x1 y1## 2 2 x2 y2Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 107 / 120

Page 108: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Outer joins:left

A left join keeps all observations in x.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 108 / 120

Page 109: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Outer joins: right

A right join keeps all observations in y.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 109 / 120

Page 110: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Outer joins: full

A full join keeps all observations in x and y.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 110 / 120

Page 111: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Filtering joins

semi_join(x,y) keeps all observations in x that have a match in y.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 111 / 120

Page 112: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Filtering joins

Only the existence of a match is important it does not matter whichobservation is matched.This means that filtering joins never duplicate rows like mutatingjoins do:

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 112 / 120

Page 113: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Filtering joins

anti_join(x,y) drops all observations in x that have a match in y.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 113 / 120

Page 114: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Top 10 most popular destinations

## # A tibble: 10 x 2## dest n## <chr> <int>## 1 ORD 17283## 2 ATL 17215## 3 LAX 16174## 4 BOS 15508## 5 MCO 14082## 6 CLT 14064## 7 SFO 13331## 8 FLL 12055## 9 MIA 11728## 10 DCA 9705

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 114 / 120

Page 115: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Top 10 most popular destinations

find each flight that went to one of those destinations.

## # A tibble: 141,145 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time## <int> <int> <int> <int> <int> <dbl> <int> <int>## 1 2013 1 1 542 540 2 923 850## 2 2013 1 1 554 600 -6 812 837## 3 2013 1 1 554 558 -4 740 728## 4 2013 1 1 555 600 -5 913 854## 5 2013 1 1 557 600 -3 838 846## 6 2013 1 1 558 600 -2 753 745## 7 2013 1 1 558 600 -2 924 917## 8 2013 1 1 558 600 -2 923 937## 9 2013 1 1 559 559 0 702 706## 10 2013 1 1 600 600 0 851 858## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 115 / 120

Page 116: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Top 10 most popular destinationsdifficult to extend that approach to multiple variables.

## Joining, by = "dest"

## # A tibble: 141,145 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time## <int> <int> <int> <int> <int> <dbl> <int> <int>## 1 2013 1 1 542 540 2 923 850## 2 2013 1 1 554 600 -6 812 837## 3 2013 1 1 554 558 -4 740 728## 4 2013 1 1 555 600 -5 913 854## 5 2013 1 1 557 600 -3 838 846## 6 2013 1 1 558 600 -2 753 745## 7 2013 1 1 558 600 -2 924 917## 8 2013 1 1 558 600 -2 923 937## 9 2013 1 1 559 559 0 702 706## 10 2013 1 1 600 600 0 851 858## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 116 / 120

Page 117: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

anti_join

Whether are there flights that do not have a match in planes

## # A tibble: 722 x 2## tailnum n## <chr> <int>## 1 <NA> 2512## 2 N725MQ 575## 3 N722MQ 513## 4 N723MQ 507## 5 N713MQ 483## 6 N735MQ 396## 7 N0EGMQ 371## 8 N534MQ 364## 9 N542MQ 363## 10 N531MQ 349## # ... with 712 more rows

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 117 / 120

Page 118: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Some Notes for join data

1 Start by identifying the variables that form the primary key in eachtable.

2.Check that none of the variables in the primary key are missing andwhether is duplicated(for inner join).

3 Check that your foreign keys match primary keys in another table.

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 118 / 120

Page 119: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Topics in the Next Lecture

Missing Values and Duplicatesggplot2Rmarkdown

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 119 / 120

Page 120: R Lab2: Data Management · Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 14/120. Data Management in R Introduction Data.frames 数据框) data.frameis just like

“Fancy” Data Management using Tidyverse Relational data

Reference

1 Grolemund & Wickham(2017), R for Data Science2 DataCamp(2018), Introduction to Tidyverse

Zhaopeng Qu (Nanjing University) R Lab2: Data Management 9/23/2020 120 / 120