13
IS5 in R: Scatterplots, Association, and Correlation (Chapter 6) Margaret Chien and Nicholas Horton ([email protected]) July 17, 2018 Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of Intro Stats (2018) by De Veaux, Velleman, and Bock. More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the mosaic package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the R Journal : https://journal.r-project.org/archive/2017/RJ-2017-024. Chapter 6: Scatterplots, Association, and Correlation library(mosaic) library(readr) library(janitor) Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tracking_hurricanes_2015.csv") ## Parsed with column specification: ## cols( ## Year = col_integer(), ## Error_24h = col_double(), ## Error_48h = col_double(), ## Error_72h = col_double() ## ) By default, read_csv() prints the variable names. These messages can be suppressed using the message=FALSE code chunk option to save space and improve readability. # Figure 6.1, page 164 gf_point(Error_72h ~ Year, data = Hurricanes, ylab = "Prediction Error") 1

IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

IS5 in R: Scatterplots, Association, and Correlation(Chapter 6)

Margaret Chien and Nicholas Horton ([email protected])July 17, 2018

Introduction and background

This document is intended to help describe how to undertake analyses introduced as examples in the FifthEdition of Intro Stats (2018) by De Veaux, Velleman, and Bock. More information about the book can befound at http://wps.aw.com/aw_deveaux_stats_series. This file as well as the associated R Markdownreproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-fundedeffort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.In particular, we utilize the mosaic package, which was written to simplify the use of R for introductorystatistics courses. A short summary of the R needed to teach introductory statistics can be found in themosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaicapproach was published in the R Journal: https://journal.r-project.org/archive/2017/RJ-2017-024.

Chapter 6: Scatterplots, Association, and Correlation

library(mosaic)library(readr)library(janitor)Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tracking_hurricanes_2015.csv")

## Parsed with column specification:## cols(## Year = col_integer(),## Error_24h = col_double(),## Error_48h = col_double(),## Error_72h = col_double()## )

By default, read_csv() prints the variable names. These messages can be suppressed using themessage=FALSE code chunk option to save space and improve readability.# Figure 6.1, page 164gf_point(Error_72h ~ Year, data = Hurricanes, ylab = "Prediction Error")

1

Page 2: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

200

400

600

1970 1980 1990 2000 2010

Year

Pre

dict

ion

Err

or

Section 6.1: Scatterplots

See dots on pages 164-165.

Example 6.1: Comparing Prices WorldwidePrices <- read_csv("http://nhorton.people.amherst.edu/is5/data/Prices_and_Earnings.csv") %>%

clean_names()

## Parsed with column specification:## cols(## .default = col_double(),## City = col_character(),## `Food Costs($)` = col_integer(),## `Womens Clothing($)` = col_integer(),## `Mens Clothing($)` = col_integer(),## `Hours Worked` = col_integer(),## `Vacation Days` = col_integer(),## `Big Mac(min)` = col_integer(),## `Bread(kg in min)` = col_integer(),## `Rice(kg in min)` = col_integer(),## `Goods and Services($)` = col_integer()## )

## See spec(...) for full column specifications.names(Prices)

## [1] "city" "food_costs"## [3] "womens_clothing" "mens_clothing"## [5] "i_phone_4s_hr" "clothing_index"## [7] "hours_worked" "wage_gross"## [9] "wage_net" "vacation_days"## [11] "col_excl_rent" "col_incl_rent"## [13] "pur_power_gross" "pur_power_net"## [15] "pur_power_annual" "big_mac_min"

2

Page 3: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

## [17] "bread_kg_in_min" "rice_kg_in_min"## [19] "goods_and_services" "good_and_services_index"## [21] "food_index"

Here we use the clean_names() function from the janitor package to sanitize the names of the columns(which would otherwise contain special characters or whitespace).gf_point(food_costs ~ womens_clothing, data = Prices) %>%

gf_labs(x = "Cost of Women's Clothes", y = "Food Costs ($)")

200

400

600

800

250 500 750 1000 1250

Cost of Women's Clothes

Foo

d C

osts

($)

gf_point(i_phone_4s_hr ~ wage_gross, data = Prices) %>%gf_labs(x = "Average Hourly Wage", y = "Hours to Earn an iPhone 4S")

100

200

300

400

50 100

Average Hourly Wage

Hou

rs to

Ear

n an

iPho

ne 4

S

gf_point(clothing_index ~ hours_worked, data = Prices) %>%gf_labs(x = "Working Hours", y = "Clothes Index")

3

Page 4: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

50

100

150

200

1750 2000 2250

Working Hours

Clo

thes

Inde

x

gf_point(food_costs ~ vacation_days, data = Prices) %>%gf_labs(x = "Vacation Days (per year)", y = "Food Costs ($)")

200

400

600

800

5 10 15 20 25 30

Vacation Days (per year)

Foo

d C

osts

($)

Roles for Variables

Smoothing ScatterplotsHopkinsForest <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hopkins_Forest.csv") %>%

clean_names()

## Parsed with column specification:## cols(## .default = col_double(),## Date = col_character(),## Year = col_integer(),## Month = col_integer(),## Day = col_integer(),

4

Page 5: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

## `Day of Year` = col_integer(),## `Max Sol Rad (w/m^2)` = col_integer(),## `Min Sol Rad (w/m^2)` = col_integer(),## `Total Sol Rad (w/m^2)` = col_integer(),## `Min Wind (mph)` = col_integer(),## `Max Barom (mb)` = col_integer(),## `Min Barom (mb)` = col_integer()## )

## See spec(...) for full column specifications.# Figure 6.2, page 168gf_point(avg_wind_mph ~ day_of_year, data = HopkinsForest) %>%

gf_smooth(se = FALSE) %>%gf_labs(x = "Day of Year", y = "Average Wind Speed (mph)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

0

2

4

6

0 100 200 300

Day of Year

Ave

rage

Win

d S

peed

(m

ph)

Example 6.2: Smoothing TimeplotsFitness <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fitness_data.csv") %>%

clean_names()gf_histogram(~ weight, data = Fitness, binwidth = 1, center = .5) %>%

gf_labs(x = "Weight (lb)", y = "# of Days")

## Warning: Removed 70 rows containing non-finite values (stat_bin).

5

Page 6: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

0

20

40

60

165 170 175 180

Weight (lb)

# of

Day

s

gf_point(weight ~ days_since_july_19_2014, data = Fitness) %>%gf_smooth(se = FALSE) %>%gf_labs(x = "Days Since July 19, 2014", y = "Weight (lb)")

## Warning: Removed 70 rows containing non-finite values (stat_smooth).

## Warning: Removed 70 rows containing missing values (geom_point).

165

170

175

180

0 200 400 600 800

Days Since July 19, 2014

Wei

ght (

lb)

gf_boxplot(weight ~ as.factor(month), data = Fitness) %>%gf_labs(x = "Month", y = "Weight (lb)")

## Warning: Removed 70 rows containing non-finite values (stat_boxplot).

6

Page 7: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

165

170

175

180

1 2 3 4 5 6 7 8 9 10 11 12 NA

Month

Wei

ght (

lb)

Warnings can be suppressed with the warnings=FALSE chunk option.

Section 6.2: Correlation

HeightsWeights <- read_csv("http://nhorton.people.amherst.edu/is5/data/Heights_and_weights.csv")

## Parsed with column specification:## cols(## Weight = col_integer(),## Height = col_double()## )# Figure 6.3, page 170gf_point(Weight ~ Height, data = HeightsWeights) %>%

gf_labs(x = "Height (in.)", y = "Weight (lb)")

100

150

200

250

65 70 75

Height (in.)

Wei

ght (

lb)

cor(Weight ~ Height, data = HeightsWeights)

7

Page 8: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

## [1] 0.6440311

See displays on pages 170 - 171.

Step-by-Step Example: Looking at AssociationFramingham <- read_csv("http://nhorton.people.amherst.edu/is5/data/Framingham.csv")

## Parsed with column specification:## cols(## Cholesterol = col_integer(),## Age = col_integer(),## Sex = col_character(),## SBP = col_integer(),## DBP = col_integer(),## CIG = col_integer()## )

## Warning in rbind(names(probs), probs_f): number of columns of result is not## a multiple of vector length (arg 2)

## Warning: 1 parsing failure.## row # A tibble: 1 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1090 CIG an integer . 'http://nhorton.people.amherst.edu/is5/da~ file # A tibble: 1 x 5gf_point(SBP ~ DBP, data = Framingham) %>%

gf_labs(x = "Diastolic BP (mm Hg)", y = "Systolic BP (mm Hg)")

100

150

200

250

300

80 120 160

Diastolic BP (mm Hg)

Sys

tolic

BP

(m

m H

g)

cor(SBP ~ DBP, data = Framingham)

## [1] 0.7924792

Random Matters: Correlations VaryLiveBirths <- read_csv("http://nhorton.people.amherst.edu/is5/data/Babysamp_98.csv") %>%

clean_names()

## Parsed with column specification:## cols(## MomAge = col_integer(),

8

Page 9: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

## DadAge = col_integer(),## MomEduc = col_integer(),## MomMarital = col_integer(),## numlive = col_integer(),## dobmm = col_integer(),## gestation = col_integer(),## sex = col_character(),## weight = col_integer(),## prenatalstart = col_integer(),## orig.id = col_integer(),## preemie = col_logical()## )LiveBirths <- LiveBirths %>%

filter(dad_age != "NA")set.seed(14513) # To ensure we get the same values when we run it multiple timesnumsim <- 10000 # Number of samplesgf_point(mom_age ~ dad_age, data = sample(LiveBirths, size = 50))

20

25

30

35

40

20 30 40

dad_age

mom

_age

# Graph will look different for different samplescor(mom_age ~ dad_age, data = LiveBirths)

## [1] 0.7516507# What does do() do?cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Correlation of one random sample

## [1] 0.7619002cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Correlation of another random sample

## [1] 0.7767026do(2) * cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Finds the correlation twice

## cor## 1 0.7067583## 2 0.7401397

9

Page 10: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

# For the visualization, we need 10,000 correlationsLiveCorr <- do(numsim) * cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50))

The do() function runs, 10,000 times, the correlation and sampling functions on a random sample of 50.

(We can use the chunk option cache=TRUE to enable cache)# Figure 6.8, page 176gf_histogram(~ cor, data = LiveCorr, binwidth = .05, center = .025) %>%

gf_labs(x = "Correlation of Mother's Age and Father's Age in Samples of Size 50",y = "# of Samples")

0

1000

2000

3000

0.5 0.6 0.7 0.8 0.9

Correlation of Mother's Age and Father's Age in Samples of Size 50

# of

Sam

ples

Section 6.3: Warning: Correlation 6= Causation

Storks <- read_csv("http://nhorton.people.amherst.edu/is5/data/Storks.csv")

## Parsed with column specification:## cols(## Storks = col_integer(),## Population = col_integer()## )# Figure 6.9gf_point(Population ~ Storks, data = Storks) %>%

gf_labs(x = "# of Storks", y = "Human Population")

10

Page 11: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

55000

60000

65000

70000

75000

150 175 200 225 250

# of Storks

Hum

an P

opul

atio

n

Correlation TablesCompanies <- read_csv("http://nhorton.people.amherst.edu/is5/data/Companies.csv") %>%

clean_names()

## Parsed with column specification:## cols(## Company = col_character(),## Assets = col_integer(),## Sales = col_integer(),## `Market Value` = col_integer(),## Profits = col_double(),## `Cash Flow` = col_double(),## Employees = col_double(),## sector = col_character(),## Banks = col_integer()## )# Table 6.1, page 178Companies %>%

select(assets, sales, market_value, profits, cash_flow, employees) %>%cor()

## assets sales market_value profits cash_flow## assets 1.0000000 0.7464649 0.6822122 0.6016986 0.6409018## sales 0.7464649 1.0000000 0.8788920 0.8137758 0.8549172## market_value 0.6822122 0.8788920 1.0000000 0.9681987 0.9702851## profits 0.6016986 0.8137758 0.9681987 1.0000000 0.9887795## cash_flow 0.6409018 0.8549172 0.9702851 0.9887795 1.0000000## employees 0.5943581 0.9240429 0.8182161 0.7621057 0.7866148## employees## assets 0.5943581## sales 0.9240429## market_value 0.8182161## profits 0.7621057## cash_flow 0.7866148## employees 1.0000000

11

Page 12: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

Section 6.4: Straightening Scatterplots

FStops <- read_csv("http://nhorton.people.amherst.edu/is5/data/F-stops.csv") %>%clean_names()

## Parsed with column specification:## cols(## `F-stop` = col_double(),## ShutterSpeed = col_double()## )# Figure 6.10, page 179gf_point(f_stop ~ shutter_speed, data = FStops) %>%

gf_labs(x = "Shutter Speed (sec)", y = "f/stop")

10

20

30

0.000 0.025 0.050 0.075 0.100 0.125

Shutter Speed (sec)

f/sto

p

cor(f_stop ~ shutter_speed, data = FStops)

## [1] 0.9786716

The Ladder of Powers

f/Stops Again# Figure 6.11, page 181gf_point(log(f_stop) ~ shutter_speed, data = FStops) %>%

gf_labs(x = "Shutter Speed (sec)", y = "Log (f/stop)")

12

Page 13: IS5 in R: Scatterplots, Association, and Correlation ... file50 100 150 200 1750 2000 2250 Working Hours Clothes Index gf_point(food_costs ~ vacation_days,data =Prices) %>% gf_labs(x

1.0

1.5

2.0

2.5

3.0

3.5

0.000 0.025 0.050 0.075 0.100 0.125

Shutter Speed (sec)

Log

(f/s

top)

# Figure 6.12gf_point((f_stop)^2 ~ shutter_speed, data = FStops) %>%

gf_labs(x = "Shutter Speed (sec)", y = "f/stop squareed")

0

250

500

750

1000

0.000 0.025 0.050 0.075 0.100 0.125

Shutter Speed (sec)

f/sto

p sq

uare

ed

See displays in “What Can Go Wrong?” on pages 181-183.

13