31
Hadley Wickham Stat405 Graphical extensions & missing values Tuesday, 31 August 2010

03 extensions

Embed Size (px)

Citation preview

Page 1: 03 extensions

Hadley Wickham

Stat405Graphical extensions & missing values

Tuesday, 31 August 2010

Page 2: 03 extensions

1. Graphical extensions (1d & 2d)

2. Subsetting

Tuesday, 31 August 2010

Page 3: 03 extensions

1d extensions

Tuesday, 31 August 2010

Page 4: 03 extensions

price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)Tuesday, 31 August 2010

Page 5: 03 extensions

price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

What makes it difficult to compare the distributions?

Brainstorm for 1 minute.

Tuesday, 31 August 2010

Page 6: 03 extensions

Problems

Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities

Varying relative abundance makes comparisons difficult → rescale to ensure constant area

Tuesday, 31 August 2010

Page 7: 03 extensions

# Large distances make comparisons hardqplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

# Stacked heights hard to compareqplot(price, data = diamonds, binwidth = 500, fill = cut)

# Much better - but still have differing relative abundanceqplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# Instead of displaying count on y-axis, display density# .. indicates that variable isn't in original dataqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# To use with histogram, you need to be explicitqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut)

Tuesday, 31 August 2010

Page 8: 03 extensions

Use this technique to explore the relationship between price and clarity, and carat and clarity.

Your turn

Tuesday, 31 August 2010

Page 9: 03 extensions

2d extensions

Tuesday, 31 August 2010

Page 10: 03 extensions

Idea ggplot

Small points shape = I(".")

Transparency alpha = I(1/50)

Jittering geom = "jitter"

Smooth curve geom = "smooth"

2d bins geom = "bin2d" or geom = "hex"

Density contours geom = "density2d"

Tuesday, 31 August 2010

Page 11: 03 extensions

# There are two ways to add additional geoms# 1) A vector of geom names:qplot(price, carat, data = diamonds, geom = c("point", "smooth"))

# 2) Add on extra geomsqplot(price, carat, data = diamonds) + geom_smooth()

# This how you get help about a specific geom:?geom_smooth# or go to http://had.co.nz/ggplot2/geom_smooth.html

Tuesday, 31 August 2010

Page 12: 03 extensions

# To set aesthetics to a particular value, you need# to wrap that value in I()

qplot(price, carat, data = diamonds, colour = "blue")qplot(price, carat, data = diamonds, colour = I("blue"))

# Practical application: varying alphaqplot(price, carat, data = diamonds, alpha = I(1/10))qplot(price, carat, data = diamonds, alpha = I(1/50))qplot(price, carat, data = diamonds, alpha = I(1/100))qplot(price, carat, data = diamonds, alpha = I(1/250))

Tuesday, 31 August 2010

Page 13: 03 extensions

Your turn

Explore the relationship between carat, price and clarity, using these techniques.

Which did you find most useful?

Tuesday, 31 August 2010

Page 14: 03 extensions

Subsetting

Tuesday, 31 August 2010

Page 15: 03 extensions

Motivation

Look at histograms and scatterplots of x, y, z from the diamonds dataset

Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm)

Tuesday, 31 August 2010

Page 16: 03 extensions

Plots

qplot(x, data = diamonds, binwidth = 0.1)qplot(y, data = diamonds, binwidth = 0.1)qplot(z, data = diamonds, binwidth = 0.1)qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)qplot(y, z, data = diamonds)

Tuesday, 31 August 2010

Page 17: 03 extensions

Modifying data

To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time.

Basic structure: df$varnamedf[row index, column index]

Tuesday, 31 August 2010

Page 18: 03 extensions

$

Remember str(diamonds) ?

That hints at how to extract individual variables:

diamonds$carat

diamonds$price

Tuesday, 31 August 2010

Page 19: 03 extensions

logical

blank

integer

character

+ve: include-ve: exclude

lookup by name

include TRUEs

include all

Tuesday, 31 August 2010

Page 20: 03 extensions

Integer subsetting

Tuesday, 31 August 2010

Page 21: 03 extensions

# Nothingstr(diamonds[, ])

# Positive integers & nothingdiamonds[1:6, ] # same as head(diamonds)diamonds[, 1:4] # watch out!

# Two positive integers in rows & columnsdiamonds[1:10, 1:4]

# Repeating input repeats outputdiamonds[c(1,1,1,2,2), 1:4]

# Negative integers drop valuesdiamonds[-(1:53900), -1]

Tuesday, 31 August 2010

Page 22: 03 extensions

# Useful technique: Order by one or more columnsdiamonds <- diamonds[order(diamonds$price), ]

# Useful technique: Combine two tablescarats <- data.frame(table(carat = diamonds$carat))mtch <- match(diamonds$carat, carats$carat)diamonds$carat_count <- carats$Freq[mtch]

Tuesday, 31 August 2010

Page 23: 03 extensions

Logical subsetting

Tuesday, 31 August 2010

Page 24: 03 extensions

# The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the datax_big <- diamonds$x > 10

head(x_big)sum(x_big)mean(x_big)table(x_big)

diamonds$x[x_big]diamonds[x_big, ]

Tuesday, 31 August 2010

Page 25: 03 extensions

small <- diamonds[diamonds$carat < 1, ]lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ]

# Comparison functions:# < > <= >= != == %in%

# Boolean operators: & | !small <- diamonds$carat < 1 & diamonds$price > 500lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair"

a

b

a | b

a & b

a & !b

xor(a, b)

Tuesday, 31 August 2010

Page 26: 03 extensions

Useful functions for

logical vectors

table(zeros)sum(zeros)mean(zeros)

TRUE = 1; FALSE = 0

Tuesday, 31 August 2010

Page 27: 03 extensions

Select the diamonds that have:

Equal x and y dimensions.

Depth between 55 and 70.

Carat smaller than the mean.

Cost more than $10,000 per carat.

Are of good quality or better.

Your turn

Tuesday, 31 August 2010

Page 28: 03 extensions

Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]

Tuesday, 31 August 2010

Page 29: 03 extensions

diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!

Tuesday, 31 August 2010

Page 30: 03 extensions

Your turn

Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.

Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)

Tuesday, 31 August 2010

Page 31: 03 extensions

equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]

Tuesday, 31 August 2010