03 extensions

Preview:

Citation preview

Hadley Wickham

Stat405Graphical extensions & missing values

Tuesday, 31 August 2010

1. Graphical extensions (1d & 2d)

2. Subsetting

Tuesday, 31 August 2010

1d extensions

Tuesday, 31 August 2010

price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)Tuesday, 31 August 2010

price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

What makes it difficult to compare the distributions?

Brainstorm for 1 minute.

Tuesday, 31 August 2010

Problems

Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities

Varying relative abundance makes comparisons difficult → rescale to ensure constant area

Tuesday, 31 August 2010

# Large distances make comparisons hardqplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

# Stacked heights hard to compareqplot(price, data = diamonds, binwidth = 500, fill = cut)

# Much better - but still have differing relative abundanceqplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# Instead of displaying count on y-axis, display density# .. indicates that variable isn't in original dataqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# To use with histogram, you need to be explicitqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut)

Tuesday, 31 August 2010

Use this technique to explore the relationship between price and clarity, and carat and clarity.

Your turn

Tuesday, 31 August 2010

2d extensions

Tuesday, 31 August 2010

Idea ggplot

Small points shape = I(".")

Transparency alpha = I(1/50)

Jittering geom = "jitter"

Smooth curve geom = "smooth"

2d bins geom = "bin2d" or geom = "hex"

Density contours geom = "density2d"

Tuesday, 31 August 2010

# There are two ways to add additional geoms# 1) A vector of geom names:qplot(price, carat, data = diamonds, geom = c("point", "smooth"))

# 2) Add on extra geomsqplot(price, carat, data = diamonds) + geom_smooth()

# This how you get help about a specific geom:?geom_smooth# or go to http://had.co.nz/ggplot2/geom_smooth.html

Tuesday, 31 August 2010

# To set aesthetics to a particular value, you need# to wrap that value in I()

qplot(price, carat, data = diamonds, colour = "blue")qplot(price, carat, data = diamonds, colour = I("blue"))

# Practical application: varying alphaqplot(price, carat, data = diamonds, alpha = I(1/10))qplot(price, carat, data = diamonds, alpha = I(1/50))qplot(price, carat, data = diamonds, alpha = I(1/100))qplot(price, carat, data = diamonds, alpha = I(1/250))

Tuesday, 31 August 2010

Your turn

Explore the relationship between carat, price and clarity, using these techniques.

Which did you find most useful?

Tuesday, 31 August 2010

Subsetting

Tuesday, 31 August 2010

Motivation

Look at histograms and scatterplots of x, y, z from the diamonds dataset

Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm)

Tuesday, 31 August 2010

Plots

qplot(x, data = diamonds, binwidth = 0.1)qplot(y, data = diamonds, binwidth = 0.1)qplot(z, data = diamonds, binwidth = 0.1)qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)qplot(y, z, data = diamonds)

Tuesday, 31 August 2010

Modifying data

To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time.

Basic structure: df$varnamedf[row index, column index]

Tuesday, 31 August 2010

$

Remember str(diamonds) ?

That hints at how to extract individual variables:

diamonds$carat

diamonds$price

Tuesday, 31 August 2010

logical

blank

integer

character

+ve: include-ve: exclude

lookup by name

include TRUEs

include all

Tuesday, 31 August 2010

Integer subsetting

Tuesday, 31 August 2010

# Nothingstr(diamonds[, ])

# Positive integers & nothingdiamonds[1:6, ] # same as head(diamonds)diamonds[, 1:4] # watch out!

# Two positive integers in rows & columnsdiamonds[1:10, 1:4]

# Repeating input repeats outputdiamonds[c(1,1,1,2,2), 1:4]

# Negative integers drop valuesdiamonds[-(1:53900), -1]

Tuesday, 31 August 2010

# Useful technique: Order by one or more columnsdiamonds <- diamonds[order(diamonds$price), ]

# Useful technique: Combine two tablescarats <- data.frame(table(carat = diamonds$carat))mtch <- match(diamonds$carat, carats$carat)diamonds$carat_count <- carats$Freq[mtch]

Tuesday, 31 August 2010

Logical subsetting

Tuesday, 31 August 2010

# The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the datax_big <- diamonds$x > 10

head(x_big)sum(x_big)mean(x_big)table(x_big)

diamonds$x[x_big]diamonds[x_big, ]

Tuesday, 31 August 2010

small <- diamonds[diamonds$carat < 1, ]lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ]

# Comparison functions:# < > <= >= != == %in%

# Boolean operators: & | !small <- diamonds$carat < 1 & diamonds$price > 500lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair"

a

b

a | b

a & b

a & !b

xor(a, b)

Tuesday, 31 August 2010

Useful functions for

logical vectors

table(zeros)sum(zeros)mean(zeros)

TRUE = 1; FALSE = 0

Tuesday, 31 August 2010

Select the diamonds that have:

Equal x and y dimensions.

Depth between 55 and 70.

Carat smaller than the mean.

Cost more than $10,000 per carat.

Are of good quality or better.

Your turn

Tuesday, 31 August 2010

Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]

Tuesday, 31 August 2010

diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!

Tuesday, 31 August 2010

Your turn

Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.

Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)

Tuesday, 31 August 2010

equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]

Tuesday, 31 August 2010

Recommended