03 extensions

Hadley Wickham

Stat405Graphical extensions & missing values

Tuesday, 31 August 2010

1. Graphical extensions (1d & 2d)

2. Subsetting


1d extensions


price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)Tuesday, 31 August 2010

price

coun

t 0

1000

2000

3000

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Fair

Premium

0 5000 10000 15000

Good

Ideal

0 5000 10000 15000

Very Good

0 5000 10000 15000

qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

What makes it difficult to compare the distributions?

Brainstorm for 1 minute.


Problems

Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities

Varying relative abundance makes comparisons difficult → rescale to ensure constant area


# Large distances make comparisons hardqplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

# Stacked heights hard to compareqplot(price, data = diamonds, binwidth = 500, fill = cut)

# Much better - but still have differing relative abundanceqplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# Instead of displaying count on y-axis, display density# .. indicates that variable isn't in original dataqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)

# To use with histogram, you need to be explicitqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut)


Use this technique to explore the relationship between price and clarity, and carat and clarity.

Your turn


2d extensions


Idea ggplot

Small points shape = I(".")

Transparency alpha = I(1/50)

Jittering geom = "jitter"

Smooth curve geom = "smooth"

2d bins geom = "bin2d" or geom = "hex"

Density contours geom = "density2d"


# There are two ways to add additional geoms# 1) A vector of geom names:qplot(price, carat, data = diamonds, geom = c("point", "smooth"))

# 2) Add on extra geomsqplot(price, carat, data = diamonds) + geom_smooth()

# This how you get help about a specific geom:?geom_smooth# or go to http://had.co.nz/ggplot2/geom_smooth.html


# To set aesthetics to a particular value, you need# to wrap that value in I()

qplot(price, carat, data = diamonds, colour = "blue")qplot(price, carat, data = diamonds, colour = I("blue"))

# Practical application: varying alphaqplot(price, carat, data = diamonds, alpha = I(1/10))qplot(price, carat, data = diamonds, alpha = I(1/50))qplot(price, carat, data = diamonds, alpha = I(1/100))qplot(price, carat, data = diamonds, alpha = I(1/250))


Your turn

Explore the relationship between carat, price and clarity, using these techniques.

Which did you find most useful?


Subsetting


Motivation

Look at histograms and scatterplots of x, y, z from the diamonds dataset

Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm)


Plots

qplot(x, data = diamonds, binwidth = 0.1)qplot(y, data = diamonds, binwidth = 0.1)qplot(z, data = diamonds, binwidth = 0.1)qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)qplot(y, z, data = diamonds)


Modifying data

To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time.

Basic structure: df$varnamedf[row index, column index]


$

Remember str(diamonds) ?

That hints at how to extract individual variables:

diamonds$carat

diamonds$price


logical

blank

integer

character

+ve: include-ve: exclude

lookup by name

include TRUEs

include all


Integer subsetting


# Nothingstr(diamonds[, ])

# Positive integers & nothingdiamonds[1:6, ] # same as head(diamonds)diamonds[, 1:4] # watch out!

# Two positive integers in rows & columnsdiamonds[1:10, 1:4]

# Repeating input repeats outputdiamonds[c(1,1,1,2,2), 1:4]

# Negative integers drop valuesdiamonds[-(1:53900), -1]


# Useful technique: Order by one or more columnsdiamonds <- diamonds[order(diamonds$price), ]

# Useful technique: Combine two tablescarats <- data.frame(table(carat = diamonds$carat))mtch <- match(diamonds$carat, carats$carat)diamonds$carat_count <- carats$Freq[mtch]


Logical subsetting


# The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the datax_big <- diamonds$x > 10

head(x_big)sum(x_big)mean(x_big)table(x_big)

diamonds$x[x_big]diamonds[x_big, ]


small <- diamonds[diamonds$carat < 1, ]lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ]

# Comparison functions:# < > <= >= != == %in%

# Boolean operators: & | !small <- diamonds$carat < 1 & diamonds$price > 500lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair"

a

b

a | b

a & b

a & !b

xor(a, b)


Useful functions for

logical vectors

table(zeros)sum(zeros)mean(zeros)

TRUE = 1; FALSE = 0


Select the diamonds that have:

Equal x and y dimensions.

Depth between 55 and 70.

Carat smaller than the mean.

Cost more than $10,000 per carat.

Are of good quality or better.

Your turn


Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]


diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!


Your turn

Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.

Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)


equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]


Documents

03 extensions