View
1.123
Download
0
Category
Preview:
Citation preview
Hadley Wickham
Stat405Graphical extensions & missing values
Tuesday, 31 August 2010
1. Graphical extensions (1d & 2d)
2. Subsetting
Tuesday, 31 August 2010
1d extensions
Tuesday, 31 August 2010
price
coun
t 0
1000
2000
3000
4000
5000
6000
0
1000
2000
3000
4000
5000
6000
Fair
Premium
0 5000 10000 15000
Good
Ideal
0 5000 10000 15000
Very Good
0 5000 10000 15000
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)Tuesday, 31 August 2010
price
coun
t 0
1000
2000
3000
4000
5000
6000
0
1000
2000
3000
4000
5000
6000
Fair
Premium
0 5000 10000 15000
Good
Ideal
0 5000 10000 15000
Very Good
0 5000 10000 15000
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
What makes it difficult to compare the distributions?
Brainstorm for 1 minute.
Tuesday, 31 August 2010
Problems
Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities
Varying relative abundance makes comparisons difficult → rescale to ensure constant area
Tuesday, 31 August 2010
# Large distances make comparisons hardqplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
# Stacked heights hard to compareqplot(price, data = diamonds, binwidth = 500, fill = cut)
# Much better - but still have differing relative abundanceqplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)
# Instead of displaying count on y-axis, display density# .. indicates that variable isn't in original dataqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut)
# To use with histogram, you need to be explicitqplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut)
Tuesday, 31 August 2010
Use this technique to explore the relationship between price and clarity, and carat and clarity.
Your turn
Tuesday, 31 August 2010
2d extensions
Tuesday, 31 August 2010
Idea ggplot
Small points shape = I(".")
Transparency alpha = I(1/50)
Jittering geom = "jitter"
Smooth curve geom = "smooth"
2d bins geom = "bin2d" or geom = "hex"
Density contours geom = "density2d"
Tuesday, 31 August 2010
# There are two ways to add additional geoms# 1) A vector of geom names:qplot(price, carat, data = diamonds, geom = c("point", "smooth"))
# 2) Add on extra geomsqplot(price, carat, data = diamonds) + geom_smooth()
# This how you get help about a specific geom:?geom_smooth# or go to http://had.co.nz/ggplot2/geom_smooth.html
Tuesday, 31 August 2010
# To set aesthetics to a particular value, you need# to wrap that value in I()
qplot(price, carat, data = diamonds, colour = "blue")qplot(price, carat, data = diamonds, colour = I("blue"))
# Practical application: varying alphaqplot(price, carat, data = diamonds, alpha = I(1/10))qplot(price, carat, data = diamonds, alpha = I(1/50))qplot(price, carat, data = diamonds, alpha = I(1/100))qplot(price, carat, data = diamonds, alpha = I(1/250))
Tuesday, 31 August 2010
Your turn
Explore the relationship between carat, price and clarity, using these techniques.
Which did you find most useful?
Tuesday, 31 August 2010
Subsetting
Tuesday, 31 August 2010
Motivation
Look at histograms and scatterplots of x, y, z from the diamonds dataset
Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm)
Tuesday, 31 August 2010
Plots
qplot(x, data = diamonds, binwidth = 0.1)qplot(y, data = diamonds, binwidth = 0.1)qplot(z, data = diamonds, binwidth = 0.1)qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)qplot(y, z, data = diamonds)
Tuesday, 31 August 2010
Modifying data
To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time.
Basic structure: df$varnamedf[row index, column index]
Tuesday, 31 August 2010
$
Remember str(diamonds) ?
That hints at how to extract individual variables:
diamonds$carat
diamonds$price
Tuesday, 31 August 2010
logical
blank
integer
character
+ve: include-ve: exclude
lookup by name
include TRUEs
include all
Tuesday, 31 August 2010
Integer subsetting
Tuesday, 31 August 2010
# Nothingstr(diamonds[, ])
# Positive integers & nothingdiamonds[1:6, ] # same as head(diamonds)diamonds[, 1:4] # watch out!
# Two positive integers in rows & columnsdiamonds[1:10, 1:4]
# Repeating input repeats outputdiamonds[c(1,1,1,2,2), 1:4]
# Negative integers drop valuesdiamonds[-(1:53900), -1]
Tuesday, 31 August 2010
# Useful technique: Order by one or more columnsdiamonds <- diamonds[order(diamonds$price), ]
# Useful technique: Combine two tablescarats <- data.frame(table(carat = diamonds$carat))mtch <- match(diamonds$carat, carats$carat)diamonds$carat_count <- carats$Freq[mtch]
Tuesday, 31 August 2010
Logical subsetting
Tuesday, 31 August 2010
# The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the datax_big <- diamonds$x > 10
head(x_big)sum(x_big)mean(x_big)table(x_big)
diamonds$x[x_big]diamonds[x_big, ]
Tuesday, 31 August 2010
small <- diamonds[diamonds$carat < 1, ]lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ]
# Comparison functions:# < > <= >= != == %in%
# Boolean operators: & | !small <- diamonds$carat < 1 & diamonds$price > 500lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair"
a
b
a | b
a & b
a & !b
xor(a, b)
Tuesday, 31 August 2010
Useful functions for
logical vectors
table(zeros)sum(zeros)mean(zeros)
TRUE = 1; FALSE = 0
Tuesday, 31 August 2010
Select the diamonds that have:
Equal x and y dimensions.
Depth between 55 and 70.
Carat smaller than the mean.
Cost more than $10,000 per carat.
Are of good quality or better.
Your turn
Tuesday, 31 August 2010
Saving results
# Prints to screen
diamonds[diamonds$x > 10, ]
# Saves to new data frame
big <- diamonds[diamonds$x > 10, ]
# Overwrites existing data frame. Dangerous!
diamonds <- diamonds[diamonds$x < 10,]
Tuesday, 31 August 2010
diamonds <- diamonds[1, 1]diamonds
# Uh oh!
rm(diamonds)str(diamonds)
# Phew!
Tuesday, 31 August 2010
Your turn
Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.
Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)
Tuesday, 31 August 2010
equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]
y_big <- diamonds$y > 10z_big <- diamonds$z > 6
x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero
bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]
Tuesday, 31 August 2010
Recommended