Upload
hadley-wickham
View
926
Download
1
Tags:
Embed Size (px)
Citation preview
Hadley Wickham
Stat405Data
Monday, 14 September 2009
1. Group work
2. Motivating problem
3. Loading & saving data
4. Factors & characters
Monday, 14 September 2009
Want to help your groups become effective teams.
We’ll spend 15 minutes getting you into teams, and establishing expectations. See handouts.
Final project weighting for team citizenship.
Group project
Monday, 14 September 2009
Firing & Quitting
You may fire a non-participating team member, but you need to meet with me and issue a written warning.
If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team.
Monday, 14 September 2009
State regulated payoffs: how can be sure they’re honest? CC by-nc-nd: http://www.flickr.com/photos/amoleji/2979221622/
Monday, 14 September 2009
Where are we going?
In the next few weeks we will be focussing our attention on some slot machine data. We want to figure out if the slot machine is paying out at the rate the manufacturer claims.
To do this, we’ll need to learn more about data formats and how to write functions.
Monday, 14 September 2009
Loading dataread.table(): white space separated
read.table(sep="\t"): tab separated
read.csv(): comma separated
read.fwf(): fixed width
load(): R binary format
All take file argument
Monday, 14 September 2009
Why csv?
Simple.
Compatible with all statistics software.
Human readable (in 20 years time you will still be able to extract data from it).
Monday, 14 September 2009
Your turnDownload baseball and slots csv files from website. Practice using read.csv() to load into R.
Guess the name of the function you might use to write the R object back to a csv file on disk. Practice using it.
What happens if you read in a file you wrote with this method?
Monday, 14 September 2009
batting <- read.csv("batting.csv")players <- read.csv("players.csv")slots <- read.csv("slots.csv")
write.csv(slots, "slots-2.csv")slots2 <- read.csv("slots-2.csv")str(slots)str(slots2)
# Betterwrite.table(slots, file = "slots-3.csv", sep=",", row = F)slots3 <- read.csv("slots-3.csv")
Monday, 14 September 2009
Remember to set your working directory.
From the terminal (linux or mac): the working directory is the directory you’re in when you start R
On windows: setwd(choose.dir())
On the mac: ⌘-D
Working directory
Monday, 14 September 2009
Saving data
# For long-termwrite.table(slots, file = "slots-3.csv", sep=",", row = F)
# For short-term cachingsave(slots, file = "slots.rdata")
Monday, 14 September 2009
.csv .rdata
read.csv() load()
write.table(sep = ",", row = F) save()
Only data frames Any R object
Can be read by any program Only by R
Long term Short term caching of expensive computations
Monday, 14 September 2009
Cleaning
I cleaned up slots.csv for you to practice with. The original data was slots.txt. Your next task is to performing the cleaning yourself.
This should always be the first step in an analysis: ensure that your data is available as a clean csv file. Do this in once in a file called clean.r.
Monday, 14 September 2009
Your turn
Take two minutes to find as many differences as possible between slots.txt and slots.csv.
What did I do to clean up the file?
Monday, 14 September 2009
Cleaning
• Convert from space delimited to csv
• Add variable names
• Convert uninformative numbers to informative labels
Monday, 14 September 2009
Variable names
names(slots)
names(slots) <- c("w1", "w2", "w3", "prize", "night")
dput(names(slots))
This is a general pattern we’ll see a lot of
Monday, 14 September 2009
Factors
• R’s way of storing categorical data
• Have ordered levels() which:
• Control order on plots and in table()
• Are preserved across subsets
• Affect contrasts in linear models
Monday, 14 September 2009
# Creating a factor
x <- sample(5, 20, rep = T)
a <- factor(x)
b <- factor(x, levels = 1:10)
c <- factor(x, labels = letters[1:5])
levels(a); levels(b); levels(c)
table(a); table(b); table(c)
Monday, 14 September 2009
# Subsets
b2 <- b[1:5]
levels(b2)
table(b2)
# Remove extra levels
b2[, drop=T]
factor(b2)
# Convert to character
b3 <- as.character(b)
table(b3)
table(b3[1:5])
Monday, 14 September 2009
as.numeric(a)
as.numeric(b)
as.numeric(c)
d <- factor(x, labels = 2^(1:5))
as.numeric(d)
as.character(d)
as.numeric(as.character(d))
Monday, 14 September 2009
Characters don’t remember all levels. Tables of characters always ordered alphabetically
By default, strings converted to factors when loading data frames.
Use stringsAsFactors = F to turn off for one data frame, or options(stringsAsFactors = F)
Character vs. factor
Monday, 14 September 2009
Character vs. factor
Use a factor when there is a well-defined set of all possible values.
Use a character vector when there are potentially infinite possibilities.
Monday, 14 September 2009
Quiz
Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment:
Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state.
Monday, 14 September 2009
Your turnConvert w1, w2 and w3 to factors with labels from adjacent table
Rearrange levels in terms of value: DD, 7, BBB, BB, B, C, 0
Save as a csv file
Read in and look at levels. Compare to input with stringsAsFactors = F
0 Blank (0)
1 Single Bar (B)
2 Double Bar (BB)
3 Triple Bar (BBB)
5 Double Diamond (DD)
6 Cherries (C)
7 Seven (7)
Monday, 14 September 2009
slots <- read.table("slots.txt")names(slots) <- c("w1", "w2", "w3", "prize", "night")
levels <- c(0, 1, 2, 3, 5, 6, 7)labels <- c("0", "B", "BB", "BBB", "DD", "C", "7")
slots$w1 <- factor(slots$w1, levels = levels, labels = labels)slots$w2 <- factor(slots$w2, levels = levels, labels = labels)slots$w3 <- factor(slots$w3, levels = levels, labels = labels)
write.table(slots, "slots.csv", sep=",", row=F)
Monday, 14 September 2009