26
Hadley Wickham Stat405 Data Monday, 14 September 2009

06 Data

Embed Size (px)

Citation preview

Page 1: 06 Data

Hadley Wickham

Stat405Data

Monday, 14 September 2009

Page 2: 06 Data

1. Group work

2. Motivating problem

3. Loading & saving data

4. Factors & characters

Monday, 14 September 2009

Page 3: 06 Data

Want to help your groups become effective teams.

We’ll spend 15 minutes getting you into teams, and establishing expectations. See handouts.

Final project weighting for team citizenship.

Group project

Monday, 14 September 2009

Page 4: 06 Data

Firing & Quitting

You may fire a non-participating team member, but you need to meet with me and issue a written warning.

If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team.

Monday, 14 September 2009

Page 5: 06 Data

State regulated payoffs: how can be sure they’re honest? CC by-nc-nd: http://www.flickr.com/photos/amoleji/2979221622/

Monday, 14 September 2009

Page 6: 06 Data

Where are we going?

In the next few weeks we will be focussing our attention on some slot machine data. We want to figure out if the slot machine is paying out at the rate the manufacturer claims.

To do this, we’ll need to learn more about data formats and how to write functions.

Monday, 14 September 2009

Page 7: 06 Data

Loading dataread.table(): white space separated

read.table(sep="\t"): tab separated

read.csv(): comma separated

read.fwf(): fixed width

load(): R binary format

All take file argument

Monday, 14 September 2009

Page 8: 06 Data

Why csv?

Simple.

Compatible with all statistics software.

Human readable (in 20 years time you will still be able to extract data from it).

Monday, 14 September 2009

Page 9: 06 Data

Your turnDownload baseball and slots csv files from website. Practice using read.csv() to load into R.

Guess the name of the function you might use to write the R object back to a csv file on disk. Practice using it.

What happens if you read in a file you wrote with this method?

Monday, 14 September 2009

Page 10: 06 Data

batting <- read.csv("batting.csv")players <- read.csv("players.csv")slots <- read.csv("slots.csv")

write.csv(slots, "slots-2.csv")slots2 <- read.csv("slots-2.csv")str(slots)str(slots2)

# Betterwrite.table(slots, file = "slots-3.csv", sep=",", row = F)slots3 <- read.csv("slots-3.csv")

Monday, 14 September 2009

Page 11: 06 Data

Remember to set your working directory.

From the terminal (linux or mac): the working directory is the directory you’re in when you start R

On windows: setwd(choose.dir())

On the mac: ⌘-D

Working directory

Monday, 14 September 2009

Page 12: 06 Data

Saving data

# For long-termwrite.table(slots, file = "slots-3.csv", sep=",", row = F)

# For short-term cachingsave(slots, file = "slots.rdata")

Monday, 14 September 2009

Page 13: 06 Data

.csv .rdata

read.csv() load()

write.table(sep = ",", row = F) save()

Only data frames Any R object

Can be read by any program Only by R

Long term Short term caching of expensive computations

Monday, 14 September 2009

Page 14: 06 Data

Cleaning

I cleaned up slots.csv for you to practice with. The original data was slots.txt. Your next task is to performing the cleaning yourself.

This should always be the first step in an analysis: ensure that your data is available as a clean csv file. Do this in once in a file called clean.r.

Monday, 14 September 2009

Page 15: 06 Data

Your turn

Take two minutes to find as many differences as possible between slots.txt and slots.csv.

What did I do to clean up the file?

Monday, 14 September 2009

Page 16: 06 Data

Cleaning

• Convert from space delimited to csv

• Add variable names

• Convert uninformative numbers to informative labels

Monday, 14 September 2009

Page 17: 06 Data

Variable names

names(slots)

names(slots) <- c("w1", "w2", "w3", "prize", "night")

dput(names(slots))

This is a general pattern we’ll see a lot of

Monday, 14 September 2009

Page 18: 06 Data

Factors

• R’s way of storing categorical data

• Have ordered levels() which:

• Control order on plots and in table()

• Are preserved across subsets

• Affect contrasts in linear models

Monday, 14 September 2009

Page 19: 06 Data

# Creating a factor

x <- sample(5, 20, rep = T)

a <- factor(x)

b <- factor(x, levels = 1:10)

c <- factor(x, labels = letters[1:5])

levels(a); levels(b); levels(c)

table(a); table(b); table(c)

Monday, 14 September 2009

Page 20: 06 Data

# Subsets

b2 <- b[1:5]

levels(b2)

table(b2)

# Remove extra levels

b2[, drop=T]

factor(b2)

# Convert to character

b3 <- as.character(b)

table(b3)

table(b3[1:5])

Monday, 14 September 2009

Page 21: 06 Data

as.numeric(a)

as.numeric(b)

as.numeric(c)

d <- factor(x, labels = 2^(1:5))

as.numeric(d)

as.character(d)

as.numeric(as.character(d))

Monday, 14 September 2009

Page 22: 06 Data

Characters don’t remember all levels. Tables of characters always ordered alphabetically

By default, strings converted to factors when loading data frames.

Use stringsAsFactors = F to turn off for one data frame, or options(stringsAsFactors = F)

Character vs. factor

Monday, 14 September 2009

Page 23: 06 Data

Character vs. factor

Use a factor when there is a well-defined set of all possible values.

Use a character vector when there are potentially infinite possibilities.

Monday, 14 September 2009

Page 24: 06 Data

Quiz

Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment:

Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state.

Monday, 14 September 2009

Page 25: 06 Data

Your turnConvert w1, w2 and w3 to factors with labels from adjacent table

Rearrange levels in terms of value: DD, 7, BBB, BB, B, C, 0

Save as a csv file

Read in and look at levels. Compare to input with stringsAsFactors = F

0 Blank (0)

1 Single Bar (B)

2 Double Bar (BB)

3 Triple Bar (BBB)

5 Double Diamond (DD)

6 Cherries (C)

7 Seven (7)

Monday, 14 September 2009

Page 26: 06 Data

slots <- read.table("slots.txt")names(slots) <- c("w1", "w2", "w3", "prize", "night")

levels <- c(0, 1, 2, 3, 5, 6, 7)labels <- c("0", "B", "BB", "BBB", "DD", "C", "7")

slots$w1 <- factor(slots$w1, levels = levels, labels = labels)slots$w2 <- factor(slots$w2, levels = levels, labels = labels)slots$w3 <- factor(slots$w3, levels = levels, labels = labels)

write.table(slots, "slots.csv", sep=",", row=F)

Monday, 14 September 2009