23
Hadley Wickham Stat405 ddply case study (2) Thursday, 7 October 2010

14 case-study

Embed Size (px)

Citation preview

Page 1: 14 case-study

Hadley Wickham

Stat405ddply case study (2)

Thursday, 7 October 2010

Page 2: 14 case-study

1. Recap

1. Focus on smaller subset

2. More ddply

1. Develop summary statistic

2. Classify names

3. Apply to full data

Thursday, 7 October 2010

Page 3: 14 case-study

For names that are used for both boys and girls, how has usage changed?

Can we use names that clearly have the incorrect sex to estimate error rates over time?

Questions

Thursday, 7 October 2010

Page 4: 14 case-study

Getting started

options(stringsAsFactors = FALSE)library(plyr)library(ggplot2)

both <- read.csv("both.csv")

Thursday, 7 October 2010

Page 5: 14 case-study

Interesting subsetboth_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2)

both_sum <- subset(both_sum, years > 1)qplot(years, avg_usage, data = both_sum)

selected_names <- subset(both_sum, years > 50 & avg_usage > 0.0005)$nameselected <- subset(both, name %in% selected_names)

Thursday, 7 October 2010

Page 6: 14 case-study

selected$lratio <- with(selected, log10(boy / girl))qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected)qplot(abs(lratio), reorder(name, lratio), data = selected)

Patterns

Thursday, 7 October 2010

Page 7: 14 case-study

abs(lratio)

reor

der(n

ame,

lrat

io)

MaryHelen

MargaretElizabethFrances

HazelRuby

BerniceCarolPearl

BonnieJune

ShirleyJean

ConnieShannon

OraKelly

PatsyRobin

GailJamieBillieTracyOllie

DanaMarion

LynnJessieJackieAngelLeslie

JohnnieJimmie

WillieTerry

LeeSidney

GeneCecil

EddieFrancis

IraDale

ClydeJerryRay

CharlieJesse

JoeHenry

GeorgeMichaelCharles

FrankJosephJamesRobert

JohnDavid

ThomasWilliamRichard

●●●●●● ●●●● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●

● ●●● ●● ●●●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●●●●● ●● ●●● ●●

● ●● ●● ●● ●●● ● ●●● ● ●●●●●●●● ● ●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●● ●●●

● ●●● ●●● ●● ● ●●● ●● ●●●●● ●● ●●●● ● ●●●● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●●●●●●●

● ●●●● ● ●●●● ●● ●● ●●●●● ●●● ●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ●

● ●● ● ●● ●● ●● ●● ●●●● ●●● ● ●●● ●● ●● ●● ●● ● ●●●●●●●●●●● ●●●●●●●●

● ●●● ●●●●● ●● ●● ● ●●● ●● ● ●●●● ● ●●● ●● ●●● ●●●● ●● ●●●● ●●●●●●●●

● ●●● ●●●● ●● ●● ●●●●● ●●● ● ●●●●● ●●●●●● ●● ●●● ●● ●●●●●● ●●●●●●●

● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●●●●● ●●●●● ●●●●●●

● ●●●●●● ●● ●● ●●●● ● ●● ● ●●● ●● ●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●● ●●● ●●●●

● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●●● ●●●●● ●●● ● ●●●● ●●● ●●● ●●●● ●● ●● ● ●●●● ●● ●●●● ●●●●

●●● ●●●●● ● ●●●●● ●●● ●● ● ●● ● ●●●●● ● ● ●● ●● ●●● ●●● ●● ●●●●●● ●●●● ●●● ●●

●●●●●● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●● ● ● ●● ●● ● ● ●● ●●●●●●● ● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●

●●●●● ● ●● ● ●●●● ●●●● ● ● ●●● ●●● ●●● ● ●● ●● ● ●●●●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ● ●●

●● ●● ● ●●●● ●● ●● ●● ●● ●●● ●●● ●●● ● ●●●● ●● ●● ●●● ● ● ●● ●● ●●●●●● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●●●●● ●● ●●●●● ●●

● ● ●●●● ●● ● ●● ● ●●●●● ●●●●● ●●●● ● ● ● ●●●● ●● ●●●● ●●● ● ●● ●●●●●● ● ●●●●●●●●●●●●●●

● ●●●●● ●● ● ●●● ● ●●● ●● ●●●●● ●●●● ●● ●● ●●●●●● ●●● ●●● ●●●● ●● ●●● ● ●● ●● ● ●●●●●●

●●● ●●● ●●●●● ● ●●● ● ●●●●● ●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●

● ●● ●●● ● ●●●● ● ●● ● ●●●● ● ●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ●●●●

●● ●●●●●●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●● ●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●

● ●● ● ●● ●●● ● ●● ●●●●● ●●●● ●●●● ●● ●●●●●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●●● ● ●●● ●●●●●●●●

●● ●●●●●● ●● ●●●●● ● ●●●● ● ●●●●●●●● ● ●●●●● ●●●●●●● ●●●● ● ●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●● ●●

● ●●●● ●●● ●● ●● ●● ●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●●●●●●●●●● ●●● ●● ●● ● ●●● ●●●● ●●●●●

●● ● ●●● ●● ●●●● ●●● ●●● ● ●●●● ●●● ● ● ●●●● ● ● ●●●● ● ●●● ●●●●●●●●●●●●●● ●●●●●●●● ●

● ●● ●●●● ●●●●●●● ●●●●●● ●●● ●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●● ●●●●●●● ●● ●●●● ●●● ●●●●●●● ●● ●●

●●●● ●●● ●● ●●● ●●●●●● ●● ●●● ●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●● ●●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●

●●●●●●● ●●● ● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●● ●●● ●●●●●●●●●

● ●● ● ●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●

●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●

●●●●●● ●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●● ●●●● ● ●●●●● ●●●●●●●●●●

●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●● ● ●● ●●●●●●●●●● ● ● ● ● ●●●

●● ● ●● ● ●●●● ●● ●●● ●●● ●●●●● ●●● ●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●●● ●● ●● ●●●● ●●●●● ●●●●● ●●●● ● ●

●● ●●● ●● ●● ● ●● ●●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●● ●●● ●●

●●● ●●● ●●●●● ●● ●● ●● ● ●●● ●● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●

●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●

● ●●●●●●●●● ● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●● ●●● ●●●

●● ●● ●● ●●●● ●● ●●●● ●●●●●●● ●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●● ●●●●●●●●● ●● ● ●●●●●●●

●● ● ●● ●● ●●● ●● ●● ●●●● ● ●●●● ●● ● ●●● ●● ●● ●●●●●●●●●●●●●● ●● ●●●●● ●● ●● ●●●●●●● ● ●● ● ● ●●●●●● ●●●●

● ●● ● ●● ● ●●●● ● ●● ●● ●● ●● ●● ●●●●●● ●● ●●● ●●●●● ● ● ● ●● ●● ●● ●●●●● ●●●●●●● ●●●

●●●● ●●●● ●● ●●●● ●●●● ● ●●● ●●●● ●●●●● ● ●●● ●● ●● ●● ●● ●●●● ●●●●●●●● ●●● ●●●● ●●

●●●● ●● ●● ●●● ●● ●● ● ●●● ●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ●●●●●● ●● ●● ● ●●

●● ●● ●●●●●●●●●●●● ●●● ●● ●● ●●● ●● ●●● ● ●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●● ●●● ●●● ●●●● ●●●● ●●● ●● ●●●● ●●●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●●● ●●●●●●●●

● ● ● ● ●●● ● ●●●●● ●● ●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●●●●●

●● ● ●●● ●●●●● ●●● ●●●● ●●●● ●●●●● ● ●● ●● ●●●●●● ●● ●●● ●●● ●●●●● ●●

●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●●

●●● ●●● ●●●●● ●● ●● ●●● ●● ● ●● ● ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● ● ●●●●●

●●●● ●●●●●●●●● ●●●●●●● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●

●● ●●●● ●●● ●●● ● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●

●●●● ●●● ●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●

●●● ● ●● ● ●● ●●●●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●●●● ●●●●

●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●● ●●●●●●●

●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●

●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●

● ●●●● ●● ●● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●●

●● ●● ●● ●●● ●●●● ●●● ●●● ●●●● ●● ● ●● ●● ●●●●●●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●●● ●

●● ●●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●

● ●● ● ●●● ●● ● ● ●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●

● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●

●●● ●●●●● ●●●● ● ●● ●● ●●●●● ●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●

●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●●●●●●●●●● ●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●

●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●

● ● ●●●●●●●●● ●●●●●● ●●● ●●●● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●

0.5 1.0 1.5 2.0 2.5Thursday, 7 October 2010

Page 8: 14 case-study

abs(lratio)

reor

der(n

ame,

lrat

io)

MaryHelen

MargaretElizabethFrances

HazelRuby

BerniceCarolPearl

BonnieJune

ShirleyJean

ConnieShannon

OraKelly

PatsyRobin

GailJamieBillieTracyOllie

DanaMarion

LynnJessieJackieAngelLeslie

JohnnieJimmie

WillieTerry

LeeSidney

GeneCecil

EddieFrancis

IraDale

ClydeJerryRay

CharlieJesse

JoeHenry

GeorgeMichaelCharles

FrankJosephJamesRobert

JohnDavid

ThomasWilliamRichard

●●●●●● ●●●● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●

● ●●● ●● ●●●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●●●●● ●● ●●● ●●

● ●● ●● ●● ●●● ● ●●● ● ●●●●●●●● ● ●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●● ●●●

● ●●● ●●● ●● ● ●●● ●● ●●●●● ●● ●●●● ● ●●●● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●●●●●●●

● ●●●● ● ●●●● ●● ●● ●●●●● ●●● ●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ●

● ●● ● ●● ●● ●● ●● ●●●● ●●● ● ●●● ●● ●● ●● ●● ● ●●●●●●●●●●● ●●●●●●●●

● ●●● ●●●●● ●● ●● ● ●●● ●● ● ●●●● ● ●●● ●● ●●● ●●●● ●● ●●●● ●●●●●●●●

● ●●● ●●●● ●● ●● ●●●●● ●●● ● ●●●●● ●●●●●● ●● ●●● ●● ●●●●●● ●●●●●●●

● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●●●●● ●●●●● ●●●●●●

● ●●●●●● ●● ●● ●●●● ● ●● ● ●●● ●● ●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●● ●●● ●●●●

● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●●● ●●●●● ●●● ● ●●●● ●●● ●●● ●●●● ●● ●● ● ●●●● ●● ●●●● ●●●●

●●● ●●●●● ● ●●●●● ●●● ●● ● ●● ● ●●●●● ● ● ●● ●● ●●● ●●● ●● ●●●●●● ●●●● ●●● ●●

●●●●●● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●● ● ● ●● ●● ● ● ●● ●●●●●●● ● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●

●●●●● ● ●● ● ●●●● ●●●● ● ● ●●● ●●● ●●● ● ●● ●● ● ●●●●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ● ●●

●● ●● ● ●●●● ●● ●● ●● ●● ●●● ●●● ●●● ● ●●●● ●● ●● ●●● ● ● ●● ●● ●●●●●● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●●●●● ●● ●●●●● ●●

● ● ●●●● ●● ● ●● ● ●●●●● ●●●●● ●●●● ● ● ● ●●●● ●● ●●●● ●●● ● ●● ●●●●●● ● ●●●●●●●●●●●●●●

● ●●●●● ●● ● ●●● ● ●●● ●● ●●●●● ●●●● ●● ●● ●●●●●● ●●● ●●● ●●●● ●● ●●● ● ●● ●● ● ●●●●●●

●●● ●●● ●●●●● ● ●●● ● ●●●●● ●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●

● ●● ●●● ● ●●●● ● ●● ● ●●●● ● ●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ●●●●

●● ●●●●●●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●● ●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●

● ●● ● ●● ●●● ● ●● ●●●●● ●●●● ●●●● ●● ●●●●●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●●● ● ●●● ●●●●●●●●

●● ●●●●●● ●● ●●●●● ● ●●●● ● ●●●●●●●● ● ●●●●● ●●●●●●● ●●●● ● ●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●● ●●

● ●●●● ●●● ●● ●● ●● ●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●●●●●●●●●● ●●● ●● ●● ● ●●● ●●●● ●●●●●

●● ● ●●● ●● ●●●● ●●● ●●● ● ●●●● ●●● ● ● ●●●● ● ● ●●●● ● ●●● ●●●●●●●●●●●●●● ●●●●●●●● ●

● ●● ●●●● ●●●●●●● ●●●●●● ●●● ●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●● ●●●●●●● ●● ●●●● ●●● ●●●●●●● ●● ●●

●●●● ●●● ●● ●●● ●●●●●● ●● ●●● ●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●● ●●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●

●●●●●●● ●●● ● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●● ●●● ●●●●●●●●●

● ●● ● ●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●

●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●

●●●●●● ●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●● ●●●● ● ●●●●● ●●●●●●●●●●

●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●● ● ●● ●●●●●●●●●● ● ● ● ● ●●●

●● ● ●● ● ●●●● ●● ●●● ●●● ●●●●● ●●● ●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●●● ●● ●● ●●●● ●●●●● ●●●●● ●●●● ● ●

●● ●●● ●● ●● ● ●● ●●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●● ●●● ●●

●●● ●●● ●●●●● ●● ●● ●● ● ●●● ●● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●

●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●

● ●●●●●●●●● ● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●● ●●● ●●●

●● ●● ●● ●●●● ●● ●●●● ●●●●●●● ●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●● ●●●●●●●●● ●● ● ●●●●●●●

●● ● ●● ●● ●●● ●● ●● ●●●● ● ●●●● ●● ● ●●● ●● ●● ●●●●●●●●●●●●●● ●● ●●●●● ●● ●● ●●●●●●● ● ●● ● ● ●●●●●● ●●●●

● ●● ● ●● ● ●●●● ● ●● ●● ●● ●● ●● ●●●●●● ●● ●●● ●●●●● ● ● ● ●● ●● ●● ●●●●● ●●●●●●● ●●●

●●●● ●●●● ●● ●●●● ●●●● ● ●●● ●●●● ●●●●● ● ●●● ●● ●● ●● ●● ●●●● ●●●●●●●● ●●● ●●●● ●●

●●●● ●● ●● ●●● ●● ●● ● ●●● ●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ●●●●●● ●● ●● ● ●●

●● ●● ●●●●●●●●●●●● ●●● ●● ●● ●●● ●● ●●● ● ●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●● ●●● ●●● ●●●● ●●●● ●●● ●● ●●●● ●●●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●●● ●●●●●●●●

● ● ● ● ●●● ● ●●●●● ●● ●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●●●●●

●● ● ●●● ●●●●● ●●● ●●●● ●●●● ●●●●● ● ●● ●● ●●●●●● ●● ●●● ●●● ●●●●● ●●

●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●●

●●● ●●● ●●●●● ●● ●● ●●● ●● ● ●● ● ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● ● ●●●●●

●●●● ●●●●●●●●● ●●●●●●● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●

●● ●●●● ●●● ●●● ● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●

●●●● ●●● ●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●

●●● ● ●● ● ●● ●●●●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●●●● ●●●●

●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●● ●●●●●●●

●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●

●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●

● ●●●● ●● ●● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●●

●● ●● ●● ●●● ●●●● ●●● ●●● ●●●● ●● ● ●● ●● ●●●●●●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●●● ●

●● ●●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●

● ●● ● ●●● ●● ● ● ●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●

● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●

●●● ●●●●● ●●●● ● ●● ●● ●●●●● ●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●

●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●●●●●●●●●● ●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●

●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●

● ● ●●●●●●●●● ●●●●●● ●●● ●●●● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●

0.5 1.0 1.5 2.0 2.5

What characteristics separate sex-errors from dual-sex names?

Thursday, 7 October 2010

Page 9: 14 case-study

Your turn

Compute the mean and range of lratio for each name.

Plot and come up with cutoffs that you think separate the two groups.

Thursday, 7 October 2010

Page 10: 14 case-study

rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))

qplot(diff, abs(mean), data = rng)qplot(diff, abs(mean), data = rng, geom = "text", label = name)

rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)

selected <- join(selected, rng[c("name", "dual")])

Thursday, 7 October 2010

Page 11: 14 case-study

diff

abs(mean)

0.5

1.0

1.5

2.0

Angel

Bernice

Billie

Bonnie

Carol

Cecil

Charles

CharlieClyde

Connie

Dale

Dana

David

Eddie

Elizabeth

Frances

Francis

Frank

Gail

Gene

George

Hazel

HelenHenry

Ira

Jackie

James

Jamie

Jean

Jerry

Jesse

Jessie

Jimmie

Joe

John

Johnnie

Joseph

June

KellyLee

LeslieLynn

Margaret

Marion

MaryMichael

Ollie

OraPatsy

PearlRay

RichardRobert

Robin

Ruby

Shannon

Shirley

Sidney

Terry

Thomas

Tracy

William

Willie

0.5 1.0 1.5 2.0 2.5

Thursday, 7 October 2010

Page 12: 14 case-study

diff

abs(mean)

0.5

1.0

1.5

2.0

Angel

Bernice

Billie

Bonnie

Carol

Cecil

Charles

CharlieClyde

Connie

Dale

Dana

David

Eddie

Elizabeth

Frances

Francis

Frank

Gail

Gene

George

Hazel

HelenHenry

Ira

Jackie

James

Jamie

Jean

Jerry

Jesse

Jessie

Jimmie

Joe

John

Johnnie

Joseph

June

KellyLee

LeslieLynn

Margaret

Marion

MaryMichael

Ollie

OraPatsy

PearlRay

RichardRobert

Robin

Ruby

Shannon

Shirley

Sidney

Terry

Thomas

Tracy

William

Willie

0.5 1.0 1.5 2.0 2.5

Why does this pattern give us confidence that those dual-sex names are errors?

Thursday, 7 October 2010

Page 13: 14 case-study

qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual)

qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name)

qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name)

Thursday, 7 October 2010

Page 14: 14 case-study

Apply this threshold to all names, not just the few we focussed in on. Does it still seem like a good classification?

What can you say about trends in errors over time?

Your turn

Thursday, 7 October 2010

Page 15: 14 case-study

both$lratio <- with(both, log10(boy / girl))rng <- ddply(both, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)both <- join(both, rng[c("name", "dual")])

qplot(year, lratio, data = subset(both, !dual)) qplot(year, abs(lratio), data = subset(both, !dual), colour = factor(boy > girl)) + geom_smooth(size = 3)

Thursday, 7 October 2010

Page 16: 14 case-study

Math on the computer

Thursday, 7 October 2010

Page 17: 14 case-study

Your turn

Perform the following calculations in R. Are the answers what you expect?

seq(0.1, 0.9, by = 0.1) - 1:9 / 10

sqrt(2)^2 - 2

What is the property of these numbers that might cause the problem?

Thursday, 7 October 2010

Page 18: 14 case-study

# Each number must be stored in a finite amount of space

# => each number can only have a finite number of digits

# => floating point math does not work like normal math

(1e-16 + 1) == 1

(1e-16 + 1) * 10 == 1e-16 * 10 + 1 * 10

1e9 + 2 - 0.1 - 1e9

1e10 + 2 - 0.1 - 1e10

1e11 + 2 - 0.1 - 1e11

1e12 + 2 - 0.1 - 1e12

1e13 + 2 - 0.1 - 1e13

1e14 + 2 - 0.1 - 1e14

Thursday, 7 October 2010

Page 19: 14 case-study

a⋅(b + c) = a⋅b + a⋅ca + (b + c) = (a + b) + c

a + b - b = a

Thursday, 7 October 2010

Page 20: 14 case-study

# By default R only shows 7 significant digits

# If the trailing digits are zero, the number will be rounded

(1 / 237)

(1 / 237) * 237

(1 / 237) * 237 - 1

seq(0.1, 0.9, by = 0.1)

seq(0.1, 0.9, by = 0.1) - 1:9 / 10

# Tricky to get to print exactly:

formatC((1 / 237) * 237, digits = 20)

formatC(seq(0.1, 0.9, by = 0.1), digits = 20)

Thursday, 7 October 2010

Page 21: 14 case-study

# When working with floating point numbers (numeric)

# (but not integers, this is the one place where the

# difference is important) never test for equality with ==

a <- seq(0.1, 0.9, by = 0.1)

b <- 1:9 / 10

all(a == b)

all.equal(a, b)

all(abs(a - b) < 1e-6)

# Similarly, need to be careful with < and > etc

Thursday, 7 October 2010

Page 22: 14 case-study

# Places where this matters:# # * sums# * calculating the standard deviation# * inverting a matrix (condition)# * linear models!# * maximum likelihood estimation

Thursday, 7 October 2010

Page 23: 14 case-study

Reflection on project teams.

Plyr drills.

Plyr practice with basketball data (what we’ll be using for the next project).

Homework

Thursday, 7 October 2010