Upload
hadley-wickham
View
642
Download
1
Tags:
Embed Size (px)
Citation preview
Hadley Wickham
Stat405ddply case study (2)
Thursday, 7 October 2010
1. Recap
1. Focus on smaller subset
2. More ddply
1. Develop summary statistic
2. Classify names
3. Apply to full data
Thursday, 7 October 2010
For names that are used for both boys and girls, how has usage changed?
Can we use names that clearly have the incorrect sex to estimate error rates over time?
Questions
Thursday, 7 October 2010
Getting started
options(stringsAsFactors = FALSE)library(plyr)library(ggplot2)
both <- read.csv("both.csv")
Thursday, 7 October 2010
Interesting subsetboth_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2)
both_sum <- subset(both_sum, years > 1)qplot(years, avg_usage, data = both_sum)
selected_names <- subset(both_sum, years > 50 & avg_usage > 0.0005)$nameselected <- subset(both, name %in% selected_names)
Thursday, 7 October 2010
selected$lratio <- with(selected, log10(boy / girl))qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected)qplot(abs(lratio), reorder(name, lratio), data = selected)
Patterns
Thursday, 7 October 2010
abs(lratio)
reor
der(n
ame,
lrat
io)
MaryHelen
MargaretElizabethFrances
HazelRuby
BerniceCarolPearl
BonnieJune
ShirleyJean
ConnieShannon
OraKelly
PatsyRobin
GailJamieBillieTracyOllie
DanaMarion
LynnJessieJackieAngelLeslie
JohnnieJimmie
WillieTerry
LeeSidney
GeneCecil
EddieFrancis
IraDale
ClydeJerryRay
CharlieJesse
JoeHenry
GeorgeMichaelCharles
FrankJosephJamesRobert
JohnDavid
ThomasWilliamRichard
●●●●●● ●●●● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●
● ●●● ●● ●●●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●●●●● ●● ●●● ●●
● ●● ●● ●● ●●● ● ●●● ● ●●●●●●●● ● ●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●● ●●●
● ●●● ●●● ●● ● ●●● ●● ●●●●● ●● ●●●● ● ●●●● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●●●●●●●
● ●●●● ● ●●●● ●● ●● ●●●●● ●●● ●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ●
● ●● ● ●● ●● ●● ●● ●●●● ●●● ● ●●● ●● ●● ●● ●● ● ●●●●●●●●●●● ●●●●●●●●
● ●●● ●●●●● ●● ●● ● ●●● ●● ● ●●●● ● ●●● ●● ●●● ●●●● ●● ●●●● ●●●●●●●●
● ●●● ●●●● ●● ●● ●●●●● ●●● ● ●●●●● ●●●●●● ●● ●●● ●● ●●●●●● ●●●●●●●
● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●●●●● ●●●●● ●●●●●●
● ●●●●●● ●● ●● ●●●● ● ●● ● ●●● ●● ●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●● ●●● ●●●●
● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●●● ●●●●● ●●● ● ●●●● ●●● ●●● ●●●● ●● ●● ● ●●●● ●● ●●●● ●●●●
●●● ●●●●● ● ●●●●● ●●● ●● ● ●● ● ●●●●● ● ● ●● ●● ●●● ●●● ●● ●●●●●● ●●●● ●●● ●●
●●●●●● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●● ● ● ●● ●● ● ● ●● ●●●●●●● ● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●
●●●●● ● ●● ● ●●●● ●●●● ● ● ●●● ●●● ●●● ● ●● ●● ● ●●●●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ● ●●
●● ●● ● ●●●● ●● ●● ●● ●● ●●● ●●● ●●● ● ●●●● ●● ●● ●●● ● ● ●● ●● ●●●●●● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●●●●● ●● ●●●●● ●●
● ● ●●●● ●● ● ●● ● ●●●●● ●●●●● ●●●● ● ● ● ●●●● ●● ●●●● ●●● ● ●● ●●●●●● ● ●●●●●●●●●●●●●●
● ●●●●● ●● ● ●●● ● ●●● ●● ●●●●● ●●●● ●● ●● ●●●●●● ●●● ●●● ●●●● ●● ●●● ● ●● ●● ● ●●●●●●
●●● ●●● ●●●●● ● ●●● ● ●●●●● ●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●
● ●● ●●● ● ●●●● ● ●● ● ●●●● ● ●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ●●●●
●● ●●●●●●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●● ●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●
● ●● ● ●● ●●● ● ●● ●●●●● ●●●● ●●●● ●● ●●●●●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●●● ● ●●● ●●●●●●●●
●● ●●●●●● ●● ●●●●● ● ●●●● ● ●●●●●●●● ● ●●●●● ●●●●●●● ●●●● ● ●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●● ●●
● ●●●● ●●● ●● ●● ●● ●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●●●●●●●●●● ●●● ●● ●● ● ●●● ●●●● ●●●●●
●● ● ●●● ●● ●●●● ●●● ●●● ● ●●●● ●●● ● ● ●●●● ● ● ●●●● ● ●●● ●●●●●●●●●●●●●● ●●●●●●●● ●
● ●● ●●●● ●●●●●●● ●●●●●● ●●● ●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●● ●●●●●●● ●● ●●●● ●●● ●●●●●●● ●● ●●
●●●● ●●● ●● ●●● ●●●●●● ●● ●●● ●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●● ●●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●
●●●●●●● ●●● ● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●● ●●● ●●●●●●●●●
● ●● ● ●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●
●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●
●●●●●● ●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●● ●●●● ● ●●●●● ●●●●●●●●●●
●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●● ● ●● ●●●●●●●●●● ● ● ● ● ●●●
●● ● ●● ● ●●●● ●● ●●● ●●● ●●●●● ●●● ●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●●● ●● ●● ●●●● ●●●●● ●●●●● ●●●● ● ●
●● ●●● ●● ●● ● ●● ●●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●● ●●● ●●
●●● ●●● ●●●●● ●● ●● ●● ● ●●● ●● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●
● ●●●●●●●●● ● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●● ●●● ●●●
●● ●● ●● ●●●● ●● ●●●● ●●●●●●● ●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●● ●●●●●●●●● ●● ● ●●●●●●●
●● ● ●● ●● ●●● ●● ●● ●●●● ● ●●●● ●● ● ●●● ●● ●● ●●●●●●●●●●●●●● ●● ●●●●● ●● ●● ●●●●●●● ● ●● ● ● ●●●●●● ●●●●
● ●● ● ●● ● ●●●● ● ●● ●● ●● ●● ●● ●●●●●● ●● ●●● ●●●●● ● ● ● ●● ●● ●● ●●●●● ●●●●●●● ●●●
●●●● ●●●● ●● ●●●● ●●●● ● ●●● ●●●● ●●●●● ● ●●● ●● ●● ●● ●● ●●●● ●●●●●●●● ●●● ●●●● ●●
●●●● ●● ●● ●●● ●● ●● ● ●●● ●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ●●●●●● ●● ●● ● ●●
●● ●● ●●●●●●●●●●●● ●●● ●● ●● ●●● ●● ●●● ● ●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●● ●●● ●●● ●●●● ●●●● ●●● ●● ●●●● ●●●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●●● ●●●●●●●●
● ● ● ● ●●● ● ●●●●● ●● ●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●●●●●
●● ● ●●● ●●●●● ●●● ●●●● ●●●● ●●●●● ● ●● ●● ●●●●●● ●● ●●● ●●● ●●●●● ●●
●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●●
●●● ●●● ●●●●● ●● ●● ●●● ●● ● ●● ● ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● ● ●●●●●
●●●● ●●●●●●●●● ●●●●●●● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●
●● ●●●● ●●● ●●● ● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●
●●●● ●●● ●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●
●●● ● ●● ● ●● ●●●●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●●●● ●●●●
●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●● ●●●●●●●
●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●
●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●● ●● ●● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●●
●● ●● ●● ●●● ●●●● ●●● ●●● ●●●● ●● ● ●● ●● ●●●●●●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●●● ●
●● ●●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●
● ●● ● ●●● ●● ● ● ●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●●● ●●●●● ●●●● ● ●● ●● ●●●●● ●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●
●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●●●●●●●●●● ●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●
●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●
● ● ●●●●●●●●● ●●●●●● ●●● ●●●● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●
0.5 1.0 1.5 2.0 2.5Thursday, 7 October 2010
abs(lratio)
reor
der(n
ame,
lrat
io)
MaryHelen
MargaretElizabethFrances
HazelRuby
BerniceCarolPearl
BonnieJune
ShirleyJean
ConnieShannon
OraKelly
PatsyRobin
GailJamieBillieTracyOllie
DanaMarion
LynnJessieJackieAngelLeslie
JohnnieJimmie
WillieTerry
LeeSidney
GeneCecil
EddieFrancis
IraDale
ClydeJerryRay
CharlieJesse
JoeHenry
GeorgeMichaelCharles
FrankJosephJamesRobert
JohnDavid
ThomasWilliamRichard
●●●●●● ●●●● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●
● ●●● ●● ●●●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●●●●● ●● ●●● ●●
● ●● ●● ●● ●●● ● ●●● ● ●●●●●●●● ● ●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●● ●●●
● ●●● ●●● ●● ● ●●● ●● ●●●●● ●● ●●●● ● ●●●● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●●●●●●●
● ●●●● ● ●●●● ●● ●● ●●●●● ●●● ●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ●
● ●● ● ●● ●● ●● ●● ●●●● ●●● ● ●●● ●● ●● ●● ●● ● ●●●●●●●●●●● ●●●●●●●●
● ●●● ●●●●● ●● ●● ● ●●● ●● ● ●●●● ● ●●● ●● ●●● ●●●● ●● ●●●● ●●●●●●●●
● ●●● ●●●● ●● ●● ●●●●● ●●● ● ●●●●● ●●●●●● ●● ●●● ●● ●●●●●● ●●●●●●●
● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●●●●● ●●●●● ●●●●●●
● ●●●●●● ●● ●● ●●●● ● ●● ● ●●● ●● ●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●● ●●● ●●●●
● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●●● ●●●●● ●●● ● ●●●● ●●● ●●● ●●●● ●● ●● ● ●●●● ●● ●●●● ●●●●
●●● ●●●●● ● ●●●●● ●●● ●● ● ●● ● ●●●●● ● ● ●● ●● ●●● ●●● ●● ●●●●●● ●●●● ●●● ●●
●●●●●● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●● ● ● ●● ●● ● ● ●● ●●●●●●● ● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●
●●●●● ● ●● ● ●●●● ●●●● ● ● ●●● ●●● ●●● ● ●● ●● ● ●●●●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ● ●●
●● ●● ● ●●●● ●● ●● ●● ●● ●●● ●●● ●●● ● ●●●● ●● ●● ●●● ● ● ●● ●● ●●●●●● ● ●●● ● ● ●●●● ●● ● ●● ● ●● ●●●●● ●● ●●●●● ●●
● ● ●●●● ●● ● ●● ● ●●●●● ●●●●● ●●●● ● ● ● ●●●● ●● ●●●● ●●● ● ●● ●●●●●● ● ●●●●●●●●●●●●●●
● ●●●●● ●● ● ●●● ● ●●● ●● ●●●●● ●●●● ●● ●● ●●●●●● ●●● ●●● ●●●● ●● ●●● ● ●● ●● ● ●●●●●●
●●● ●●● ●●●●● ● ●●● ● ●●●●● ●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●
● ●● ●●● ● ●●●● ● ●● ● ●●●● ● ●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ●●●●
●● ●●●●●●●●● ●● ●●● ●● ● ●●● ●●● ● ●● ●● ●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●
● ●● ● ●● ●●● ● ●● ●●●●● ●●●● ●●●● ●● ●●●●●●● ●●●●● ●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●●● ● ●●● ●●●●●●●●
●● ●●●●●● ●● ●●●●● ● ●●●● ● ●●●●●●●● ● ●●●●● ●●●●●●● ●●●● ● ●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●● ●●
● ●●●● ●●● ●● ●● ●● ●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●● ●●●●●●●●●●●●● ●●● ●● ●● ● ●●● ●●●● ●●●●●
●● ● ●●● ●● ●●●● ●●● ●●● ● ●●●● ●●● ● ● ●●●● ● ● ●●●● ● ●●● ●●●●●●●●●●●●●● ●●●●●●●● ●
● ●● ●●●● ●●●●●●● ●●●●●● ●●● ●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●● ●●●●●●● ●● ●●●● ●●● ●●●●●●● ●● ●●
●●●● ●●● ●● ●●● ●●●●●● ●● ●●● ●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●● ●●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●
●●●●●●● ●●● ● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●● ●●● ●●●●●●●●●
● ●● ● ●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●
●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●
●●●●●● ●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●● ●●●● ● ●●●●● ●●●●●●●●●●
●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●● ● ●● ●●●●●●●●●● ● ● ● ● ●●●
●● ● ●● ● ●●●● ●● ●●● ●●● ●●●●● ●●● ●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●●● ●● ●● ●●●● ●●●●● ●●●●● ●●●● ● ●
●● ●●● ●● ●● ● ●● ●●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●● ●●● ●●
●●● ●●● ●●●●● ●● ●● ●● ● ●●● ●● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●
● ●●●●●●●●● ● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●● ●●● ●●●
●● ●● ●● ●●●● ●● ●●●● ●●●●●●● ●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●● ●●●●●●●●● ●● ● ●●●●●●●
●● ● ●● ●● ●●● ●● ●● ●●●● ● ●●●● ●● ● ●●● ●● ●● ●●●●●●●●●●●●●● ●● ●●●●● ●● ●● ●●●●●●● ● ●● ● ● ●●●●●● ●●●●
● ●● ● ●● ● ●●●● ● ●● ●● ●● ●● ●● ●●●●●● ●● ●●● ●●●●● ● ● ● ●● ●● ●● ●●●●● ●●●●●●● ●●●
●●●● ●●●● ●● ●●●● ●●●● ● ●●● ●●●● ●●●●● ● ●●● ●● ●● ●● ●● ●●●● ●●●●●●●● ●●● ●●●● ●●
●●●● ●● ●● ●●● ●● ●● ● ●●● ●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ●●●●●● ●● ●● ● ●●
●● ●● ●●●●●●●●●●●● ●●● ●● ●● ●●● ●● ●●● ● ●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●● ●●● ●●● ●●●● ●●●● ●●● ●● ●●●● ●●●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●●● ●●●●●●●●
● ● ● ● ●●● ● ●●●●● ●● ●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●●●●●
●● ● ●●● ●●●●● ●●● ●●●● ●●●● ●●●●● ● ●● ●● ●●●●●● ●● ●●● ●●● ●●●●● ●●
●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●●
●●● ●●● ●●●●● ●● ●● ●●● ●● ● ●● ● ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● ● ●●●●●
●●●● ●●●●●●●●● ●●●●●●● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●
●● ●●●● ●●● ●●● ● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●
●●●● ●●● ●● ●●● ●●●● ●● ●●● ●● ● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ●●● ●
●●● ● ●● ● ●● ●●●●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●●●● ●●●●
●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●● ●●●●●●●
●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●
●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●● ●● ●● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●●
●● ●● ●● ●●● ●●●● ●●● ●●● ●●●● ●● ● ●● ●● ●●●●●●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●●● ●
●● ●●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●
● ●● ● ●●● ●● ● ● ●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●●● ●●●●● ●●●● ● ●● ●● ●●●●● ●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●
●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●●●●●●●●●● ●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●
●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●
● ● ●●●●●●●●● ●●●●●● ●●● ●●●● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●
0.5 1.0 1.5 2.0 2.5
What characteristics separate sex-errors from dual-sex names?
Thursday, 7 October 2010
Your turn
Compute the mean and range of lratio for each name.
Plot and come up with cutoffs that you think separate the two groups.
Thursday, 7 October 2010
rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))
qplot(diff, abs(mean), data = rng)qplot(diff, abs(mean), data = rng, geom = "text", label = name)
rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)
selected <- join(selected, rng[c("name", "dual")])
Thursday, 7 October 2010
diff
abs(mean)
0.5
1.0
1.5
2.0
Angel
Bernice
Billie
Bonnie
Carol
Cecil
Charles
CharlieClyde
Connie
Dale
Dana
David
Eddie
Elizabeth
Frances
Francis
Frank
Gail
Gene
George
Hazel
HelenHenry
Ira
Jackie
James
Jamie
Jean
Jerry
Jesse
Jessie
Jimmie
Joe
John
Johnnie
Joseph
June
KellyLee
LeslieLynn
Margaret
Marion
MaryMichael
Ollie
OraPatsy
PearlRay
RichardRobert
Robin
Ruby
Shannon
Shirley
Sidney
Terry
Thomas
Tracy
William
Willie
0.5 1.0 1.5 2.0 2.5
Thursday, 7 October 2010
diff
abs(mean)
0.5
1.0
1.5
2.0
Angel
Bernice
Billie
Bonnie
Carol
Cecil
Charles
CharlieClyde
Connie
Dale
Dana
David
Eddie
Elizabeth
Frances
Francis
Frank
Gail
Gene
George
Hazel
HelenHenry
Ira
Jackie
James
Jamie
Jean
Jerry
Jesse
Jessie
Jimmie
Joe
John
Johnnie
Joseph
June
KellyLee
LeslieLynn
Margaret
Marion
MaryMichael
Ollie
OraPatsy
PearlRay
RichardRobert
Robin
Ruby
Shannon
Shirley
Sidney
Terry
Thomas
Tracy
William
Willie
0.5 1.0 1.5 2.0 2.5
Why does this pattern give us confidence that those dual-sex names are errors?
Thursday, 7 October 2010
qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual)
qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name)
qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name)
Thursday, 7 October 2010
Apply this threshold to all names, not just the few we focussed in on. Does it still seem like a good classification?
What can you say about trends in errors over time?
Your turn
Thursday, 7 October 2010
both$lratio <- with(both, log10(boy / girl))rng <- ddply(both, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)both <- join(both, rng[c("name", "dual")])
qplot(year, lratio, data = subset(both, !dual)) qplot(year, abs(lratio), data = subset(both, !dual), colour = factor(boy > girl)) + geom_smooth(size = 3)
Thursday, 7 October 2010
Math on the computer
Thursday, 7 October 2010
Your turn
Perform the following calculations in R. Are the answers what you expect?
seq(0.1, 0.9, by = 0.1) - 1:9 / 10
sqrt(2)^2 - 2
What is the property of these numbers that might cause the problem?
Thursday, 7 October 2010
# Each number must be stored in a finite amount of space
# => each number can only have a finite number of digits
# => floating point math does not work like normal math
(1e-16 + 1) == 1
(1e-16 + 1) * 10 == 1e-16 * 10 + 1 * 10
1e9 + 2 - 0.1 - 1e9
1e10 + 2 - 0.1 - 1e10
1e11 + 2 - 0.1 - 1e11
1e12 + 2 - 0.1 - 1e12
1e13 + 2 - 0.1 - 1e13
1e14 + 2 - 0.1 - 1e14
Thursday, 7 October 2010
a⋅(b + c) = a⋅b + a⋅ca + (b + c) = (a + b) + c
a + b - b = a
Thursday, 7 October 2010
# By default R only shows 7 significant digits
# If the trailing digits are zero, the number will be rounded
(1 / 237)
(1 / 237) * 237
(1 / 237) * 237 - 1
seq(0.1, 0.9, by = 0.1)
seq(0.1, 0.9, by = 0.1) - 1:9 / 10
# Tricky to get to print exactly:
formatC((1 / 237) * 237, digits = 20)
formatC(seq(0.1, 0.9, by = 0.1), digits = 20)
Thursday, 7 October 2010
# When working with floating point numbers (numeric)
# (but not integers, this is the one place where the
# difference is important) never test for equality with ==
a <- seq(0.1, 0.9, by = 0.1)
b <- 1:9 / 10
all(a == b)
all.equal(a, b)
all(abs(a - b) < 1e-6)
# Similarly, need to be careful with < and > etc
Thursday, 7 October 2010
# Places where this matters:# # * sums# * calculating the standard deviation# * inverting a matrix (condition)# * linear models!# * maximum likelihood estimation
Thursday, 7 October 2010
Reflection on project teams.
Plyr drills.
Plyr practice with basketball data (what we’ll be using for the next project).
Homework
Thursday, 7 October 2010