Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
There are three types of lies— lies, damned lies and statistics
Benjamin Disraeli
I British prime minister (Tory).
William Gladstone
I Defeated Disraeli in the general election of 1868.I President of the Royal Statistical Society 1867-1869.
Another Disraeli quote
. . . That question is this: Is man an ape or an angel? I,my lord, I am on the side of the angels. I repudiatewith indignation and abhorrence those new fangledtheories.
(Oxford Diocesan Conference 25/11/1864)
A rational approach to uncertainty?
1850 1900 1950 2000
−0.
6−
0.2
0.2
Global temperature
year
Tem
pera
ture
ano
mal
y (C
)
1850 1900 1950 2000
250
300
350
400
Atmospheric C02
year
CO
2 (P
PM
)
Absorption spectra
Is abstraction the problem?
Baker & Bellis, 1993, Animal Behaviour
count
0.0 0.2 0.4 0.6 0.8 1.0
100
300
500
0.0
0.2
0.4
0.6
0.8
1.0
prop.partner
100 200 300 400 500 40 60 80 100 120 140 160
4080
120
160
time.ipc
The Baker and Bellis Analysis
0.0 0.2 0.4 0.6 0.8 1.0
100
300
500
prop.partner
coun
t
0.0 0.2 0.4 0.6 0.8 1.0
−20
00
200
prop.partner
coun
t
0.0 0.2 0.4 0.6 0.8 1.0
−20
00
200
prop.partner
coun
t
0.0 0.2 0.4 0.6 0.8 1.0
−20
00
200
prop.partner
coun
t
40 80 120 160−30
0−
100
100
time.ipc
rsd
40 80 120 160
−20
00
200
time.ipc
coun
t
0.0 0.2 0.4 0.6 0.8 1.0
−20
00
200
prop.partner
rsd
0.0 0.2 0.4 0.6 0.8 1.0
−20
00
200
prop.partner
coun
t
Baker and Bellis Conclusions
I At the end of the process they asked whether the apparentstraight line relationships were stronger than couldplausibly have arisen by chance.
I On this basis they concluded that there is evidence forcount declining with proportion of time spent together.
I Time since last copulation seemed not to play a detectablerole.
I But they also collected another dataset . . .
count
20 24 28 52 56 60 64 165 175 185 10 15 20 25 30
100
400
2024
28
f.age
f.height
155
170
5258
64
f.weight
m.age
2030
40
165
180
m.height
m.weight
6080
100 300 500
1020
30
155 165 175 20 30 40 60 70 80 90
m.vol
More conclusions. . .
I Going through the same process as with the first data set,leads to the conclusion that only female weight is linearlyrelated to count.
I But a careful look at the residuals shows that thisconclusion is completely dependent on a single data pointwith very low sperm count.
I Re-do the analysis without this datum, and only volumematters.
I Actually it’s the same subjects in both datasets, and wecan match up the volumes with the first dataset.
I Repeating the first analysis with volume added, leads tothe dull conclusion that there is only any evidence for alinear relationship between count and volume.
I This result has limited marketing potential.
But why straight lines anyway?
count
0.0 0.2 0.4 0.6 0.8 1.0
100
300
500
0.0
0.2
0.4
0.6
0.8
1.0
prop.partner
100 200 300 400 500 40 60 80 100 120 140 160
4080
120
160
time.ipc
Smoothing
1. What if the relationship between the residuals and avariable does not look like a straight line?
2. Why not let it be a smooth curve, instead?
0.0 0.2 0.4 0.6 0.8 1.0
−30
0−
100
100
300
prop.partner
s(pr
op.p
artn
er,1
.07)
40 60 80 100 140
−30
0−
100
100
300
time.ipc
s(tim
e.ip
c,1.
77)
How to choose the best fit curve?
I Take a bendy strip of wood.I Hook it up to the data points with springs.I The result is a spline
1.5 2.0 2.5 3.0
2.0
2.5
3.0
3.5
4.0
4.5
size
wea
r
Splines are controllable
I Changing the flexibility of the spline changes the curve.
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wea
r
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wea
r
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wea
r
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wea
r
I Splines can be described mathematically, in a way that iseasy to work with.
Smooth surfaces: thin plate splines
I For smooth surfaces there are several optionsI We can replace the bendy strip, with a bendy sheet. . .
x
0.20.4
0.6
0.8z
0.2
0.4
0.6
0.8
linear predictor
0.0
0.2
0.4
0.6
0.8
x
0.20.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear predictor
0.0
0.2
0.4
0.6
0.8
x
0.20.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear predictor
0.0
0.2
0.4
0.6
0.8
x
0.20.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear predictor
0.0
0.2
0.4
0.6
0.8
More smooth surfaces: tensor product splinesI Or we can make a surface from a lattice of bendy strips.I The strips should usually have different degrees of
flexibility in the two directions.
xz
f(x,z)
Yet more smooth surfaces: soap filmsI For smoothing within oddly shaped areas, it can help to
replace bendy sheets/strips, with a soap film.I This avoids smoothing across the area boundary.
58.0 58.5 59.0 59.5 60.0 60.5
44.0
44.5
45.0
45.5
46.0
46.5
longitude
latit
ude
58.0 58.5 59.0 59.5 60.0 60.544
.044
.545
.045
.546
.046
.5
longitude
latit
ude
58.0 58.5 59.0 59.5 60.0 60.5
44.0
44.5
45.0
45.5
46.0
46.5
longitude
latit
ude
58.0 58.5 59.0 59.5 60.0 60.5
44.0
44.5
45.0
45.5
46.0
46.5
longitude
latit
ude
58.0 58.5 59.0 59.5 60.0 60.5
44.0
44.5
45.0
45.5
46.0
46.5
longitude
latit
ude
58.0 58.5 59.0 59.5 60.0 60.5
44.0
44.5
45.0
45.5
46.0
46.5
longitude
latit
ude
How flexible should the spline be?
I Mathematically, all these ways of describing a surface,have the degree of smoothness controlled by just one ortwo numbers . . .
I . . . which must be chosen. How?
0.2 0.4 0.6 0.8 1.0
−2
02
46
8
λ too high
x
y
0.2 0.4 0.6 0.8 1.0
−2
02
46
8
λ about right
x
y
0.2 0.4 0.6 0.8 1.0
−2
02
46
8
λ too low
x
y
Cleaning up a brain scan
10 20 30 40 50
5060
7080
medFPQ brain image
Y
X
I Model log FPQ as a smooth surface, represented using athin plate spline.
I Springs attaching the plate to the data have strengthdependent on the height of the plate.
Smoothed version
10 20 30 40 50
5060
7080
linear predictor
Y
X
Is Cairo getting hotter?
0 1000 2000 3000
5060
7080
90
time (days)
tem
pera
ture
(F
)
I A model . . .I The temperature varies smoothly with day of year.I There might be an additional smooth long term trend in
temperature.I The small scale day to day fluctuations are probably
correlated between one day and the next.
Yes it is.
0 100 200 300
−15
−10
−5
05
10
day.of.year
s(da
y.of
.yea
r,8.
52)
0 1000 2000 3000−
1.5
−1.
0−
0.5
0.0
0.5
1.0
1.5
time
s(tim
e,1.
35)
Predicting octane rating
1000 1200 1400 1600
0.0
0.2
0.4
0.6
0.8
1.0
1.2
octane = 85.3
wavelength (nm)
log(
1/R
)
I How can we predict the octane rating from the spectrum?
Octane prediction model
1000 1200 1400 1600
0.0
0.2
0.4
0.6
0.8
1.0
1.2
octane = 85.3
wavelength (nm)
log(
1/R
)
I Model: octane rating is a constant plus the average valueof the red curve multiplied by the spectrum (blue).
I Need to estimate the red curve.
Octane prediction fit
1000 1200 1400 1600
−8
−4
02
46
Estimated function
nm
s(nm
,7.9
):N
IR
84 85 86 87 88 8984
8688
octane
fitted
mea
sure
d
Diabetic Retinopathy Study0 10 20 30 40 50
0.0
0.4
0.8
10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ret20
3040
50
bmi
1015
20
gly
0 10 20 30 40 50
020
40
dur
I Model is that probability of retinopathy is related to a sumof smooth curves depending on bmi, gly and dur plussmooth surfaces depending on bmi & gly, gly & dur . . .
Diabetic Retinopathy Results
0 10 20 30 40 50
−4
−2
02
46
dur
s(du
r,3.
26)
10 15 20
−4
−2
02
46
glys(
gly,
1)
20 30 40 50
−4
−2
02
46
bmi
s(bm
i,2.6
7)
dur
gly
te(dur,gly,0)
durbm
i
te(dur,bmi,0)
gly
bmi
te(gly,bmi,2.5)
Diabetic Retinopathy Results II
bmi
gly
linear predictor
15 20 25 30 35 40 45 50
1015
20
linear predictor
bmi
gly
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
cran.r-project.org
Picture Credits
I Gladstone and Disraeli are from the House of Commons web site.I The 1921 Eugenics conference logo is from
en.wikipedia.org/wiki/File:Eugenics congress logo.pngI The Gates of Auschwitz are from oncampus.richmond.edu/academics/education/
projects/webquests/holocaust/images/arbeit macht frei.jpgI Hogarth’s South Sea Bubble can be found at
www.library.hbs.edu/hc/ssb/images/using-top.jpg, but I’ve lost where I found theone shown.
I The absorption spectrum figure is fromwww.te-software.co.nz/blog/augie auer.htm
I Reproductions of Picasso’s Les Demoiselles d’Avignon are available from manysites. The one shown is possibly fromwww.enjoyart.com/library/featured artists/pablopicasso/large/Bmcgaw-P591.jpg
I The cover of Sperm Wars was taken from www.amazon.co.uk.
Data Credits
I The Global CO2 and temperature data are fromwww.cru.uea.ac.uk/cru/data/temperature/ and the ScrippsInstitute CO2 research group.
I The Aral Sea CO2 data are from the SeaWifs satellite.I For full credits for the Cairo and Brain Scan data, see R
package gamair.I The octane data are available in R package pls.I The Retinopathy data are available in R package gss.