View
214
Download
1
Category
Preview:
Citation preview
04/19/23 330 Lecture 2 1
STATS 330: Lecture 2
04/19/23 330 Lecture 2 2
Housekeeping matters
• STATS 762– An extra test (details to be provided). An extra
assignment• Class rep – ????
• Office hours– Alan: 10:30 – 12:00 Tuesday and Thursday in Rm
265, Building 303S– Tutors: TBA
• Assignment 1: Due 2 August
04/19/23 330 Lecture 2 3
Today’s lecture: Exploratory graphics
Aim of the lecture:
To give you a quick overview of the kinds
of graphs that can be helpful in exploring data. Some of this material has been covered in 201/8. (We will discuss the R code used to make the plots in Lecture 4)
04/19/23 330 Lecture 2 4
Exploratory Graphics: topics
• Exploratory Graphics for a single variable: aim is to show distribution of values– Histograms– Kernel density estimators– Qq plots
• For 2 variables: aim is to show relationships– Both continuous: scatter plots– One of each: side by side boxplots– Both categorical: mosaic plots – see Ch 5
• 3 variables– Pairs plot– Rotating plot– coplots– 3D plots, contour plots
04/19/23 330 Lecture 2 5
Single variable:Exchange rate data
• The data of interest: daily changes in log(exchange rate) for the US$/Kiwi.– Monthly date from June 1986 to may 2012– Source: Reserve Bank
• Questions:– What is the distribution of the daily changes
in the logged exchange rate?– Is it normal? If not, how is it different?
04/19/23 330 Lecture 2 6
yt = exchange rate at time t
Difference in logs is log(yt) – log(yt-1) = log(yt/yt-1)
04/19/23 330 Lecture 2 7
Data AnalysisSuppose we have the data (3374 differences in
the logs), in an R vector, Diff.in.Logs
hist(Diff.in.Logs,nclass=100,freq=FALSE)# add density estimatelines(density(Diff.in.Logs),col="blue", lwd=2)# add fitted normal densityxvec = seq(-0.1, 0.1, length=100)lines(xvec,dnorm(xvec,mean=mean(Diff.in.Logs), sd=sd(Diff.in.Logs)),col="red", lwd=2)
04/19/23 330 Lecture 2 8
04/19/23 330 Lecture 2 9
04/19/23 330 Lecture 2 10
Normal plot> qqnorm(Diff.in.Logs)
Normal data?
No –
QQ plot indicates that the differences have longer tails than normal, since the plotted points are lower than the line for small values and higher for big ones
04/19/23 330 Lecture 2 11
Two Variables: Rats!Of interest: growth rates of 16 rats
i.e. relationship between weight and time
• Want to explore the relationship graphically.
• Each rat was measured (roughly) every week for 11 weeks
• For weeks 1-6, all rats were on a fixed diet. Diet was changed after week 6.
04/19/23 330 Lecture 2 12
Two Variables: Rats!Data set rats.df has variables
– rat (1-16)– growth (weight in grams)– day (day since start of study, 11 values, at
approximately weekly intervals– group (litter, one of 3)– change (has values 1 or 2 - diet was changed
after 6 weeks, diet 1 for weeks 1-6, diet 2 for weeks 7-11
04/19/23 330 Lecture 2 13
Rats: the data> rats.df growth group rat change day1 240 1 1 1 12 250 1 1 1 83 255 1 1 1 154 260 1 1 1 225 262 1 1 1 296 258 1 1 1 367 266 1 1 2 438 266 1 1 2 449 265 1 1 2 5010 272 1 1 2 5711 278 1 1 2 6412 225 1 2 1 112 230 1 2 1 8
... More data
04/19/23 330 Lecture 2 14
Rats (cont)
• Could plot weight (i.e. the variable growth) versus the variable day:
plot(day,growth)
BUT….
04/19/23 330 Lecture 2 15
0 10 20 30 40 50 60
30
04
00
50
06
00
day
gro
wth
04/19/23 330 Lecture 2 16
Criticisms• Can’t tell which points belong to which rat
• Seem to be 2 groups of points
• In actual fact, the rats came from 3 different litters, is this relevant?
• Could do better
04/19/23 330 Lecture 2 17
More rats: improvements
• Join points representing the same rat with a line
• Use different colours (or different line types e.g. dashed or dotted) for the different litters
• Use a legend
04/19/23 330 Lecture 2 18
0 10 20 30 40 50 60
30
04
00
50
06
00
Growth rate of rats
day
gro
wth
Litter 1 Litter 2 Litter 3
04/19/23 330 Lecture 2 19
More improvements• Plot is too cluttered
• Could plot each rat on a different graph – important to use same scales (axes) for each graph
• This leads to the idea of “Trellis graphics”
04/19/23 330 Lecture 2 20
day
grow
th
200
300
400
500
600
0 20 40 60
rat rat
0 20 40 60
rat rat
rat rat rat
200
300
400
500
600
rat200
300
400
500
600
rat rat rat rat
rat rat
0 20 40 60
rat
200
300
400
500
600
rat
0 20 40 60
04/19/23 330 Lecture 2 21
day
grow
th
200
300
400
500
600
0 20 40 60
rat.within.groupgroup
rat.within.groupgroup
0 2040 60
rat.within.groupgroup
rat.within.groupgroup
0 20 40 60
rat.within.groupgroup
rat.within.groupgroup
0 20 40 60
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
rat.within.groupgroup
200
300
400
500
600
rat.within.groupgroup
200
300
400
500
600
rat.within.groupgroup
rat.within.groupgroup
0 20 4060
rat.within.groupgroup
rat.within.groupgroup
0 2040 60
rat.within.groupgroup
rat.within.groupgroup
0 20 40 60
rat.within.groupgroup
rat.within.groupgroup
0 20 40 60
04/19/23 330 Lecture 2 22
Two variables: one continuous, one
categorical• Insurance data: data on 14,000 insurance
claims. Want to explore relationship between the amount of the claim (a continuous variable) and the type of car
(a categorical variable).
• Use side-by side boxplots.
04/19/23 330 Lecture 2 23
11 22 33 44 55 66 77 88 99 1111 1313 1515 1717
46
8
CARGROUP
Log(
AD
INC
UR
)Car Group
Loesssmooth
04/19/23 330 Lecture 2 24
More than 2 variables:
• If all variables are continuous, we can explore the relationships between them using a pairs plot
• If we have 3 variables, a rotating plot is a very useful tool
04/19/23 330 Lecture 2 25
Example: Cherry trees
> cherry.df diameter height volume1 8.3 70 10.32 8.6 65 10.33 8.8 63 10.24 10.5 72 16.45 10.7 81 18.86 10.8 83 19.77 11.0 66 15.68 11.0 75 18.29 11.1 80 22.610 11.2 75 19.9
... more data – 31 trees in all
04/19/23 330 Lecture 2 26
Cherry trees: pairs plots
> pairs(cherry.df)
diameter
65 70 75 80 85
810
1214
1618
20
6570
7580
85
height
8 10 12 14 16 18 20 10 20 30 40 50 60 70
1020
3040
5060
70
volume
3-d Rotating plots• The challenge: to represent a 3-
dimensional object on a 2-dimensional surface (a TV screen, computer screen etc)
• Traditional method uses projection, perspective
• A powerful idea is to use motion, looking at the 3-d scene from different angles
04/19/23 330 Lecture 2 27
04/19/23 330 Lecture 2 28
Perspective
04/19/23 330 Lecture 2 29
Diameter -volume view
Diameter -height view
Volume -height view
Arbitrary view
Projection
04/19/23 330 Lecture 2 30
Cherry trees: rotating plot
04/19/23 330 Lecture 2 31
Dynamic motion• By dynamically changing the angle of
view, we get a better impression of the 3-dimensional structure of the data
• “Dynamic graphics” is a very powerful tool
04/19/23 330 Lecture 2 32
A powerful idea: Coplots
• Coplot shows relationship between x and y for selected values of z (usually a narrow range of z’s)
• By showing separate plots for different z ranges, we can see how the relationship between x and y changes as z changes
• Coplot: conditioning plot, shows relationship between x and y conditional on z (ie for fixed z)
04/19/23 330 Lecture 2 33
Cherry trees: coplots
• To show the relationship between height and volume for different values of diameter:
• Divide the range of diameter (8.3 to 20.6) up into 6 subranges 8-11, 10.5 -11.5 etc
• Draw 6 plots, the first using all data whose diameter is between 8 and 11, the second using all data whose diameter is between 10.5 and 11.5, and so on
04/19/23 330 Lecture 2 34
1020
3040
5060
70
65 70 75 80 85
65 70 75 80 85 65 70 75 80 85
1020
3040
5060
70
height
volu
me
10 12 14 16 18
Given : diameter
04/19/23 330 Lecture 2 35
Interpretation• Note that the lines are not of the same slope• This implies that the point configuration is not
“planar”
rowcolumn
z
-60
-40
-20
0
20
40
Other 3-d graphs
04/19/23 330 Lecture 2 36
plot of surface3-d scatter plot
Both can be rotated
Recommended