Upload
andrewcliu
View
222
Download
0
Embed Size (px)
Citation preview
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
1/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
2/178
2
About this Course
Below is a link to the course website. Please visitand bookmark this site NOW.
faculty.chicagobooth.edu/alan.bester/teaching/
You can also find the course website on Chalk orGoogle business statistics bester.
Everything you need to know is in the lecture
notes. Everything you need for the class is on
the course website.
http://faculty.chicagobooth.edu/alan.bester/teaching/http://faculty.chicagobooth.edu/alan.bester/teaching/8/14/2019 Statistics Notes 1 Data_Plots and Summaries
3/178
3
About These Notes
You will find links to data sets, examples, and other thingswe talk about throughout the notes.
Due to the name change Ive had to change all the links
from chicagogsb.edu to chicagoboth.edu. If you find one(in the notes or on the website) that doesnt work trychanging gsb to booth in the URL.
Yes, there are a lot of slides. I like to restate things and limitthe number of concepts per slide. This course is actuallyabout a small number of big ideas that we will developthroughout the quarter.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
4/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
5/178
5
Notes1: Data: Plots and Summaries
1. Data
2. Looking at a Single Variable2.1 Tables2.2 Histograms2.3 Dotplots2.4 Time Series Plots
3. Summarizing a Single Numeric Variable3.1 The Mean and Median3.2 The Variance and Standard Deviation3.3 The Empirical Rule
3.4 Percentiles, quartiles, and the IQR4. Looking at Two Variables
4.1 Categorical variables: the Two-way table4.2 Numeric variables: Scatter Plots4.3 Relating Numeric and Categorical variables
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
6/178
6
5. Summarizing Bivariate Relations5.1 In Tables5.2 Covariance and Correlation
6. Linearly related variables6.1 Linear functions6.2 Mean and variance of a linear function
6.3 Linear combinations6.4 Mean and variance of a linear combination
7. Linear Regression8. Pivot Tables (Optional)
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
7/178
7
1.Data
age sex soc edu Reg inc cola restE juice cigs antiq news ender friend simp foot
67 2 3 1 3 12 1 0 1 0 1 0 0 0 0 0
51 2 3 8 3 10 1 1 0 1 1 0 1 1 0 0
63 2 3 1 2 13 1 1 0 1 1 0 1 0 0 0
45 2 4 3 1 18 1 1 1 0 1 0 0 0 0 0
Here is some data (oursample):
The data is from a large survey carried out by a marketingresearch company in Britain. (Marketing data)
Each row corresponds to a household.Each column corresponds to a different feature of the household.The features are called variables.
The rows are called observations.
.
.
.(many more rows !!)
http://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
8/178
8
Most data sets come in this form.
A rectangular array.
Rows are observations.Columns are variables.
Variables are the fundamental object in statistics.They come in several types.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
9/178
9
The variable labeled "age" is simply the age (in years)of the responder.
This is a numericvariable.This variable has units, and averages are interpretable.
1 "Scotland"
2 "North West"
3 "North"
4 "Yorkshire & Humberside"
5 "East Midlands" 6 "East Anglia"
7 "South East"
8 "Greater London"
9 "South West"
10 "Wales"
11 "West Midlands"
A variable like Regis called categorical.
Think of:numeric vs. categorical
quantitative vs. qualitative
In contrast, the variable "Reg" is the geographical regionof the household. Each "number" is really just a codefor a region:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
10/178
10
Instead of using numbers we could have usedtext strings in the data file, that is,
Reg:NorthNorthNorth_West
Scotland..
But it is extremely common to use numeric codes.
Another example: Which Democratic candidate doyou support?
1= Hillary Clinton, 2= John Edwards,3= Barack Obama, 4= Bill Richardson
Reg:332
1..
we could haveInstead of
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
11/178
11
The variable soc is categorical.It takes on codes 1-6, with meanings:
1 "A"
2 "B"
3 "C1"
4 "C2"
5 "D"
6 "E"
This is an ordered categoricalvariable.You can't think of it as a numerical measurebut A < B < ... < E. (A is actually the lowestsocial grade)
Soc is ordered like age, but does not have units.It does not really make sense to compute the differenceor to average two soc measurements.It does make sense to difference two ages.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
12/178
12
That pretty much covers it.Variables are either numeric, categorical, or
ordered categorical.
Of course a numeric variable is always ordered.
A variable is discrete if you can list its possible values.
Otherwise it is called continuous.
For numeric variables we also have:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
13/178
13
For example, the amount of rainfall in the City of Chicagothis month is usually thought of as being continuous.
As a practical matter, any variable is discrete sincewe put it in the computer. What it comes down tois, if there are a lot of possible values, we think of it
as continuous. (This is not really that important now;it will be later when we get to probability.)
For example, you might think of age as continuous
even though we measure it in years and can easilylist its possible values.
Number of children is more likely to be thought of as discrete.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
14/178
14
Again, a good rule when working with a numericvariable is to keep in mind the units in which it ismeasured.
For example age has units years.
Percentages, which are numeric, don't have units.
Butthere are always units somewhere. For example, if
we look at the percentage of income a householdspends on entertainment, we are looking at onequantity measured in units of currency divided byanother.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
15/178
15
Here are the definitions of all the variables in the surveydata set:
age: age in yearssex: 1 means male, 2 means femalesoc: we saw thisedu: education, terminal age of education
1 "14 Or Under"
2 "15"
3 "16"
4 "17"
5 "18"
6 "19"
7 "20"
8 "21 - 23"
9 "24 Or Over"
Reg: we saw this.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
16/178
16
VARIABLE LABELS V_842 "Total Family Income Before Tax".VALUE LABELS V_842
1 "1,999 Or Less"2 "2,000 - 2,999"
3 "3,000 - 3,999"4 "4,000 - 4,999"
5 "5,000 - 5,999"6 "6,000 - 6,999"7 "7,000 -7,999"
8 "8,000 - 8,999"9 "9,000 - 9,999"
10 "10,000 - 10,999"11 "11,000 - 11,999"12 "12,000 - 14,999"13 "15,000 - 19,999"14 "20,000 - 24,999"15 "25,000 - 29,999"16 "30,000 - 34,999"17 "35,000 - 39,999"18 "40,000 - 49,999"19 "50,000 Or Over"20 "Not Stated"
inc: income
Note:
Both edu and inc could have
been numeric, but are brokendown into ranges. They arethus ordered categorical.
This is extremely common;with income there are actuallygood reasons for doing this!
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
17/178
17
cola, restE, juice, cigs indicate use of a productcategory.
1 if you use it, 0 if you don't.
This is called a dummy variable.1 indicates something "happened", 0 if not.
So, cigs=1 means you purchase cigarettes.restE means "restaurants in the evening".
This is extremely common. Often in statistics weare interested in does something happen?.
Another example is approval ratings ( 1=approve ).We will work with a lot of dummy variables this quarter.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
18/178
18
The rest of the variables in the marketing data
represent tv shows.They are dummies: 1 if you watch, 0 if you don't.
antiq: antiques roadshownews: bbc news
enders: east endersfriend: friendssimp: simpsonsfoot: "football" (soccer)
A dummy variable can take on two values, 0 or 1.We use dummy variables to indicate something,
1 if that something happened, 0 if it did not.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
19/178
19
Now we can see that there are three types of variablesin the data set.
(i) Demographics: age through income(ii) Product category usage,(iii) Media exposure (tv shows).
What is the point? Why collect this data?
We want to see how product usage relatesto demographics. What kind of people drink colas?
We want to see how the media relates to product usageso that we can select the appropriate media toadvertise in. If friends viewers tend to drink colas,that might be a good place to advertise your cola.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
20/178
20
Important Note:
You can always take a numeric variable and
make it an ordered categorical variable byusing bins.
For example, instead of treating age as a numeric
variable it is common to break it into ranges.
0-20: a121-30:a231-40:a3
41-50:a451-60:a561-70:a6>70: a7
for example:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
21/178
21
The simplest case is a dummy variable:
1
0
x ad
x a
>=
For example, you could define someone to be "old"if older than 40 and "young" otherwise.
d=1 then means "old" and d=0 means "young".
where x is numeric
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
22/178
22
2. Looking at a Single Variable
The most interesting thing in statistics is understandinghow variables relate to each other.
"Friends watchers tend to drink colas".
"Smokers tend to get cancer".
But it is still very important to get of sense of what variablesare like on their own.
Note: Well use the term distribution informally to talkabout what a variable looks like (what does a typical valuelook like, how spread out are its values, etc.) We will usethe term more formally when we study probability.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
23/178
23
2.1 Tables
To look at a categorical variable we use a table:soc count
1 28
2 151
3 310
4 2355 156
6 120
We simply count how many of each category we have.
Note: We have 1000 observations total, so the numbersin this table must add to 1000.
How to make this table
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
24/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
25/178
25
2.2 Histograms
We take a numeric variable, break it down into categories
and then plot the table as on the previous slide.Remember, the height of each bar = # of observations orfrequency in that category.
Histogram for age
0
20
40
60
80
100
120
90
Category
35-40means(35,40]that is,
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
26/178
26
Histogram for Inter arrivalTime
0
10
20
30
40
50
60
70
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
27/178
27
4 %
5 %
Heres a histogram of monthly hedge fund returns from1994 to 2005. Notice anything interesting?
Source: Nicolas P. B. Bollen and Veronika K. Pool, Do Hedge Fund Managers Misreport Returns? Evidence from the
Pooled Distributions; original data from Center for International Securities and Derivatives Markets, University of
Massachusetts
0
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
28/178
28
Aside: Histograms can be displayed in different ways
The observations here are starting players in the NFL (on offense). The numbers onthe verticalaxis correspond to rounds of the NFL draft, while the length of each blue bar
is thepercentage of starting players drafted at that position (forget the red bars). Theplots on the right show onlyquarterbacks and fullbacks. (Source)
Aside or Optional on a slide means you are not
responsible for the material on that slide on an exam!
Dont worry, all of our histograms will be like the previous two slides.
http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/8/14/2019 Statistics Notes 1 Data_Plots and Summaries
29/178
29
2.3 Dotplots
nbeerm: the number of beers male MBA students claimthey can drink without getting drunk
nbeerf: same for females
It can be a hassle choosing the bins for a numericvariable.
For discrete variables and/or small data sets, we canjust put a dot on the number line for each value.
(Beer data)
Note (1): Unfortunately StatPro doesnt do dotplots.The dotplots in these slides were done in Minitab.
Note (2): The beer data is text, not Excel format. Use Text toColumns.
http://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dathttp://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dat8/14/2019 Statistics Notes 1 Data_Plots and Summaries
30/178
30
.
: :: :
. . : : : :
. . : . : : :.: : : : . .
+---------+---------+---------+---------+---------+-------
nbeerm
. .. . : : .
+---------+---------+---------+---------+---------+-------
nbeerf
0.0 4.0 8.0 12.0 16.0 20.0
Generally the males claim they can drink more,their numbers are centered or located at larger values.
Note: The dot plot is giving you the same kind ofinformation as the histogram.
We call a pointlike this anoutlier.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
31/178
31
2.4 Time Series Plots
The survey data is what we call cross-sectional.The households in our survey are a (hopefullyrepresentative) cross section of all British households at aparticular point in time.
In cross-sectional data, order doesnt matter. We can sortour households by age, social, etc. and none of our resultschange as long as we keep each row intact.
Other examples would be samples were everyrow corresponded to a firm, a plant, a machine...
With a time series, each observation corresponds toa point in time.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
32/178
32
Date Open High Low Close Volume
1-May-00 10749.4 11001.3 10622.2 10811.8 9663000
2-May-00 10805.6 10932.5 10580.7 10731.1 10115000
3-May-00 10732.2 10754.4 10345.2 10480.1 9916000
4-May-00 10478.9 10631.5 10293.1 10412.5 9258000
Daily data on the Dow Jones index: (Dow data)
For time series data, the order of observations matters.
(1-May-00 comes before 2-May-00, etc.)
The easiest way to visualize time series data is oftensimply to plot the series in time order.
.
.
.
http://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
33/178
33
Time series plot of Close
7800
8400
9000
9600
10200
10800
11400
5
/1/2000
6
/1/2000
7
/1/2000
8
/1/2000
9
/1/2000
10
/1/2000
11
/1/2000
12
/1/2000
1
/1/2001
2
/1/2001
3
/1/2001
4
/1/2001
5
/1/2001
6
/1/2001
7
/1/2001
8
/1/2001
9
/1/2001
10
/1/2001
11
/1/2001
12
/1/2001
1
/1/2002
2
/1/2002
3
/1/2002
4
/1/2002
Date
Close
Time series plot of the close series.
How to make this plot
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
34/178
34
We could have data at various frequencies:
daily,monthly,quarterly,annual.
The kinds of patterns you will uncover can be verydifferent depending on the frequency of the data.
A current hot topic of research at Booth is"high frequency data".
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
35/178
35
70605040302010
20
19
18
17
16
15
14
13
12
Index
b_
prod
MonthlyUS beer
production.
Do you seea pattern?
Would we see this pattern if we looked at annual data?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
36/178
36
Time series plot of monthly returns on a portfolioof Canadian assets: (Country Portfolio returns)
10080604020
0.1
0.0
-0.1
Index
canada
On theverticalaxis we
havereturns.
On thehorizontalaxis wehave time.
Do you see a pattern?
http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
37/178
37
Here is thehistogram
of the Canadianreturns.
0.090.060.030.00-0.03-0.06-0.09
30
20
10
0
canada
Frequency
0.10.0-0.1
30
20
10
0
canada
Fre
quency
Notes:
(i) The histogramdoes not dependon the time order.
(ii) The appearance of
the histogram dependson the number of bins.Too many bins makesthe histogram appear
spiky.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
38/178
38
Taken from David Greenlaw, Jan Hatzius, Anil Kashyap, and Hyun Shin, US Monetary Policy Forum Report No. 2, 2008
Be careful. What pattern do you see in this series?
How about now?
http://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdf8/14/2019 Statistics Notes 1 Data_Plots and Summaries
39/178
39
Time series plots are also used to compare patternsacross different variables over time, and sometimes to seethe impact of past events (be very careful there, too).
From same paper as the previous slide.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
40/178
40
3. Summarizing a Single Numeric Variable
We have looked at graphs. Suppose we are now interestedin having numerical summaries of the data rather thangraphical representations.
Two important features of any numeric variable are:
1) What is a typical or average value?
2) How spread out or variable are the values?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
41/178
41
The mean and median capture a typical value.The variance/standard deviation capture the spread.
For example we saw that the men tend to claimthey can drink more.
How can we summarize this?
.
: :
: :
. . : : : :
. . : . : : :.: : : : . .
+---------+---------+---------+---------+---------+-------nbeerm
. .. . : : .
+---------+---------+---------+---------+---------+-------
nbeerf
0.0 4.0 8.0 12.0 16.0 20.0
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
42/178
42
Monthly returns
on Canadianportfolioand Japaneseportfolio.
They seemto be centeredroughly atthe same place
but Japanhas morespread.
How can we summarize this?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
43/178
43
1 2 3 nx ,x ,x ,...x
the firstnumber
the last number, n is the numberof numbers,or the number ofobservations. You may also hear
it referred to as the sample size.
xi is the value of x associated with the ithobservation (row).
3.1 The Mean and Median
We will need some notation.
Suppose we have n observations on a numericvariable which we call "x".
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
44/178
44
Here, x is just a name for the set of numbers, we couldjust as easily use y.In a real data set we would use a meaningful name like "age".
x
5
2
8
62
x1
x3
n=5
Sometimes the order of the observations means something.
In our return data the first observation corresponds to thefirst time period.In the survey data, the order did not matter.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
45/178
45
The sample mean is justtheaverage of the numbers x:
1 2 nx x ... xsumxn n
+ + += =
We often use the symbol to denote the mean of thenumbers x.
We call it x bar.
x
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
46/178
46
Here is a more compact way to write the same thing
Consider
1 2 nx x ... x+ + +We use a shorthand for it (it is just notation):
n
i 1 2 n
i 1
x x x ... x=
= + + +
This is summation notation.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
47/178
47
Using summation notation we have:
x n xi
i
n
==
1
1
The sample mean:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
48/178
48
Character Dotplot
. . . . : : .
+---------+---------+---------+---------+---------+-------nbeerf
.
: :
: :
. . : : : :
. . : . : : : . : : : : .
+---------+---------+---------+---------+---------+-------nbeerm
0.0 2.5 5.0 7.5 10.0 12.5
In some sense, the men claim to drink more.To summarize this we can compute the average valuefor each group (men and women).
Note: I deleted the outlier, I do not believe him!.
Graphical interpretation of the sample mean
Here are the dot plots of the beer data for women and men.
Which group claims to be able to drink more?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
49/178
49
Mean of nbeerf = 4.2222
Mean of nbeerm = 7.8625
Character Dotplot
. . . . : : .
+---------+---------+---------+---------+---------+-------nbeerf
.
: :: :
. . : : : :
. . : . : : : . : : : : .
+---------+---------+---------+---------+---------+-------nbeerm
0.0 2.5 5.0 7.5 10.0 12.5
On average women claimthey can drink 4.2 beers. Men
claim they can drink 7.9 beers
In the picture, I think of the mean as the center of the data.
4.2
7.86
How to calculate these means
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
50/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
51/178
51
Let us look at summation in more detail.
xii
n
=1means that for each value of i, from 1 to n,
we add to the sum the value indicated,in this case xi.
add in this value for each i
More on summation notation (take this as an aside)
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
52/178
52
x y year
0.07 0.11 1
0.06 0.05 2
0.04 0.09 30.03 0.03 4
Think of each row as anobservation on both x and y.To make things concrete, thinkof each row as corresponding to
a year and let x and y be annualreturns on two different assets.
In year 1 asset x had return7%.In year 4 asset y had return3%.
To understand how it works let us consider someexamples.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
53/178
53
(here, we do not sumover all observations: we sumonly over the second and thethird observation).
compute x bar.
compute y bar.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
54/178
54
For each value of i, we can add in anything we want:
= (.02)*(.04) + (.01)*(-.02) + (-.01)*(.02)+(-.02)*(-.04)
How to do these calculations using Excel
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
55/178
55
The median
After ordering the data, the median is themiddle value of the data. If there is an evennumber of data points, the median is theaverage of the two middle values.
Example
1,2,3,4,5 Median = 31,1,2,3,4,5 Median = (2+3)/2 =2.5
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
56/178
56
Mean versus median
Although boththe mean and the median are goodmeasures of the center of a distribution of measurements,the median is less sensitive to extreme values.
The median is not affected by extreme values sincethe numerical values of the measurements are notused in its computation.
Example
1,2,3,4,5 Mean: 3 Median: 31,2,3,4,100 Mean: 22 Median: 3
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
57/178
57
If data is right skewed the mean will be biggerthan the median. You can think of this as the extremeright tail observations pulling the mean upward.
Summary measures for selectedvariables
InterarrivalTime
Mean 4.163
Median 2.779
For the bank interarrival data:
H is t o g r a m f o r I n t e r a r r i v a
0
10
20
30
40
50
60
70
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
58/178
58
Median or Mean?
At Booth professors are rated by students from 1-5 inseveral categories. In the past only the mean rating wasreported.
Some faculty members believe the median shouldbe reported instead. This was actually a major debate ata faculty meeting a few years ago.
What difference would this make?
In fact, Booth now reports the mean andmedian,along with a histogram of all the ratings!
Th M f D V i bl
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
59/178
59
The Mean of a Dummy Variable
Consider the "simpson" variable in the survey data set.Does it make sense to take the mean?
Summary measures for selected variables
simpsons
Count 1000.000
Mean 0.181
The sum of the 1's and0's will equal the numberof respondents who watchthe simpsons.
So the mean is the fractionof respondents who watch.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
60/178
60
So, in general, the average of a dummy,
gives the percentage of times that whatever dummy=1signals happens.
Another example, if a poll is conducted about a
particular candidate where1=approval, 0=disapproval
then the sample mean is the candidates approval rating.
This may seem obvious, but we will get a lot of use outof this idea throughout the quarter.
3 2 Th V i d St d d D i ti
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
61/178
61
3.2 The Variance and Standard Deviation
The mean and the median give usinformationabout the central tendency of a set of
observations, but they shed no light on thedispersion, or spread of the data.
Example: Which data set is more variable ?
5,5,5,5,5 Mean: 51,3,5,8,8 Mean: 5
If these were portfolio returns (in percent), means areaverage returns. What else might we want tomeasure?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
62/178
62
The Sample Variance
. . . .
-+---------+---------+---------+---------+---------+-----x
. . . .
-+---------+---------+---------+---------+---------+-----y
0.030 0.045 0.060 0.075 0.090 0.105
The y numbers are more spread outthan the x numbers.We want a numerical measure of variation or spread.
The basic idea is to view variability in terms of distancebetween each measurement and the mean.
x xi
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
63/178
63
. . . .
-+---------+---------+---------+---------+---------+-----x
. . . .
-+---------+---------+---------+---------+---------+-----y
0.030 0.045 0.060 0.075 0.090 0.105
Overall, these are smaller than these.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
64/178
64
We cannot just look at the distance between each
measurement and the mean. We need an overallmeasure of how big the differences are
(i.e., just one number like in the case of the mean).
Also, we cannot just sum the individual distancesbecause the negative distances cancel out with thepositive ones giving zero always (Why?).
The average squared distance would be
1
1
2
nx xi
i
n
( )=
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
65/178
65
So, the sample variance of the x data is defined to be:
s
n
x xx ii
n2
1
21
1
=
=
( )
We use n-1 instead of n for technical reasons that will
be discussed later (and because Excel does it this way).
Think of it as the average squared distance of
the observations from the mean.
Sample variance:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
66/178
66
2) What are the units of the variance?
It is helpful to have a measure of spread whichis in the original units. The sample variance is not in theoriginal units. We now introduce a measure of dispersionthat solves this problem: the sample standard deviation
1) What is the smallest value a variance can be?
Questions
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
67/178
67
The sample standard deviation
It is defined as the square root of the sample variance (easy).
s sx x=
2
The units of the standard deviation are the sameas those of the original data.
The sample standard deviation:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
68/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
69/178
69
The samplestandard deviation
for the y datais bigger thanthat for the x data.
This numerically
captures thefact that y hasmore variationabout its meanthan x.
Example 2 (graphical)
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
70/178
70
Character Dotplot
.
:
: :
:: :
.::: :.:
: : :::: ::::
::: :::: :::: :::
. : :::: :::: :::: :::.-----+---------+---------+---------+---------+---------+-canada
. .
::. . : .
. ::: .:: :.: .
: ::: .::: :::: : :.
. .. .. :.:: :::: :::: :::: : :: : : . : .
-----+---------+---------+---------+---------+---------+-japan
-0.160 -0.080 0.000 0.080 0.160 0.240
Variable N Mean StDev
canada 107 0.00907 0.03833
japan 107 0.00234 0.07368
Example 2 (graphical)The standard deviationsmeasure the fact that thereis more spread in the Japanese
returns
3 3 Th E i i l R l
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
71/178
71
3.3 The Empirical Rule
We now have two numerical summaries for the data
x sx
where the data is how spread out,how variable the data is
The mean is pretty easy to interpret (some sort of center of thedata).
We know that the bigger sx is, the more variable the data is, but how
do we really interpret this number?
What is a big sx, what is a small one ?
The empirical rule will help us understand s and
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
72/178
72
The empirical rule will help us understand sx and
relate the numerical summaries back to our plots.
Empirical Rule
For mound shaped data:
Approximately 68% of the data is in the interval
( , )x s x s x sx x x + =
Approximately 95% of the data is in the interval
( , )x s x s x sx x x + = 2 2 2
We can see this on a histogram of the Canadian returns
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
73/178
73
We can see this on a histogram of the Canadian returns
x =.00907
sx =.03833
x sx+x sx
x sx 2 x sx+ 2
The empirical
rule says thatroughly 95%of theobservationsare between the
dashed lines androughly 68% betweenthe dotted lines.
Looks reasonable.
H i s t o g r a m f o r c a
0
5
1 0
1 5
2 0
2 5
3 0
. 1
-0.1 0.10
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
74/178
74
10080604020
0.1
0.0
-0.1
Index
cana
da
x
xx 2s+
xx 2s
Same thingviewed from
the perspectiveof the timeseries plot.
n=108, so5% outsidewould be about5 points.
There are 4 pointsoutside, which ispretty close.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
75/178
A little finance: comparing mutual funds
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
76/178
76
A little finance: comparing mutual funds
Let us use the means and standard deviations to compare mutual funds.For 9 different assets we compute the means and standard deviations.Later, we plot the means versus the standard deviations.
The assets are:
#C1 - R22 Drefus (growth)#C2- R30 Fidelity Trend fund (growth)
#c3- R55 Keystone Speculative fund (max capital gain)
#c4- R92 Putnam Income Fund (income)
#c5- R99 Scudder Income
#c6- R129 Windsor Fund (growth)
#c7- equally weighted market#c8- value weighted market
#c9- tbill rate
# sample period monthly returns 1:68 - 12-82
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
77/178
77
Variable N Mean StDev
drefus 180 0.00677 0.04724fidel 180 0.00470 0.05659
keystne 180 0.00654 0.08424
Putnminc 180 0.00552 0.03008
scudinc 180 0.00443 0.03597
windsor 180 0.01002 0.04864eqmrkt 180 0.01082 0.06856
valmrkt 180 0.00681 0.04800
tbill 180 0.00598 0.00252
The speculative fund (keystne) has a higher mean andstandard deviation than the income fund (Putnminc).
Later well see how to look at this information graphically.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
78/178
78
3.4 Percentiles, quartiles, and the IQR
Again, this just applies to numeric variables.
The 10th percentile is the number such that 10% ofthe values are less than it and 90% are bigger.
The median is the 50th percentile.
Percentiles are also known as quantiles.
95th percentile,.95 quantile, and 95% quantile
all mean the same thing.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
79/178
79
Summary measures for selectedvariables
age
Count 1000.000
5th percentile 25.000
10th percentile 28.000
90th percentile 71.000
95th percentile 75.000
For the age variable in the survey data:
5% of the 1000 age valuesare less than 25.
90% of people in the sample
are less than 71 years old.
5% of the people in thesample are over 75 years of
age.
For now dont worry aboutstrictly less than vs. lessthan or equal to.
Summary measures for selectedvariables
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
80/178
80
The first, second,and third quartiles are the25th, 50th, and 75th percentiles.
The interquartile rangeis the difference betweenthe third and first quartile.
variables
age
Count 1000.000
Mean 48.312
Median 48.000
Standard deviation 15.718
Variance 247.062
First quartile 35.000
Third quartile 60.000
Interquartile range 25.000
The interquartile rangeis used as a measureof spread (IQR is tovariance as median is tomean).
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
81/178
81
Histogram for age
0
20
40
60
80
100
120
90
Category
first quartile = 35 years
We can interpret quantiles graphically on the histogram.25% of the area of the colored bars is to the left of the first quantile.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
82/178
82
The empirical rule is actually a statement about quantiles.
What does it say? For a variable with a mound shapedhistogram
What quantile is two standard deviations below the mean?
What quantile is one standard deviation above?
2.5%
84%
To see this yourself, draw the picture! Well learn later thatthe empirical rule is based on a very important probabilitymodel.
10th Percentile (o) 50th Percentile (+) 90th Percentile ( )
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
83/178
83
10th Percentile (o) 50th Percentile (+) 90th Percentile ( )
Figure 3. Indexed Real Wages for Men by Percentile 1967-1997Year
70 75 80 85 90 95
0.90
1.00
1.10
1.20
1.30
Aside: We wont use percentiles much in this class, but above is aninteresting time series plot of the 90th (top line), median (middle line),and 10th percentiles of real wages in the U.S. from the late 1960s tolate 1990s. This widening income gap is a major concern foreconomists or is it?
Source: Murphy, Kevin and Finis Welch, Wage Differentials in the 1990s: Is the Glass Half-full or Half-empty?
4 L ki t T V i bl
http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/8/14/2019 Statistics Notes 1 Data_Plots and Summaries
84/178
84
4. Looking at Two Variables
While it is important to look at variables oneat a time, many interesting business problemsconcern how two (or more) variables are related
to each other.
4 1 Categorical variables: the Two way Table
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
85/178
85
4.1 Categorical variables: the Two-way Table
Lets look at the relationship between two categoricalvariables,xand y.
Ifxhas two categories and yhas two as well,then there are four categories using both x and y.
We can then just count the number of observations ineach category.
If x has r1 and y has r2, then we have r1*r2possibilities. We can arrange these possibilities ina two-way table.
This is the two way table relating viewership of the simpsons
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
86/178
86
simpsons
colas 0 1Grand Total
0 387 35 4221 432 146 578
Grand Total 819 181 1000
This is the two way table relating viewership of the simpsonswith cola use.146 of the 1000 view simpsons andconsume colas.
simpsons
colas 0 1Grand Total
0 38.70% 3.50% 42.20%1 43.20% 14.60% 57.80%
Grand Total 81.90% 18.10% 100.00%
Raw counts: Percent of total:
Percent of column: Percent of row:Count of colas simpsons
colas 0 1Grand Total
0 47% 19% 42%
1 53% 81% 58%
Grand Total 100% 100% 100%
Count of colas simpsons
colas 0 1Grand Total
0 92% 8% 100%
1 75% 25% 100%
Grand Total 82% 18% 100%
How to make these tables
A picture of the table:
http://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htmhttp://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
87/178
87
0
100
200
300
400
500
600
700
800
900
0 1
1
0
simpsons
colas
A much higher fraction of the simpsons viewersconsumes colas.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
88/178
4 2 N i i bl S tt Pl t
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
89/178
89
4.2 Numeric variables: Scatter Plots
For two numeric variables we have the scatter plot.
nbeer weight
12.0 192
12.0 160
5.0 155
5.0 120
7.0 150
13.0 175
4.0 100
12.0 165
12.0 165
12.0 150
. .
. .
. .
How are they related?
Each row is an observationcorresponding to a person.
Each person has two numbersassociated with him/her,
# beers and weight.
Is the numberof beers you can drinkrelated to your weight?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
90/178
90
200150100
20
10
0
weight
nbeer
nbeer weight
12.0 192
12.0 160
5.0 155
5.0 120
7.0 150
13.0 175
4.0 100
12.0 165
12.0 165
12.0 150
. .
. .
. .
You can think of a scatterplot as a 2D dotplot. Each point corresponds to an
observation: weightdetermines the positionon the horizontal axis, heighton the vertical.
related to your weight?
Notice our outlier is back (circled)... and is he really an outlier?!
In addition to relating two variables, a scatterplot also gives youall the information youd get from a dotplot of either variable
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
91/178
91
200150100
20
10
0
weight
nbeer
allthe information you d get from a dotplot of either variable.
Sample Exam Question
The sample mean ofweight is
(i) 105 (ii) 130 (iii) 155 (iv) 180
Imagine the dots on the scatterplotbeing pulled downward by gravity youd get a dotplot of weight!
Same ideafor nbeer,though thevertical axiscan be alittle harderto picture(Hint: rotatethe paper)
The sample SD of weight is around 28,so roughly 68% of observationsbetween 127 and 183 pounds.
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
92/178
92
Are returns on a mutual fund related to market returns?
0.20.10.0-0.1
0.2
0.1
0.0
-0.1
valmrkt
windsor
Each pointcorrespondsto a month.
Like the histogram,scatterplots canalso be used withtime series data,
and the resultingplot does notdepend on the timeordering.
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
93/178
93
Heres another example of an outlier. This data is from a pokerwebsite that went through a major cheating scandal.
WINRATE
VPIP
A similar scandal surfaced recently. Is the evidence as compelling?
In finance we often use a different type of 2-D plot to compare asset
http://www.msnbc.msn.com/id/26563848/http://www.msnbc.msn.com/id/26563848/8/14/2019 Statistics Notes 1 Data_Plots and Summaries
94/178
94
0.090.080.070.060.050.040.030.020.010.00
0.011
0.010
0.009
0.008
0.007
0.006
0.005
0.004
StDev
Mean
tbill
valmrkt
eqmrkt
windsor
scudinc
Putnminc
keystne
fidel
drefus
yp p preturns. Here each point is a mutual fund. The horizontal and verticallocation of each point reflects the sample standard deviation andsample mean of its returns within the same sample period.
If youre a
fundmanager,where doyou wantto be on
this plot?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
95/178
95
Let us compare some countries (Country returns data)
Basedonmonthlyreturnsfrom 88to 96
0.080.070.060.050.040.03
0.02
0.01
0.00
StDev
Mean
singaporusa
japan
italy
honkong
germany
france
finalndcanada
belgium australi
http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
96/178
96
4.3 Relating a Numeric to a Categorical variable
How do you plot a numeric variable vs acategorical variable?
This is not so obvious.
An easy thing to do is make the numeric variablecategorical by binning it, like we did when making ahistogram.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
97/178
97
cigs
age 0 1Grand Total
16-25 50.98% 49.02% 100.00%
26-35 63.64% 36.36% 100.00%
36-45 67.69% 32.31% 100.00%
46-55 64.76% 35.24% 100.00%
56-65 79.76% 20.24% 100.00%
66-75 91.13% 8.87% 100.00%
76-85 88.10% 11.90% 100.00%
86-95 100.00% 0.00% 100.00%
Grand Total 71.20% 28.80% 100.00%
Cigarette usage and age:
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
16-25 26-35 36-45 46-55 56-65 66-75 76-85 86-95
1
0
Quick what is the relationship betweenage and cigarette usage?
Plots are a great way to identify patterns, but carefulHow strong is the evidence?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
98/178
5.1 In Tables
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
99/178
99
5.1 In Tables
There does not seem to be a standard way to
summarize the strength of the relationship in a table.
Sometimes I use the difference between a marginalproportion and a conditional proportion.
simpsons
colas 0 1Grand Total
0 38.70% 3.50% 42.20%
1 43.20% 14.60% 57.80%
Grand Total 81.90% 18.10% 100.00%
simpsons
colas 0 1Grand Total
0 47.25% 19.34% 42.20%
1 52.75% 80.66% 57.80%
Grand Total 100.00% 100.00% 100.00%
In this case it would be: |.578 - .8066| =.2286
The difference between the percent of cola drinkersand percent of simpsons viewers that are cola drinkers.
5.2 Covariance and Correlation
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
100/178
100
In the beer data (beers vs weight) and mutual fund data(windsor vs valmrkt), it looks like there is a relationship.
Even more, the relationship looks linear in that it looks likewe could draw a line through the plot to capture the pattern.
Covarianceandcorrelation summarize how strong alinearrelationship there is between two variables.
In our first example weight and # beers were two variables.In our second example our two variables were two kinds of
returns.
In general, we think of the two variables as x and y.
The sample covariance between x and y:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
101/178
101
p y
sn
x x y yxy i i
i
n
=
=
1
1 1( )( )
The sample correlation between x and y:
rs
s sxy
xy
x y
=
So, the correlation is just the covariance divided bythe two standard deviations. What are the units?
We will get some intuition about these formulae, but firstl t th i ti H d th i d t
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
102/178
102
let us see them in action. How do they summarize datafor us? Let us start with the correlation.
Correlation, the facts of life:
1 1rxy
The closer r is to 1 the stronger the linearrelationship is with a positive slope.When one goes up, the other tends to go up.
The closer r is to -1 the stronger the linear
relationship is with a negative slope.When one goes up, the other tends to go down.
The correlations corresponding to the two scatter plots
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
103/178
103
Correlation of valmrkt and windsor = 0.923
Correlation of nbeer and weight = 0.692
p g pwe looked at are:
The larger correlation between valmrkt and windsor
indicates that the linear relationship is stronger.
Let us look at some more examples.
0.20.10.0-0.1
0.2
0.1
0.0
-0.1
valmrkt
windsor
200150100
20
10
0
weight
nbeer
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
104/178
104
3210-1-2-3
2
1
0
-1
-2
x1
y1
Correlation of
y1 and x1 = 0.019
3210-1-2-3
3
2
1
0
-1
-2
-3
x2
y2Correlation of
y2 and x2 = 0.995
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
105/178
105
3210-1-2-3
4
3
2
1
0
-1
-2
-3
-4
x3
y3
Correlation of
y3 and x3 = 0.586
3210-1-2-3
3
2
1
0
-1
-2
-3
x4
y4Correlation of
y4 and x4 = -0.982
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
106/178
106
3210-1-2-3
9
8
7
6
5
4
3
2
1
0
x5
y5
Correlation of y5 and x5 = 0.210
IMPORTANT: Correlation only measures linearrelationships (here the value is small but there is a strongnonlinearrelationship between y5 and x5.)
Example: The country data
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
107/178
107
Which countries go up and down together?I have data on 23 countries.That would be a lot of plots!
0.10.0-0.1
0.1
0.0
-0.1
usa
canada
The correlation matrixis a table of all sample correlations
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
108/178
108
pbetween each possible pair of a set of variables.
australi belgium canada finalnd france germany honkong italy
belgium 0.189
canada 0.507 0.357
finalnd 0.387 0.183 0.386
france 0.275 0.734 0.342 0.176
germany 0.226 0.691 0.302 0.304 0.709
honkong 0.334 0.301 0.558 0.355 0.359 0.339
italy 0.159 0.367 0.334 0.389 0.352 0.465 0.261
japan 0.251 0.418 0.271 0.307 0.421 0.318 0.219 0.426
usa 0.360 0.429 0.651 0.264 0.501 0.372 0.429 0.240
singapor 0.409 0.355 0.478 0.391 0.408 0.467 0.647 0.416
japan usa
usa 0.246
singapor 0.407 0.473
Why is this blank?
StatPro will also make the covariance matrix, whichdisplays covariances with variances on the diagonal.
Make this table in StatPro
Understanding the covariance and correlation formulae
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htm8/14/2019 Statistics Notes 1 Data_Plots and Summaries
109/178
109
How do these weird looking formulae for covariance andcorrelation capture the relationship?
To get a feeling for this, let us go back to the simple exampleand compute covariance and correlation
x y
0.07 0.11
0.06 0.05
0.04 0.090.03 0.03
First let us compute the covariance
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
110/178
110
First, let us compute the covariance(which is a necessary ingredient tocompute the correlation):
1
1
1
307 05 11 07 06 05 05 07 04 05 09 07 03 05 03 07
13
02 04 01 02 1 02 02 04
1
30008 0002 0002 0008
1
30012 0004
1nx x y yi i
i
n
=
+ + +
= + + +
= + = =
=
( )( )
((. . )(. . ) (. . )(. . ) (. . )(. . ) (. . )(. . ))
(. *. . * ( . ) ( . )*. ( . ) * ( . ))
(. . . . ) (. ) .
= .0004
Each of the 4 points makes a contribution to the sum.Let us see which point does what.
x
( )( ) . *. .x x y y1 1 02 04 008 = =( )( ) ( . )*. .x x y y3 3 01 02 0002 = =
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
111/178
111
0.070.060.050.040.03
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
x
y
x
y
( )( ) ( . ) * ( . ) .x x y y4 4 02 04 008 = =( )( ) . * ( . ) .x x y y2 2 01 02 0002 = =
(I)
(III)
(II)
(IV)
Points in (I) have both x and y bigger than their means so we get a positive
contribution to the covariance.Points in (III) have both x and y less than their means so we get a positivecontribution to the covariance.In (II) and (IV) one of x and y is less than its mean and the other is greaterso we get a negative contribution.
The further out the point is, the bigger the contribution.
Lots of positive contributions
just a fewrelatively small
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
112/178
112
0.20.10.0-0.1
0.2
0.1
0.0
-0.1
valmrkt
windsor
Lots of positive contributions
Lots of positive contributions
just a fewrelatively smallcontributions
relatively smallcontributions
We saw beforethat this mutualfunds returnsare positively
correlated withthe market.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
113/178
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
114/178
The sign of the correlation contains the same information
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
115/178
115
gas the sign of the covariance (in fact, they have the samesign because the standard deviations always positive).
Positive sign: positive relationshipNegative sign: negative relationship
The correlation can be more informative, though, becauseit is unit-less (always between 1 and 1), by construction.Hence, it is a more easily interpretable measure of thestrength of the relationship.
Close to 1: strong positive relationship
Close to -1: strong negative relationship
6 Linearly Related Variables
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
116/178
116
We have studied data sets that display some kind of relation
between variables (the mutual fund returns and the marketreturns, for instance).
Sometimes there is an exactlinear relation between variables:
y = c0 + c1 x
In this linear relationship, c0 is called the intercept.
c1 is called the slope.
Suppose we had started with x and we already knew itssample mean and variance.
Can we figure out the sample mean and variance of thenew variable, y?
6.1 Linear functions
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
117/178
117
Example
Suppose we have a sample of temperatures in Celsiusand we convert them to Fahrenheit.
fahr = 32 + (9/5) * cel
cel fahr
10 50
15 59
20 68
25 77
40 10430 86
50 122
70 158
How are the cel values relatedto the fahr values?
Note that cel = 32.5, and scel = 20
We could find fahr and sfahrusing a spreadsheet.
Note: if we make a scatter plot of
http://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
118/178
118
Note: if we make a scatter plot offahr versus cel, what do we see ?
Correlation of cel and fahr = 1.000
10 20 30 40 50 60 70
50
100
150
cel
fahr
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
119/178
119
The variable y is a linear function of the variable x if:
0 1y c c x= +
In general, we like to use the symbols y and xfor the two variables
0
1
c : the intercept
c : the slope We think of the cs as constants(fixed numbers) while x and y vary.
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
120/178
120
Example
Suppose your client is a movie star. She has adeal which pays her a $10 million fee per movie +10% of the gross ticket revenues.
How is our stars income related to the gross?
Let I denote income.Let G denote Gross.
10 1I . G= +
Note: Dont forget units! When we write it this way weneed to make sure all our numbers are in millions ofdollars.
6.2 Mean and variance of a linear function
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
121/178
121
Suppose y (i.e., each value of the variable y) is a linear
function of x.
How are the mean and variance (standard deviation)of y related to those of x?
Let us look atour temperatureexample.
Suppose wefirst multiply by(9/5) and thenadd 32.
mul = 9/5 * celfahr = 32 + mul
= 32 + (9/5)*cel
Variable Mean StDev
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
122/178
122
. . .. . . . .
+---------+---------+---------+---------+---------+-------cel
. . . . . . . .
+---------+---------+---------+---------+---------+-------mul
. . . . . . . .
+---------+---------+---------+---------+---------+-------fahr
0 30 60 90 120 150
cel 32.50 20.00
mul 58.5 36.0
fahr 90.5 36.0
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
123/178
123
Interpret
When we multiply cel by 9/5 we affect (increase) boththe mean and the standard deviation proportionally.
If we add a constant (32 in our case) we simply
increase the mean (by the value of the constant) butleave the overall dispersion unaffected.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
124/178
S l d i f li f ti
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
125/178
125
Sample mean and variance of a linear function
Suppose
Then,
0 1y c c x= +
0 1y c c x= +
y 1 xs | c | s=
2 2 2
y 1 xs c s=
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
126/178
126
So, instead of using a spreadsheet, we could have used
our linear formulas.
We knew that fahr = 32 + (9/5) * cel
c0 = 32y
xc1 = 9/5
Our handy linear formulas tell us:
fahr = c0 + c1 * cel
sfahr = |c1| * scel = |9/5| * 20 = 36
Of course,these are
the sameanswers wegot before!!
= 32 + (9/5)*32.5= 90.5
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
127/178
Aside: Why? (The hard way)
y c c x= +
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
128/178
128
1
0 1
1
0 1
1 1
0 1
1
1( )
1 1
n
i
i
n
i
i
n n
i
i i
x xn
y c c xn
c c xn n
c c x
=
=
= =
=
= +
= +
= +
2 2
1
2 2
0 1 0 1
1
0
1
( )1
1( )
1
1 (1
n
x i
i
n
y i
i
s x xn
s c c x c c xn
cn
=
=
=
= + +
=
1 0ic x c+ 2
1
1
2 2 2 2
1 1
1
)
1( )
1
n
i
n
i x
i
c x
c x x c sn
=
=
= =
0 1i iy c c x= +
NOTE: This is way more math than we will typically need in this course.
BUT you should know these formulas are properties of our summary statistics,not just some coincidence. AND they come up again when we do probability!
Example Each Income numberi 10 + 1* th di
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
129/178
129
Suppose our movie starmade 10 pictures lastyear and the samplemean and sample
variance of the gross onthe films are 100 and900, respectively.
What are the samplemean and variance ofthe stars income?
Gross Income
115.8 21.58
128.9 22.89
109.5 20.95
127.1 22.71
87.2 18.72
111.2 21.12
62.5 16.25
129.4 22.94
87.2 18.7241.2 14.12
is 10 + .1* the correspondingGross number.
See the file "moviestar1.xls". Remember,
G I
http://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
130/178
130
10 1. G= +
( )2 2
1 G. * s=
10 1I . G= +
c0 c1y x
So,
0 1I c c G= +
10 1 100. *= +
20=
2 2 2
1I Gs c s=
9=
Gross Income
115.8 21.6
128.9 22.9 The average of the Gross numbers = 100
109.5 21.0 The sample variance of the Gross numbers = 900
127. 1 22. 7 The s tandard deviat ion of t he Gross numbers = 30
87.2 18.7111.2 21.1 The average of the Income numbers= 20
62.5 16.2 The sample variance of the Income numbers= 9
129. 4 22. 9 The s tandard deviat ion of t he Income numbers= 3
87.2 18.7
41.2 14.1
10+.1*100= 20
(.1) 2 * 900 = 9
.1*30= 3
14
16
18
20
22
24
40 60 80 100 120 140
Gross
Income
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
131/178
Why are these formulas useful?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
132/178
132
We could always just type everything into a
spreadsheet and use spreadsheet functions to get theanswers.
Really, though, the reason for these formulas will
become apparent when we study probability,statistical inference, and regression. You cannotunderstand statistics or regression without a
solid understanding of linear relationships.
In other words, yes, I recognize these formulas are probably theleast fun part of the course (and considering this is basic stats,thats saying something). But you absolutely mustknow them.
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
133/178
133
Example
Suppose x has mean 100 and standard deviation 10.
What are the mean, standard deviation and variance of:
(i) y = 2x?
(ii) y = 5+x?
(iii) y = 5-2x?
(c0=0, c1=2)
(c0=5, c1=1)
(c0=5, c1= -2)
Answers:Mean SD Variance
(i) 200 20 400(ii) 105 10 100(iii) -195 20 400
Answers are above; click on the textbox just above this and use your cursorto highlight the text inside.
6.3 Linear combinations
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
134/178
134
We may want a variable to be related to several others instead ofjust one. We will assume that Y is a function of X,Z,rather than
just a function of X.
When a variable y is linearly related to several others,we call it a linear combination.
0 1 1 2 2 k ky c c x c x c x= + + +K
We say, y is a linear combination of the xs.c0 is called the intercept or just the constant
ci is called the coefficient of xi.
Example
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
135/178
135
Suppose in addition to the flat $10 million fee and 10
percent of ticket revenues, our movie star also gets 5percent of all sales of the soundtrack (on CD) releasedwith the movie.
How is the stars income related to the films gross and
CD sales (in millions of dollars)?
Let I,G,C, denoteincome, Gross, and cd sales 10 1 05I . G . C= + +
yx1
x2
c0 c1 c2
Important example: Portfolios
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
136/178
136
Suppose you have $100 to invest.
Let x1 be the return on asset 1.
If x1 = .1, and you put all your money into asset 1, then
you will have $100*(1+.1) = $110 at the end of the period.
Let x2 be the return on asset 2.
If x2 = .15, and you put all your money into asset 2, then
you will have $100*(1+.15) = $115 at the end of the period.
Suppose you put of your money into asset 1 the other of your money into asset 2.What will happen?
At the end of the period you will have,
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
137/178
137
.5*(100)*(1+.1) + .5*(100)*(1+.15) = 100*[ 1+(.5*.1)+(.5*.15) ]
55 + 57.50 = $112.50
So the return is (.5*.1) + (.5*.15) = .125
In other words, when we put of our money into asset 1and the other into asset 2, the return on the resulting
portfolio is
Investment inasset 1
Investment inasset 2
Return onportfolio
Rp = ( )*x1 + ( )*x2
The return on a portfolio is a linear combination of
the returns on the individual assets.
It turns out this is true in general. Suppose you have $M toinvest in two assets with returns x1 and x2. Let w1 be the
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
138/178
138
invest in two assets with returns x1 and x2. Let w1 be the
fraction of your wealth you choose to invest in asset 1:
w M x w M x M w w w x w x
M w x w x
1 1 2 2 1 2 1 1 2 2
1 1 2 2
1 1
1
( ) ( ) ( )
( )
+ + + = + + +
= + +
The portfolio return is:
p 1 1 2 2R w x w x= +
The portfolio return is a linear combination of the individualasset returns. The coefficients are the portfolio weights(fraction of wealth invested in each asset).
Note: For this to work, we need w1 + w2 = 1
Notice that the portfolio weights always sum up to one.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
139/178
139
Notice that the portfolio weights always sum up to one.(If I invest 30% of my wealth in asset 1, then I have to
invest 70% of my wealth in asset 2).
When were talking about portfolios, we use w1, w2,
instead of c1, c2, to remind us that weights have to sumto one. Our linear formulas work the same way in eithercase. Most of the time when we do portfolios, we dontworry about the constant (c0=0).
Question for those with some finance experience:Can portfolio weights be negative?
Suppose we have m assets.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
140/178
140
The return on the ith asset is xi.
Put wi fraction of your wealth into asset i..
Your portfolio is determined by the portfolio weights wi.
Then, the return on the portfolio is:
m
p 1 1 2 2 m m i i
i 1
R w x w x ... w x w x=
= + + + =
Your portfolio return is always a linear combination ofindividual asset returns, with coefficients equal to thefraction of wealth invested.
6.4 Mean and variance of a linear combination
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
141/178
141
y c c x c x= + +0 1 1 2 2
2 inputs:
Suppose
Then,
y c c x c x= + +0 1 1 2 2
s c s c s c c sy x x x x2
1
2 2
2
2 2
1 21 2 1 22= + +
First, we consider the case where we have only two xs.
For linear combinations of 2or more variables, variance
also depends on thecovariance between the xs!!
Example
For each film she does our movie star makes $10 million
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
142/178
142
Gross Cd
115.763100 5.412503
128.904400 6.539900
109.524600 5.878809127.133700 4.984490
87.234720 3.544932
111.248000 5.602628
62.455030 3.954600
129.397300 5.38724487.171460 5.092816
41.167710 3.602078
For each film she does, our movie star makes $10 millionplus 10% of gross ticket revenues and 5% of CD sales.
Here is the data for ten movies she made last year:
Here is her income for
each film.Remember,
Income
21.8
23.2
21.223.0
18.9
21.4
16.4
23.219.0
14.3
10 1 05I . G . C= + +
So each number in theIncome column equals 10plus .1 times the Grossvalue plus .05 times theCd value.
Note: All numbers are in millions of $.
Like before, we could type everything in and get thesample mean and variance of income using a
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
143/178
143
sample mean and variance of income using aspreadsheet.
But lets suppose, as her agent, we already knew that:
100G = 5C =
30Gs = 1Cs =
0 8CGr .=
Like before, we know that:
10 1 05I . G . C= + +
c0 c1 c2
So: I = c0 + c1 G + c2 C = 10 + .1*(100) + .05*(5)= 20.25
sI2 = c1
2sG2 + c2
2sC2 + 2c1c2sCG
= (.1)2(30)2 + (.05)2(1)2 + 2(.1)(.05)(30)(1)(.8) = 9.24
See next slide
Reminder:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
144/178
144
Remember, we defined sample correlation as the
covariance divided by the standard deviations
So, if we know the correlation and both standarddeviations, we can get back sample covariance
rs
s sxy
xy
x y
=
xy x y xys s s r =
So, if we know the sample standard deviations and eitherof correlation or covariance, we can figure out the other.We used this trick to calculate sCG on the previous slide.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
145/178
Example (the country data again)
L d d h h d
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
146/178
146
Let us use our country data and suppose that we had put.5 into USA and .5 into Hong Kong.What would our returns have been?
port = .5*honkong + .5*usa
honkong usa port
0.02 0.04 0.030
0.06 -0.03 0.015
0.02 0.01 0.015
-0.03 0.01 -0.0100.08 0.05 0.065
........
For each month, weget the portfolio return
as *hongkong + *usa.
port = .5*honkong + .5*usa
w1 (= c1) w2 (= c2)
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
147/178
147
honkong usa port
0.02 0.04 0.0300.06 -0.03 0.015
0.02 0.01 0.015
-0.03 0.01 -0.010
0.08 0.05 0.065
........
For each month, weget the portfolio returnas *hongkong + *usa.
The sample means are: honkong = 0.02103
usa = 0.01346
The sample mean of our portfolio returns is:
port = w1 honkong + w2 usa
= .5*.02103 + .5*.01346 = .01724
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
148/178
What if we had put 25% into USA and 75% into Hong Kong?
C i
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
149/178
149
Covariances
honkong usa port2
honkong 0.00521497
usa 0.00103037 0.00110774
port2 0.00416882 0.00104972 0.00338905
(.75)2(.00521) + (.25)2(.00111) +(2)*(.25)*(.75)*(.00103)
port2 =.75*honkong +.25*usa
To get sport22 just use the SAME formula from the previous
slide, except now with w1=.75 and w2=.25
= .00339
How do the returns on the w1=w2=.5 portfolio compare with
those of Hong Kong and USA?
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
150/178
150
g g
0.070.060.050.040.03
0.021
0.020
0.019
0.018
0.017
0.016
0.015
0.014
0.013
StDev
Mean port
usa
honkong
It lookslike the meanfor my portfoliois right inbetween the
means ofUSA andHong Kong.
What about the
standard deviation?
The sample standard deviation is less than halfwaybetween susa and shonkong what happened?
port = .0172
sport = .046
Why is covariance important?
We just used the formulafrom this slide:
=1 2 1 2 1 2x x x x x x
s s s r
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
151/178
151
Often useful to rewrite the variance formula as
= + +1 2 1 2 1 2
2 2 2 2 2y 1 x 2 x 1 2 x x x xs c s c s 2c c s s r
Remember, correlations are between -1 and 1!IF x1 and x2 are perfectly correlated (r=1), then
= + +1 2 1 2
2 2 2 2 2y 1 x 2 x 1 2 x xs c s c s 2c c s s
= +1 2
2
1 x 2 x(c s c s )
So in this case,1 2y 1 x 2 xs c s c s
= +
1 2y 1 x 2 xs c s c s< +
BUT in general, when c1 and c2 are positive,
The basic idea here is
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
152/178
152
The smallerthe correlation, the fasterthis
happens.
This is actually one of the most importantideas in statistics well see it again!!
It is also one of the most important ideas infinance, because it leads to diversification.
When we take averages,
variance gets smaller.
1
0 1
-0.07-0.05
-0.12Example (Optional)
y = 5x + 5 x
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
153/178
153
x1
x2
-1 0 1 2
-2
-1
0
-0.05
-0.1
0.12
-0.03
0.05
0.04
-0.08
0.03
0.05
0.12
-0.01
-0.05
-0.01
-0.06
0.13
0.03
0.11
y = .5x1 + .5 x2
At each point weplot the value of y.
The variances andcovariance are:
Then, the variance of y is
Why is the variance of y so much smaller than those of the xs ?
x1 x2
x1 1.334636
x2 -1.208679 1.106238
0.0058105 = .5*.5*1.3346 + .5*.5*1.106 +2*.5*.5*(-1.208679)
The dashed lines are drawn atthe mean of x1 and x2.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
154/178
1.5
2.0
1.290.75
0.93
-0.27
-0.02
Example (Optional)
y = 5x + 5 x
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
155/178
155
x1
x2
-2 -1 0 1
-1.0
-0.5
0.0
0.
5
1.0
-1.07 -0.76
0.43
-0.09
-1.2
0.39
0.17
-0.69
-1.11
-0.43
0.13
-0.35
1.03
0.23
-1.67
y .5x1 + .5 x2
At each point weplot the value of y.
x1 x2
x1 1.3870537
x2 0.1976187 0.8247886
The variances andcovariance are:
Then, the variance of y is
0.65175=.5*.5*1.387 + .5*.5*.8248 + 2*.5*.5*.1976
Why is the variance of y less than those of x1 and x2 ?
The dashed lines are drawn atthe mean of x1 and x2.
3 inputs:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
156/178
156
y c c x c x c x= + + +0 1 1 2 2 3 3
y c c x c x c x= + + +0 1 1 2 2 3 3
s c s c s c sc s c s c s
y x x x
x x x x x x
2
1
2 2
2
2 2
3
2 2
2 3 2
1 2 3
1 2 1 3 3 22
= + ++ + +c c c1 1 3
Note that there are now THREE covariance terms, one for each PAIR of xs
The formula forthe sample
mean isbasically thesame, just onemore termbecause theresone more x
Example: Portfolio with 3 inputs
port = .1*fidel+.4*eqmrkt+.5*windsor
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
157/178
157
Covariances
port fidel eqmrkt windsor
port 0.00306760
fidel 0.00280224 0.00320210
eqmrkt 0.00369384 0.00319150 0.00470021
windsor 0.00261967 0.00241087 0.00298922 0.00236580
.0030676 = (.1)*(.1)*.00320 + (.4)*(.4)*.00470 + (.5)*(.5)*.00236+2*[ (.1)*(.4)*.00319 + (.1)*(.5)*.00241+(.4)*(.5)*.00299 ]
sport2 = w1
2sfidel2 + w2
2seqmarket2 + w3
2swindsor2 +
2 w1w2 sfidel , eqmarket + 2 w1w3 sfidel , windsor + 2 w2w3 seqmarket , windsor
Let us try a portfolio with three stocks.Let us go short on Canada (i.e., we borrow Canada to investin the other stocks)
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
158/178
158
0.070.060.050.040.03
0.020
0.015
0.010
StDev
Mean
port
usa
honkong
canada
in the other stocks)
port = -.5*canada+usa+.5*honkong
Clearly,formingportfoliosis an interestingthing to do!
Aside: We can show (using ourlinear formulas) that all portfolios
that can be formed with a givenset of assets lie on a hyperbolain mean-s.d. space. Yourinvestments class will call thisthe portfolio possibilities curveor just the efficient frontier.
Aside: Why would we form portfolios?
M b th tf li h i d i (i
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
159/178
159
Maybe the portfolio has a nice mean and variance (i.e.
nice average return and nice risk)
Because portfolio returns are linear combinations ofreturns on individual assets, we can apply our linearformulas to find the average return and risk of any
possible portfolio as long as we know the means andvariances of the individual asset returns. Theseformulae are fundamental tools for those who reallyunderstand finance.
And remember our when we take averages, variancegets smaller idea? In finance, thats known asdiversification
Example (Optional)
Cut from a Finance Textbook:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
160/178
160
Cut from a Finance Textbook:
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
161/178
161
y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3
K inputs (Optional): Suppose
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
162/178
162
y c c x c x c x c xk k + + + + +0 1 1 2 2 3 3
then,
y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3
s c s c s c sy x x k xk2
1
2 2
2
2 2 2 2
1 2
2
= + + +
+
L
N
M
MM
O
Q
P
PP
the sum of all the different
covariance terms
times the products of the c's
I wont ask you to do calculations by hand for more than 3 inputs,this is just to give you an idea of what the formulas look like.
7. Linear Regression
This is data on 128 homes (Housing data)
http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
163/178
163
50000
75000
100000
125000
150000
175000
200000
225000
1400 1600 1800 2000 2200 2400 2600
SqFt
Price
This is data on 128 homes. (Housing data)x=size (square feet) y = price (dollars)
Clearly, the data are correlated:
http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls8/14/2019 Statistics Notes 1 Data_Plots and Summaries
164/178
164
Table of correlations
SqFt Price
SqFt 1.000
Price 0.553 1.000
But what is the equation of the line you would draw
through the data?
Linear regression fits a line to the plot.
When I "run a regression" I get values for
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
165/178
165
Regression coefficients
Coefficient
Constant -10091.1299
SqFt 70.2263
When I run a regression I get values forthe intercept and the slope.
y = (intercept) + (slope) * x
intercept
slope
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
166/178
166
Here is thescatter plotwith the linedrawn through it.
Looks reasonable!
It turns out the formula for the slope and the intercept are
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
167/178
167
xy
2x
s
slope = s
intercept = y - slope*x
Well see these later when we study regression.But it isnt that hard to see what they do!
The slope formula takes covariance and standardizes it
so that its units are (units of y)/(units of x)
The intercept formula makes our line pass throughthe point (x,y)
Regression and Prediction
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
168/178
168
You have a house on the market with size = 2200 sqft.
Can we predict at what price the house will sell?
Histogram ofPrice (in $1,000s)
Price = $130.4 k
sPrice = $26.9 k
We might use the sample mean or median as ourprediction. But this doesnt take size into account.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
169/178
Summary of Regression
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
170/178
170
Because they are using other information, the predictions
we make are (hopefully!) better in some sense.One of the homework problems asks you to explore this.
Most importantly, though, regression is based on the
same concepts (sample means, standard deviations, andcovariance) that weve studied in these notes. Its simplya new way to display (and use!) this information.
Theres nothing magical or mysterious about linear
regression! If you understand the basics well, regressionis both intuitive and incredibly useful.
Limitations of Regression
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
171/178
171
One thing to notice about regression is that it is not
symmetric. As weve seen, the sample correlation (orcovariance) between x and y is the same as between xand y.
In regression, it matters which variable is on the left handside of the = (the dependent variable). A regressionwith y = Size and x = Price gives a different answer.
Remember:
Correlation is not causation!
Just because we regress y on x doesnt mean changes inx cause changes in y.
8. Pivot Tables (Optional)
Up till now, we have tried to look at pairs of
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
172/178
172
p , pvariables.
Of course, it would be interesting to look at morethan two at a time.
The Pivot table utility in excel uses tables to do this.But the tables can be "more than two way" and youcan put a summary for another variable in eachcell.
The simple two way tables we looked at earlierwere also created using pivot tables.
In each cell is printed the average of the cigs dummy.This gives the percentage of smokers.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
173/178
173
The cells are determined by a binned version of ageand sex.
In the age group 16-25, 53% of female respondentsare smokers.
This table attempts to look at 3 variables at the same time!!
Average of cigs age
sex 16-25 26-35 36-45 46-55 >56 Grand Total
1 0.42 0.42 0.37 0.35 0.16 0.28
2 0.53 0.33 0.28 0.39 0.23 0.29
Grand Total 0.49 0.36 0.32 0.37 0.19 0.29
What do you think is going on here?
here is the pivot chart.
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
174/178
174
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2
>56
46-55
36-45
26-3516-25
The Hockey Data
We have data on every penalty called in the NHLfrom 95 96 to 2001 2002 Data below is a
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
175/178
175
oppcall timespan laghome goaldiff inrow2 laghomeT inrowT
0 14.75 0 -1 0v one
0 6.90 1 2 0h one
1 8.45 1 2 1h two
0 11.75 0 0 0v one
1 6.30 1 1 0h one
1 3.33 1 -1 1h two
1 5.93 0 -1 1v two
from 95-96 to 2001-2002. Data below is a
subsample of size 5000.
oppcall = 1 if penalty switches, that is, if A is playing Band the last penalty was on B, then oppcall =1 ifthis penalty is on A.
Each row corresponds to a penalty.
(Can't have first penalty in game).
timespan=time between penalties (mins)
laghome=1 last pen on home team
goaldiff = lead of last penalized team
inrow2=1 if last two pens on same team
laghomeT: h if laghome=1
inrowT: two if inrow2=1
...
The table attempts to look at 4 variables at one time!!!!
8/14/2019 Statistics Notes 1 Data_Plots and Summaries
176/178
176
Average of oppcall goaldiff
inrowT l