Statistics Notes 1 Data_Plots and Summaries

8/14/2019 Statistics Notes 1 Data_Plots and Summaries

1/178


2/178

2

About this Course

Below is a link to the course website. Please visitand bookmark this site NOW.

faculty.chicagobooth.edu/alan.bester/teaching/

You can also find the course website on Chalk orGoogle business statistics bester.

Everything you need to know is in the lecture

notes. Everything you need for the class is on

the course website.
http://faculty.chicagobooth.edu/alan.bester/teaching/http://faculty.chicagobooth.edu/alan.bester/teaching/


3/178

3

About These Notes

You will find links to data sets, examples, and other thingswe talk about throughout the notes.

Due to the name change Ive had to change all the links

from chicagogsb.edu to chicagoboth.edu. If you find one(in the notes or on the website) that doesnt work trychanging gsb to booth in the URL.

Yes, there are a lot of slides. I like to restate things and limitthe number of concepts per slide. This course is actuallyabout a small number of big ideas that we will developthroughout the quarter.


4/178


5/178

5

Notes1: Data: Plots and Summaries

1. Data

2. Looking at a Single Variable2.1 Tables2.2 Histograms2.3 Dotplots2.4 Time Series Plots

3. Summarizing a Single Numeric Variable3.1 The Mean and Median3.2 The Variance and Standard Deviation3.3 The Empirical Rule

3.4 Percentiles, quartiles, and the IQR4. Looking at Two Variables

4.1 Categorical variables: the Two-way table4.2 Numeric variables: Scatter Plots4.3 Relating Numeric and Categorical variables


6/178

6

5. Summarizing Bivariate Relations5.1 In Tables5.2 Covariance and Correlation

6. Linearly related variables6.1 Linear functions6.2 Mean and variance of a linear function

6.3 Linear combinations6.4 Mean and variance of a linear combination

7. Linear Regression8. Pivot Tables (Optional)


7/178

7

1.Data

age sex soc edu Reg inc cola restE juice cigs antiq news ender friend simp foot

67 2 3 1 3 12 1 0 1 0 1 0 0 0 0 0

51 2 3 8 3 10 1 1 0 1 1 0 1 1 0 0

63 2 3 1 2 13 1 1 0 1 1 0 1 0 0 0

45 2 4 3 1 18 1 1 1 0 1 0 0 0 0 0

Here is some data (oursample):

The data is from a large survey carried out by a marketingresearch company in Britain. (Marketing data)

Each row corresponds to a household.Each column corresponds to a different feature of the household.The features are called variables.

The rows are called observations.

.

.

.(many more rows !!)
http://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xls


8/178

8

Most data sets come in this form.

A rectangular array.

Rows are observations.Columns are variables.

Variables are the fundamental object in statistics.They come in several types.


9/178

9

The variable labeled "age" is simply the age (in years)of the responder.

This is a numericvariable.This variable has units, and averages are interpretable.

1 "Scotland"

2 "North West"

3 "North"

4 "Yorkshire & Humberside"

5 "East Midlands" 6 "East Anglia"

7 "South East"

8 "Greater London"

9 "South West"

10 "Wales"

11 "West Midlands"

A variable like Regis called categorical.

Think of:numeric vs. categorical

quantitative vs. qualitative

In contrast, the variable "Reg" is the geographical regionof the household. Each "number" is really just a codefor a region:


10/178

10

Instead of using numbers we could have usedtext strings in the data file, that is,

Reg:NorthNorthNorth_West

Scotland..

But it is extremely common to use numeric codes.

Another example: Which Democratic candidate doyou support?

1= Hillary Clinton, 2= John Edwards,3= Barack Obama, 4= Bill Richardson

Reg:332

1..

we could haveInstead of


11/178

11

The variable soc is categorical.It takes on codes 1-6, with meanings:

1 "A"

2 "B"

3 "C1"

4 "C2"

5 "D"

6 "E"

This is an ordered categoricalvariable.You can't think of it as a numerical measurebut A < B < ... < E. (A is actually the lowestsocial grade)

Soc is ordered like age, but does not have units.It does not really make sense to compute the differenceor to average two soc measurements.It does make sense to difference two ages.


12/178

12

That pretty much covers it.Variables are either numeric, categorical, or

ordered categorical.

Of course a numeric variable is always ordered.

A variable is discrete if you can list its possible values.

Otherwise it is called continuous.

For numeric variables we also have:


13/178

13

For example, the amount of rainfall in the City of Chicagothis month is usually thought of as being continuous.

As a practical matter, any variable is discrete sincewe put it in the computer. What it comes down tois, if there are a lot of possible values, we think of it

as continuous. (This is not really that important now;it will be later when we get to probability.)

For example, you might think of age as continuous

even though we measure it in years and can easilylist its possible values.

Number of children is more likely to be thought of as discrete.


14/178

14

Again, a good rule when working with a numericvariable is to keep in mind the units in which it ismeasured.

For example age has units years.

Percentages, which are numeric, don't have units.

Butthere are always units somewhere. For example, if

we look at the percentage of income a householdspends on entertainment, we are looking at onequantity measured in units of currency divided byanother.


15/178

15

Here are the definitions of all the variables in the surveydata set:

age: age in yearssex: 1 means male, 2 means femalesoc: we saw thisedu: education, terminal age of education

1 "14 Or Under"

2 "15"

3 "16"

4 "17"

5 "18"

6 "19"

7 "20"

8 "21 - 23"

9 "24 Or Over"

Reg: we saw this.


16/178

16

VARIABLE LABELS V_842 "Total Family Income Before Tax".VALUE LABELS V_842

1 "1,999 Or Less"2 "2,000 - 2,999"

3 "3,000 - 3,999"4 "4,000 - 4,999"

5 "5,000 - 5,999"6 "6,000 - 6,999"7 "7,000 -7,999"

8 "8,000 - 8,999"9 "9,000 - 9,999"

10 "10,000 - 10,999"11 "11,000 - 11,999"12 "12,000 - 14,999"13 "15,000 - 19,999"14 "20,000 - 24,999"15 "25,000 - 29,999"16 "30,000 - 34,999"17 "35,000 - 39,999"18 "40,000 - 49,999"19 "50,000 Or Over"20 "Not Stated"

inc: income

Note:

Both edu and inc could have

been numeric, but are brokendown into ranges. They arethus ordered categorical.

This is extremely common;with income there are actuallygood reasons for doing this!


17/178

17

cola, restE, juice, cigs indicate use of a productcategory.

1 if you use it, 0 if you don't.

This is called a dummy variable.1 indicates something "happened", 0 if not.

So, cigs=1 means you purchase cigarettes.restE means "restaurants in the evening".

This is extremely common. Often in statistics weare interested in does something happen?.

Another example is approval ratings ( 1=approve ).We will work with a lot of dummy variables this quarter.


18/178

18

The rest of the variables in the marketing data

represent tv shows.They are dummies: 1 if you watch, 0 if you don't.

antiq: antiques roadshownews: bbc news

enders: east endersfriend: friendssimp: simpsonsfoot: "football" (soccer)

A dummy variable can take on two values, 0 or 1.We use dummy variables to indicate something,

1 if that something happened, 0 if it did not.


19/178

19

Now we can see that there are three types of variablesin the data set.

(i) Demographics: age through income(ii) Product category usage,(iii) Media exposure (tv shows).

What is the point? Why collect this data?

We want to see how product usage relatesto demographics. What kind of people drink colas?

We want to see how the media relates to product usageso that we can select the appropriate media toadvertise in. If friends viewers tend to drink colas,that might be a good place to advertise your cola.


20/178

20

Important Note:

You can always take a numeric variable and

make it an ordered categorical variable byusing bins.

For example, instead of treating age as a numeric

variable it is common to break it into ranges.

0-20: a121-30:a231-40:a3

41-50:a451-60:a561-70:a6>70: a7

for example:


21/178

21

The simplest case is a dummy variable:

1

0

x ad

x a

>=

For example, you could define someone to be "old"if older than 40 and "young" otherwise.

d=1 then means "old" and d=0 means "young".

where x is numeric


22/178

22

2. Looking at a Single Variable

The most interesting thing in statistics is understandinghow variables relate to each other.

"Friends watchers tend to drink colas".

"Smokers tend to get cancer".

But it is still very important to get of sense of what variablesare like on their own.

Note: Well use the term distribution informally to talkabout what a variable looks like (what does a typical valuelook like, how spread out are its values, etc.) We will usethe term more formally when we study probability.


23/178

23

2.1 Tables

To look at a categorical variable we use a table:soc count

1 28

2 151

3 310

4 2355 156

6 120

We simply count how many of each category we have.

Note: We have 1000 observations total, so the numbersin this table must add to 1000.

How to make this table
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htm


24/178


25/178

25

2.2 Histograms

We take a numeric variable, break it down into categories

and then plot the table as on the previous slide.Remember, the height of each bar = # of observations orfrequency in that category.

Histogram for age

0

20

40

60

80

100

120

90

Category

35-40means(35,40]that is,


26/178

26

Histogram for Inter arrivalTime

0

10

20

30

40

50

60

70


27/178

27

4 %

5 %

Heres a histogram of monthly hedge fund returns from1994 to 2005. Notice anything interesting?

Source: Nicolas P. B. Bollen and Veronika K. Pool, Do Hedge Fund Managers Misreport Returns? Evidence from the

Pooled Distributions; original data from Center for International Securities and Derivatives Markets, University of

Massachusetts

0


28/178

28

Aside: Histograms can be displayed in different ways

The observations here are starting players in the NFL (on offense). The numbers onthe verticalaxis correspond to rounds of the NFL draft, while the length of each blue bar

is thepercentage of starting players drafted at that position (forget the red bars). Theplots on the right show onlyquarterbacks and fullbacks. (Source)

Aside or Optional on a slide means you are not

responsible for the material on that slide on an exam!

Dont worry, all of our histograms will be like the previous two slides.
http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/


29/178

29

2.3 Dotplots

nbeerm: the number of beers male MBA students claimthey can drink without getting drunk

nbeerf: same for females

It can be a hassle choosing the bins for a numericvariable.

For discrete variables and/or small data sets, we canjust put a dot on the number line for each value.

(Beer data)

Note (1): Unfortunately StatPro doesnt do dotplots.The dotplots in these slides were done in Minitab.

Note (2): The beer data is text, not Excel format. Use Text toColumns.
http://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dathttp://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dat


30/178

30

.

: :: :

. . : : : :

. . : . : : :.: : : : . .

+---------+---------+---------+---------+---------+-------

nbeerm

. .. . : : .

+---------+---------+---------+---------+---------+-------

nbeerf

0.0 4.0 8.0 12.0 16.0 20.0

Generally the males claim they can drink more,their numbers are centered or located at larger values.

Note: The dot plot is giving you the same kind ofinformation as the histogram.

We call a pointlike this anoutlier.


31/178

31

2.4 Time Series Plots

The survey data is what we call cross-sectional.The households in our survey are a (hopefullyrepresentative) cross section of all British households at aparticular point in time.

In cross-sectional data, order doesnt matter. We can sortour households by age, social, etc. and none of our resultschange as long as we keep each row intact.

Other examples would be samples were everyrow corresponded to a firm, a plant, a machine...

With a time series, each observation corresponds toa point in time.


32/178

32

Date Open High Low Close Volume

1-May-00 10749.4 11001.3 10622.2 10811.8 9663000

2-May-00 10805.6 10932.5 10580.7 10731.1 10115000

3-May-00 10732.2 10754.4 10345.2 10480.1 9916000

4-May-00 10478.9 10631.5 10293.1 10412.5 9258000

Daily data on the Dow Jones index: (Dow data)

For time series data, the order of observations matters.

(1-May-00 comes before 2-May-00, etc.)

The easiest way to visualize time series data is oftensimply to plot the series in time order.

.

.

.
http://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xls


33/178

33

Time series plot of Close

7800

8400

9000

9600

10200

10800

11400

5

/1/2000

6

/1/2000

7

/1/2000

8

/1/2000

9

/1/2000

10

/1/2000

11

/1/2000

12

/1/2000

1

/1/2001

2

/1/2001

3

/1/2001

4

/1/2001

5

/1/2001

6

/1/2001

7

/1/2001

8

/1/2001

9

/1/2001

10

/1/2001

11

/1/2001

12

/1/2001

1

/1/2002

2

/1/2002

3

/1/2002

4

/1/2002

Date

Close

Time series plot of the close series.

How to make this plot
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htm


34/178

34

We could have data at various frequencies:

daily,monthly,quarterly,annual.

The kinds of patterns you will uncover can be verydifferent depending on the frequency of the data.

A current hot topic of research at Booth is"high frequency data".


35/178

35

70605040302010

20

19

18

17

16

15

14

13

12

Index

b_

prod

MonthlyUS beer

production.

Do you seea pattern?

Would we see this pattern if we looked at annual data?


36/178

36

Time series plot of monthly returns on a portfolioof Canadian assets: (Country Portfolio returns)

10080604020

0.1

0.0

-0.1

Index

canada

On theverticalaxis we

havereturns.

On thehorizontalaxis wehave time.

Do you see a pattern?
http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls


37/178

37

Here is thehistogram

of the Canadianreturns.

0.090.060.030.00-0.03-0.06-0.09

30

20

10

0

canada

Frequency

0.10.0-0.1

30

20

10

0

canada

Fre

quency

Notes:

(i) The histogramdoes not dependon the time order.

(ii) The appearance of

the histogram dependson the number of bins.Too many bins makesthe histogram appear

spiky.


38/178

38

Taken from David Greenlaw, Jan Hatzius, Anil Kashyap, and Hyun Shin, US Monetary Policy Forum Report No. 2, 2008

Be careful. What pattern do you see in this series?

How about now?
http://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdf


39/178

39

Time series plots are also used to compare patternsacross different variables over time, and sometimes to seethe impact of past events (be very careful there, too).

From same paper as the previous slide.


40/178

40

3. Summarizing a Single Numeric Variable

We have looked at graphs. Suppose we are now interestedin having numerical summaries of the data rather thangraphical representations.

Two important features of any numeric variable are:

1) What is a typical or average value?

2) How spread out or variable are the values?


41/178

41

The mean and median capture a typical value.The variance/standard deviation capture the spread.

For example we saw that the men tend to claimthey can drink more.

How can we summarize this?

.

: :

: :

. . : : : :

. . : . : : :.: : : : . .

+---------+---------+---------+---------+---------+-------nbeerm

. .. . : : .

+---------+---------+---------+---------+---------+-------

nbeerf

0.0 4.0 8.0 12.0 16.0 20.0


42/178

42

Monthly returns

on Canadianportfolioand Japaneseportfolio.

They seemto be centeredroughly atthe same place

but Japanhas morespread.

How can we summarize this?


43/178

43

1 2 3 nx ,x ,x ,...x

the firstnumber

the last number, n is the numberof numbers,or the number ofobservations. You may also hear

it referred to as the sample size.

xi is the value of x associated with the ithobservation (row).

3.1 The Mean and Median

We will need some notation.

Suppose we have n observations on a numericvariable which we call "x".


44/178

44

Here, x is just a name for the set of numbers, we couldjust as easily use y.In a real data set we would use a meaningful name like "age".

x

5

2

8

62

x1

x3

n=5

Sometimes the order of the observations means something.

In our return data the first observation corresponds to thefirst time period.In the survey data, the order did not matter.


45/178

45

The sample mean is justtheaverage of the numbers x:

1 2 nx x ... xsumxn n

+ + += =

We often use the symbol to denote the mean of thenumbers x.

We call it x bar.

x


46/178

46

Here is a more compact way to write the same thing

Consider

1 2 nx x ... x+ + +We use a shorthand for it (it is just notation):

n

i 1 2 n

i 1

x x x ... x=

= + + +

This is summation notation.


47/178

47

Using summation notation we have:

x n xi

i

n

==

1

1

The sample mean:


48/178

48

Character Dotplot

. . . . : : .

+---------+---------+---------+---------+---------+-------nbeerf

.

: :

: :

. . : : : :

. . : . : : : . : : : : .

+---------+---------+---------+---------+---------+-------nbeerm

0.0 2.5 5.0 7.5 10.0 12.5

In some sense, the men claim to drink more.To summarize this we can compute the average valuefor each group (men and women).

Note: I deleted the outlier, I do not believe him!.

Graphical interpretation of the sample mean

Here are the dot plots of the beer data for women and men.

Which group claims to be able to drink more?


49/178

49

Mean of nbeerf = 4.2222

Mean of nbeerm = 7.8625

Character Dotplot

. . . . : : .

+---------+---------+---------+---------+---------+-------nbeerf

.

: :: :

. . : : : :

. . : . : : : . : : : : .

+---------+---------+---------+---------+---------+-------nbeerm

0.0 2.5 5.0 7.5 10.0 12.5

On average women claimthey can drink 4.2 beers. Men

claim they can drink 7.9 beers

In the picture, I think of the mean as the center of the data.

4.2

7.86

How to calculate these means
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htm


50/178


51/178

51

Let us look at summation in more detail.

xii

n

=1means that for each value of i, from 1 to n,

we add to the sum the value indicated,in this case xi.

add in this value for each i

More on summation notation (take this as an aside)


52/178

52

x y year

0.07 0.11 1

0.06 0.05 2

0.04 0.09 30.03 0.03 4

Think of each row as anobservation on both x and y.To make things concrete, thinkof each row as corresponding to

a year and let x and y be annualreturns on two different assets.

In year 1 asset x had return7%.In year 4 asset y had return3%.

To understand how it works let us consider someexamples.


53/178

53

(here, we do not sumover all observations: we sumonly over the second and thethird observation).

compute x bar.

compute y bar.


54/178

54

For each value of i, we can add in anything we want:

= (.02)*(.04) + (.01)*(-.02) + (-.01)*(.02)+(-.02)*(-.04)

How to do these calculations using Excel
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htm


55/178

55

The median

After ordering the data, the median is themiddle value of the data. If there is an evennumber of data points, the median is theaverage of the two middle values.

Example

1,2,3,4,5 Median = 31,1,2,3,4,5 Median = (2+3)/2 =2.5


56/178

56

Mean versus median

Although boththe mean and the median are goodmeasures of the center of a distribution of measurements,the median is less sensitive to extreme values.

The median is not affected by extreme values sincethe numerical values of the measurements are notused in its computation.

Example

1,2,3,4,5 Mean: 3 Median: 31,2,3,4,100 Mean: 22 Median: 3


57/178

57

If data is right skewed the mean will be biggerthan the median. You can think of this as the extremeright tail observations pulling the mean upward.

Summary measures for selectedvariables

InterarrivalTime

Mean 4.163

Median 2.779

For the bank interarrival data:

H is t o g r a m f o r I n t e r a r r i v a

0

10

20

30

40

50

60

70


58/178

58

Median or Mean?

At Booth professors are rated by students from 1-5 inseveral categories. In the past only the mean rating wasreported.

Some faculty members believe the median shouldbe reported instead. This was actually a major debate ata faculty meeting a few years ago.

What difference would this make?

In fact, Booth now reports the mean andmedian,along with a histogram of all the ratings!

Th M f D V i bl


59/178

59

The Mean of a Dummy Variable

Consider the "simpson" variable in the survey data set.Does it make sense to take the mean?

Summary measures for selected variables

simpsons

Count 1000.000

Mean 0.181

The sum of the 1's and0's will equal the numberof respondents who watchthe simpsons.

So the mean is the fractionof respondents who watch.


60/178

60

So, in general, the average of a dummy,

gives the percentage of times that whatever dummy=1signals happens.

Another example, if a poll is conducted about a

particular candidate where1=approval, 0=disapproval

then the sample mean is the candidates approval rating.

This may seem obvious, but we will get a lot of use outof this idea throughout the quarter.

3 2 Th V i d St d d D i ti


61/178

61

3.2 The Variance and Standard Deviation

The mean and the median give usinformationabout the central tendency of a set of

observations, but they shed no light on thedispersion, or spread of the data.

Example: Which data set is more variable ?

5,5,5,5,5 Mean: 51,3,5,8,8 Mean: 5

If these were portfolio returns (in percent), means areaverage returns. What else might we want tomeasure?


62/178

62

The Sample Variance

. . . .

-+---------+---------+---------+---------+---------+-----x

. . . .

-+---------+---------+---------+---------+---------+-----y

0.030 0.045 0.060 0.075 0.090 0.105

The y numbers are more spread outthan the x numbers.We want a numerical measure of variation or spread.

The basic idea is to view variability in terms of distancebetween each measurement and the mean.

x xi


63/178

63

. . . .

-+---------+---------+---------+---------+---------+-----x

. . . .

-+---------+---------+---------+---------+---------+-----y

0.030 0.045 0.060 0.075 0.090 0.105

Overall, these are smaller than these.


64/178

64

We cannot just look at the distance between each

measurement and the mean. We need an overallmeasure of how big the differences are

(i.e., just one number like in the case of the mean).

Also, we cannot just sum the individual distancesbecause the negative distances cancel out with thepositive ones giving zero always (Why?).

The average squared distance would be

1

1

2

nx xi

i

n

( )=


65/178

65

So, the sample variance of the x data is defined to be:

s

n

x xx ii

n2

1

21

1

=

=

( )

We use n-1 instead of n for technical reasons that will

be discussed later (and because Excel does it this way).

Think of it as the average squared distance of

the observations from the mean.

Sample variance:


66/178

66

2) What are the units of the variance?

It is helpful to have a measure of spread whichis in the original units. The sample variance is not in theoriginal units. We now introduce a measure of dispersionthat solves this problem: the sample standard deviation

1) What is the smallest value a variance can be?

Questions


67/178

67

The sample standard deviation

It is defined as the square root of the sample variance (easy).

s sx x=

2

The units of the standard deviation are the sameas those of the original data.

The sample standard deviation:


68/178


69/178

69

The samplestandard deviation

for the y datais bigger thanthat for the x data.

This numerically

captures thefact that y hasmore variationabout its meanthan x.

Example 2 (graphical)


70/178

70

Character Dotplot

.

:

: :

:: :

.::: :.:

: : :::: ::::

::: :::: :::: :::

. : :::: :::: :::: :::.-----+---------+---------+---------+---------+---------+-canada

. .

::. . : .

. ::: .:: :.: .

: ::: .::: :::: : :.

. .. .. :.:: :::: :::: :::: : :: : : . : .

-----+---------+---------+---------+---------+---------+-japan

-0.160 -0.080 0.000 0.080 0.160 0.240

Variable N Mean StDev

canada 107 0.00907 0.03833

japan 107 0.00234 0.07368

Example 2 (graphical)The standard deviationsmeasure the fact that thereis more spread in the Japanese

returns

3 3 Th E i i l R l


71/178

71

3.3 The Empirical Rule

We now have two numerical summaries for the data

x sx

where the data is how spread out,how variable the data is

The mean is pretty easy to interpret (some sort of center of thedata).

We know that the bigger sx is, the more variable the data is, but how

do we really interpret this number?

What is a big sx, what is a small one ?

The empirical rule will help us understand s and


72/178

72

The empirical rule will help us understand sx and

relate the numerical summaries back to our plots.

Empirical Rule

For mound shaped data:

Approximately 68% of the data is in the interval

( , )x s x s x sx x x + =

Approximately 95% of the data is in the interval

( , )x s x s x sx x x + = 2 2 2

We can see this on a histogram of the Canadian returns


73/178

73

We can see this on a histogram of the Canadian returns

x =.00907

sx =.03833

x sx+x sx

x sx 2 x sx+ 2

The empirical

rule says thatroughly 95%of theobservationsare between the

dashed lines androughly 68% betweenthe dotted lines.

Looks reasonable.

H i s t o g r a m f o r c a

0

5

1 0

1 5

2 0

2 5

3 0

. 1

-0.1 0.10


74/178

74

10080604020

0.1

0.0

-0.1

Index

cana

da

x

xx 2s+

xx 2s

Same thingviewed from

the perspectiveof the timeseries plot.

n=108, so5% outsidewould be about5 points.

There are 4 pointsoutside, which ispretty close.


75/178

A little finance: comparing mutual funds


76/178

76

A little finance: comparing mutual funds

Let us use the means and standard deviations to compare mutual funds.For 9 different assets we compute the means and standard deviations.Later, we plot the means versus the standard deviations.

The assets are:

#C1 - R22 Drefus (growth)#C2- R30 Fidelity Trend fund (growth)

#c3- R55 Keystone Speculative fund (max capital gain)

#c4- R92 Putnam Income Fund (income)

#c5- R99 Scudder Income

#c6- R129 Windsor Fund (growth)

#c7- equally weighted market#c8- value weighted market

#c9- tbill rate

# sample period monthly returns 1:68 - 12-82


77/178

77

Variable N Mean StDev

drefus 180 0.00677 0.04724fidel 180 0.00470 0.05659

keystne 180 0.00654 0.08424

Putnminc 180 0.00552 0.03008

scudinc 180 0.00443 0.03597

windsor 180 0.01002 0.04864eqmrkt 180 0.01082 0.06856

valmrkt 180 0.00681 0.04800

tbill 180 0.00598 0.00252

The speculative fund (keystne) has a higher mean andstandard deviation than the income fund (Putnminc).

Later well see how to look at this information graphically.


78/178

78

3.4 Percentiles, quartiles, and the IQR

Again, this just applies to numeric variables.

The 10th percentile is the number such that 10% ofthe values are less than it and 90% are bigger.

The median is the 50th percentile.

Percentiles are also known as quantiles.

95th percentile,.95 quantile, and 95% quantile

all mean the same thing.


79/178

79


age

Count 1000.000

5th percentile 25.000




For the age variable in the survey data:

5% of the 1000 age valuesare less than 25.

90% of people in the sample

are less than 71 years old.

5% of the people in thesample are over 75 years of

age.

For now dont worry aboutstrictly less than vs. lessthan or equal to.



80/178

80

The first, second,and third quartiles are the25th, 50th, and 75th percentiles.

The interquartile rangeis the difference betweenthe third and first quartile.

variables

age

Count 1000.000

Mean 48.312

Median 48.000

Standard deviation 15.718

Variance 247.062

First quartile 35.000

Third quartile 60.000

Interquartile range 25.000

The interquartile rangeis used as a measureof spread (IQR is tovariance as median is tomean).


81/178

81

Histogram for age

0

20

40

60

80

100

120

90

Category

first quartile = 35 years

We can interpret quantiles graphically on the histogram.25% of the area of the colored bars is to the left of the first quantile.


82/178

82

The empirical rule is actually a statement about quantiles.

What does it say? For a variable with a mound shapedhistogram

What quantile is two standard deviations below the mean?

What quantile is one standard deviation above?

2.5%

84%

To see this yourself, draw the picture! Well learn later thatthe empirical rule is based on a very important probabilitymodel.

10th Percentile (o) 50th Percentile (+) 90th Percentile ( )


83/178

83

10th Percentile (o) 50th Percentile (+) 90th Percentile ( )

Figure 3. Indexed Real Wages for Men by Percentile 1967-1997Year

70 75 80 85 90 95

0.90

1.00

1.10

1.20

1.30

Aside: We wont use percentiles much in this class, but above is aninteresting time series plot of the 90th (top line), median (middle line),and 10th percentiles of real wages in the U.S. from the late 1960s tolate 1990s. This widening income gap is a major concern foreconomists or is it?

Source: Murphy, Kevin and Finis Welch, Wage Differentials in the 1990s: Is the Glass Half-full or Half-empty?

4 L ki t T V i bl
http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/


84/178

84

4. Looking at Two Variables

While it is important to look at variables oneat a time, many interesting business problemsconcern how two (or more) variables are related

to each other.

4 1 Categorical variables: the Two way Table


85/178

85

4.1 Categorical variables: the Two-way Table

Lets look at the relationship between two categoricalvariables,xand y.

Ifxhas two categories and yhas two as well,then there are four categories using both x and y.

We can then just count the number of observations ineach category.

If x has r1 and y has r2, then we have r1*r2possibilities. We can arrange these possibilities ina two-way table.

This is the two way table relating viewership of the simpsons


86/178

86

simpsons

colas 0 1Grand Total

0 387 35 4221 432 146 578

Grand Total 819 181 1000

This is the two way table relating viewership of the simpsonswith cola use.146 of the 1000 view simpsons andconsume colas.

simpsons


0 38.70% 3.50% 42.20%1 43.20% 14.60% 57.80%

Grand Total 81.90% 18.10% 100.00%

Raw counts: Percent of total:

Percent of column: Percent of row:Count of colas simpsons


0 47% 19% 42%

1 53% 81% 58%

Grand Total 100% 100% 100%

Count of colas simpsons


0 92% 8% 100%

1 75% 25% 100%

Grand Total 82% 18% 100%

How to make these tables

A picture of the table:
http://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htmhttp://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htm


87/178

87

0

100

200

300

400

500

600

700

800

900

0 1

1

0

simpsons

colas

A much higher fraction of the simpsons viewersconsumes colas.


88/178

4 2 N i i bl S tt Pl t


89/178

89

4.2 Numeric variables: Scatter Plots

For two numeric variables we have the scatter plot.

nbeer weight

12.0 192

12.0 160

5.0 155

5.0 120

7.0 150

13.0 175

4.0 100

12.0 165

12.0 165

12.0 150

. .

. .

. .

How are they related?

Each row is an observationcorresponding to a person.

Each person has two numbersassociated with him/her,

# beers and weight.

Is the numberof beers you can drinkrelated to your weight?


90/178

90

200150100

20

10

0

weight

nbeer

nbeer weight

12.0 192

12.0 160

5.0 155

5.0 120

7.0 150

13.0 175

4.0 100

12.0 165

12.0 165

12.0 150

. .

. .

. .

You can think of a scatterplot as a 2D dotplot. Each point corresponds to an

observation: weightdetermines the positionon the horizontal axis, heighton the vertical.

related to your weight?

Notice our outlier is back (circled)... and is he really an outlier?!

In addition to relating two variables, a scatterplot also gives youall the information youd get from a dotplot of either variable


91/178

91

200150100

20

10

0

weight

nbeer

allthe information you d get from a dotplot of either variable.

Sample Exam Question

The sample mean ofweight is

(i) 105 (ii) 130 (iii) 155 (iv) 180

Imagine the dots on the scatterplotbeing pulled downward by gravity youd get a dotplot of weight!

Same ideafor nbeer,though thevertical axiscan be alittle harderto picture(Hint: rotatethe paper)

The sample SD of weight is around 28,so roughly 68% of observationsbetween 127 and 183 pounds.

Example


92/178

92

Are returns on a mutual fund related to market returns?

0.20.10.0-0.1

0.2

0.1

0.0

-0.1

valmrkt

windsor

Each pointcorrespondsto a month.

Like the histogram,scatterplots canalso be used withtime series data,

and the resultingplot does notdepend on the timeordering.

Example


93/178

93

Heres another example of an outlier. This data is from a pokerwebsite that went through a major cheating scandal.

WINRATE

VPIP

A similar scandal surfaced recently. Is the evidence as compelling?

In finance we often use a different type of 2-D plot to compare asset
http://www.msnbc.msn.com/id/26563848/http://www.msnbc.msn.com/id/26563848/


94/178

94

0.090.080.070.060.050.040.030.020.010.00

0.011

0.010

0.009

0.008

0.007

0.006

0.005

0.004

StDev

Mean

tbill

valmrkt

eqmrkt

windsor

scudinc

Putnminc

keystne

fidel

drefus

yp p preturns. Here each point is a mutual fund. The horizontal and verticallocation of each point reflects the sample standard deviation andsample mean of its returns within the same sample period.

If youre a

fundmanager,where doyou wantto be on

this plot?


95/178

95

Let us compare some countries (Country returns data)

Basedonmonthlyreturnsfrom 88to 96

0.080.070.060.050.040.03

0.02

0.01

0.00

StDev

Mean

singaporusa

japan

italy

honkong

germany

france

finalndcanada

belgium australi
http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls


96/178

96

4.3 Relating a Numeric to a Categorical variable

How do you plot a numeric variable vs acategorical variable?

This is not so obvious.

An easy thing to do is make the numeric variablecategorical by binning it, like we did when making ahistogram.


97/178

97

cigs

age 0 1Grand Total

16-25 50.98% 49.02% 100.00%

26-35 63.64% 36.36% 100.00%

36-45 67.69% 32.31% 100.00%

46-55 64.76% 35.24% 100.00%

56-65 79.76% 20.24% 100.00%

66-75 91.13% 8.87% 100.00%

76-85 88.10% 11.90% 100.00%

86-95 100.00% 0.00% 100.00%

Grand Total 71.20% 28.80% 100.00%

Cigarette usage and age:

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

16-25 26-35 36-45 46-55 56-65 66-75 76-85 86-95

1

0

Quick what is the relationship betweenage and cigarette usage?

Plots are a great way to identify patterns, but carefulHow strong is the evidence?


98/178

5.1 In Tables


99/178

99

5.1 In Tables

There does not seem to be a standard way to

summarize the strength of the relationship in a table.

Sometimes I use the difference between a marginalproportion and a conditional proportion.

simpsons


0 38.70% 3.50% 42.20%

1 43.20% 14.60% 57.80%

Grand Total 81.90% 18.10% 100.00%

simpsons


0 47.25% 19.34% 42.20%

1 52.75% 80.66% 57.80%

Grand Total 100.00% 100.00% 100.00%

In this case it would be: |.578 - .8066| =.2286

The difference between the percent of cola drinkersand percent of simpsons viewers that are cola drinkers.

5.2 Covariance and Correlation


100/178

100

In the beer data (beers vs weight) and mutual fund data(windsor vs valmrkt), it looks like there is a relationship.

Even more, the relationship looks linear in that it looks likewe could draw a line through the plot to capture the pattern.

Covarianceandcorrelation summarize how strong alinearrelationship there is between two variables.

In our first example weight and # beers were two variables.In our second example our two variables were two kinds of

returns.

In general, we think of the two variables as x and y.

The sample covariance between x and y:


101/178

101

p y

sn

x x y yxy i i

i

n

=

=

1

1 1( )( )

The sample correlation between x and y:

rs

s sxy

xy

x y

=

So, the correlation is just the covariance divided bythe two standard deviations. What are the units?

We will get some intuition about these formulae, but firstl t th i ti H d th i d t


102/178

102

let us see them in action. How do they summarize datafor us? Let us start with the correlation.

Correlation, the facts of life:

1 1rxy

The closer r is to 1 the stronger the linearrelationship is with a positive slope.When one goes up, the other tends to go up.

The closer r is to -1 the stronger the linear

relationship is with a negative slope.When one goes up, the other tends to go down.

The correlations corresponding to the two scatter plots


103/178

103

Correlation of valmrkt and windsor = 0.923

Correlation of nbeer and weight = 0.692

p g pwe looked at are:

The larger correlation between valmrkt and windsor

indicates that the linear relationship is stronger.

Let us look at some more examples.

0.20.10.0-0.1

0.2

0.1

0.0

-0.1

valmrkt

windsor

200150100

20

10

0

weight

nbeer


104/178

104

3210-1-2-3

2

1

0

-1

-2

x1

y1

Correlation of

y1 and x1 = 0.019

3210-1-2-3

3

2

1

0

-1

-2

-3

x2

y2Correlation of

y2 and x2 = 0.995


105/178

105

3210-1-2-3

4

3

2

1

0

-1

-2

-3

-4

x3

y3

Correlation of

y3 and x3 = 0.586

3210-1-2-3

3

2

1

0

-1

-2

-3

x4

y4Correlation of

y4 and x4 = -0.982


106/178

106

3210-1-2-3

9

8

7

6

5

4

3

2

1

0

x5

y5

Correlation of y5 and x5 = 0.210

IMPORTANT: Correlation only measures linearrelationships (here the value is small but there is a strongnonlinearrelationship between y5 and x5.)

Example: The country data


107/178

107

Which countries go up and down together?I have data on 23 countries.That would be a lot of plots!

0.10.0-0.1

0.1

0.0

-0.1

usa

canada

The correlation matrixis a table of all sample correlations


108/178

108

pbetween each possible pair of a set of variables.

australi belgium canada finalnd france germany honkong italy

belgium 0.189

canada 0.507 0.357

finalnd 0.387 0.183 0.386

france 0.275 0.734 0.342 0.176

germany 0.226 0.691 0.302 0.304 0.709

honkong 0.334 0.301 0.558 0.355 0.359 0.339

italy 0.159 0.367 0.334 0.389 0.352 0.465 0.261

japan 0.251 0.418 0.271 0.307 0.421 0.318 0.219 0.426

usa 0.360 0.429 0.651 0.264 0.501 0.372 0.429 0.240

singapor 0.409 0.355 0.478 0.391 0.408 0.467 0.647 0.416

japan usa

usa 0.246

singapor 0.407 0.473

Why is this blank?

StatPro will also make the covariance matrix, whichdisplays covariances with variances on the diagonal.

Make this table in StatPro

Understanding the covariance and correlation formulae
http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htm


109/178

109

How do these weird looking formulae for covariance andcorrelation capture the relationship?

To get a feeling for this, let us go back to the simple exampleand compute covariance and correlation

x y

0.07 0.11

0.06 0.05

0.04 0.090.03 0.03

First let us compute the covariance


110/178

110

First, let us compute the covariance(which is a necessary ingredient tocompute the correlation):

1

1

1

307 05 11 07 06 05 05 07 04 05 09 07 03 05 03 07

13

02 04 01 02 1 02 02 04

1

30008 0002 0002 0008

1

30012 0004

1nx x y yi i

i

n

=

+ + +

= + + +

= + = =

=

( )( )

((. . )(. . ) (. . )(. . ) (. . )(. . ) (. . )(. . ))

(. *. . * ( . ) ( . )*. ( . ) * ( . ))

(. . . . ) (. ) .

= .0004

Each of the 4 points makes a contribution to the sum.Let us see which point does what.

x

( )( ) . *. .x x y y1 1 02 04 008 = =( )( ) ( . )*. .x x y y3 3 01 02 0002 = =


111/178

111

0.070.060.050.040.03

0.11

0.10

0.09

0.08

0.07

0.06

0.05

0.04

0.03

x

y

x

y

( )( ) ( . ) * ( . ) .x x y y4 4 02 04 008 = =( )( ) . * ( . ) .x x y y2 2 01 02 0002 = =

(I)

(III)

(II)

(IV)

Points in (I) have both x and y bigger than their means so we get a positive

contribution to the covariance.Points in (III) have both x and y less than their means so we get a positivecontribution to the covariance.In (II) and (IV) one of x and y is less than its mean and the other is greaterso we get a negative contribution.

The further out the point is, the bigger the contribution.

Lots of positive contributions

just a fewrelatively small


112/178

112

0.20.10.0-0.1

0.2

0.1

0.0

-0.1

valmrkt

windsor



just a fewrelatively smallcontributions

relatively smallcontributions

We saw beforethat this mutualfunds returnsare positively

correlated withthe market.


113/178


114/178

The sign of the correlation contains the same information


115/178

115

gas the sign of the covariance (in fact, they have the samesign because the standard deviations always positive).

Positive sign: positive relationshipNegative sign: negative relationship

The correlation can be more informative, though, becauseit is unit-less (always between 1 and 1), by construction.Hence, it is a more easily interpretable measure of thestrength of the relationship.

Close to 1: strong positive relationship

Close to -1: strong negative relationship

6 Linearly Related Variables


116/178

116

We have studied data sets that display some kind of relation

between variables (the mutual fund returns and the marketreturns, for instance).

Sometimes there is an exactlinear relation between variables:

y = c0 + c1 x

In this linear relationship, c0 is called the intercept.

c1 is called the slope.

Suppose we had started with x and we already knew itssample mean and variance.

Can we figure out the sample mean and variance of thenew variable, y?

6.1 Linear functions


117/178

117

Example

Suppose we have a sample of temperatures in Celsiusand we convert them to Fahrenheit.

fahr = 32 + (9/5) * cel

cel fahr

10 50

15 59

20 68

25 77

40 10430 86

50 122

70 158

How are the cel values relatedto the fahr values?

Note that cel = 32.5, and scel = 20

We could find fahr and sfahrusing a spreadsheet.

Note: if we make a scatter plot of
http://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xls


118/178

118

Note: if we make a scatter plot offahr versus cel, what do we see ?

Correlation of cel and fahr = 1.000

10 20 30 40 50 60 70

50

100

150

cel

fahr


119/178

119

The variable y is a linear function of the variable x if:

0 1y c c x= +

In general, we like to use the symbols y and xfor the two variables

0

1

c : the intercept

c : the slope We think of the cs as constants(fixed numbers) while x and y vary.

Example


120/178

120

Example

Suppose your client is a movie star. She has adeal which pays her a $10 million fee per movie +10% of the gross ticket revenues.

How is our stars income related to the gross?

Let I denote income.Let G denote Gross.

10 1I . G= +

Note: Dont forget units! When we write it this way weneed to make sure all our numbers are in millions ofdollars.

6.2 Mean and variance of a linear function


121/178

121

Suppose y (i.e., each value of the variable y) is a linear

function of x.

How are the mean and variance (standard deviation)of y related to those of x?

Let us look atour temperatureexample.

Suppose wefirst multiply by(9/5) and thenadd 32.

mul = 9/5 * celfahr = 32 + mul

= 32 + (9/5)*cel

Variable Mean StDev


122/178

122

. . .. . . . .

+---------+---------+---------+---------+---------+-------cel

. . . . . . . .

+---------+---------+---------+---------+---------+-------mul

. . . . . . . .

+---------+---------+---------+---------+---------+-------fahr

0 30 60 90 120 150

cel 32.50 20.00

mul 58.5 36.0

fahr 90.5 36.0


123/178

123

Interpret

When we multiply cel by 9/5 we affect (increase) boththe mean and the standard deviation proportionally.

If we add a constant (32 in our case) we simply

increase the mean (by the value of the constant) butleave the overall dispersion unaffected.


124/178

S l d i f li f ti


125/178

125

Sample mean and variance of a linear function

Suppose

Then,

0 1y c c x= +

0 1y c c x= +

y 1 xs | c | s=

2 2 2

y 1 xs c s=

Example


126/178

126

So, instead of using a spreadsheet, we could have used

our linear formulas.

We knew that fahr = 32 + (9/5) * cel

c0 = 32y

xc1 = 9/5

Our handy linear formulas tell us:

fahr = c0 + c1 * cel

sfahr = |c1| * scel = |9/5| * 20 = 36

Of course,these are

the sameanswers wegot before!!

= 32 + (9/5)*32.5= 90.5


127/178

Aside: Why? (The hard way)

y c c x= +


128/178

128

1

0 1

1

0 1

1 1

0 1

1

1( )

1 1

n

i

i

n

i

i

n n

i

i i

x xn

y c c xn

c c xn n

c c x

=

=

= =

=

= +

= +

= +

2 2

1

2 2

0 1 0 1

1

0

1

( )1

1( )

1

1 (1

n

x i

i

n

y i

i

s x xn

s c c x c c xn

cn

=

=

=

= + +

=

1 0ic x c+ 2

1

1

2 2 2 2

1 1

1

)

1( )

1

n

i

n

i x

i

c x

c x x c sn

=

=

= =

0 1i iy c c x= +

NOTE: This is way more math than we will typically need in this course.

BUT you should know these formulas are properties of our summary statistics,not just some coincidence. AND they come up again when we do probability!

Example Each Income numberi 10 + 1* th di


129/178

129

Suppose our movie starmade 10 pictures lastyear and the samplemean and sample

variance of the gross onthe films are 100 and900, respectively.

What are the samplemean and variance ofthe stars income?

Gross Income

115.8 21.58

128.9 22.89

109.5 20.95

127.1 22.71

87.2 18.72

111.2 21.12

62.5 16.25

129.4 22.94

87.2 18.7241.2 14.12

is 10 + .1* the correspondingGross number.

See the file "moviestar1.xls". Remember,

G I
http://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xls


130/178

130

10 1. G= +

( )2 2

1 G. * s=

10 1I . G= +

c0 c1y x

So,

0 1I c c G= +

10 1 100. *= +

20=

2 2 2

1I Gs c s=

9=

Gross Income

115.8 21.6

128.9 22.9 The average of the Gross numbers = 100

109.5 21.0 The sample variance of the Gross numbers = 900

127. 1 22. 7 The s tandard deviat ion of t he Gross numbers = 30

87.2 18.7111.2 21.1 The average of the Income numbers= 20

62.5 16.2 The sample variance of the Income numbers= 9

129. 4 22. 9 The s tandard deviat ion of t he Income numbers= 3

87.2 18.7

41.2 14.1

10+.1*100= 20

(.1) 2 * 900 = 9

.1*30= 3

14

16

18

20

22

24

40 60 80 100 120 140

Gross

Income


131/178

Why are these formulas useful?


132/178

132

We could always just type everything into a

spreadsheet and use spreadsheet functions to get theanswers.

Really, though, the reason for these formulas will

become apparent when we study probability,statistical inference, and regression. You cannotunderstand statistics or regression without a

solid understanding of linear relationships.

In other words, yes, I recognize these formulas are probably theleast fun part of the course (and considering this is basic stats,thats saying something). But you absolutely mustknow them.

Example


133/178

133

Example

Suppose x has mean 100 and standard deviation 10.

What are the mean, standard deviation and variance of:

(i) y = 2x?

(ii) y = 5+x?

(iii) y = 5-2x?

(c0=0, c1=2)

(c0=5, c1=1)

(c0=5, c1= -2)

Answers:Mean SD Variance

(i) 200 20 400(ii) 105 10 100(iii) -195 20 400

Answers are above; click on the textbox just above this and use your cursorto highlight the text inside.

6.3 Linear combinations


134/178

134

We may want a variable to be related to several others instead ofjust one. We will assume that Y is a function of X,Z,rather than

just a function of X.

When a variable y is linearly related to several others,we call it a linear combination.

0 1 1 2 2 k ky c c x c x c x= + + +K

We say, y is a linear combination of the xs.c0 is called the intercept or just the constant

ci is called the coefficient of xi.

Example


135/178

135

Suppose in addition to the flat $10 million fee and 10

percent of ticket revenues, our movie star also gets 5percent of all sales of the soundtrack (on CD) releasedwith the movie.

How is the stars income related to the films gross and

CD sales (in millions of dollars)?

Let I,G,C, denoteincome, Gross, and cd sales 10 1 05I . G . C= + +

yx1

x2

c0 c1 c2

Important example: Portfolios


136/178

136

Suppose you have $100 to invest.

Let x1 be the return on asset 1.

If x1 = .1, and you put all your money into asset 1, then

you will have $100*(1+.1) = $110 at the end of the period.

Let x2 be the return on asset 2.

If x2 = .15, and you put all your money into asset 2, then

you will have $100*(1+.15) = $115 at the end of the period.

Suppose you put of your money into asset 1 the other of your money into asset 2.What will happen?

At the end of the period you will have,


137/178

137

.5*(100)*(1+.1) + .5*(100)*(1+.15) = 100*[ 1+(.5*.1)+(.5*.15) ]

55 + 57.50 = $112.50

So the return is (.5*.1) + (.5*.15) = .125

In other words, when we put of our money into asset 1and the other into asset 2, the return on the resulting

portfolio is

Investment inasset 1

Investment inasset 2

Return onportfolio

Rp = ( )*x1 + ( )*x2

The return on a portfolio is a linear combination of

the returns on the individual assets.

It turns out this is true in general. Suppose you have $M toinvest in two assets with returns x1 and x2. Let w1 be the


138/178

138

invest in two assets with returns x1 and x2. Let w1 be the

fraction of your wealth you choose to invest in asset 1:

w M x w M x M w w w x w x

M w x w x

1 1 2 2 1 2 1 1 2 2

1 1 2 2

1 1

1

( ) ( ) ( )

( )

+ + + = + + +

= + +

The portfolio return is:

p 1 1 2 2R w x w x= +

The portfolio return is a linear combination of the individualasset returns. The coefficients are the portfolio weights(fraction of wealth invested in each asset).

Note: For this to work, we need w1 + w2 = 1

Notice that the portfolio weights always sum up to one.


139/178

139

Notice that the portfolio weights always sum up to one.(If I invest 30% of my wealth in asset 1, then I have to

invest 70% of my wealth in asset 2).

When were talking about portfolios, we use w1, w2,

instead of c1, c2, to remind us that weights have to sumto one. Our linear formulas work the same way in eithercase. Most of the time when we do portfolios, we dontworry about the constant (c0=0).

Question for those with some finance experience:Can portfolio weights be negative?

Suppose we have m assets.


140/178

140

The return on the ith asset is xi.

Put wi fraction of your wealth into asset i..

Your portfolio is determined by the portfolio weights wi.

Then, the return on the portfolio is:

m

p 1 1 2 2 m m i i

i 1

R w x w x ... w x w x=

= + + + =

Your portfolio return is always a linear combination ofindividual asset returns, with coefficients equal to thefraction of wealth invested.

6.4 Mean and variance of a linear combination


141/178

141

y c c x c x= + +0 1 1 2 2

2 inputs:

Suppose

Then,

y c c x c x= + +0 1 1 2 2

s c s c s c c sy x x x x2

1

2 2

2

2 2

1 21 2 1 22= + +

First, we consider the case where we have only two xs.

For linear combinations of 2or more variables, variance

also depends on thecovariance between the xs!!

More on this later

Example

For each film she does our movie star makes $10 million


142/178

142

Gross Cd

115.763100 5.412503

128.904400 6.539900

109.524600 5.878809127.133700 4.984490

87.234720 3.544932

111.248000 5.602628

62.455030 3.954600

129.397300 5.38724487.171460 5.092816

41.167710 3.602078

For each film she does, our movie star makes $10 millionplus 10% of gross ticket revenues and 5% of CD sales.

Here is the data for ten movies she made last year:

Here is her income for

each film.Remember,

Income

21.8

23.2

21.223.0

18.9

21.4

16.4

23.219.0

14.3

10 1 05I . G . C= + +

So each number in theIncome column equals 10plus .1 times the Grossvalue plus .05 times theCd value.

Note: All numbers are in millions of $.

Like before, we could type everything in and get thesample mean and variance of income using a


143/178

143

sample mean and variance of income using aspreadsheet.

But lets suppose, as her agent, we already knew that:

100G = 5C =

30Gs = 1Cs =

0 8CGr .=

Like before, we know that:

10 1 05I . G . C= + +

c0 c1 c2

So: I = c0 + c1 G + c2 C = 10 + .1*(100) + .05*(5)= 20.25

sI2 = c1

2sG2 + c2

2sC2 + 2c1c2sCG

= (.1)2(30)2 + (.05)2(1)2 + 2(.1)(.05)(30)(1)(.8) = 9.24

See next slide

Reminder:


144/178

144

Remember, we defined sample correlation as the

covariance divided by the standard deviations

So, if we know the correlation and both standarddeviations, we can get back sample covariance

rs

s sxy

xy

x y

=

xy x y xys s s r =

So, if we know the sample standard deviations and eitherof correlation or covariance, we can figure out the other.We used this trick to calculate sCG on the previous slide.


145/178

Example (the country data again)

L d d h h d


146/178

146

Let us use our country data and suppose that we had put.5 into USA and .5 into Hong Kong.What would our returns have been?

port = .5*honkong + .5*usa

honkong usa port

0.02 0.04 0.030

0.06 -0.03 0.015

0.02 0.01 0.015

-0.03 0.01 -0.0100.08 0.05 0.065

........

For each month, weget the portfolio return

as *hongkong + *usa.

port = .5*honkong + .5*usa

w1 (= c1) w2 (= c2)


147/178

147

honkong usa port

0.02 0.04 0.0300.06 -0.03 0.015

0.02 0.01 0.015

-0.03 0.01 -0.010

0.08 0.05 0.065

........

For each month, weget the portfolio returnas *hongkong + *usa.

The sample means are: honkong = 0.02103

usa = 0.01346

The sample mean of our portfolio returns is:

port = w1 honkong + w2 usa

= .5*.02103 + .5*.01346 = .01724


148/178

What if we had put 25% into USA and 75% into Hong Kong?

C i


149/178

149

Covariances

honkong usa port2

honkong 0.00521497

usa 0.00103037 0.00110774

port2 0.00416882 0.00104972 0.00338905

(.75)2(.00521) + (.25)2(.00111) +(2)*(.25)*(.75)*(.00103)

port2 =.75*honkong +.25*usa

To get sport22 just use the SAME formula from the previous

slide, except now with w1=.75 and w2=.25

= .00339

How do the returns on the w1=w2=.5 portfolio compare with

those of Hong Kong and USA?


150/178

150

g g

0.070.060.050.040.03

0.021

0.020

0.019

0.018

0.017

0.016

0.015

0.014

0.013

StDev

Mean port

usa

honkong

It lookslike the meanfor my portfoliois right inbetween the

means ofUSA andHong Kong.

What about the

standard deviation?

The sample standard deviation is less than halfwaybetween susa and shonkong what happened?

port = .0172

sport = .046

Why is covariance important?

We just used the formulafrom this slide:

=1 2 1 2 1 2x x x x x x

s s s r


151/178

151

Often useful to rewrite the variance formula as

= + +1 2 1 2 1 2

2 2 2 2 2y 1 x 2 x 1 2 x x x xs c s c s 2c c s s r

Remember, correlations are between -1 and 1!IF x1 and x2 are perfectly correlated (r=1), then

= + +1 2 1 2

2 2 2 2 2y 1 x 2 x 1 2 x xs c s c s 2c c s s

= +1 2

2

1 x 2 x(c s c s )

So in this case,1 2y 1 x 2 xs c s c s

= +

1 2y 1 x 2 xs c s c s< +

BUT in general, when c1 and c2 are positive,

The basic idea here is


152/178

152

The smallerthe correlation, the fasterthis

happens.

This is actually one of the most importantideas in statistics well see it again!!

It is also one of the most important ideas infinance, because it leads to diversification.

When we take averages,

variance gets smaller.

1

0 1

-0.07-0.05

-0.12Example (Optional)

y = 5x + 5 x


153/178

153

x1

x2

-1 0 1 2

-2

-1

0

-0.05

-0.1

0.12

-0.03

0.05

0.04

-0.08

0.03

0.05

0.12

-0.01

-0.05

-0.01

-0.06

0.13

0.03

0.11

y = .5x1 + .5 x2

At each point weplot the value of y.

The variances andcovariance are:

Then, the variance of y is

Why is the variance of y so much smaller than those of the xs ?

x1 x2

x1 1.334636

x2 -1.208679 1.106238

0.0058105 = .5*.5*1.3346 + .5*.5*1.106 +2*.5*.5*(-1.208679)

The dashed lines are drawn atthe mean of x1 and x2.


154/178

1.5

2.0

1.290.75

0.93

-0.27

-0.02

Example (Optional)

y = 5x + 5 x


155/178

155

x1

x2

-2 -1 0 1

-1.0

-0.5

0.0

0.

5

1.0

-1.07 -0.76

0.43

-0.09

-1.2

0.39

0.17

-0.69

-1.11

-0.43

0.13

-0.35

1.03

0.23

-1.67

y .5x1 + .5 x2

At each point weplot the value of y.

x1 x2

x1 1.3870537

x2 0.1976187 0.8247886

The variances andcovariance are:

Then, the variance of y is

0.65175=.5*.5*1.387 + .5*.5*.8248 + 2*.5*.5*.1976

Why is the variance of y less than those of x1 and x2 ?

The dashed lines are drawn atthe mean of x1 and x2.

3 inputs:


156/178

156

y c c x c x c x= + + +0 1 1 2 2 3 3

y c c x c x c x= + + +0 1 1 2 2 3 3

s c s c s c sc s c s c s

y x x x

x x x x x x

2

1

2 2

2

2 2

3

2 2

2 3 2

1 2 3

1 2 1 3 3 22

= + ++ + +c c c1 1 3

Note that there are now THREE covariance terms, one for each PAIR of xs

The formula forthe sample

mean isbasically thesame, just onemore termbecause theresone more x

Example: Portfolio with 3 inputs

port = .1*fidel+.4*eqmrkt+.5*windsor


157/178

157

Covariances

port fidel eqmrkt windsor

port 0.00306760

fidel 0.00280224 0.00320210

eqmrkt 0.00369384 0.00319150 0.00470021

windsor 0.00261967 0.00241087 0.00298922 0.00236580

.0030676 = (.1)*(.1)*.00320 + (.4)*(.4)*.00470 + (.5)*(.5)*.00236+2*[ (.1)*(.4)*.00319 + (.1)*(.5)*.00241+(.4)*(.5)*.00299 ]

sport2 = w1

2sfidel2 + w2

2seqmarket2 + w3

2swindsor2 +

2 w1w2 sfidel , eqmarket + 2 w1w3 sfidel , windsor + 2 w2w3 seqmarket , windsor

Let us try a portfolio with three stocks.Let us go short on Canada (i.e., we borrow Canada to investin the other stocks)


158/178

158

0.070.060.050.040.03

0.020

0.015

0.010

StDev

Mean

port

usa

honkong

canada

in the other stocks)

port = -.5*canada+usa+.5*honkong

Clearly,formingportfoliosis an interestingthing to do!

Aside: We can show (using ourlinear formulas) that all portfolios

that can be formed with a givenset of assets lie on a hyperbolain mean-s.d. space. Yourinvestments class will call thisthe portfolio possibilities curveor just the efficient frontier.

Aside: Why would we form portfolios?

M b th tf li h i d i (i


159/178

159

Maybe the portfolio has a nice mean and variance (i.e.

nice average return and nice risk)

Because portfolio returns are linear combinations ofreturns on individual assets, we can apply our linearformulas to find the average return and risk of any

possible portfolio as long as we know the means andvariances of the individual asset returns. Theseformulae are fundamental tools for those who reallyunderstand finance.

And remember our when we take averages, variancegets smaller idea? In finance, thats known asdiversification

Example (Optional)

Cut from a Finance Textbook:


160/178

160

Cut from a Finance Textbook:


161/178

161

y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3

K inputs (Optional): Suppose


162/178

162

y c c x c x c x c xk k + + + + +0 1 1 2 2 3 3

then,

y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3

s c s c s c sy x x k xk2

1

2 2

2

2 2 2 2

1 2

2

= + + +

+

L

N

M

MM

O

Q

P

PP

the sum of all the different

covariance terms

times the products of the c's

I wont ask you to do calculations by hand for more than 3 inputs,this is just to give you an idea of what the formulas look like.

7. Linear Regression

This is data on 128 homes (Housing data)
http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls


163/178

163

50000

75000

100000

125000

150000

175000

200000

225000

1400 1600 1800 2000 2200 2400 2600

SqFt

Price

This is data on 128 homes. (Housing data)x=size (square feet) y = price (dollars)

Clearly, the data are correlated:
http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls


164/178

164

Table of correlations

SqFt Price

SqFt 1.000

Price 0.553 1.000

But what is the equation of the line you would draw

through the data?

Linear regression fits a line to the plot.

When I "run a regression" I get values for


165/178

165

Regression coefficients

Coefficient

Constant -10091.1299

SqFt 70.2263

When I run a regression I get values forthe intercept and the slope.

y = (intercept) + (slope) * x

intercept

slope


166/178

166

Here is thescatter plotwith the linedrawn through it.

Looks reasonable!

It turns out the formula for the slope and the intercept are


167/178

167

xy

2x

s

slope = s

intercept = y - slope*x

Well see these later when we study regression.But it isnt that hard to see what they do!

The slope formula takes covariance and standardizes it

so that its units are (units of y)/(units of x)

The intercept formula makes our line pass throughthe point (x,y)

Regression and Prediction


168/178

168

You have a house on the market with size = 2200 sqft.

Can we predict at what price the house will sell?

Histogram ofPrice (in $1,000s)

Price = $130.4 k

sPrice = $26.9 k

We might use the sample mean or median as ourprediction. But this doesnt take size into account.


169/178

Summary of Regression


170/178

170

Because they are using other information, the predictions

we make are (hopefully!) better in some sense.One of the homework problems asks you to explore this.

Most importantly, though, regression is based on the

same concepts (sample means, standard deviations, andcovariance) that weve studied in these notes. Its simplya new way to display (and use!) this information.

Theres nothing magical or mysterious about linear

regression! If you understand the basics well, regressionis both intuitive and incredibly useful.

Limitations of Regression


171/178

171

One thing to notice about regression is that it is not

symmetric. As weve seen, the sample correlation (orcovariance) between x and y is the same as between xand y.

In regression, it matters which variable is on the left handside of the = (the dependent variable). A regressionwith y = Size and x = Price gives a different answer.

Remember:

Correlation is not causation!

Just because we regress y on x doesnt mean changes inx cause changes in y.

8. Pivot Tables (Optional)

Up till now, we have tried to look at pairs of


172/178

172

p , pvariables.

Of course, it would be interesting to look at morethan two at a time.

The Pivot table utility in excel uses tables to do this.But the tables can be "more than two way" and youcan put a summary for another variable in eachcell.

The simple two way tables we looked at earlierwere also created using pivot tables.

In each cell is printed the average of the cigs dummy.This gives the percentage of smokers.


173/178

173

The cells are determined by a binned version of ageand sex.

In the age group 16-25, 53% of female respondentsare smokers.

This table attempts to look at 3 variables at the same time!!

Average of cigs age

sex 16-25 26-35 36-45 46-55 >56 Grand Total

1 0.42 0.42 0.37 0.35 0.16 0.28

2 0.53 0.33 0.28 0.39 0.23 0.29

Grand Total 0.49 0.36 0.32 0.37 0.19 0.29

What do you think is going on here?

here is the pivot chart.


174/178

174

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

1 2

>56

46-55

36-45

26-3516-25

The Hockey Data

We have data on every penalty called in the NHLfrom 95 96 to 2001 2002 Data below is a


175/178

175

oppcall timespan laghome goaldiff inrow2 laghomeT inrowT

0 14.75 0 -1 0v one

0 6.90 1 2 0h one

1 8.45 1 2 1h two

0 11.75 0 0 0v one

1 6.30 1 1 0h one

1 3.33 1 -1 1h two

1 5.93 0 -1 1v two

from 95-96 to 2001-2002. Data below is a

subsample of size 5000.

oppcall = 1 if penalty switches, that is, if A is playing Band the last penalty was on B, then oppcall =1 ifthis penalty is on A.

Each row corresponds to a penalty.

(Can't have first penalty in game).

timespan=time between penalties (mins)

laghome=1 last pen on home team

goaldiff = lead of last penalized team

inrow2=1 if last two pens on same team

laghomeT: h if laghome=1

inrowT: two if inrow2=1

...

The table attempts to look at 4 variables at one time!!!!


176/178

176

Average of oppcall goaldiff

inrowT l

Documents

Statistics Notes 1 Data_Plots and Summaries