54
2: Descriptive Statistics 1/54 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Statistics and Data Analysis

  • Upload
    stefan

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. So, what is the story? - PowerPoint PPT Presentation

Citation preview

Page 1: Statistics and Data Analysis

2: Descriptive Statistics1/54

Statistics and Data Analysis

Professor William GreeneStern School of Business

IOMS DepartmentDepartment of Economics

Page 2: Statistics and Data Analysis

2: Descriptive Statistics2/54

Statistics and Data Analysis

Part 2 – Descriptive StatisticsSummarizing data with useful statistics

Page 3: Statistics and Data Analysis

2: Descriptive Statistics3/54

Use random samples and basic descriptive statistics.

What is the ‘breach rate’ in a pool of tens of thousands of mortgages? (‘Breach’ = improperly underwritten or serviced or otherwise faulty mortgage.)

Page 4: Statistics and Data Analysis

2: Descriptive Statistics4/54

The forensic analysis was an examination of statistics from a random sample of 1,500 loans.

Page 5: Statistics and Data Analysis

2: Descriptive Statistics5/54

Descriptive StatisticsAgenda

Populations and Random Samples Descriptive Statistics for a Variable

Measures of location: Mean,median,mode Measure of dispersion: Standard deviation

Measuring Correlation of Two Variables Understanding correlation Measuring correlation Scatter plots and regression

Page 6: Statistics and Data Analysis

2: Descriptive Statistics6/54

Populations and Samples Population: Collection of all possible observations (data

points) on a variable Sample: A subset of the data points in the population Random sample: Defined by the way the sample data are

obtained. All points in the population are equally likely to be drawn in any particular sample.

What is the purpose of obtaining a sample?To describe or learn about the population. The sample is observed The population is assumed.

In order to learn confidently about the population from a sample, the sample must be ‘random.’

Page 7: Statistics and Data Analysis

2: Descriptive Statistics7/54

Random Sampling A production process produces circuit boards. Boards are

produced in each hour with an average of 2 defects per board when the process is in control. Each hour, the engineer examines a random sample of 100 circuit boards. The average number of defects per board in a particular 30 hour week is

Hour 1: Mean of 100 boards = 1.95, Hour 2: “ 2.65, Hour 3: “ 1.80, …Hour 30: “ 2.35.(These are estimates of the defect rate per board)

The objective of drawing the sample is to determine whether the process is in control or not. The process is under control if the defect rate is < 2.)

Method: Assuming the process is in control, would we expect to see this rate of defects?

Page 8: Statistics and Data Analysis

2: Descriptive Statistics8/54

Random samples of behavior are difficult to obtain, especially by telephone.

Page 9: Statistics and Data Analysis

2: Descriptive Statistics9/54

Nonrandom Samples

Nonrandom samples produce tainted, sometimes not believable results

Biased with respect to the population May describe a not useful specific subset of

the population.

Page 10: Statistics and Data Analysis

2: Descriptive Statistics10/54

(Non)Randomness of SamplesSources of bias in samples (generally related)

Bad sample design – e.g., home phone surveys conducted during working hours

Survey (non)response bias – e.g., opinion surveys about service quality

Participation bias – e.g., voluntary participation in a survey

Self selection – volunteering for a trial or an opinion sample. (Shere Hite’s cultural revolution)

Attrition bias from clinical trials - e.g., if the drug works, the subject does not come back.

Page 11: Statistics and Data Analysis

2: Descriptive Statistics11/54

Nonrandom results in incubator funds.

The “NYU No Action Letter”

Page 12: Statistics and Data Analysis

2: Descriptive Statistics12/54

Nonscientific, Nonrandom “(non)Sampling”

A Cultural Revolution …“3000 women, ages 14 to 78 describe in their own words …”

Page 13: Statistics and Data Analysis

2: Descriptive Statistics13/54

A Cultural Revolution …“3000 women, ages 14 to 78 describe in their own words …”

http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692

Page 14: Statistics and Data Analysis

2: Descriptive Statistics14/54

http://en.wikipedia.org/wiki/Shere_Hite

Page 15: Statistics and Data Analysis

2: Descriptive Statistics15/54

The Lesson…

Having a really big sample does not assure you of an accurate result. It may assure you of a really solid, really bad (inaccurate) result.

Page 16: Statistics and Data Analysis

2: Descriptive Statistics16/54 http://old.cni.org/docs/ima.ip-workshop/Massarsky.html

How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and publishers? The following relates to terrestrial radio, which, as a group, pays a lump sum into the pool, which is then allocated by the PRSs.

Page 17: Statistics and Data Analysis

2: Descriptive Statistics17/54

A Descriptive Statistic Is … ? Describes what?

The sample dataThe population that the data came from

Page 18: Statistics and Data Analysis

2: Descriptive Statistics18/54

Measures of Location

Location and central tendency There exists a distribution of values We are interested in the “center” of the distribution

Two measures are the sample mean and the sample median

They look similar, and measure the same thing. They differ systematically (and predictably) when the data

are not ‘symmetric.’

These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Page 19: Statistics and Data Analysis

2: Descriptive Statistics19/54

The Sample Mean

N

NN

N N

There are N observations (data points) in the sample., , , ,... ]

In this sample, N = 30. The sample mean is

[ ... ]

1 2 3 4

1 2 3 41 1

= [y y y y y

= y = y + y + y + y y

1 56.30= (1.45 +...+ 2.35) = =130 30

ii=1

Sample data : y

y

.8767

These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Page 20: Statistics and Data Analysis

2: Descriptive Statistics20/54

It may be necessary to ‘weight’ aggregate data.Average Home Listings

1Listing = (896,800 + 713,864 +...+164,326) = 369,68751

Page 21: Statistics and Data Analysis

2: Descriptive Statistics21/54

Averaging Averages?

Hawaii’s average listing = $896,800 Hawaii’s population = 1,275,194 Illinois’ average listing = $377,683 Illinois’ population = 12,763,371 Illinois and Hawaii each get weight 1/51

= .019607 when the mean is computed. Looks like Hawaii is getting too much

influence.

Page 22: Statistics and Data Analysis

2: Descriptive Statistics22/54

A Properly Weighted Average

State StateStatesSimple average = Listing = Weight Listing

1Weight = = .01960751

Illinois is 10 times as big as Hawaii. Suppose we use weights that arein proportion to the st

State

ate's population. (The weights sum to 1.0.)Weight varies from .001717 for Wyoming to .121899 for California

New average is 409,234 compared to 369,687 without weights, an error of 11%. Sometimes an unequal weighting of the observations is necessary.

State populations from http://www.factmonster.com/ipka/A0004986.html

Page 23: Statistics and Data Analysis

2: Descriptive Statistics23/54

Averaging Trending Time Series Observations Is Usually Not Informative

Does the mean over the entire observation period mean anything? (Does it estimate anything meaningful?)

Note how the mean changes completely depending on what time interval is used to compute it.

Page 24: Statistics and Data Analysis

2: Descriptive Statistics24/54

The Sample Median Median = the middle observation after

data are sorted. Odd number: Central observation:

Med[1,2,4,6,8,9,17] = 6 Even number: Midpoint between the

two central observations Med[1,2,4,6,8,9,14,17] = (6+8)/2=7

Page 25: Statistics and Data Analysis

2: Descriptive Statistics25/54

Sample Median of (Sorted) Defects Data

1.05 1.30 1.40 1.45 1.45 1.50 1.55 1.60 1.60 1.65 1.65 1.70 1.70 1.70 1.70 1.90 1.90 1.95 2.05 2.05 2.05 2.20 2.25 2.30 2.30 2.35 2.35 2.35 2.60 2.70

Median = 1.8000Mean = 1.8767

Fre

qu

ency

DEFECTS

0

3

6

9

12

1.000 1.500 2.000 2.500 3.000

Page 26: Statistics and Data Analysis

2: Descriptive Statistics26/54

Tomorrow I will compute the average number of defectives for a 61st day. What is a good guess of the number I will find?

(Let’s deduce estimates of the mean and median from the histogram.)

Page 27: Statistics and Data Analysis

2: Descriptive Statistics27/54

Skewed Earnings Distribution Mean vs. Median in Skewed Data

M y

Monthly EarningsN = 595, Median = 800Mean = 883

The mean will exceed the median when the distribution is skewed to the right. (The skewness is in the direction of the long tail.)

These data are skewed to the right.

Page 28: Statistics and Data Analysis

2: Descriptive Statistics28/54

Extreme Observations Distort Means but Not Medians

Outlying observations distort the mean Med [1,2,4,6,8,9,17] = 6 Mean[1,2,4,6,8,9,17] = 6.714 Med [1,2,4,6,8,9,17000] = 6 (still) Mean[1,2,4,6,8,9,17000] = 2432.8 (!)

This typically occurs when there are some outlying obervations, such as in cross sections of income or wealth and/or when the sample is not very large.

Page 29: Statistics and Data Analysis

2: Descriptive Statistics29/54

Page 30: Statistics and Data Analysis

2: Descriptive Statistics30/54

The mean does not give information about the shape of the distribution.

Two problems with the computations(1) The data are ratings, not quantitative(2) The mean does not suggest the extreme nature of the data

Page 31: Statistics and Data Analysis

2: Descriptive Statistics31/54

The problem with the mean or median as a description of a sample – more information is usually needed.

Both data sets have a mean of about 100.

Page 32: Statistics and Data Analysis

2: Descriptive Statistics32/54

Dispersion of the Observations

Defects

Freq

uenc

y

2.82.42.01.61.2

6

5

4

3

2

1

0

Histogram of Defects We quantify the variation of the values around the mean. Note the range is from 1.05 to 2.70. This gives an idea where the data lie. The mean plus a measure of the variation do the same job.

These are 30 hours of average defect data on sets of circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Page 33: Statistics and Data Analysis

2: Descriptive Statistics33/54

The Problem with the Range as a Measure of Dispersion

These two data sets both have 1,000 observations that range from about 10 to about 180

Page 34: Statistics and Data Analysis

2: Descriptive Statistics34/54

A Measure of DispersionThe standard deviation is the interesting value. You need to compute the variance to get the standard deviation.

Variance = sy2 =

Standard deviation = sy =

N

N 2

ii=1

1 Y - Y1

N

N 2

ii=1

1 Y - Y1

Note the units of measurement. The standard deviation has the same units as the mean. The standard deviation is the standard measure for the dispersion (spread) of a set of values (sample of observations).

Page 35: Statistics and Data Analysis

2: Descriptive Statistics35/54

The variance is the average squared deviation of the sample values from the mean. Why is N-1 in the denominator of s2?

Everyone else does it Minitab does it I have totally no idea. Tendency of the variance to be too

small when computed using 1/N when the sample size, N, is itself small.

(When N is large, it won’t matter.)

See HOG, p. 37

Page 36: Statistics and Data Analysis

2: Descriptive Statistics36/54

Computing a Standard DeviationY Deviation Squared From Mean Deviation1 -2.1 4.414 0.9 0.816 2.9 8.410 -3.1 9.613 -0.1 0.012 -1.1 1.216 2.9 8.414 0.9 0.814 0.9 0.811 -2.1 4.41SUM 0.0 38.90

Sum = 31

Mean = 31/10=3.1 Sum of squared deviations = 38.90 Variance = 38.90/(10-1) = 4.322

Standard Deviation = 2.079

Page 37: Statistics and Data Analysis

2: Descriptive Statistics37/54

Standard Deviation

230ii=1

230ii=1

1 1Variance = Y -1.8767 = 4.808667 = 0.16581630 -1 30 -1

1Standard Deviation = Y -1.8767 = 0.40720530 -1

These are 30 hours of average defect data on sets of circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Page 38: Statistics and Data Analysis

2: Descriptive Statistics38/54

Distribution of Values

Defects

Freq

uenc

y

2.82.42.01.61.2

6

5

4

3

2

1

0

Histogram of Defects

Page 39: Statistics and Data Analysis

2: Descriptive Statistics39/54

Reliable Rules of Thumb Almost always, 66% of the observations in a sample will

lie in the range [mean - 1 s.d. to mean + 1 s.d.] Almost always, 95% of the observations in a sample will

lie in the range [mean - 2 s.d. to mean + 2 s.d.] Almost always, 99.5% of the observations in a sample will

lie in the range [mean - 3 s.d. to mean + 3 s.d.]

When these rules are not met, they will almost be met. Data nearly always act this way.

Page 40: Statistics and Data Analysis

2: Descriptive Statistics40/54

A Reliable Empirical Rule

Defects2.752.502.252.001.751.501.251.00

Dotplot of Defects

Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%

Mean ± 2 s = 1.8767 ± 2(.4072) = 1.06 to 2.69 includes 28/30 = 93%

Minitab: Graph Dotplot …

Page 41: Statistics and Data Analysis

2: Descriptive Statistics41/54

Rules For Transformations

Mean of a + bY = a + b

Standard deviation of a + bY = |b| sy

y

Page 42: Statistics and Data Analysis

2: Descriptive Statistics42/54

Which city is warmer, New York (USA) or Old York (England)? Which is more variable?

Average Temperatures (high + low)/2Month NY (f) OY(c) Month NY(f) OY(c)Jan 29.5 2.0 Jul 75.5 15.5Feb 32.0 2.0 Aug 73.5 15.0Mar 35.0 4.5 Sep 66.0 13.0Apr 50.0 8.5 Oct 55.0 9.5May 60.5 9.5 Nov 45.0 6.0Jun 70.0 13.0 Dec 35.0 3.5

City Mean Std.Dev. Min MaxOld York 8.500 4.913 2.000 15.50New York 52.25 16.93 29.50 75.50

Page 43: Statistics and Data Analysis

2: Descriptive Statistics43/54

Application – Cost of Defects

Suppose the cost to repair defects is $25 + 10*DefectsI.e., a $25 setup cost plus $10 per defect.Mean defects = 1.8767 Standard Deviation = 0.407205Mean Cost = $25 + $10(1.8767) = $43.767Standard Deviation Cost = $10(.407205) = $4.07205

These are 30 observations of average defect data on sets of manufactured circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Page 44: Statistics and Data Analysis

2: Descriptive Statistics44/54

Correlation

Variables Y and X vary together Causality vs. correlation: Does movement in X

“cause” movement in Y in some metaphysical sense?

Correlation Simultaneous movement through a statistical relationship Simultaneous variation “induced” by the variation of a

common third effect

Page 45: Statistics and Data Analysis

2: Descriptive Statistics45/54

Samples of House Listings and Per Capita Incomes at a Particular Time

Page 46: Statistics and Data Analysis

2: Descriptive Statistics46/54

Scatter Plot Suggests Positive Correlation

IncomePC

Listin

g

3250030000275002500022500200001750015000

900000

800000

700000

600000

500000

400000

300000

200000

100000

Scatterplot of Listing vs IncomePC

Page 47: Statistics and Data Analysis

2: Descriptive Statistics47/54

Regression Measures Correlation

IncomePC

Listin

g

3250030000275002500022500200001750015000

900000

800000

700000

600000

500000

400000

300000

200000

100000

Scatterplot of Listing vs IncomePC

Regression Line: Listing = a + b IncomePC

Page 48: Statistics and Data Analysis

2: Descriptive Statistics48/54

Correlation Is Not Causation

GasPrice

Inco

me

12010080604020

27500

25000

22500

20000

17500

15000

12500

10000

Scatterplot of Income vs GasPrice

Price and Income seem to be “positively” related.

The U.S. Gasoline Market. Data are yearly from 1953 to 2004. Plot of per capita income vs. gasoline price index.

Page 49: Statistics and Data Analysis

2: Descriptive Statistics49/54

The Hidden (Spurious) Relationship

Year

Inco

me

2010200019901980197019601950

27500

25000

22500

20000

17500

15000

12500

10000

Scatterplot of Income vs Year

Year

GasP

rice

2010200019901980197019601950

120

100

80

60

40

20

Scatterplot of GasPrice vs Year

Not positively “related” to each other; both positively related to “time.”

Page 50: Statistics and Data Analysis

2: Descriptive Statistics50/54

Correlation is the interesting number.We must compute covariance and the two standard deviations first.

2 2N nX i Y ii=1 i=1

Ni ii=1

XY

XYXY

X Y

1 1Standard Deviations: s X - X , s Y - YN 1 N 1

X X Y YCovariance: s

N 1

sCorrelation : r

s s-1 < rXY < +1 Units free. A pure number.

Page 51: Statistics and Data Analysis

2: Descriptive Statistics51/54

Correlation

IncomePC

Listin

g

3250030000275002500022500200001750015000

900000

800000

700000

600000

500000

400000

300000

200000

100000

Scatterplot of Listing vs IncomePC

Listing

Income

rIncome,Listing = +0.591

Page 52: Statistics and Data Analysis

2: Descriptive Statistics52/54

Correlations

Defects

cost

2.82.62.42.22.01.81.61.41.21.0

25.28

25.26

25.24

25.22

25.20

25.18

25.16

25.14

25.12

25.10

Scatterplot of cost vs Defects

Defects

Noise

2.82.62.42.22.01.81.61.41.21.0

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

Scatterplot of Noise vs Defects

MoreNoise

Noise

2.502.252.001.751.50

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

Scatterplot of Noise vs MoreNoise

r = +1.0

r = 0.0

r = +0.5

Page 53: Statistics and Data Analysis

2: Descriptive Statistics53/54

Sample Statistics and Population Parameters

Sample has a sample mean and standard deviation and sY.

Population has a mean, μ, and standard deviation, σ.

The sample “looks like” the population. The sample statistics resemble the population

features. The bigger is the RANDOM sample, the

closer will be the resemblance. We will study this later in the course.

Y

Page 54: Statistics and Data Analysis

2: Descriptive Statistics54/54

Summary Statistics to describe location (mean) and

spread (standard deviation) of a sample of values. Interpretations Computations Complications

Statistics and graphical tools to describe bivariate (two variable) relationships Scatter plots Correlation