Upload
stefan
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. So, what is the story? - PowerPoint PPT Presentation
Citation preview
2: Descriptive Statistics1/54
Statistics and Data Analysis
Professor William GreeneStern School of Business
IOMS DepartmentDepartment of Economics
2: Descriptive Statistics2/54
Statistics and Data Analysis
Part 2 – Descriptive StatisticsSummarizing data with useful statistics
2: Descriptive Statistics3/54
Use random samples and basic descriptive statistics.
What is the ‘breach rate’ in a pool of tens of thousands of mortgages? (‘Breach’ = improperly underwritten or serviced or otherwise faulty mortgage.)
2: Descriptive Statistics4/54
The forensic analysis was an examination of statistics from a random sample of 1,500 loans.
2: Descriptive Statistics5/54
Descriptive StatisticsAgenda
Populations and Random Samples Descriptive Statistics for a Variable
Measures of location: Mean,median,mode Measure of dispersion: Standard deviation
Measuring Correlation of Two Variables Understanding correlation Measuring correlation Scatter plots and regression
2: Descriptive Statistics6/54
Populations and Samples Population: Collection of all possible observations (data
points) on a variable Sample: A subset of the data points in the population Random sample: Defined by the way the sample data are
obtained. All points in the population are equally likely to be drawn in any particular sample.
What is the purpose of obtaining a sample?To describe or learn about the population. The sample is observed The population is assumed.
In order to learn confidently about the population from a sample, the sample must be ‘random.’
2: Descriptive Statistics7/54
Random Sampling A production process produces circuit boards. Boards are
produced in each hour with an average of 2 defects per board when the process is in control. Each hour, the engineer examines a random sample of 100 circuit boards. The average number of defects per board in a particular 30 hour week is
Hour 1: Mean of 100 boards = 1.95, Hour 2: “ 2.65, Hour 3: “ 1.80, …Hour 30: “ 2.35.(These are estimates of the defect rate per board)
The objective of drawing the sample is to determine whether the process is in control or not. The process is under control if the defect rate is < 2.)
Method: Assuming the process is in control, would we expect to see this rate of defects?
2: Descriptive Statistics8/54
Random samples of behavior are difficult to obtain, especially by telephone.
2: Descriptive Statistics9/54
Nonrandom Samples
Nonrandom samples produce tainted, sometimes not believable results
Biased with respect to the population May describe a not useful specific subset of
the population.
2: Descriptive Statistics10/54
(Non)Randomness of SamplesSources of bias in samples (generally related)
Bad sample design – e.g., home phone surveys conducted during working hours
Survey (non)response bias – e.g., opinion surveys about service quality
Participation bias – e.g., voluntary participation in a survey
Self selection – volunteering for a trial or an opinion sample. (Shere Hite’s cultural revolution)
Attrition bias from clinical trials - e.g., if the drug works, the subject does not come back.
2: Descriptive Statistics11/54
Nonrandom results in incubator funds.
The “NYU No Action Letter”
2: Descriptive Statistics12/54
Nonscientific, Nonrandom “(non)Sampling”
A Cultural Revolution …“3000 women, ages 14 to 78 describe in their own words …”
2: Descriptive Statistics13/54
A Cultural Revolution …“3000 women, ages 14 to 78 describe in their own words …”
http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692
2: Descriptive Statistics14/54
http://en.wikipedia.org/wiki/Shere_Hite
2: Descriptive Statistics15/54
The Lesson…
Having a really big sample does not assure you of an accurate result. It may assure you of a really solid, really bad (inaccurate) result.
2: Descriptive Statistics16/54 http://old.cni.org/docs/ima.ip-workshop/Massarsky.html
How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and publishers? The following relates to terrestrial radio, which, as a group, pays a lump sum into the pool, which is then allocated by the PRSs.
2: Descriptive Statistics17/54
A Descriptive Statistic Is … ? Describes what?
The sample dataThe population that the data came from
2: Descriptive Statistics18/54
Measures of Location
Location and central tendency There exists a distribution of values We are interested in the “center” of the distribution
Two measures are the sample mean and the sample median
They look similar, and measure the same thing. They differ systematically (and predictably) when the data
are not ‘symmetric.’
These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
2: Descriptive Statistics19/54
The Sample Mean
N
NN
N N
There are N observations (data points) in the sample., , , ,... ]
In this sample, N = 30. The sample mean is
[ ... ]
1 2 3 4
1 2 3 41 1
= [y y y y y
= y = y + y + y + y y
1 56.30= (1.45 +...+ 2.35) = =130 30
ii=1
Sample data : y
y
.8767
These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
2: Descriptive Statistics20/54
It may be necessary to ‘weight’ aggregate data.Average Home Listings
1Listing = (896,800 + 713,864 +...+164,326) = 369,68751
2: Descriptive Statistics21/54
Averaging Averages?
Hawaii’s average listing = $896,800 Hawaii’s population = 1,275,194 Illinois’ average listing = $377,683 Illinois’ population = 12,763,371 Illinois and Hawaii each get weight 1/51
= .019607 when the mean is computed. Looks like Hawaii is getting too much
influence.
2: Descriptive Statistics22/54
A Properly Weighted Average
State StateStatesSimple average = Listing = Weight Listing
1Weight = = .01960751
Illinois is 10 times as big as Hawaii. Suppose we use weights that arein proportion to the st
State
ate's population. (The weights sum to 1.0.)Weight varies from .001717 for Wyoming to .121899 for California
New average is 409,234 compared to 369,687 without weights, an error of 11%. Sometimes an unequal weighting of the observations is necessary.
State populations from http://www.factmonster.com/ipka/A0004986.html
2: Descriptive Statistics23/54
Averaging Trending Time Series Observations Is Usually Not Informative
Does the mean over the entire observation period mean anything? (Does it estimate anything meaningful?)
Note how the mean changes completely depending on what time interval is used to compute it.
2: Descriptive Statistics24/54
The Sample Median Median = the middle observation after
data are sorted. Odd number: Central observation:
Med[1,2,4,6,8,9,17] = 6 Even number: Midpoint between the
two central observations Med[1,2,4,6,8,9,14,17] = (6+8)/2=7
2: Descriptive Statistics25/54
Sample Median of (Sorted) Defects Data
1.05 1.30 1.40 1.45 1.45 1.50 1.55 1.60 1.60 1.65 1.65 1.70 1.70 1.70 1.70 1.90 1.90 1.95 2.05 2.05 2.05 2.20 2.25 2.30 2.30 2.35 2.35 2.35 2.60 2.70
Median = 1.8000Mean = 1.8767
Fre
qu
ency
DEFECTS
0
3
6
9
12
1.000 1.500 2.000 2.500 3.000
2: Descriptive Statistics26/54
Tomorrow I will compute the average number of defectives for a 61st day. What is a good guess of the number I will find?
(Let’s deduce estimates of the mean and median from the histogram.)
2: Descriptive Statistics27/54
Skewed Earnings Distribution Mean vs. Median in Skewed Data
M y
Monthly EarningsN = 595, Median = 800Mean = 883
The mean will exceed the median when the distribution is skewed to the right. (The skewness is in the direction of the long tail.)
These data are skewed to the right.
2: Descriptive Statistics28/54
Extreme Observations Distort Means but Not Medians
Outlying observations distort the mean Med [1,2,4,6,8,9,17] = 6 Mean[1,2,4,6,8,9,17] = 6.714 Med [1,2,4,6,8,9,17000] = 6 (still) Mean[1,2,4,6,8,9,17000] = 2432.8 (!)
This typically occurs when there are some outlying obervations, such as in cross sections of income or wealth and/or when the sample is not very large.
2: Descriptive Statistics29/54
2: Descriptive Statistics30/54
The mean does not give information about the shape of the distribution.
Two problems with the computations(1) The data are ratings, not quantitative(2) The mean does not suggest the extreme nature of the data
2: Descriptive Statistics31/54
The problem with the mean or median as a description of a sample – more information is usually needed.
Both data sets have a mean of about 100.
2: Descriptive Statistics32/54
Dispersion of the Observations
Defects
Freq
uenc
y
2.82.42.01.61.2
6
5
4
3
2
1
0
Histogram of Defects We quantify the variation of the values around the mean. Note the range is from 1.05 to 2.70. This gives an idea where the data lie. The mean plus a measure of the variation do the same job.
These are 30 hours of average defect data on sets of circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
2: Descriptive Statistics33/54
The Problem with the Range as a Measure of Dispersion
These two data sets both have 1,000 observations that range from about 10 to about 180
2: Descriptive Statistics34/54
A Measure of DispersionThe standard deviation is the interesting value. You need to compute the variance to get the standard deviation.
Variance = sy2 =
Standard deviation = sy =
N
N 2
ii=1
1 Y - Y1
N
N 2
ii=1
1 Y - Y1
Note the units of measurement. The standard deviation has the same units as the mean. The standard deviation is the standard measure for the dispersion (spread) of a set of values (sample of observations).
2: Descriptive Statistics35/54
The variance is the average squared deviation of the sample values from the mean. Why is N-1 in the denominator of s2?
Everyone else does it Minitab does it I have totally no idea. Tendency of the variance to be too
small when computed using 1/N when the sample size, N, is itself small.
(When N is large, it won’t matter.)
See HOG, p. 37
2: Descriptive Statistics36/54
Computing a Standard DeviationY Deviation Squared From Mean Deviation1 -2.1 4.414 0.9 0.816 2.9 8.410 -3.1 9.613 -0.1 0.012 -1.1 1.216 2.9 8.414 0.9 0.814 0.9 0.811 -2.1 4.41SUM 0.0 38.90
Sum = 31
Mean = 31/10=3.1 Sum of squared deviations = 38.90 Variance = 38.90/(10-1) = 4.322
Standard Deviation = 2.079
2: Descriptive Statistics37/54
Standard Deviation
230ii=1
230ii=1
1 1Variance = Y -1.8767 = 4.808667 = 0.16581630 -1 30 -1
1Standard Deviation = Y -1.8767 = 0.40720530 -1
These are 30 hours of average defect data on sets of circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
2: Descriptive Statistics38/54
Distribution of Values
Defects
Freq
uenc
y
2.82.42.01.61.2
6
5
4
3
2
1
0
Histogram of Defects
2: Descriptive Statistics39/54
Reliable Rules of Thumb Almost always, 66% of the observations in a sample will
lie in the range [mean - 1 s.d. to mean + 1 s.d.] Almost always, 95% of the observations in a sample will
lie in the range [mean - 2 s.d. to mean + 2 s.d.] Almost always, 99.5% of the observations in a sample will
lie in the range [mean - 3 s.d. to mean + 3 s.d.]
When these rules are not met, they will almost be met. Data nearly always act this way.
2: Descriptive Statistics40/54
A Reliable Empirical Rule
Defects2.752.502.252.001.751.501.251.00
Dotplot of Defects
Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%
Mean ± 2 s = 1.8767 ± 2(.4072) = 1.06 to 2.69 includes 28/30 = 93%
Minitab: Graph Dotplot …
2: Descriptive Statistics41/54
Rules For Transformations
Mean of a + bY = a + b
Standard deviation of a + bY = |b| sy
y
2: Descriptive Statistics42/54
Which city is warmer, New York (USA) or Old York (England)? Which is more variable?
Average Temperatures (high + low)/2Month NY (f) OY(c) Month NY(f) OY(c)Jan 29.5 2.0 Jul 75.5 15.5Feb 32.0 2.0 Aug 73.5 15.0Mar 35.0 4.5 Sep 66.0 13.0Apr 50.0 8.5 Oct 55.0 9.5May 60.5 9.5 Nov 45.0 6.0Jun 70.0 13.0 Dec 35.0 3.5
City Mean Std.Dev. Min MaxOld York 8.500 4.913 2.000 15.50New York 52.25 16.93 29.50 75.50
2: Descriptive Statistics43/54
Application – Cost of Defects
Suppose the cost to repair defects is $25 + 10*DefectsI.e., a $25 setup cost plus $10 per defect.Mean defects = 1.8767 Standard Deviation = 0.407205Mean Cost = $25 + $10(1.8767) = $43.767Standard Deviation Cost = $10(.407205) = $4.07205
These are 30 observations of average defect data on sets of manufactured circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
2: Descriptive Statistics44/54
Correlation
Variables Y and X vary together Causality vs. correlation: Does movement in X
“cause” movement in Y in some metaphysical sense?
Correlation Simultaneous movement through a statistical relationship Simultaneous variation “induced” by the variation of a
common third effect
2: Descriptive Statistics45/54
Samples of House Listings and Per Capita Incomes at a Particular Time
2: Descriptive Statistics46/54
Scatter Plot Suggests Positive Correlation
IncomePC
Listin
g
3250030000275002500022500200001750015000
900000
800000
700000
600000
500000
400000
300000
200000
100000
Scatterplot of Listing vs IncomePC
2: Descriptive Statistics47/54
Regression Measures Correlation
IncomePC
Listin
g
3250030000275002500022500200001750015000
900000
800000
700000
600000
500000
400000
300000
200000
100000
Scatterplot of Listing vs IncomePC
Regression Line: Listing = a + b IncomePC
2: Descriptive Statistics48/54
Correlation Is Not Causation
GasPrice
Inco
me
12010080604020
27500
25000
22500
20000
17500
15000
12500
10000
Scatterplot of Income vs GasPrice
Price and Income seem to be “positively” related.
The U.S. Gasoline Market. Data are yearly from 1953 to 2004. Plot of per capita income vs. gasoline price index.
2: Descriptive Statistics49/54
The Hidden (Spurious) Relationship
Year
Inco
me
2010200019901980197019601950
27500
25000
22500
20000
17500
15000
12500
10000
Scatterplot of Income vs Year
Year
GasP
rice
2010200019901980197019601950
120
100
80
60
40
20
Scatterplot of GasPrice vs Year
Not positively “related” to each other; both positively related to “time.”
2: Descriptive Statistics50/54
Correlation is the interesting number.We must compute covariance and the two standard deviations first.
2 2N nX i Y ii=1 i=1
Ni ii=1
XY
XYXY
X Y
1 1Standard Deviations: s X - X , s Y - YN 1 N 1
X X Y YCovariance: s
N 1
sCorrelation : r
s s-1 < rXY < +1 Units free. A pure number.
2: Descriptive Statistics51/54
Correlation
IncomePC
Listin
g
3250030000275002500022500200001750015000
900000
800000
700000
600000
500000
400000
300000
200000
100000
Scatterplot of Listing vs IncomePC
Listing
Income
rIncome,Listing = +0.591
2: Descriptive Statistics52/54
Correlations
Defects
cost
2.82.62.42.22.01.81.61.41.21.0
25.28
25.26
25.24
25.22
25.20
25.18
25.16
25.14
25.12
25.10
Scatterplot of cost vs Defects
Defects
Noise
2.82.62.42.22.01.81.61.41.21.0
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
Scatterplot of Noise vs Defects
MoreNoise
Noise
2.502.252.001.751.50
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
Scatterplot of Noise vs MoreNoise
r = +1.0
r = 0.0
r = +0.5
2: Descriptive Statistics53/54
Sample Statistics and Population Parameters
Sample has a sample mean and standard deviation and sY.
Population has a mean, μ, and standard deviation, σ.
The sample “looks like” the population. The sample statistics resemble the population
features. The bigger is the RANDOM sample, the
closer will be the resemblance. We will study this later in the course.
Y
2: Descriptive Statistics54/54
Summary Statistics to describe location (mean) and
spread (standard deviation) of a sample of values. Interpretations Computations Complications
Statistics and graphical tools to describe bivariate (two variable) relationships Scatter plots Correlation