117
Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

Statistical Inference and Regression Analysis: GB.3302.30

  • Upload
    wyatt

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 6 – Regression Model-1 Conditional Mean . U.S. Gasoline Price. 6 Months. 5 Years. - PowerPoint PPT Presentation

Citation preview

Statistics

Statistical Inference and Regression Analysis: GB.3302.30Professor William GreeneStern School of BusinessIOMS Department Department of Economics

#/1171Statistics and Data AnalysisPart 6 Regression Model-1 Conditional Mean

#/1172

U.S. Gasoline Price

6 Months5 Years

#/117Impact of Change in Gasoline Price on Consumer Demand?Demand for gasolineLong term vs. short termIncomeElasticity conceptsDemand for food

#/117

Movie Success vs. Movie Online Buzz Before Release (2009)

#/117

#/117Internet Buzz and Movie Success

Box office sales vs. Cant wait votes 3 weeks before release

#/117Is There Really a Relationship?BoxOffice is obviously not equal to f(Buzz) for some function. But, they do appear to be related, perhaps statistically that is, stochastically. There is a covariance. The linear regression summarizes it.A predictor would be Box Office = a + b Buzz. Is b really > 0? What would be implied by b > 0?

#/1178Covariation Education and Life Expectancy

Causality? Covariation? Does more education make people live longer? Is there a hidden driver of both? (Per capita GDP?)

#/1179Using Regression to Predict

Predictor: Overseas box office = a + b Domestic box office The prediction will not be perfect. We construct a range of uncertainty.

The equation would not predict Titanic.

#/11710Conditional Variation and RegressionConditional distribution of a pair of random variables

f(y|x) or P(y|x)

Mean function, E[y|x] = Regression of y on x.

#/117

y|x ~ Normal[ 20 + 3x, 42 ], x = 1,2,3,4; PoissonX=4

X=3

X=2

X=1Expected Income Depends on Household Size

#/117 Average Box Office by Internet Buzz Index = Average Box Office for Buzz in Interval

#/117Linear Regression?Fuel Bills vs. Number of Rooms

#/117Independent vs. Dependent VariablesY in the modelDependent variableResponse variableX in the modelIndependent variable: Meaning of independentRegressorCovariateConditional vs. joint distribution

#/117Linearity and Functional Formy = g(x)h(y) = + f(x)y = + xy = exp( + x); logy = + xy = + (1/x) = + f(x)y = e x, logy = + log x. Etc.

#/117Inference and RegressionLeast Squares

#/11717Fitting a Line to a Set of Points

Choose and tominimize the sum of squared residuals Gausss methodof least squares.

Residuals

YiXiPredictionsa + bxi

#/11718Least Squares Regression

#/11719

#/117

Least Squares Algebra

#/117Least Squares

#/11722Normal Equations

#/117Computing the Least Squares Parameters a and b

(We will use sy2 later.)

#/11724Least Absolute Deviations

#/117Least Squares vs. LAD

#/117Inference and RegressionRegression Model

#/11727b Measures CovariationPredictor Box Office = a + b Buzz.

#/11728Interpreting the Function

aba = the life expectancy associated with 0 years of education. No country has 0 average years of education. The regression only applies in the range of experience.b = the increase in life expectancy associated with each additional year of average education. The range of experience (education)

#/11729Covariation and Causality

Does more education make you live longer (on average)?

#/11730Causality?Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Estimated Income = -451 + 50.2 HeightCorrelation = 0.84 (!)

#/117Inference and RegressionAnalysis of Variance

#/11732 Regression Fits

Regression of salary vs. years Regression of fuel bill vs. number of experience of rooms for a sample of homes

#/11733Regression Arithmetic

#/11734Variance Decomposition

#/11735Fit of the Equation to the Data

#/11736Regression vs. Residual SS

#/117Analysis of Variance TableSourceDegrees of Freedom Sum of SquaresMean Square F RatioP ValueRegression 12P[z>F]*Residual N-2 Total N-1

#/11738Explained VariationThe proportion of variation explained by the regression is called R-squared (R2)It is also called the Coefficient of Determination(It is the square of something to be shown later)

#/11739ANOVA TableSourceDegrees of Freedom Sum of SquaresMean Square F RatioP ValueRegression 12P[z>F]*Residual N-2 Total N-1

#/11740Movie Madness Fit

#/11741Regression Fits

R2=0.522R2=0.360

R2=0.880

R2=0.424

#/11742R Squared BenchmarksAggregate time series: expect .9+Cross sections, .5 is good. Sometimes we do much better.Large survey data sets, .2 is not bad.

R2 = 0.924 in this cross section.

#/11743Correlation Coefficient

#/11744Correlations

rxy = 0.6

rxy = 0.723

rxy = -.402

rxy = +1.000

#/11745R-Squared is rxy2R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.The correlation between yi and (a+bxi) is the same as the correlation between yi and xi.Therefore,.A regression with a high R2 predicts yi well.

#/11746Squared Correlations

rxy2 = 0.36

rxy2 = 0.522

rxy2 = .161

rxy2 = .924

#/11747Movie MadnessEstimated equationEstimated coefficients a and bS = se = estimated std. deviation of Square of the sample correlation between x and ySum of squared residuals, iei2N-2 = degrees of freedomS2 = se2

#/11748Software

#/117

http://apps.stern.nyu.eduhttp://estore.onthehub.com

#/117

#/117

https://apps.stern.nyu.edu

#/117

#/117

#/117

MONET.MPJ

#/117

Use File:Open Worksheet to open an Excel .xls or .xlsx file

#/117

#/117

#/117

Stat Basic Statistics Display Descriptive Statistics

#/117

#/117

#/117

#/117

#/117

#/11764

Stat Regression Regression

#/11765

#/11766Results to Report

#/11767Linear RegressionSample Regression Line

#/11768

#/11769

http://people.stern.nyu.edu/wgreene/MathStat/IRAnlogit5setup.exe

#/117

http://people.stern.nyu.edu/wgreene

#/117

http://people.stern.nyu.edu/wgreene/MathStat/IRAnlogit5setup.exe

#/117

#/117

#/117

#/117

#/117

#/117

#/117

#/117

Project Import Variables imports .csv

#/117

#/117

#/117

Command Typed in Editing Window

#/117

Cursor in desired line of text (or highlight more than one line)Press GO button

#/117Typing Commands in the Editor

#/117Important Commands: SAMPLE ; first - last $Sample ; 1 1000 $Sample ; All $ CREATE ; Variable = transformation $Create ; LogMilk = Log(Milk) $Create ; LMC = .5*Log(Milk)*Log(Cows) $Create ; any algebraic transformation $

#/117Name Conventions CREATE ; name = any function desired $ Name is the name of a new variableNo more than 8 characters in a nameThe first character must be a letterMay not contain -,+,*,/. May contain _.

#/117Model Command Model ; Lhs = dependent variable ; Rhs = list of independent variables $Regress ; Lhs = Milk ; Rhs = ONE,Feed,Labor,Land $ONE requests the constant term

#/117The Go Button

#/117Submitting Commands One CommandPlace cursor on that linePress Go button More than one commandHighlight all lines (like any text editor)Press Go button

#/117Compute a Regression Sample ; All $ Regress ; Lhs = YIT ; Rhs = One,X1,X2,X3,X4 $The constant term in the model

#/117

#/117Project window shows variablesResults appear in output windowCommands typed in editing window

Standard Three Window Operation

#/117Inference and RegressionRegression Model

#/11794The Linear Regression Statistical ModelThe linear regression modelSample statistics and population quantitiesSpecifying the regression model

#/11795 A Linear Regression

Predictor: Box Office = -14.36 + 72.72 Buzz

#/11796Data and RelationshipWe suggested the relationship between box office and internet buzz is Box Office = -14.36 + 72.72 Buzz

Note the obvious inconsistency in the figure. This is not the relationship.How do we reconcile the equation with the data?

#/11797Modeling the Underlying ProcessA model that explains the process that produces the data that we observe:Observed outcome = the sum of two parts(1) Explained: The regression line(2) Unexplained (noise): The remainderRegression modelThe model is the statement that part (1) is the same process from one observation to the next.

#/11798The Population RegressionTHE model: A specific statement about the parts of the model(1) Explained: Explained Box Office = + Buzz(2) Unexplained: The rest is noise, . Random has certain characteristicsModel statementBox Office = + Buzz +

#/11799The Data Include the Noise

#/117What Explains the Noise?

#/117101Assumptions(Regression) The equation linking Box Office and Buzz is stable

E[Box Office| Buzz] = + Buzz

Another sample of movies, say 2012, would obey the same fundamental relationship.

#/117102Model Assumptionsyi = + xi + i + xi is the regression functionContains the information about Yi in xiUnobserved because and are not known for certain i is the disturbance. It is the unobserved random componentObserved Yi is the sum of two unobserved parts.

#/117103Model Assumptions About iRandom VariableMean zero. The regression is the mean of yi. i is the deviation from the regression.Variance 2.Noisei is unrelated to any values of xi (no covariance) its random noisei is unrelated to any other observations on j (not autocorrelated).

#/117104Sample Estimate vs. Population

#/117105Application: Health Care DataGerman Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994.

DOCTOR = 1(Number of doctor visits > 0) HOSPITAL= 1(Number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / 10000. HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years MARRIED = marital status EDUC = years of education

#/117106Sample vs. PopulationFor the full population of 27,326Income = .12609 + .01996 * Educ +

For a random sample of 52 households, least squares regression producesIncome = .06856 + .02079 * Educ + e

#/117Sample vs. Population

#/117Disturbances vs. Residuals

=y--Buzze=y-a-bBuzz

#/117Standard Deviation of ResidualsStandard deviation of i = yi--xi is = E[i2] (Mean of i is zero)Sample a and b estimate and Residual ei = yi-a-bxi estimates i Use (1/N)ei2 to estimate ? Close, not quite.

Why N-2? Relates to the fact that two parameters (,) were estimated. Proof to come later.

#/117110Residuals

#/117111Samples and PopulationsPopulation (Theory)yi = + xi + iParameters , Regression + xiMean of yi | xiDisturbance, iMean 0 Standard deviation No correlation with xiSample (Observed)yi = a + bxi + eiEstimates, a, bFitted regressiona + bxiPredicted yi|xiResiduals, eiSample mean 0, Sample std. dev. seSample Cov[x,e] = 0

#/117112Linear RegressionSample Regression Line

#/117113A Cost Model

Electricity.mpjTotal cost in $MillionOutput in Million KWHN = 123 American electric utilitiesModel: Cost = + KWH +

#/117Cost Relationship

#/117Sample Regression

#/117Interpreting the ModelCost = 2.44 + 0.00529 Output + eCost is $Million, Output is Million KWH.Fixed Cost = Cost when output = 0 Fixed Cost = $2.44MillionMarginal cost = Change in cost/change in output= .00529 * $Million/Million KWH= .00529 $/KWH = 0.529 cents/KWH.

#/117