Some econometrics, result presentation and review of tests€¦ · 1 Some econometrics, result presentation and review of tests Tron Anders Moger 14.11.2007 Econometrics • ”Econometrics

1

Some econometrics, resultpresentation and review of tests

Tron Anders Moger14.11.2007

Econometrics

• ”Econometrics is the field of economics thatconcerns itself with the application ofmathematical statistics and the tools of statisticalinference to the empirical measurement ofrelationships postulated by economic theory”

• Is the unification of– economic statistics– quantitative economic theory– mathematical economics

2

About econometrics• Variations and extensions of the regression model

– Time series models– Correct for heteroscedasticity– Correct for autocorrelation– Panel data– Non-linear regression models– Multivariate regression

• Matrix computations (linear algebra) is almostindispensable tool

• Time series data • Introductory book in econometrics: Econometric

analysis, Willian H. Greene

Time series models• A time series is a set of measurements, ordered

over time• Time series issues:

– Identifying trends, seasonality, cycles, and irregularity– Predicting future values: Forecasting

• Autoregressive models:– Explicit models for time dependencies:

• (Box-Jenkins, ARMA models)

AR(1)

AR(2)ttt YY εγβ ++= −110

tttt YYY εγγβ +++= −− 22110

jjtt YYCorr 1),( γ=−

3

The runs test (for random samples)• First step for time-series data is to do a runs test• In a random sample, the probability that an observation is

above or below the median is independent of whether theprevious observation is.

• A run is a (maximal) sequence of observations such that all are above the median, or all are below.

• For n observations, the number of runs has a knowndistribution under the assumption of no patterns in thedata. With too few runs, the null hypothesis of no patternscan be rejected. (Table 14 in Newbold).

• H0: No pattern• For large samples, a formula based on a normal

approximation can be used.

Example: Index for volume of sharestraded at New York Stock Exchange• Data: Day 1: 98, D2: 93, D3: 82, D4: 103,

D5: 113, D6: 111, D7: 104, D8: 103, D9: 114, D10: 107, D11: 111, D12: 109, D13: 109, D14: 108, D15: 128, D16: 92

• Median: 107.5• Runs: ----++--+-+++++-• n=16 and R=7, p-value=0.214 from table 14• Test statistic (large samples):Reject H0 if / 2 / 22

/ 2 1 or 2

4( 1)

R n z zn n

n

α α− − < − >

−−

4

Dependence over time – Time series data

• Sometimes, y1, y2, …, yn are not completelyindependent observations (given theindependent variables). – Lagged values: yi may depend on yi-1 in

addition to its independent variables– Autocorrelated errors: The residuals εi are

correlated• Often relevant for time-series data

Time series data - lagged values• In this case, we may run a multiple regression just as

before, but including the previous dependent variable Yi-1 as a predictor variable for Yi.

• Use the model Yt=β0+β1X1+γYt-1+εt

• A 1-unit increase in x1 in first time period yields an expected increase in y of β1, an increase β1γ in thesecond period, β1γ2 in the third period and so on

• Total expected increase in all future is β1/(1-γ)

5

Example: Pension funds from textbook CD

• Want to use the market return for stocks (say, in millon $) as a predictor for the percentage ofpension fund portifolios at market value (y) at theend of the year

• Have data for 25 yrs->24 observationsModel Summaryb

,980a ,961 ,957 2,288 1,008Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Durbin-Watson

Predictors: (Constant), lag, returna.

Dependent Variable: stocksb. Coefficientsa

1,397 2,359 ,592 ,560 -3,509 6,303,235 ,030 ,359 7,836 ,000 ,172 ,297,954 ,042 1,041 22,690 ,000 ,867 1,042

(Constant)returnlag

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: stocksa.

Get the model:• Yt=1.397+0.235*stock return+0.954*yt-1

• A one million $ increase in stock return oneyear yields a 0.24% increase in pensionfund portifolios at market value

• For the next year: 0.235*0.954=0.22%• And the third year: 0.235*0.9542=0.21%• For all future: 0.235/(1-0.954)=5.1%• What if you have a 2 million $ increase?

6

Autocorrelations

• Recall: When for example the data is from a time series, the random errors for adjacent time stepsmight be correlated!

• Improvements in model might reduce problem• Standard regression methods are not optimal• Modelling and estimating the autoregression gives

improved results

Autocorrelation – how to detect? • Plotting residuals against time!

• The Durbin-Watson test compares thepossibility of independent errors with a first-order autoregressive model: 1t t tuε ρε −= +

21

2

2

1

( )n

t tt

n

tt

e ed

e

−=

=

−=∑

∑Test statistic:

Option in SPSS

Test depends on K (no. of independent variables), n (no. observations) andsig.level αTest H0: ρ=0 vs H1: ρ=0Reject H0 if d<dLAccept H0 if d>dUInconclusive if dL<d<dU

7

Example: Pension fundsModel Summaryb

,980a ,961 ,957 2,288 1,008Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Durbin-Watson

Predictors: (Constant), lag, returna.

Dependent Variable: stocksb.

•Want to test ρ=0 on 5%-level•Test statistic d=1.008•Have one independent variable (K=1 in table 12 on p. 876) and n=24•Find critical values of dL=1.27 and dU=1.45•Reject H0

Autocorrelation – what to do? • It is possible to use a two-stage regression

procedure: – If a first-order auto-regressive model with

parameter is appropriate, the model

will have uncorrelated errors• Estimate from the Durbin-Watson

statistic, and estimate from the model above

ρ1 0 1 1 1, 1 , 1 1(1 ) ( ) ... ( )t t t t K Kt K t t ty y x x x xρ β ρ β ρ β ρ ε ρε− − − −− = − + − + + − + −

1t tε ρε −−

ρ

8

Heteroscedasticity• Recall: When the variances of independent errors in the

model vary, the model is heteroscedastic. • Example: In a regression model of house size against

income, the variance of house sizes might increase withincome

• In case of heteroscedasticity, ordinary regression modelsare not optimal.

• Previously, we mentioned variable transformation as a possible solution

• Much more advanced solutions exist, when theheteroscedasticity is known or can be estimated: Generalized least squares,…

Panel data

• Data collected for the same sample, at repeated time points

• Corresponds to longitudinal epidemiological studies

• A combination of cross-sectional data and time series data; collect the same cross-sectional data at repeated time points

• Increasingly popular study type

9

Analyzing panel data• Model looks like linear regression, with a touch of

ANOVA• Fixed effects: Standard regression, but using a

constant term differing for each group– We get a parameter for each group!– Model yit=αi+βTxit+εit

• Random effects: A stochastic variable modelsvariation connected to each individual– The individual variation is assumed drawn from a

distribution with fixed variance– Model yit=α+βTxit+ui+εit– A generalization of least squares is needed for

computations

Non-linear regression models

• Ordinary regression is very useful, but it is limited by the linear form of the equations

• Sometimes, variable transformations can bring theconnection between variables to a linear form

• Other times, this is not possible: The relationshipdescribes the dependent variable as some functionof independent variables and some random error.

• The model may still be estimated by minimizingthe errors. This is non-linear regression.

10

Multivariate regression

• Instead of one dependent variable, one canhave a vector of dependent variables

• A theory of multivariate multiple regressioncan be developed (with the help of matrixalgebra): Many similar results to ordinarymultiple regressions

• Captures the dependencies betweendependent variables

Presentation of results

• Written presentation (paper):• 1. Title• 2. Summary• 3. Introduction (Why did you do the study?)• 4. Methods (What have you done?)• 5. Results (What did you find?)• 6. Discussion (What does this mean? Weaknesses of

your study?)• Oral presentation: Summary moved to the last slide,

Goal of study mentioned after the title slide

11

Methods• Goals, main hypotheses• Describe what you have done in detail.

Others should be able to repeat your study• Design

– Observational study (control group, matching, retrospective, cross-sectional or prospective, time series: data collected at which time points)

– Experimental studies (Def. interventionregimes, blinding, matching)

Methods, cont’d.

• Inclusion- and exclusion criteria: Number ofobservations, which ones are included, which ones are excluded, and why

• Which population do you generalize to?• How did you collect your data?

Questionnaire? Interview?• Representativity?• Validity?

12

Statistical analysis• What methods did you use?• In which situations did you use your

different methods?• Are some continuous variables categorized,

or categories in categorical variables collapsed?

• Significance level, one-sided, two-sided?• Statistical software

Results• Descriptive table of your variables (both

”background” variables like gender, age etcand main variables)

• Describe discrepancies from your original design, drop-outs, non-responders etc

• Are assumptions of methods checked and arethey fulfilled?

• Results from the analyses

13

Discussion• Summary• Interpretation of results• Comparison to previous studies?• Strengths/weakneses

– Of your design?– Of the study?– Have you done many tests?– Power

• Further work?

Results

• In the presentation there are three importantquantities:– Effect measure (mean, median, proportion,

regression coefficient– Confidence interval/standard error of the effect

measure!– P-value

14

Numerical precision

• Data: Usually enough with one or nodecimals (46% women, mean weight 65.5 kg)

• P-values: 2 decimals common. Write the p-value, not p>0.05, p<0.05 or p=NS

• P-values less than 0.01: p<0.01• P-values less than 0.001: p<0.001

Descriptive presentation of data:• Normal data:

• Skewed data:

80-25014-45709-4990Range129.8 (30.6)23.2 (5.3)2944.7 (729.0)Mena (SD*)189189189NMother’s weight (pounds)Mother’s age (years)Infant’s weight (g)

*SD=Standard deviation

110.0-140.519.0-26.02412.0-3481.0Q1-Q3*121.023.02977.0Median189189189NMother’s weight (pounds)Mother’s age (years)Infant’s weight (g)

*Q1=1. quartile and Q2=3. quartile

15

Presenation of analysis:

• Two sample t-test, smokers vs non-smokers:

• Multiple linear regression:

0.012773.1 (2620.3,2926.2)74Smokers3055.0 (2916.0,3193.9)115Non-smokers

P-valueBirth weight (g)*n

*Mean and 95% confidence interval

Smoking*

Mother’sweight

Variable

-281,7

4.43

Univariateeffects

(-492.7,-70.7)

(1.05,7.81)

95% CI

0.01(-478.3,-61.7)-270.00.01

0.01(0.91,7.57)4.240.01

P-value95% CIMultivariateeffects

P-value

*Smokers vs non-smokers

Presentation of figures

• Can show e.g. scatter plots for the relationship betweendependent variable and independent variables.

• Can include fitted regression lines• Error bar plot: Analyzed birth weight and ethnicity

with one-way ANOVA (p=0.01). An illustrationfollows on the next slide (In SPSS: Graph-error bar-simple, Variable: BWT, Category axis: RAC)

16

white black other

race

2400,00

2600,00

2800,00

3000,00

3200,00

95%

CI b

irthw

eigh

t

Review of tests

• Below is a listing of most of the statisticaltests encountered in Newbold.

• It gives a grouping of the tests by application area

• For details, consult the book or previousnotes!

17

One group of normally distributedobservations

Distribution: Test statistic: Goal of test:

Chi-kvadrat, n-1 d.f.Testing variance ofnormal population

t-distribution, n-1 d.f.:Testing mean ofnormal distribution, variance unknown

standard normal: Testing mean ofnormal distribution,

variance known

0

/X

nµ

σ−

(0,1)N

0

/x

Xs n

µ−

2

20

( 1) xn sσ−

21nχ −

1nt −

Comparing two groups ofobservations: matched pairs

Wilcoxon signed rank statistic

T=min(T+,T-); T+ / T- are sum ofpositive/negative ranks

Wilcoxon signed rank test: Compare ranks and signs ofdifferences

S = the number of pairs with positive difference. Large samples(n>20):

Sign test: Compareonly whichobservations arelargest

Assuming normal distributions, unknownvariance: Comparemeans

0

/D

D Ds n

−

(D1, …, Dn differences)1nt −

( ,0.5)Bin n

(0,1)NLarge samples: * 0.5

0.5S n

n−

18

Comparing two groups ofobservations: unmatched data

Standard normal (n>10)Based on sum of ranks ofobs. from one group; all obs. ranked together

Assuming identical translateddistributions: test equalmeans: Mann Whitney U test

Testing equality of variancesfor two normal populations

Diff. between pop. means: Unknown and unequalvariances

Diff. between pop. means: Unknown but equal variances

Standard normalDiff. between pop. means: Known variances

22

0( ) / yx

x yn nX Y D σσ− − + (0,1)N

2 2

0( ) / p p

x y

s sn nX Y D− − + 2x yn nt + −

22

0( ) / yx

x y

ssn nX Y D− − + tν

see book for d.f.

2 2/x ys s 1, 1x yn nF − −

(0,1)N

Comparing more than two groups ofdata

Two-way ANOVA withinteraction: Testing if groupsand blocking variable interact

Two-way ANOVA: Testing ifall groups are equal, whenyou also have blocking

Based on sums of ranks for each group; all obs. ranked together

Kruskal-Wallis test: Testing ifall groups are equal

One-way ANOVA: Testing ifall groups are equal (norm.)

/( 1)/( )

SSG KSSW n K

−− 1,K n KF − −

21Kχ −

/( 1)/(( 1)( 1))SSG K

SSE K H−

− − 1,( 1)( 1)K K HF − − −

/(( 1)( 1))/( ( 1))

SSI K HSSE HK L

− −− ( 1)( 1), ( 1)K H HK LF − − −

19

Studying population proportions

Standard normalComparing thepopulationproportions in twogroups (largesamples)

Standard normalTest of populationproportion in onegroup (largesamples)

0

0 0(1 ) /p

nπ

π π−− (0,1)N

0 0 0 0(1 ) (1 )x y

x y

p pp p p p

n n

−− −+

(p0 common estimate)

(0,1)N

Regression tests

Test on sets of partialregression coefficients: Can they all be set to zero (i.e., removed)?

Test on partial regressioncoefficient: Is it ?

Test of regression slope: Is it ?

1

1 *

b

bs

β−2nt −

*β

*

j

j

b

bs

β−*β 1n Kt − −

2

( ( ) ) /

e

SSE r SSE rs−

, 1r n K rF − − −

20

Model tests

Tests for normality:•Kolmogorov-Smirnov

Goodness-of-fit test: Countsin K categories, compared to expected counts, under H0

Contingency table test: Test ifthere is an associationbetween the two attributes in a contingency table

2

1 1

( )r cij ij

i j ij

O EE= =

−∑∑ 2

( 1)( 1)r cχ − −

2

1

( )Ki i

i i

O EE=

−∑ 21Kχ −

* *

Tests for correlation

Specialdistribution

Compute ranks of x-values, and of y-values, and computecorrelation of theseranks

Test for zero correlation(nonparametric): Spearmanrank correlation

Test for zero populationcorrelation (normal distribution) 2

21

r nr−

− 2nt −

21

Tests for autocorrelation

Special distribution, or standard normal

for large samples

Counting the numberof ”runs” above and below the median in the time series

The runs test (nonparametric), testing for randomness in time

Special distributionThe Durbin-Watson test (based on normal assumption) testing for autocorrelation in regression data

21

2

2

1

( )n

t ti

n

ti

e e

e

−=

=

−∑

∑

(0,1)N

From problem to choice of method

• Example: You have the grades of a class ofstudents from this years statistics course, and from last years statistics course. How to analyze?

• You have measured the blood pressure, working habits, eating habits, and exerciselevel for 200 middleaged men. How to analyze?

22

From problem to choice of method

• Example: You have asked 100 marriedwomen how long they have been married, and how happy they are (on a specific scale) with their marriage. How to analyze?

• Example: You have data for how satisfied(on some scale) 50 patients are with theirprimary health care, from each of 5 regions of Norway. How to analyze?

Documents

Some econometrics, result presentation and review of tests€¦ · 1 Some econometrics, result presentation and review of tests Tron Anders Moger 14.11.2007 Econometrics • ”Econometrics