Upload
others
View
30
Download
0
Embed Size (px)
Citation preview
1
Some econometrics, resultpresentation and review of tests
Tron Anders Moger14.11.2007
Econometrics
• ”Econometrics is the field of economics thatconcerns itself with the application ofmathematical statistics and the tools of statisticalinference to the empirical measurement ofrelationships postulated by economic theory”
• Is the unification of– economic statistics– quantitative economic theory– mathematical economics
2
About econometrics• Variations and extensions of the regression model
– Time series models– Correct for heteroscedasticity– Correct for autocorrelation– Panel data– Non-linear regression models– Multivariate regression
• Matrix computations (linear algebra) is almostindispensable tool
• Time series data • Introductory book in econometrics: Econometric
analysis, Willian H. Greene
Time series models• A time series is a set of measurements, ordered
over time• Time series issues:
– Identifying trends, seasonality, cycles, and irregularity– Predicting future values: Forecasting
• Autoregressive models:– Explicit models for time dependencies:
• (Box-Jenkins, ARMA models)
AR(1)
AR(2)ttt YY εγβ ++= −110
tttt YYY εγγβ +++= −− 22110
jjtt YYCorr 1),( γ=−
3
The runs test (for random samples)• First step for time-series data is to do a runs test• In a random sample, the probability that an observation is
above or below the median is independent of whether theprevious observation is.
• A run is a (maximal) sequence of observations such that all are above the median, or all are below.
• For n observations, the number of runs has a knowndistribution under the assumption of no patterns in thedata. With too few runs, the null hypothesis of no patternscan be rejected. (Table 14 in Newbold).
• H0: No pattern• For large samples, a formula based on a normal
approximation can be used.
Example: Index for volume of sharestraded at New York Stock Exchange• Data: Day 1: 98, D2: 93, D3: 82, D4: 103,
D5: 113, D6: 111, D7: 104, D8: 103, D9: 114, D10: 107, D11: 111, D12: 109, D13: 109, D14: 108, D15: 128, D16: 92
• Median: 107.5• Runs: ----++--+-+++++-• n=16 and R=7, p-value=0.214 from table 14• Test statistic (large samples):Reject H0 if / 2 / 22
/ 2 1 or 2
4( 1)
R n z zn n
n
α α− − < − >
−−
4
Dependence over time – Time series data
• Sometimes, y1, y2, …, yn are not completelyindependent observations (given theindependent variables). – Lagged values: yi may depend on yi-1 in
addition to its independent variables– Autocorrelated errors: The residuals εi are
correlated• Often relevant for time-series data
Time series data - lagged values• In this case, we may run a multiple regression just as
before, but including the previous dependent variable Yi-1 as a predictor variable for Yi.
• Use the model Yt=β0+β1X1+γYt-1+εt
• A 1-unit increase in x1 in first time period yields an expected increase in y of β1, an increase β1γ in thesecond period, β1γ2 in the third period and so on
• Total expected increase in all future is β1/(1-γ)
5
Example: Pension funds from textbook CD
• Want to use the market return for stocks (say, in millon $) as a predictor for the percentage ofpension fund portifolios at market value (y) at theend of the year
• Have data for 25 yrs->24 observationsModel Summaryb
,980a ,961 ,957 2,288 1,008Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), lag, returna.
Dependent Variable: stocksb. Coefficientsa
1,397 2,359 ,592 ,560 -3,509 6,303,235 ,030 ,359 7,836 ,000 ,172 ,297,954 ,042 1,041 22,690 ,000 ,867 1,042
(Constant)returnlag
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound95% Confidence Interval for B
Dependent Variable: stocksa.
Get the model:• Yt=1.397+0.235*stock return+0.954*yt-1
• A one million $ increase in stock return oneyear yields a 0.24% increase in pensionfund portifolios at market value
• For the next year: 0.235*0.954=0.22%• And the third year: 0.235*0.9542=0.21%• For all future: 0.235/(1-0.954)=5.1%• What if you have a 2 million $ increase?
6
Autocorrelations
• Recall: When for example the data is from a time series, the random errors for adjacent time stepsmight be correlated!
• Improvements in model might reduce problem• Standard regression methods are not optimal• Modelling and estimating the autoregression gives
improved results
Autocorrelation – how to detect? • Plotting residuals against time!
• The Durbin-Watson test compares thepossibility of independent errors with a first-order autoregressive model: 1t t tuε ρε −= +
21
2
2
1
( )n
t tt
n
tt
e ed
e
−=
=
−=∑
∑Test statistic:
Option in SPSS
Test depends on K (no. of independent variables), n (no. observations) andsig.level αTest H0: ρ=0 vs H1: ρ=0Reject H0 if d<dLAccept H0 if d>dUInconclusive if dL<d<dU
7
Example: Pension fundsModel Summaryb
,980a ,961 ,957 2,288 1,008Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), lag, returna.
Dependent Variable: stocksb.
•Want to test ρ=0 on 5%-level•Test statistic d=1.008•Have one independent variable (K=1 in table 12 on p. 876) and n=24•Find critical values of dL=1.27 and dU=1.45•Reject H0
Autocorrelation – what to do? • It is possible to use a two-stage regression
procedure: – If a first-order auto-regressive model with
parameter is appropriate, the model
will have uncorrelated errors• Estimate from the Durbin-Watson
statistic, and estimate from the model above
ρ1 0 1 1 1, 1 , 1 1(1 ) ( ) ... ( )t t t t K Kt K t t ty y x x x xρ β ρ β ρ β ρ ε ρε− − − −− = − + − + + − + −
1t tε ρε −−
ρ
8
Heteroscedasticity• Recall: When the variances of independent errors in the
model vary, the model is heteroscedastic. • Example: In a regression model of house size against
income, the variance of house sizes might increase withincome
• In case of heteroscedasticity, ordinary regression modelsare not optimal.
• Previously, we mentioned variable transformation as a possible solution
• Much more advanced solutions exist, when theheteroscedasticity is known or can be estimated: Generalized least squares,…
Panel data
• Data collected for the same sample, at repeated time points
• Corresponds to longitudinal epidemiological studies
• A combination of cross-sectional data and time series data; collect the same cross-sectional data at repeated time points
• Increasingly popular study type
9
Analyzing panel data• Model looks like linear regression, with a touch of
ANOVA• Fixed effects: Standard regression, but using a
constant term differing for each group– We get a parameter for each group!– Model yit=αi+βTxit+εit
• Random effects: A stochastic variable modelsvariation connected to each individual– The individual variation is assumed drawn from a
distribution with fixed variance– Model yit=α+βTxit+ui+εit– A generalization of least squares is needed for
computations
Non-linear regression models
• Ordinary regression is very useful, but it is limited by the linear form of the equations
• Sometimes, variable transformations can bring theconnection between variables to a linear form
• Other times, this is not possible: The relationshipdescribes the dependent variable as some functionof independent variables and some random error.
• The model may still be estimated by minimizingthe errors. This is non-linear regression.
10
Multivariate regression
• Instead of one dependent variable, one canhave a vector of dependent variables
• A theory of multivariate multiple regressioncan be developed (with the help of matrixalgebra): Many similar results to ordinarymultiple regressions
• Captures the dependencies betweendependent variables
Presentation of results
• Written presentation (paper):• 1. Title• 2. Summary• 3. Introduction (Why did you do the study?)• 4. Methods (What have you done?)• 5. Results (What did you find?)• 6. Discussion (What does this mean? Weaknesses of
your study?)• Oral presentation: Summary moved to the last slide,
Goal of study mentioned after the title slide
11
Methods• Goals, main hypotheses• Describe what you have done in detail.
Others should be able to repeat your study• Design
– Observational study (control group, matching, retrospective, cross-sectional or prospective, time series: data collected at which time points)
– Experimental studies (Def. interventionregimes, blinding, matching)
Methods, cont’d.
• Inclusion- and exclusion criteria: Number ofobservations, which ones are included, which ones are excluded, and why
• Which population do you generalize to?• How did you collect your data?
Questionnaire? Interview?• Representativity?• Validity?
12
Statistical analysis• What methods did you use?• In which situations did you use your
different methods?• Are some continuous variables categorized,
or categories in categorical variables collapsed?
• Significance level, one-sided, two-sided?• Statistical software
Results• Descriptive table of your variables (both
”background” variables like gender, age etcand main variables)
• Describe discrepancies from your original design, drop-outs, non-responders etc
• Are assumptions of methods checked and arethey fulfilled?
• Results from the analyses
13
Discussion• Summary• Interpretation of results• Comparison to previous studies?• Strengths/weakneses
– Of your design?– Of the study?– Have you done many tests?– Power
• Further work?
Results
• In the presentation there are three importantquantities:– Effect measure (mean, median, proportion,
regression coefficient– Confidence interval/standard error of the effect
measure!– P-value
14
Numerical precision
• Data: Usually enough with one or nodecimals (46% women, mean weight 65.5 kg)
• P-values: 2 decimals common. Write the p-value, not p>0.05, p<0.05 or p=NS
• P-values less than 0.01: p<0.01• P-values less than 0.001: p<0.001
Descriptive presentation of data:• Normal data:
• Skewed data:
80-25014-45709-4990Range129.8 (30.6)23.2 (5.3)2944.7 (729.0)Mena (SD*)189189189NMother’s weight (pounds)Mother’s age (years)Infant’s weight (g)
*SD=Standard deviation
110.0-140.519.0-26.02412.0-3481.0Q1-Q3*121.023.02977.0Median189189189NMother’s weight (pounds)Mother’s age (years)Infant’s weight (g)
*Q1=1. quartile and Q2=3. quartile
15
Presenation of analysis:
• Two sample t-test, smokers vs non-smokers:
• Multiple linear regression:
0.012773.1 (2620.3,2926.2)74Smokers3055.0 (2916.0,3193.9)115Non-smokers
P-valueBirth weight (g)*n
*Mean and 95% confidence interval
Smoking*
Mother’sweight
Variable
-281,7
4.43
Univariateeffects
(-492.7,-70.7)
(1.05,7.81)
95% CI
0.01(-478.3,-61.7)-270.00.01
0.01(0.91,7.57)4.240.01
P-value95% CIMultivariateeffects
P-value
*Smokers vs non-smokers
Presentation of figures
• Can show e.g. scatter plots for the relationship betweendependent variable and independent variables.
• Can include fitted regression lines• Error bar plot: Analyzed birth weight and ethnicity
with one-way ANOVA (p=0.01). An illustrationfollows on the next slide (In SPSS: Graph-error bar-simple, Variable: BWT, Category axis: RAC)
16
white black other
race
2400,00
2600,00
2800,00
3000,00
3200,00
95%
CI b
irthw
eigh
t
Review of tests
• Below is a listing of most of the statisticaltests encountered in Newbold.
• It gives a grouping of the tests by application area
• For details, consult the book or previousnotes!
17
One group of normally distributedobservations
Distribution: Test statistic: Goal of test:
Chi-kvadrat, n-1 d.f.Testing variance ofnormal population
t-distribution, n-1 d.f.:Testing mean ofnormal distribution, variance unknown
standard normal: Testing mean ofnormal distribution,
variance known
0
/X
nµ
σ−
(0,1)N
0
/x
Xs n
µ−
2
20
( 1) xn sσ−
21nχ −
1nt −
Comparing two groups ofobservations: matched pairs
Wilcoxon signed rank statistic
T=min(T+,T-); T+ / T- are sum ofpositive/negative ranks
Wilcoxon signed rank test: Compare ranks and signs ofdifferences
S = the number of pairs with positive difference. Large samples(n>20):
Sign test: Compareonly whichobservations arelargest
Assuming normal distributions, unknownvariance: Comparemeans
0
/D
D Ds n
−
(D1, …, Dn differences)1nt −
( ,0.5)Bin n
(0,1)NLarge samples: * 0.5
0.5S n
n−
18
Comparing two groups ofobservations: unmatched data
Standard normal (n>10)Based on sum of ranks ofobs. from one group; all obs. ranked together
Assuming identical translateddistributions: test equalmeans: Mann Whitney U test
Testing equality of variancesfor two normal populations
Diff. between pop. means: Unknown and unequalvariances
Diff. between pop. means: Unknown but equal variances
Standard normalDiff. between pop. means: Known variances
22
0( ) / yx
x yn nX Y D σσ− − + (0,1)N
2 2
0( ) / p p
x y
s sn nX Y D− − + 2x yn nt + −
22
0( ) / yx
x y
ssn nX Y D− − + tν
see book for d.f.
2 2/x ys s 1, 1x yn nF − −
(0,1)N
Comparing more than two groups ofdata
Two-way ANOVA withinteraction: Testing if groupsand blocking variable interact
Two-way ANOVA: Testing ifall groups are equal, whenyou also have blocking
Based on sums of ranks for each group; all obs. ranked together
Kruskal-Wallis test: Testing ifall groups are equal
One-way ANOVA: Testing ifall groups are equal (norm.)
/( 1)/( )
SSG KSSW n K
−− 1,K n KF − −
21Kχ −
/( 1)/(( 1)( 1))SSG K
SSE K H−
− − 1,( 1)( 1)K K HF − − −
/(( 1)( 1))/( ( 1))
SSI K HSSE HK L
− −− ( 1)( 1), ( 1)K H HK LF − − −
19
Studying population proportions
Standard normalComparing thepopulationproportions in twogroups (largesamples)
Standard normalTest of populationproportion in onegroup (largesamples)
0
0 0(1 ) /p
nπ
π π−− (0,1)N
0 0 0 0(1 ) (1 )x y
x y
p pp p p p
n n
−− −+
(p0 common estimate)
(0,1)N
Regression tests
Test on sets of partialregression coefficients: Can they all be set to zero (i.e., removed)?
Test on partial regressioncoefficient: Is it ?
Test of regression slope: Is it ?
1
1 *
b
bs
β−2nt −
*β
*
j
j
b
bs
β−*β 1n Kt − −
2
( ( ) ) /
e
SSE r SSE rs−
, 1r n K rF − − −
20
Model tests
Tests for normality:•Kolmogorov-Smirnov
Goodness-of-fit test: Countsin K categories, compared to expected counts, under H0
Contingency table test: Test ifthere is an associationbetween the two attributes in a contingency table
2
1 1
( )r cij ij
i j ij
O EE= =
−∑∑ 2
( 1)( 1)r cχ − −
2
1
( )Ki i
i i
O EE=
−∑ 21Kχ −
* *
Tests for correlation
Specialdistribution
Compute ranks of x-values, and of y-values, and computecorrelation of theseranks
Test for zero correlation(nonparametric): Spearmanrank correlation
Test for zero populationcorrelation (normal distribution) 2
21
r nr−
− 2nt −
21
Tests for autocorrelation
Special distribution, or standard normal
for large samples
Counting the numberof ”runs” above and below the median in the time series
The runs test (nonparametric), testing for randomness in time
Special distributionThe Durbin-Watson test (based on normal assumption) testing for autocorrelation in regression data
21
2
2
1
( )n
t ti
n
ti
e e
e
−=
=
−∑
∑
(0,1)N
From problem to choice of method
• Example: You have the grades of a class ofstudents from this years statistics course, and from last years statistics course. How to analyze?
• You have measured the blood pressure, working habits, eating habits, and exerciselevel for 200 middleaged men. How to analyze?
22
From problem to choice of method
• Example: You have asked 100 marriedwomen how long they have been married, and how happy they are (on a specific scale) with their marriage. How to analyze?
• Example: You have data for how satisfied(on some scale) 50 patients are with theirprimary health care, from each of 5 regions of Norway. How to analyze?