L2 flash cards quantitative methods - SS3

Study Session 3, Reading 11

Scatter Plotscatter plot - a graphical representation of the relationship

between the two variables


Correlation and Covariance AnalysisCorrelation analysis - expresses the relationship between two

variables with the help of a single number. It measures both the extent and direction of the linear relationship between two variables.

Formula: Sample covariance of X and Y for a sample size of ‘n’ can be calculated as:


Correlation and Covariance Analysis (cont.)

Formula: Sample Correlation Coefficient:

Where: SX - standard deviation of variable X

SY - standard deviation of variable Y

Formula: Sample Standard Deviation


Limitations to Correlation AnalysisOutlinersOutliers are a small number of observations that are at either extreme

of a sample

Spurious Correlation The correlation between two variables that shows a chance

relationship in a particular data set is called spurious correlation. The correlation between two variables that arises not from a direct

relationship between them but their relation to third variable is also called spurious correlation.


Hypothesis Testing For Population Correlation Coefficient

Proposed Hypothesis: null hypothesis - H0 , that the correlation is 0 (p=0) alternative hypothesis - Ha that the correlation of population

is different from 0 (p≠0)

Formula: t-test


Dependent and Independent Variables in Linear Regression

independent variable (denoted as X) - the variable that is used to explain changes

dependent variable (denoted as Y) - the variable that is to be explained.

Linear regression involves the use of one variable to make a prediction about other variable. It also involves testing hypotheses about the relation between the two variables and quantifying the strength of relationship between the two variables.


Dependent and Independent Variables in Linear Regression (cont.)

Regression equation that defines the linear relation between the dependant and independent variable:

Where: Y - dependent variable b0 – intercept

b1 - slope coefficient

X - independent variable - error term


Dependent and Independent Variables in Linear Regression (cont.)

In linear regression, estimated or fitted parameters b0 and b1 are chosen in the given equation to minimize:

cross sectional data - uses many observations on the dependant and independent variables for the same time period

time-series data - many observations from different time periods are used


Assumptions of a Classical Linear Regression Model

1. There is a linear relationship between the independent and dependant variable.

2. The independent variable is not random.3. The expected value of the error term is 0.4. The error term is normally distributed.5. The error term is uncorrelated across observations.6. The variance of the error term is the same for all

observations (Homoskedasticity Assumption).


Standard Error of Estimate

Standard Error of Estimate (also called the standard error of regression) - used to measure how accurately a regression model fits the data.

Formula:


Coefficient of Determination

coefficient of determination - used in measuring the proportion variance in the dependent variable that is explained by the independent variable

Formula:


Confidence Interval for Regression Coefficient

regression coefficient - the average change in the dependant variable for every one unit change in the value of the independent variable.

Things needed to estimate confidence interval for the regression coefficient:estimated parameter value for a sample standard error of estimateSignificance level for t-distribution degree of freedom (n-2).

Formula:

Where: tc - critical t value at a chosen significant level


Hypothesis Testing for a Population Value of the Regression Coefficient

Formula: When testing a hypothesis using a regression model with t-test of significance, the t statistic is computed as:

Formula: The confidence interval for the test is given as:


Calculating a Predicted Value for the Dependent Variable

Two sources of uncertainty in using regression model:1. the error term2. estimated parameters ( bˆ0 and bˆ1 )

Given the regression model Yi =bo +b1 Xi +Ei , if estimated parameters bˆ0 and bˆ1 are known, the predicted value of dependent variable ,Y, can be calculated as:


Calculating a Predicted Value for the Dependent Variable (cont.)

The prediction interval for a regression equation for a particular predicted value of the dependent variable is computed as:

Where: Sf - square root of estimated variance of prediction error

tc - critical level for t-statistic at chosen significance level

The confidence level is taken as


Calculating a Predicted Value for the Dependent Variable (cont.)

The estimated variance of the prediction error ( of Y) is calculated as:

Where: S2 - squared standard error of estimate -variance of independent variable


Calculating ANOVAin Regression Analysis

Analysis of Variance (ANOVA) - a statistical procedure that is used to determine how well the independent variable or variables explain the variation in the dependant variable.

F-test - the statistical test that is used in the analysis of the variance


F-testA F-statistic is used to test whether the slope coefficients in a linear

regression are equal to 0 or not.In a regression equation with one independent variable: Null Hypothesis H0 : b1= 0 Alternative Hypothesis Ha : b1≠ 0

Things required to undertake an F-test1. the total number of observations2. the total number of parameters to be estimated3. the sum of squared errors(SSE)4. regression sum of squares (RSS)


F-test (cont.)Formula: SSE Formula: RSS

Formula: Total Variation (TSS) = SSE + RSS

Formula: F-statistic in a regression with one independent variable


Limitations to Regression AnalysisParameter instabilityIn investment analysis, regression models can have limited use

because public knowledge of regression relationships can negate their use for future purpose

Violations of assumptions can make hypothesis tests and predictions invalid


Multiple Regression Equationmultiple regression equation - used to determine how a dependent variable is affected by more

than one independent variableslog-log regression model - used when the proportional changes in the dependent variable bear a

constant relationship to a proportional changes in independent variablesGeneral Form of the Multiple Regression Model

Where: Yi - the ith observation of the dependent variable Y

Xji - the ith observation of the independent variable Xj, j=1,2,…,k

b0 - the intercept of the equation

b1 ,…., bk - the slope coefficients for each of the independent variables

Ei - the error term

n - the number of observations


Hypothesis Testing for a Population Value of a Regression Coefficient

Under the null hypothesis, the hypothesis population value of a regression coefficient is taken as 0.

The degrees of freedom in the test are the number of observations minus the number of independent variables + 1 (i.e. n – (k+1).)

Formula: Hypothesis testing using t-test:

Where: b^j - regression estimate of hypothesized value of coefficient

-estimated standard error of b^j


Hypothesis Testing for a Population Value of a Regression Coefficient (cont.)

p-valueThe p-value for a regression coefficient is the smallest level of

significance at which the null hypothesis of that population value of the coefficient is 0 can be rejected in a two-sided test.

The lower the p-level, the more accurate the result of the test.


Confidence Interval for the Population Value and Predicted Value for the Dependent Variable

Two types of uncertainty in predicting the dependent variable using linear regression model:

the regression model itself because of standard error of estimate uncertainty about estimates of regression model parameters

The computation of the prediction interval to accommodate the uncertainties is done with the help of matrix algebra.


Points to be considered for predictinga dependent variable

Assumptions required for using a regression model must be met.

Caution should be exercised on predictions that are based on the value of independent variables that are outside the range of data used for estimating the model.


Steps in predicting the value of the dependent variable

Obtaining estimates of regression parameters ( ).

Determining assumed values of independent variables

Computing predicted value of dependent variable using the equation:


Assumptions of a Multiple Regression Model

1. There exists a linear relationship between the dependent variable and the independent variables.

2. There is no exact linear relationship between two or more of the independent variables and the independent variables are not random.

3. The error term is normally distributed.4. The error term is uncorrelated across observations.5. The variance of the error term is the same for all of the observations.6. The expected value of error term, conditioned upon the independent

variable, is 0.


F-statistic in Regression AnalysisF-statistic - used to test whether at least one of the slope

coefficients of the independent variables is not equal to 0

null hypothesis - all the slope coefficients in the multiple regression model are equal to 0 is presented as :

alternative hypothesis - at least one slope coefficient is not equal to 0.


F-statistic in Regression Analysis (cont.)Things required for F-testTotal number of observations (n).Total number of regression coefficients to be estimated (k+1)

where k is number of slope coefficients.Sum of squared errors (SSE) (Unexplained Variation)

Regression sum of squares (RSS) (Explained Variation)


F-statistic in Regression Analysis (cont.)Calculating the F-statistic

Degrees of freedom in the test 1) k (numerator degrees of freedom) 2) n-(k+1) (denominator degrees of freedom)


R2 and Adjusted R2 in Multiple Regression

R2 measures how appropriately the regression model fits with one independent variable.

Adjusted R2 ( ) is used in place of R2 when there is more than independent variable.

Relationship:

Where: n - the number of observations k - number of independent variables


Dummy Variables

Dummy variables - used in regression models to determine whether a qualitative independent variable explains the dependent variable

A dummy variable has a value of 1 if a particular qualitative condition is true and 0 if that condition is false.

In order to distinguish between n categories, n – 1 dummy variables are required.


Heteroskedasticity and its Effect on Statistical Inference

Heteroskedasticity - a violation of the regression assumption that the variance of the errors in a regression is constant across observations.

Two types of heteroskedasticity : 1. unconditional heteroskedasticity2. conditional heteroskedasticity

Breusch-Pagan test - widely used when testing for conditional heteroskedasticity.

Two methods used for correcting conditional heteroskedasticity:1. Robust Standard Errors2. Generalized Least Squares


Heteroskedasticity and its Effect on Statistical Inference (cont.)

Durbin-Waston test – test conducted when serial correlation generally arises in time-series regressions

Consequences of HeteroskedasticityF-test does not provide reliable results.T-tests for the significance of individual regression coefficients

does not provide reliable results.Standard errors and test statistics will have to be adjusted in

order to derive reliable results.


Unconditional Heteroskedasticity and Conditional Heteroskedasticity

Unconditional heteroskedasticity arises when the heteroskedasticity of an error variance does not correlate with the independent variables. This heteroskedasticity is not a major problem for statistical inference.

Conditional heteroskedasticity arises when heteroskedasticity in the error variance is correlated with the independent variables. This heteroskedasticity is a major problem for statistical inference.


Methods for Correcting for Heteroskedasticity

1. Under the robust standard error method, the standard errors of a linear regression model’s estimated coefficients are corrected.

2. Under the generalized least square method, original equation is modified and a new modified regression equation is estimated.


Consequences of Serial CorrelationIncorrect estimates of the regression coefficient standard errors.If the independent variable is a lagged value of the dependent

variable, it will make the parameter estimates invalid.In positive serial correlation, a positive (negative) error for one

observation increases the positive (negative) error for another observation.

Positive serial correlation has no effect on the consistency of estimated regression coefficients, but affects validity of statistical tests.


Durbin-Waston TestFormula:


Methods to correct for Serial Correlation

1. The coefficient standard errors for the linear regression parameter estimates can be adjusted.

2. Regression equation can be modified to eliminate serial correlation.


Multicollinearity in Regression Analysis

Multicollinearity - a violation of the regression assumption that there is no exact linear relationship between two or more independent variables

Consequences of MulticollinearityEstimates of regression coefficients become unreliable.It is not possible to ascertain how individual independent

variables affect dependent variables.


Model Misspecification in Regression Analysis

Model specification - the set of variables that are included in the regression and the regression equation’s functional form

Misspecified Functional FormIt omits one or more important variables from regression.One or more regression variables are required to be transformed

before estimating the regression.Data has been pooled from different samples that are not to be

pooled.


Model Misspecification in Regression Analysis (cont.)

Reasons for time-series misspecificationInclusion of lagged dependent variables as independent variables in

regressions which have serially correlated errors.The dependent variable being included as an independent variable.If there are independent variables that are measured with errors.


Models With Qualitative Dependent Variables

Qualitative dependent variables are dummy variables that are used as dependent variables.

1. Probit model - used to estimate the probability of a discrete outcome when values of independent variables used to explain the outcomes given based on normal distribution

2. Logic model - used to estimate the probability of a discrete outcome when values of independent variables used to explain the outcomes given based on logical distribution


Calculating the Predicted Trend Value for a Time Series

Linear Trend Models - the dependent variable changes at a constant rate with time

Formula:

Where: yt - value of the time series at time t

b0 - the y-intercept term

b1 - the slope coefficient (trend coefficient)

t - time (independent variable) Et - a random error term


Calculating the Predicted Trend Value for a Time Series (cont.)

Log-Linear Trend Models - used when the time series tends to grow at a constant rate

Formula:

Predicted trend value of


Limitations of the Use of Trend Models for a Given Time Series

Trend models can suffer from the limitation of serially correlated errors.

If trend models have errors that are serially correlated, better forecast models for such time series are required than trend models.


Covariance Stationary

Following things should be finite and constant in all periods: Expected value of time series. Variance of time series. Covariance of time series with itself for a fixed number of periods in the

past or future.

Implications if the Time Series is not Covariance Stationary Estimate of autoregressive time series by using linear regression will not

be valid The hypothesis test will provide invalid results.


Structure of an Autoregressive Model of Order p

In an autoregressive model, a time series is regressed on its past values and shows the relationship between current period-values and past-period values.

pth-order Autoregressive Model:

First Order Autoregression


Autocorrelation for Time SeriesAutocorrelation of a time series - the correlation of the time

series with its past values

Formula:


Autocorrelation for Error TermError autocorrelation is estimated by using sample

autocorrelations of the residuals called residual autocorrelations and their sample variance.

Formula:


Mean ReversionA time series shows mean reversion if it tends to rise when its level

is below its mean and falls when its level is above its mean.

Formula: Mean Reverting Level


Mean Reversion (cont.)Interpretation of Mean Reversion LevelIf the current value of time series is b0 /(1 – b1 ) , it will neither

increase nor decrease.If the current value is below b0 /(1 – b1 ) , the time series

will increase.If the current value is above b0 /(1 – b1 ), the time series will

decrease.


Mean Reversion (cont.)Multiple Periods of Forecasting and the Chain Rule of Forecasting

Formula: AR Model

Formula: Two-period ahead forecast


In-Sample and Out-of-Sample ForecastsIn-sample forecasts can be defined as the in-sample predicted

values from the estimated time series model.Out-of-sample forecasts are made from estimated time-series

models for a period that is different from the period from which the model was estimated.

Root Mean Squared Error (RMSE) (calculated as square root of average squared error) - used for comparing the out-of-sample forecasting accuracy of different time series models.


Instability of Coefficients in Time-Series Models

Generally unstable across different sample periodsDifferent between models that are estimated based on longer

or shorter sample periodsDepends upon the sample period


Random Walkrandom walk - a time series model in which the value of a series

in one period is calculated as the value of the series in the previous period plus an unpredictable random error

Formula:

Random walk with a drift increases or decreases by a constant amount in each period

Formula:


Random Walk (cont.)First-differencing - differencing a time series by creating a new

time series that in each period is equal to the difference between xt and xt-1.

Formula:


Dickey Fuller Unit Root TestFormula:

Where: g1 = (b1 – 1)

Null Hypothesis is H0 : g1 = 0

Alternative Hypothesis is Ha : g1 < 0


Seasonality in a Time-Series ModelSeasonality of time series occurs when regular patterns of

movement within the year are observed.

Formula: Seasonal lag in autoregressive model

Formula: Forecasted Value


ARCH ModelsAutoregressive Conditional Heteroskedasticity (ARCH) - if the

variance of errors in a time series model depends on the variance of previous

Formula: Linear regression error

Where: u1 = error term


ARCH Models (cont.)Predicting Variance of ErrorsFormula:

Formula: Calculate the variance of the error term in the current period


Analysis of Time-Series Variables Prior To Linear Regression

Two time series - said to be cointegrated if there is such a long-term financial or economic relationship between the two variables that they do not diverge from each other without being bound in the long run.

The (Engle Granger) Dickey Fuller test is used to determine whether time series are cointegrated.


Analysis of the Appropriate Time-Series Model Given an Investment Problem

Regression models or time series models can be used in the analysis of investment problems.

In a regression model, predicting the future value of a variable is undertaken on the basis of a hypothesized casual relationship with other variables.

In time series mode, the future behavior of the variable is made on the basis of past behavior of that variable.


Explanation of the Dependent Variable by Analysing the Regression Equation and ANOVA Table Key

analysis of variance (ANOVA) - used to provide information about a regression model’s explanatory power

F-statics are used to test the explanatory power of the dependent variable

If independent variables do not explain the dependent variables, the value of the F-statistic is 0.

Variability in values of the dependent variable can be divided into two parts:

Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares


Uses of Multiple Regression Analysis in Financial Analysis

Used in various finance and investment decisionsThe effect of various parameters on investment decisions can be

measuredTo predict the expected return of a fund or portfolioDummy variable can be used in various financial analysis modelsIf there are any violations of assumptions, they should be

adjusted by analysts before making any decisions

Technology

L2 flash cards quantitative methods - SS3