17
BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression Model

BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression

Embed Size (px)

Citation preview

  • BUILDING THE REGRESSION MODELData preparationVariable reductionModel SelectionModel validationProcedures for variable reduction

    *Building the Regression Model

    Building the Regression Model

  • List independent variables that could coceivably be related to the dependent variable under study.Some of the independent variables can be screened out. An independent variable:May not be fundamental to the problemMay be subjected to large measurement errors, and/orMay effectively duplicate another independent variables in the listIndependent variables that cannot be measured may either be deleted or replaced by proxy variables taht are highly correlated with them.*Building the Regression Model

    Building the Regression Model

  • The number of cases to be collected depends on the size of the pool of independent variable (usually 6 to 10 cases for every variable in the poolAfter the data have been collected, edit checks and plots should be performed to identify gross data errors as well as extreme outliers.The formal modeling process can begin; a variety of diagnostics should be employed to identify important independent variables, the functional forms in which the independent variables should enter the regression model, and important relationships.*Building the Regression Model

    Building the Regression Model

  • Yes

    NoCollect dataPreliminary Checks on data qualityDiagnostic for relationship and strong interactionAre remedial measure needed?Remedial measures*Building the Regression Model

    Building the Regression Model

  • Selecting a few good subesets of X variables should include not only the potential independent variables in first-order form but also any needed quadratic and other curvature terms and any necessary interaction terms.Several reason in reducing the independent variables:A regression model with a large number of independent variables is difficult to maintain Regression models with a limited number of independent variables are easier to work with and understandThe presence of many highly intercorrelated independent variables may add little to the predictive power of the model while substantially increasing the sampling variation of the regression coefficients, detracting from the the models descriptive abilities, and increasing the problem of roundoff errors*Building the Regression Model

    Building the Regression Model

  • After successfully reducing the number of independent variables, select a small number of potential good regression models, each of which contains those independent variables that are known to be essential.More detailed checks of curvature and interaction effects are desireble.Diagnostic on residuals are needed in order to identify influential outlying observations, multicollinearity, etc*Building the Regression Model

    Building the Regression Model

  • The final step in the model building process is to validat the selected regression model.Model Validity refers to the stability and reasonableness of the regression coefficients, the plausibility and usability of the regression function, and the ability to generalize inferences drawn from the regression analysys.*Building the Regression Model

    Building the Regression Model

  • Three basic ways of validating a regression model are:Collection of new data to check the model and its predictive ability.Comparison of results with theoritical expectations, earlier empirical results, and simulation results.Use of a hold-out sample to check the model and its predictive ability*Building the Regression Model

    Building the Regression Model

  • The purpose of collecting new data is to be able to examine whether the regression model developed from the earlier data is still applicable for the new data.Some methods of examining the validity of the regression model against the new data :Reestimate the model form chosen earlier using the new data then compared the estimated coeff and various characteristic of the fitted values to those of the regression model based on the earlier data.Resestimate from the new data all of the good models had been considered to see if the selected regression model is the preferred model according to the new data.Designed to calibrate the predictive capability of the selected regression model

    *Building the Regression Model

    Building the Regression Model

  • Data splittingThe preferred method to validate using the new data is neither practical nor feasible.A reasonable alternative when the data set is large enough is to split the data into two sets, the model building set and the validation or prediction set.This validation is called cross-validation*Building the Regression Model

    Building the Regression Model

  • A mean of measuring the actual predictive capability of the selected regression model is to use this model to predict each case in the new data set and then to calculate the mean of the squared prediction errors, denoted MSPR (mean squared prediction error):

    WhereYi : the value of the response in the ith validation casei : the predictive value for the ith validation case based on the model building data setn* : the number cases in the validation data setIf the MSPR is fairly close to MSE based on the regression fit to the model-building data set, then the MSE for the selected regression model is not seriously biased and gives an appropriate indication of the predictive ability of the model

    *Building the Regression Model

    Building the Regression Model

  • Some procedures for variable reductions are:Forward procedureBackward procedureSome criteria for comparing the regression models: R2p, MSEp, Cp and PRESSp. Where P is the number of potensial parameters and the all-possible regressions approach assumes that the number of observations n exceeds the maximum number of potential parameters n > P, and 1 < p < P *Building the Regression Model

    Building the Regression Model

  • An examination of the coefficient of multiple determination R2 R2p = SSMp/SST= 1 (SSEp/SST)Where SST is constant for all possible regressionR2p varies inversely with SSEp and R2p will be a maximum when all P-1 potential X variables are included in the regression model.*Building the Regression Model

    Building the Regression Model

  • MSEp criterion is the adjusted coefficient of multiple determination R2a which takes the number of parameters in the model into account through the df.

    Seek for min(MSEp)*Building the Regression Model

    Building the Regression Model

  • This criterion is concern with the total mean squared error of the n fitted values for each subset regression model. The model which includes all P-1 potential X variables is assumed to have been carefully chosen so that MSE(X1, ..., Xp) is an unbiased estimator of 2, and SSEp is the error sum of squares for the fitted subset regression model with p parameters. The Cp formula is defined as follows:

    Cp criterion suggests to seek the small Cp and its value is near p.*Building the Regression Model

    Building the Regression Model

  • The PRESS (prediction sum of squares) selection criterion is based on the deleted residuals di.

    Models with small PRESS values are considered good candidate models*Building the Regression Model

    Building the Regression Model

  • Predicting survival in patients undergoing a particular type of liver operation:X1 : blood cloting scoreX2 : prognostic index, which includes the age of patientX3 : enzyme function test scoreX4 : liver function test scoreY : survival timeN = 54 patientsBuilding the Regression Model*

    Building the Regression Model