Regression Tutorial 201 With NumXL

Embed Size (px)

Citation preview

  • 7/30/2019 Regression Tutorial 201 With NumXL

    1/12

    Regression201Tutorial 1 SpiderFinancialCorp,2013

    Tutorial:Regression201

    Thisisthethirdentryinourregressionanalysisandmodelingseries.Inthistutorial,wecontinuethe

    analysisdiscussionwestartedearlierbyleveragingamoreadvancedtechniqueinfluentialdata

    analysis tohelpusimprovethemodel,and,asaresult,thereliabilityoftheforecast.

    Again,wewilluseasampledatasetgatheredfrom20differentsalespersons.Theregressionmodel

    attemptstoexplainandpredicttheweeklysalesforeachperson(dependentvariable)usingtwo

    explanatoryvariables:intelligence(IQ)andextroversion.

    DataPreparationSimilartowhatwedidinourearliertutorial,weorganizeoursampledatabyplacingthevalueofeach

    variableinaseparatecolumnandeachobservationinaseparaterow.

    Next,weintroducethemask.ThemaskisaBooleanarray(0,1)thatchooseswhichvariableis

    included(orexcluded)intheanalysis.

    Initially,atthetopofthetable,letsinsertthemaskcellsarray;eachwithavalueof1(i.e.included).

    Thearrayisshownbelowhighlightedbelow:

    Inthisexample,wehave20observationsandtwoindependent(explanatory)variables.Theresponseor

    dependentvariableistheweeklysales.

    ProcessNowwearereadytoconductourregressionanalysis. First,selectanemptycellinyourworksheet

    whereyouwishtheoutputtobegenerated,thenlocateandclickontheregressioniconintheNumXL

  • 7/30/2019 Regression Tutorial 201 With NumXL

    2/12

    Regression201Tutorial 2 SpiderFinancialCorp,2013

    tab(ortoolbar).

    NowtheRegressionWizardwillappear.

    Selectthecellsrangefortheresponse/dependentvariablevalues(i.e.weeklysales).Selectthecells

    rangefortheexplanatory(independent)variablesvalues.ForVariables(X)Mask,selectthecellsatthe

    topofthedatatable(Booleanarray).

    Notes:1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput

    tableswhereitreferencesthosevariables.

    2. Theexplanatoryvariables(i.e.X)arealreadygroupedbycolumns(eachcolumnrepresentsavariable),sowedontneedtochangethat.

    3. Bydefault,theoutputcellsrangeissettothecurrentselectedcellinyourworksheet.Pleasenotethat,onceweselecttheXandYcellsrange,theoptions,ForecastandMissingValues

    tabsbecomeavailable(enabled).

    Next,selecttheOptionstab.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    3/12

    Regression201Tutorial 3 SpiderFinancialCorp,2013

    Initially,thetabissettothefollowingvalues:

    Theregressionintercept/constantisleftblank.Thisindicatesthattheregressioninterceptwillbeestimatedbytheregression.Tosettheregressiontoafixedvalue(e.g.zero(0)),enterit

    there.

    Thesignificancelevel(aka. )issetto5% Inoutputsection,themostcommonregressionanalysisisselected. Forautomodeling,checkthisoption.

    Now,clicktheMissingValuestab.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    4/12

    Regression201Tutorial 4 SpiderFinancialCorp,2013

    Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(XandY).Bydefault,any

    missingvaluefoundinXorinYinanyobservationwouldexcludetheobservationfromtheanalysis.

    Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.

    Now,clickOKtogeneratetheoutputtables.

    Toassesstheinfluencethateachobservationexertsonourmodel,wecalculateacoupleofstatistical

    measures:leverageandCooksdistance.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    5/12

    Regression201Tutorial 5 SpiderFinancialCorp,2013

    Selectthecellnexttotheresponsevariable. Intheformulabar,typeintheMLR_FITTEDfunction,thenclickthefxbutton.

    TheFunctionWizardpopsup.Selecttheinputcellsrange,mask,andaReturntypeof4fortheleveragestatistics.ClickOK.

    MLR_FITTEDreturnsanarrayofvalues,butyouwillinitiallyonlyseethe1stvalue.

    To

    display

    the

    full

    array,

    select

    all

    the

    cells

    below

    (to

    the

    end

    of

    the

    sample).

    Press

    F2,

    then

    press

    CTRL+SHIFT+ENTERtocopythearrayformula.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    6/12

    Regression201Tutorial 6 SpiderFinancialCorp,2013

    Now,tocalculatetheCooksdistance,selectthecellnexttoLeverageandrepeatthesamesteps,butwiththereturntype=5.

    AnalysisNowthatwehavetheleverageandCooksdistancestatistics,letsinterprettheirfindings.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    7/12

    Regression201Tutorial 7 SpiderFinancialCorp,2013

    1. LeverageStatistics(H)

    Leveragestatisticsmeasure

    thedistance

    of

    an

    observationfromthecenter

    ofthedata.Inourexample,

    theintelligenceand

    extroversionvaluesfor

    Salesman11arefurthest

    fromtheaverage.Doesthis

    meanSalesman11isan

    outlier?Doesthismeanhe

    exertsinfluenceonthe

    calculationof

    the

    regression

    coefficient?

    Toexaminethisassumption,letsremoveSalesman11fromourinputdataandexaminetheresulting

    regression.Todoso,justinsertan#N/Avalueinanyinputvariableofthisobservation.

    (Fulldataset) Omittingsalesman#11

    Droppingobservation11madethingsatbestthesameasearlier. Weoptedtorecoverthisobservation

    backintothesample.

    Insum,theleveragestatisticsdo

    notnecessarilyimplyanoutlier,

    butmerely

    adistant

    observation

    withfewneighbors.

    2. CooksDistance(D)TheCooksdistancecorrectsfor

    weaknessintheleverage

    statistics,andisthusmore

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

    Cook'sDistance

    (D)

    0%

    5%

    10%

    15%

    20%

    25%

    30%

    35%

    40%

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

    Leverage(H)

  • 7/30/2019 Regression Tutorial 201 With NumXL

    8/12

    Regression201Tutorial 8 SpiderFinancialCorp,2013

    indicativeofinfluentialdata.Furthermore,therearefewheuristicsforthethresholdvaluesofCooks

    distancetodetectaninfluentialdatum.Forouranalysis,weoftenuse4

    Nasathreshold(which

    translatesto20%forthe20observationsinourdataset).

    Usingthethresholdorjustlookingattheearlierplot,wedetectthatSalesman16exertsthehighest

    influenceonourregression,soletsvoidthisobservation(bysetting#N/Ainoneoftheinputvariables).

    NotethattheleveragestatisticsandCooksdistancereturn#N/Aforthismissingvalue.

    Letsnowexaminetheregressionstatisticsbeforeandafterwedroppedthesixteenthobservation.

    (FullDataset) (WithoutSalesman#16)

    Asyoumayalreadyhavenoticed,theregressionimprovedsignificantlyoneverydimension(e.g.R

    square,stderror,etc.). Salesman#16seemstobeaninfluentialoutlier,sowelldrophim.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    9/12

    Regression201Tutorial 9 SpiderFinancialCorp,2013

    Tohelpexplainwhatmakesanobservationinfluential,letsexaminetheextroversionvs.weeklysales

    graphbelow:

    Wedrawthelineartrendasaproxyforourregressionmodel.Theblack(circle)datapointrepresents

    Salesman16.Itslocation(extroversionandweeklysalesvalue)ispullingtheregression(dashed)line

    towardit,affectingthevalueoftheregressionslopeandintercept.

    Droppingthisobservationreleasestheregressionline,adjustingittobetterfittheremainingpoints.

    LetstakeanotherlookattheCooksdistanceplot(withoutSalesman16,andwithathresholdof

    421%

    19 )

  • 7/30/2019 Regression Tutorial 201 With NumXL

    10/12

    Regression201Tutorial 10 SpiderFinancialCorp,2013

    TheCooksdistancevaluesforthedifferentplotsaredistributedsomewhatuniformly,andwemaystop

    there.

    Note:Bearinmindthatourthresholdruleismerelyaheuristic(ruleofthumb),andshouldnotbetaken

    rigidly,butratherasaguideline.

    ConclusionInthistutorial,wehaveshownthatexcludingobservation#16isbeneficialtoourmodelingeffortsasit

    exertsignificant

    influence

    on

    our

    coefficient

    calculation.

    Next,usingtheremaining19observations,letsrecalculate(SHIFT+F9)theregressionstatistics,ANOVA,

    residualsdiagnosis,stepwiseregression,etc.

  • 7/30/2019 Regression Tutorial 201 With NumXL

    11/12

    Regression201Tutorial 11 SpiderFinancialCorp,2013

    Theoptimalsetoftheinputvariablesisthesameasearlier.Letsdroptheintelligencevariable(by

    settingitsvalueto0inthemask),andrecalculate

    Theregressionerroris$307(vs.$332beforeweremovedsalesman#16).

  • 7/30/2019 Regression Tutorial 201 With NumXL

    12/12

    Regression201Tutorial 12 SpiderFinancialCorp,2013

    TheFinalquestionwemayaskourselves;Istheregressionstableoverthesampledataset? Nextissue.

    $1,500

    $2,000

    $2,500

    $3,000

    $3,500

    $4,000

    $4,500

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20