18
1 It’s the Effect Size, Stupid 1 What effect size is and why it is important Paper presented at the British Educational Research Association annual conference, Exeter, 1214 September, 2002 Robert Coe School of Education, University of Durham, Leazes Road, Durham DH1 1TA Tel 0191 334 4184; Fax 0191 334 4180; Email [email protected] Abstract Effect size is a simple way of quantifying the difference between two groups that has many advantages over the use of tests of statistical significance alone. Effect size emphasises the size of the difference rather than confounding this with sample size. However, primary reports rarely mention effect sizes and few textbooks, research methods courses or computer packages address the concept. This paper provides an explication of what an effect size is, how it is calculated and how it can be interpreted. The relationship between effect size and statistical significance is discussed and the use of confidence intervals for the latter outlined. Some advantages and dangers of using effect sizes in metaanalysis are discussed and other problems with the use of effect sizes are raised. A number of alternative measures of effect size are described. Finally, advice on the use of effect sizes is summarised. ‘Effect size’ is simply a way of quantifying the size of the difference between two groups. It is easy to calculate, readily understood and can be applied to any measured outcome in Education or Social Science. It is particularly valuable for quantifying the effectiveness of a particular intervention, relative to some comparison. It allows us to move beyond the simplistic, ‘Does it work or not?’ to the far more sophisticated, ‘How well does it work in a range of contexts?’ Moreover, by placing the emphasis on the most important aspect of an intervention – the size of the effect – rather than its statistical significance (which conflates effect size and sample size), it promotes a more scientific approach to the accumulation of knowledge. For these reasons, effect size is an important tool in reporting and interpreting effectiveness. The routine use of effect sizes, however, has generally been limited to meta analysis – for combining and comparing estimates from different studies – and is all too rare in original reports of educational research (Keselman et al., 1998). This is despite the fact that measures of effect size have been available for at least 60 years (Huberty, 2002), and the American Psychological Association has been officially encouraging authors to report effect sizes since 1994 – but with limited success (Wilkinson et al., 1999). Formulae for the calculation of effect sizes do not appear in most statistics text books (other than those devoted to metaanalysis), are not featured in many statistics computer packages and are seldom taught in standard research methods courses. For these reasons, even the researcher who is convinced by the 1 During the 1992 US Presidential election campaign, Bill Clinton’s fortunes were transformed when his advisors helped him to focus on the main issue by writing ‘It’s the economy, stupid’ on a board they put in front of him every time he went out to speak.

Effect Size

Embed Size (px)

DESCRIPTION

Effect Size

Citation preview

  • 1

    ItstheEffectSize,Stupid1Whateffectsizeisandwhyitisimportant

    PaperpresentedattheBritishEducationalResearchAssociationannualconference,Exeter,1214September,2002

    RobertCoe

    SchoolofEducation,UniversityofDurham,LeazesRoad, [email protected]

    AbstractEffectsizeisasimplewayofquantifyingthedifferencebetweentwogroups

    thathasmanyadvantagesovertheuseoftestsofstatisticalsignificancealone.Effectsizeemphasisesthesizeofthedifferenceratherthanconfoundingthiswithsamplesize.However,primaryreportsrarelymentioneffectsizesandfewtextbooks,researchmethodscoursesorcomputerpackagesaddresstheconcept.Thispaperprovidesanexplicationofwhataneffectsizeis,howitiscalculatedandhowitcanbeinterpreted.Therelationshipbetweeneffectsizeandstatisticalsignificanceisdiscussedandtheuseofconfidenceintervalsforthelatteroutlined.Someadvantagesanddangersofusingeffectsizesinmetaanalysisarediscussedandotherproblemswiththeuseofeffectsizesareraised.Anumberofalternativemeasuresofeffectsizearedescribed.Finally,adviceontheuseofeffectsizesissummarised.

    Effectsizeissimplyawayofquantifyingthesizeofthedifferencebetweentwogroups.Itiseasytocalculate,readilyunderstoodandcanbeappliedtoanymeasuredoutcomeinEducationorSocialScience.Itisparticularlyvaluableforquantifyingtheeffectivenessofaparticularintervention,relativetosomecomparison.Itallowsustomovebeyondthesimplistic,Doesitworkornot?tothefarmoresophisticated,Howwelldoesitworkinarangeofcontexts?Moreover,byplacingtheemphasisonthemostimportantaspectofanintervention thesizeoftheeffectratherthanitsstatisticalsignificance(whichconflateseffectsizeandsamplesize),itpromotesamorescientificapproachtotheaccumulationofknowledge.Forthesereasons,effectsizeisanimportanttoolinreportingandinterpretingeffectiveness.

    Theroutineuseofeffectsizes,however,hasgenerallybeenlimitedtometaanalysisforcombiningandcomparingestimatesfromdifferentstudiesandisalltoorareinoriginalreportsofeducationalresearch(Keselman etal.,1998).Thisisdespitethefactthatmeasuresofeffectsizehavebeenavailableforatleast60years(Huberty,2002),andtheAmericanPsychologicalAssociationhasbeenofficiallyencouragingauthorstoreporteffectsizessince1994butwithlimitedsuccess(Wilkinson etal.,1999).Formulaeforthecalculationofeffectsizesdonotappearinmoststatisticstextbooks(otherthanthosedevotedtometaanalysis),arenotfeaturedinmanystatisticscomputerpackagesandareseldomtaughtinstandardresearchmethodscourses.Forthesereasons,eventheresearcherwhoisconvincedbythe

    1Duringthe1992USPresidentialelectioncampaign,BillClintonsfortunesweretransformedwhenhisadvisorshelpedhimtofocusonthemainissuebywritingItstheeconomy,stupidonaboardtheyputinfrontofhimeverytimehewentouttospeak.

  • 2

    wisdomofusingmeasuresofeffectsize,andisnotafraidtoconfronttheorthodoxyofconventionalpractice,mayfindthatitisquitehardtoknowexactlyhowtodoso.

    Thefollowingguideiswrittenfornonstatisticians,thoughinevitablysomeequationsandtechnicallanguagehavebeenused.Itdescribeswhateffectsizeis,whatitmeans,howitcanbeusedandsomepotentialproblemsassociatedwithusingit.

    1.Whydoweneedeffectsize?ConsideranexperimentconductedbyDowson(2000)toinvestigatetimeof

    dayeffectsonlearning:dochildrenlearnbetterinthemorningorafternoon?Agroupof38childrenwereincludedintheexperiment.Halfwererandomlyallocatedtolistentoastoryandanswerquestionsaboutit(ontape)at9am,theotherhalftohearexactlythesamestoryandanswerthesamequestionsat3pm.Theircomprehensionwasmeasuredbythenumberofquestionsansweredcorrectlyoutof20.

    Theaveragescorewas15.2forthemorninggroup,17.9fortheafternoongroup:adifferenceof2.7.Buthowbigadifferenceisthis?Iftheoutcomeweremeasuredonafamiliarscale,suchasGCSEgrades,interpretingthedifferencewouldnotbeaproblem.Iftheaveragedifferencewere,say,halfagrade,mostpeoplewouldhaveafairideaoftheeducationalsignificanceoftheeffectofreadingastoryatdifferenttimesofday.However,inmanyexperimentsthereisnofamiliarscaleavailableonwhichtorecordtheoutcomes.Theexperimenteroftenhastoinventascaleortouse(oradapt)analreadyexistingonebutgenerallynotonewhoseinterpretationwillbefamiliartomostpeople.

    (a) (b)

    Figure1

    Onewaytogetoverthisproblemistousetheamountofvariationinscorestocontextualisethedifference.Iftherewerenooverlapatallandeverysinglepersonintheafternoongrouphaddonebetteronthetestthaneveryoneinthemorninggroup,thenthiswouldseemlikeaverysubstantialdifference.Ontheotherhand,ifthespreadofscoreswerelargeandtheoverlapmuchbiggerthanthedifferencebetweenthegroups,thentheeffectmightseemlesssignificant.Becausewehaveanideaoftheamountofvariationfoundwithinagroup,wecanusethisasayardstickagainstwhichtocomparethedifference.Thisideaisquantifiedinthecalculationoftheeffectsize.TheconceptisillustratedinFigure1,whichshowstwopossiblewaysthedifferencemightvaryinrelationtotheoverlap.Ifthedifferencewereasingraph(a)itwouldbeverysignificantingraph(b),ontheotherhand,thedifferencemighthardlybenoticeable.

  • 3

    2.Howisitcalculated?Theeffectsizeisjustthestandardisedmeandifferencebetweenthetwo

    groups.Inotherwords:

    EffectSize=

    Equation1

    Ifitisnotobviouswhichoftwogroupsistheexperimental(i.e.theonewhichwasgiventhenewtreatmentbeingtested)andwhichthecontrol(theonegiventhestandardtreatment ornotreatment forcomparison),thedifferencecanstillbecalculated.Inthiscase,theeffectsizesimplymeasuresthedifferencebetweenthem,soitisimportantinquotingtheeffectsizetosaywhichwayroundthecalculationwasdone.

    Thestandarddeviationisameasureofthespreadofasetofvalues.Hereitreferstothestandarddeviationofthepopulationfromwhichthedifferenttreatmentgroupsweretaken.Inpractice,however,thisisalmostneverknown,soitmustbeestimatedeitherfromthestandarddeviationofthecontrolgroup,orfromapooledvaluefrombothgroups(seequestion7,below,formorediscussionofthis).

    InDowsonstimeofdayeffectsexperiment,thestandarddeviation(SD)=3.3,sotheeffectsizewas(17.915.2)/3.3=0.8.

    3.Howcaneffectsizesbeinterpreted?Onefeatureofaneffectsizeisthatitcanbedirectlyconvertedintostatements

    abouttheoverlapbetweenthetwosamplesintermsofacomparisonofpercentiles.AneffectsizeisexactlyequivalenttoaZscoreofastandardNormal

    distribution.Forexample,aneffectsizeof0.8meansthatthescoreoftheaveragepersonintheexperimental groupis0.8standarddeviationsabovetheaveragepersoninthecontrolgroup,andhenceexceedsthescoresof79%ofthecontrolgroup.Withthetwogroupsof19inthetimeofdayeffectsexperiment,theaveragepersonintheafternoongroup(i.e.theonewhowouldhavebeenranked10th inthegroup)wouldhavescoredaboutthesameasthe4thhighestpersoninthemorninggroup.Visualisingthesetwoindividualscangivequiteagraphicinterpretationofthedifferencebetweenthetwoeffects.

    TableIshowsconversionsofeffectsizes(column1)topercentiles(column2)andtheequivalentchangeinrankorderforagroupof25(column3).Forexample,foraneffectsizeof0.6,thevalueof73%indicatesthattheaveragepersonintheexperimentalgroupwouldscorehigherthan73%ofacontrolgroupthatwasinitiallyequivalent.Ifthegroupconsistedof25people,thisisthesameassayingthattheaverageperson(i.e.ranked13th inthegroup)wouldnowbeonapar withthepersonranked7th inthecontrolgroup.Noticethataneffectsizeof1.6wouldraisetheaveragepersontobelevelwiththetoprankedindividualinthecontrolgroup,soeffectsizeslargerthanthisareillustratedintermsofthetoppersoninalargergroup.Forexample,aneffectsizeof3.0wouldbringtheaveragepersoninagroupof740levelwiththepreviouslytoppersoninthegroup.

    [Meanofexperimentalgroup] [Meanofcontrolgroup]

    StandardDeviation

  • 4

    TableI: Interpretationsofeffectsizes

    EffectSize

    Percentageofcontrolgroupwhowouldbebelowaveragepersonin

    experimentalgroup

    Rankofpersoninacontrol

    groupof25whowouldbe

    equivalenttotheaveragepersonin

    experimentalgroup

    Probabilitythatyoucouldguesswhichgroupapersonwasin

    fromknowledgeoftheirscore.

    Equivalentcorrelation,r(=Differenceinpercentage

    successfulineachofthetwogroups,BESD)

    Probabilitythatpersonfromexperimentalgroupwillbehigherthanpersonfrom

    control,ifbothchosenatrandom(=CLES)

    0.0 50% 13th 0.50 0.00 0.50

    0.1 54% 12th 0.52 0.05 0.53

    0.2 58% 11th 0.54 0.10 0.56

    0.3 62% 10th 0.56 0.15 0.58

    0.4 66% 9th 0.58 0.20 0.61

    0.5 69% 8th 0.60 0.24 0.64

    0.6 73% 7th 0.62 0.29 0.66

    0.7 76% 6th 0.64 0.33 0.69

    0.8 79% 6th 0.66 0.37 0.71

    0.9 82% 5th 0.67 0.41 0.74

    1.0 84% 4th 0.69 0.45 0.76

    1.2 88% 3rd 0.73 0.51 0.80

    1.4 92% 2nd 0.76 0.57 0.84

    1.6 95% 1st 0.79 0.62 0.87

    1.8 96% 1st 0.82 0.67 0.90

    2.0 98% 1st (or1stoutof

    44)0.84 0.71 0.92

    2.5 99% 1st (or1stoutof

    160)0.89 0.78 0.96

    3.0 99.9% 1st (or1stoutof

    740)0.93 0.83 0.98

    Anotherwaytoconceptualisetheoverlapisintermsoftheprobabilitythatonecouldguesswhichgroupapersoncamefrom,basedonlyontheirtestscore orwhatevervaluewasbeingcompared.Iftheeffectsizewere0(i.e.thetwogroupswerethesame)thentheprobabilityofacorrectguesswouldbeexactlyahalf or0.50.Withadifferencebetweenthetwogroupsequivalenttoaneffectsizeof0.3,thereisstillplentyofoverlap,andtheprobabilityofcorrectly identifyingthegroupsrisesonlyslightlyto0.56.Withaneffectsizeof1,theprobabilityisnow0.69,justoveratwothirdschance.Theseprobabilitiesareshowninthefourthcolumnof TableI.Itisclearthattheoverlapbetweenexperimentalandcontrolgroupsissubstantial(andthereforetheprobabilityisstillcloseto0.5),evenwhentheeffectsizeisquitelarge.

  • 5

    Aslightlydifferentwaytointerpreteffectsizesmakesuseofanequivalencebetweenthestandardisedmeandifference(d)andthecorrelationcoefficient,r.Ifgroupmembershipiscodedwithadummyvariable(e.g.denotingthecontrolgroupby0andtheexperimentalgroupby1)andthecorrelationbetweenthisvariableandtheoutcomemeasurecalculated,avalueof rcanbederived.Bymakingsomeadditionalassumptions,onecanreadilyconvertd intoringeneral,usingtheequationr2= d2 / (4+d2)(seeCohen,1969,pp2022forotherformulaeandconversiontable).RosenthalandRubin(1982)takeadvantageofaninterestingpropertyof rtosuggestafurtherinterpretation,whichtheycallthebinomialeffectsizedisplay(BESD).Iftheoutcomemeasureisreducedtoasimpledichotomy(forexample,whetherascoreisaboveorbelowaparticularvaluesuchasthemedian,whichcouldbethoughtofassuccessorfailure),rcanbeinterpretedasthedifferenceintheproportionsineachcategory.Forexample,aneffectsizeof0.2indicatesadifferenceof0.10intheseproportions,aswouldbethecaseif45%ofthecontrolgroupand55%ofthetreatmentgrouphadreachedsomethresholdofsuccess.Note,however,thatiftheoverallproportionsuccessfulisnotcloseto50%,thisinterpretationcanbesomewhatmisleading(Strahan,1991McGraw,1991).ThevaluesfortheBESDareshownincolumn5.

    Finally,McGrawandWong(1992)havesuggestedaCommonLanguageEffectSize(CLES)statistic,whichtheyargueisreadilyunderstoodbynonstatisticians(shownincolumn6ofTableI).Thisistheprobabilitythatascoresampledatrandomfromonedistributionwillbegreaterthanascoresampledfromanother.Theygivetheexampleoftheheightsofyoungadultmalesandfemales,whichdifferbyaneffectsizeofabout2,andtranslatethisdifferencetoaCLESof0.92.Inotherwordsin92outof100blinddatesamongyoungadults,themalewillbetallerthanthefemale(p361).

    ItshouldbenotedthatthevaluesinTableIdependontheassumptionofaNormaldistribution.Theinterpretationof effectsizesintermsofpercentilesisverysensitivetoviolationsofthisassumption(seequestion7,below).

    Anotherwaytointerpreteffectsizesistocomparethemtotheeffectsizesofdifferencesthatarefamiliar.Forexample,Cohen(1969,p23)describesaneffectsizeof0.2assmallandgivestoillustrateittheexamplethatthedifferencebetweentheheightsof15yearoldand16yearoldgirlsintheUScorrespondstoaneffectofthissize.Aneffectsizeof0.5isdescribedasmedium andislargeenoughtobevisibletothenakedeye.A0.5effectsizecorrespondstothedifferencebetweentheheightsof14yearoldand18yearoldgirls.Cohendescribesaneffectsizeof0.8asgrosslyperceptibleandthereforelargeandequatesittothedifferencebetweentheheightsof13yearoldand18yearoldgirls.AsafurtherexamplehestatesthatthedifferenceinIQbetweenholdersofthePh.D.degreeandtypicalcollegefreshmeniscomparabletoaneffectsizeof0.8.

    Cohendoesacknowledgethedangerofusingtermslikesmall,mediumandlargeoutofcontext.Glassetal.(1981,p104)areparticularlycriticalofthisapproach,arguingthattheeffectivenessofaparticularinterventioncanonlybeinterpretedinrelation tootherinterventionsthatseektoproducethesameeffect.Theyalsopointoutthatthepracticalimportanceofaneffectdependsentirelyonitsrelativecostsandbenefits.Ineducation,ifitcouldbeshownthatmakingasmallandinexpensivechangewouldraiseacademicachievementbyaneffectsizeofevenaslittleas0.1,thenthiscouldbeaverysignificantimprovement,particularlyiftheimprovementapplieduniformlytoallstudents,andevenmoresoiftheeffectwerecumulativeovertime.

  • 6

    TableII: Examplesofaverageeffectsizesfromresearch

    Intervention OutcomeEffectSize

    Source

    Studentstestperformanceinreading 0.30Reducingclasssizefrom23to15 Studentstestperformanceinmaths 0.32

    FinnandAchilles,(1990)

    Attitudesofstudents 0.47Small(

  • 7

    inaspellingagefrom11to12correspondstoaneffectsizeofabout0.3,butseemstovaryaccordingtotheparticulartestused.

    InEngland,thedistributionofGCSEgradesincompulsorysubjects(i.e.MathsandEnglish)havestandarddeviationsofbetween1.5 1.8 grades,soanimprovementofoneGCSEgraderepresentsaneffectsizeof0.50.7.Inthecontextofsecondaryschoolstherefore,introducingachangeinpracticewhoseeffectsizewasknowntobe0.6wouldresultinanimprovementofaboutaGCSEgradeforeachpupilineachsubject.Foraschoolinwhich50%ofpupilswerepreviouslygainingfiveormoreA*Cgrades,thispercentage(otherthingsbeingequal,andassumingthattheeffectappliedequallyacrossthewholecurriculum)wouldriseto73%.1 EvenCohenssmalleffectof0.2wouldproduceanincreasefrom50%to58% adifferencethatmostschoolswouldprobablycategoriseasquitesubstantial.OlejnikandAlgina(2000)giveasimilarexamplebasedontheIowaTestofBasicSkills

    Finally,theinterpretationofeffectsizescanbegreatlyhelpedbyafewexamplesfromexistingresearch.TableIIlistsaselectionofthese,manyofwhicharetakenfromLipseyandWilson(1993).Theexamplescitedaregivenforillustrationoftheuseof effectsizemeasurestheyarenotintendedtobethedefinitivejudgementontherelativeefficacyofdifferentinterventions.Ininterpretingthem,therefore,oneshouldbearinmindthatmostofthemetaanalysesfromwhichtheyarederivedcanbe(andoftenhavebeen)criticisedforavarietyofweaknesses,thattherangeofcircumstancesinwhichtheeffectshavebeenfoundmaybelimited,andthattheeffectsizequotedisanaveragewhichisoftenbasedonquitewidelydifferingvalues.

    ItseemstobeafeatureofeducationalinterventionsthatveryfewofthemhaveeffectsthatwouldbedescribedinCohensclassificationasanythingotherthansmall.Thisappearsparticularlysoforeffectsonstudentachievement.Nodoubtthisispartlyaresultofthewidevariationfoundinthepopulationasawhole,againstwhichthemeasureofeffectsizeiscalculated.Onemightalsospeculatethatachievementishardertoinfluencethanotheroutcomes,perhapsbecausemostschoolsarealreadyusingoptimalstrategies,orbecausedifferentstrategiesarelikelytobeeffectiveindifferentsituationsacomplexitythatisnotwellcapturedbyasingleaverageeffectsize.

    4.Whatistherelationshipbetweeneffectsizeandsignificance?Effectsizequantifiesthesizeofthedifferencebetweentwogroups,andmay

    thereforebesaidtobeatruemeasureofthesignificanceofthedifference.If,forexample,theresultsofDowsonstimeofdayeffectsexperimentwerefoundtoapplygenerally,wemightaskthequestion:Howmuchdifferencewoulditmaketochildrenslearningiftheyweretaughtaparticulartopicintheafternooninsteadofthemorning?Thebestanswerwecouldgivetothiswouldbeintermsoftheeffectsize.

    However,instatisticsthewordsignificanceisoftenusedtomeanstatisticalsignificance,whichisthelikelihoodthatthedifferencebetweenthetwogroupscouldjustbeanaccidentofsampling.Ifyoutaketwosamplesfromthesamepopulationtherewillalwaysbeadifferencebetweenthem.Thestatisticalsignificanceisusuallycalculatedasapvalue,theprobabilitythatadifferenceofatleastthesamesizewouldhavearisenbychance,eveniftherereallywerenodifferencebetweenthetwopopulations.Fordifferencesbetweenthemeansoftwogroups,thispvaluewouldnormallybecalculatedfromattest.Byconvention,ifp

  • 8

    ofthesample.Onewouldgetasignificantresulteitheriftheeffectwereverybig(despitehavingonlyasmallsample)orifthesamplewereverybig(eveniftheactualeffectsizeweretiny).Itisimportanttoknowthestatisticalsignificanceofaresult,sincewithoutitthereisadangerofdrawingfirmconclusionsfromstudieswherethesampleistoosmalltojustifysuchconfidence.However,statisticalsignificancedoesnot tellyouthemostimportantthing:thesizeoftheeffect.Onewaytoovercomethisconfusionistoreporttheeffectsize,togetherwithanestimateofitslikelymarginforerrororconfidenceinterval.

    5.Whatisthemarginforerrorinestimatingeffectsizes?Clearly,ifaneffectsizeiscalculatedfromaverylargesampleitislikelytobe

    moreaccuratethanonecalculatedfromasmallsample.Thismarginforerrorcanbequantifiedusingtheideaofaconfidenceinterval,whichprovidesthesameinformationasisusuallycontainedinasignificancetest:usinga95%confidenceintervalisequivalenttotakinga5%significancelevel.Tocalculatea95%confidenceinterval,youassumethatthevalueyougot(e.g.theeffectsizeestimateof0.8)isthetruevalue,butcalculatetheamountofvariationinthisestimateyouwouldgetifyourepeatedlytooknewsamplesofthesamesize(i.e.differentsamplesof38children).Forevery100ofthesehypotheticalnewsamples,bydefinition,95wouldgiveestimatesoftheeffectsizewithinthe95%confidenceinterval.Ifthisconfidenceintervalincludeszero,thenthatisthesameassayingthattheresultisnotstatisticallysignificant.If,ontheotherhand,zeroisoutsidetherange,thenitisstatisticallysignificantatthe5%level.Usingaconfidenceintervalisabetterwayofconveyingthisinformationsinceitkeepstheemphasisontheeffectsizewhichistheimportantinformation ratherthanthepvalue.

    AformulaforcalculatingtheconfidenceintervalforaneffectsizeisgivenbyHedgesandOlkin(1985,p86).Iftheeffectsizeestimatefromthesampleisd,thenitisNormallydistributed,withstandarddeviation:

    Equation2

    (WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively.)

    Hencea95%confidenceintervalfordwouldbefrom

    d1.96 s[d] to d+1.96 s[d]Equation3

    Tousethefiguresfromthetimeofdayexperimentagain,NE=NC=19andd=0.8,so s[d]= (0.105+0.008)=0.34.Hencethe95%confidenceintervalis[0.14, 1.46].Thiswouldnormallybeinterpreted(despitethefactthatsuchaninterpretationisnotstrictlyjustifiedseeOakes,1986foranenlighteningdiscussionofthis)asmeaningthatthetrueeffectoftimeofdayisverylikelytobebetween

  • 9

    0.14and1.46.Inotherwords,itisalmostcertainlypositive(i.e.afternoonisbetterthanmorning)andthedifferencemaywellbequitelarge.

    6.Howcanknowledgeabouteffectsizesbecombined?Oneofthemainadvantagesofusingeffectsizeisthatwhenaparticular

    experimenthasbeenreplicated,thedifferenteffectsizeestimatesfromeachstudycaneasilybecombinedtogiveanoverallbestestimateofthesizeoftheeffect.Thisprocess ofsynthesisingexperimentalresultsintoasingleeffectsizeestimateisknownasmetaanalysis.Itwasdevelopedinitscurrentformbyaneducationalstatistician,GeneGlass(SeeGlassetal.,1981)thoughtherootsofmetaanalysiscanbetracedagooddealfurtherback(seeLepperetal.,1999),andisnowwidelyused,notonlyineducation,butinmedicineandthroughoutthesocialsciences.AbriefandaccessibleintroductiontotheideaofmetaanalysiscanbefoundinFitzGibbon(1984).

    Metaanalysis,however,candomuchmorethansimplyproduceanoverallaverageeffectsize,importantthoughthisoftenis.If,foraparticularintervention,somestudiesproducedlargeeffects,andsomesmalleffects,itwouldbeoflimitedvaluesimply tocombinethemtogetherandsaythattheaverageeffectwasmedium.Muchmoreusefulwouldbetoexaminetheoriginalstudiesforanydifferencesbetweenthosewithlargeandsmalleffectsandtotrytounderstandwhatfactorsmightaccountforthedifference.Thebestmetaanalysis,therefore,involvesseekingrelationshipsbetweeneffectsizesandcharacteristicsoftheintervention,thecontextandstudydesigninwhichtheywerefound(Rubin,1992seealsoLepperetal. (1999)foradiscussionof theproblemsthatcanbecreatedbyfailingtodothis,andsomeotherlimitationsoftheapplicabilityofmetaanalysis).

    Theimportanceofreplicationingainingevidenceaboutwhatworkscannotbeoverstressed.InDowsonstimeofdayexperimenttheeffectwasfoundtobelargeenoughtobestatisticallyandeducationallysignificant.Becauseweknowthatthepupilswereallocatedrandomlytoeachgroup,wecanbeconfidentthatchanceinitialdifferencesbetweenthetwogroupsareveryunlikelytoaccountforthedifferenceintheoutcomes.Furthermore,theuseofapretestofbothgroupsbeforetheinterventionmakesthisevenlesslikely.However,wecannotruleoutthepossibilitythatthedifferencearosefromsomecharacteristicpeculiartothechildreninthisparticularexperiment.Forexample,ifnoneofthemhadhadanybreakfastthatday,thismightaccountforthepoorperformanceofthemorninggroup.However,theresultwouldthenpresumablynotgeneralisetothewiderpopulationofschoolstudents,mostofwhomwouldhavehadsomebreakfast.Alternatively,theeffectmightdependontheageofthestudents.Dowsonsstudentswereaged7or8itisquitepossiblethattheeffectcouldbediminishedorreversedwitholder(oryounger)students.Thisillustratesthedangerofimplementingpolicyonthebasisofasingleexperiment.Confidenceinthegeneralityofaresultcanonlyfollowwidespreadreplication.

    Animportantconsequenceofthecapacityofmetaanalysistocombineresultsisthatevensmallstudiescanmakeasignificantcontributiontoknowledge.Thekindofexperimentthatcanbedonebyasingleteacherinaschoolmightinvolveatotaloffewerthan30students.Unlesstheeffectishuge,astudyofthissizeismostunlikelytogetastatisticallysignificantresult.Accordingtoconventionalstatisticalwisdom,therefore,theexperimentisnotworthdoing.However,iftheresultsofseveralsuchexperimentsarecombinedusingmetaanalysis,theoverallresultislikelytobehighlystatisticallysignificant.Moreover,itwillhavetheimportantstrengthsofbeingderivedfromarangeofcontexts(thusincreasingconfidenceinitsgenerality)andfromreallifeworkingpractice(therebymakingitmorelikelythatthepolicyisfeasibleandcanbeimplementedauthentically).

  • 10

    Onefinalcaveatshouldbemadehereaboutthedangerofcombiningincommensurableresults.Giventwo(ormore)numbers,onecanalwayscalculateanaverage.However,iftheyareeffectsizesfromexperimentsthatdiffersignificantlyintermsoftheoutcomemeasuresused,thentheresultmaybetotallymeaningless.Itcanbeverytempting,onceeffectsizeshavebeencalculated,totreatthemasallthesameandlosesightoftheirorigins. Certainly,thereareplentyofexamplesofmetaanalysesinwhichthejuxtapositionofeffectsizesissomewhatquestionable.

    Incomparing(orcombining)effectsizes,oneshouldthereforeconsidercarefullywhethertheyrelatetothesameoutcomes.Thisadviceappliesnotonlytometaanalysis,buttoanyothercomparisonofeffectsizes.Moreover,becauseofthesensitivityofeffectsizeestimatestoreliabilityandrangerestriction(seebelow),oneshouldalsoconsiderwhetherthoseoutcomemeasuresarederivedfromthesame(orsufficientlysimilar)instrumentsandthesame(orsufficientlysimilar)populations.

    Itisalsoimportanttocompareonlylikewithlikeintermsofthetreatmentsusedtocreatethedifferencesbeingmeasured.Intheeducationliterature,thesamenameisoftengiventointerventionsthatareactuallyverydifferent,forexample,iftheyareoperationaliseddifferently,oriftheyaresimplynotwellenoughdefinedforittobeclearwhethertheyarethesameornot.Itcouldalsobethatdifferentstudieshaveusedthesamewelldefinedandoperationalisedtreatments,buttheactualimplementationdiffered,orthatthesametreatmentmayhavehaddifferentlevelsofintensityindifferentstudies.Inanyofthesecases,itmakesnosensetoaverageouttheireffects.

    7.Whatotherfactorscaninfluenceeffectsize?Althougheffectsizeisasimpleandreadilyinterpretedmeasureof

    effectiveness,itcanalsobesensitivetoanumberofspuriousinfluences,sosomecareneedstobetakeninitsuse.Someoftheseissuesareoutlinedhere.

    Whichstandarddeviation?Thefirstproblemistheissueofwhichstandarddeviationtouse.Ideally,the

    controlgroupwillprovidethebestestimateofstandarddeviation,sinceitconsistsofarepresentativegroupofthepopulationwhohavenotbeenaffectedbytheexperimentalintervention.However,unlessthecontrolgroupisverylarge,theestimateofthetruepopulationstandarddeviationderivedfromonlythecontrolgroup islikelytobeappreciablylessaccuratethananestimatederivedfromboththecontrolandexperimentalgroups.Moreover,instudieswherethereisnotatruecontrolgroup(forexamplethetimeofdayeffectsexperiment)thenitmaybeanarbitrarydecisionwhichgroupsstandarddeviationtouse,anditwilloftenmakeanappreciabledifferencetotheestimateofeffectsize.

    Forthesereasons,itisoftenbettertouseapooledestimateofstandarddeviation.Thepooledestimateisessentially anaverageofthestandarddeviationsoftheexperimentalandcontrolgroups(Equation 4).Notethatthisisnotthesameasthestandarddeviationofallthevaluesinbothgroupspooledtogether.If,forexampleeachgrouphadalowstandarddeviationbutthetwomeansweresubstantiallydifferent,thetruepooledestimate(ascalculatedby Equation 4)wouldbemuchlowerthanthevalueobtainedbypoolingallthevaluestogetherandcalculatingthestandarddeviation.TheimplicationsofchoicesaboutwhichstandarddeviationtousearediscussedbyOlejnikandAlgina(2000).

  • 11

    Equation4

    (WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively,andSDEandSDCaretheirstandarddeviations.)

    Theuseofapooledestimateofstandarddeviationdependsontheassumptionthatthetwocalculatedstandarddeviationsareestimatesof thesamepopulationvalue.Inotherwords,thattheexperimentalandcontrolgroupstandarddeviationsdifferonlyasaresultofsamplingvariation.Wherethisassumptioncannotbemade(eitherbecausethereissomereasontobelievethatthetwostandarddeviationsarelikelytobesystematicallydifferent,oriftheactualmeasuredvaluesareverydifferent),thenapooledestimateshouldnotbeused.

    IntheexampleofDowsonstimeofdayexperiment,thestandarddeviationsforthemorningandafternoongroupswere4.12and2.10respectively.WithNE=NC=19,Equation2thereforegivesSDpooledas3.3,whichwasthevalueusedin Equation1 togiveaneffectsizeof0.8.However,thedifferencebetweenthetwostandarddeviationsseemsquitelargeinthiscase.Giventhattheafternoongroupmeanwas17.9outof20,itseemslikelythatitsstandarddeviationmayhavebeenreducedbyaceilingeffect i.e.thespreadofscoreswaslimitedbythemaximumavailablemarkof20.Inthiscasetherefore,itmightbemoreappropriatetousethemorninggroupsstandarddeviationasthebestestimate.Doingthiswillreducetheeffectsizeto0.7,anditthenbecomesasomewhatarbitrarydecisionwhichvalueoftheeffectsizetouse.Ageneralruleofthumbinstatisticswhentwovalidmethodsgivedifferentanswersis:Ifindoubt,citeboth.

    CorrectionsforbiasAlthoughusingthepooledstandarddeviationtocalculatetheeffectsize

    generallygivesabetterestimatethanthecontrolgroupSD,itisstillunfortunatelyslightlybiasedandingeneral givesavalueslightlylargerthanthetruepopulationvalue(HedgesandOlkin,1985).HedgesandOlkin(1985,p80)giveaformulawhichprovidesanapproximatecorrectiontothisbias.

    InDowsonsexperimentwith38values,thecorrectionfactorwillbe0.98,soitmakesverylittledifference,reducingtheeffectsizeestimatefrom0.82to0.80.Giventhelikelyaccuracyofthefiguresonwhichthisisbased,itisprobablyonlyworthquotingonedecimalplace,sothefigureof0.8stands.Infact,thecorrectiononlybecomessignificantforsmallsamples,inwhichtheaccuracyisanywaymuchless.Itisthereforehardlyworthworryingaboutitinprimaryreportsofempiricalresults.However,inmetaanalysis,whereresultsfromprimarystudiesarecombined,thecorrectionisimportant,sincewithoutitthisbiaswouldbeaccumulated.

    RestrictedrangeSupposethetimeofdayeffectsexperimentweretoberepeated,oncewiththe

    topsetinahighlyselectiveschoolandagainwithamixedabilitygroupin acomprehensive.Ifstudentswereallocatedtomorningandafternoongroupsatrandom,therespectivedifferencesbetweenthemmightbethesameineachcasebothmeansintheselectiveschoolmightbehigher,butthedifferencebetweenthetwogroupscouldbethesameasthedifferenceinthecomprehensive.However,itisunlikelythatthestandarddeviationswouldbethesame.Thespreadofscoresfound

  • 12

    withinthehighlyselectedgroupwouldbemuchlessthanthatinatruecrosssectionofthepopulation,asforexampleinthemixedabilitycomprehensiveclass.This,ofcourse,wouldhaveasubstantialimpactonthecalculationoftheeffectsize.Withthehighlyrestrictedrangefoundintheselectiveschool,theeffectsizewouldbemuchlargerthanthatfoundinthecomprehensive.

    Ideally,incalculatingeffectsizeoneshouldusethestandarddeviationofthefullpopulation,inordertomakecomparisonsfair.However,therewillbemanycasesinwhichunrestrictedvaluesarenotavailable,eitherinpracticeorinprinciple.Forexample,inconsideringtheeffectofaninterventionwithuniversitystudents,orwithpupilswithreadingdifficulties,onemustrememberthatthesearerestrictedpopulations.Inreportingtheeffectsize,oneshoulddrawattentiontothisfactiftheamountofrestrictioncanbequantifieditmaybepossibletomakeallowanceforit.Anycomparisonwitheffectsizescalculatedfromafullrangepopulationmustbemadewithgreatcaution,ifatall.

    NonNormaldistributionsTheinterpretationsofeffectsizesgiveninTableIdependontheassumption

    thatbothcontrolandexperimentalgroupshaveaNormaldistribution,i.e.thefamiliarbellshapedcurve,shown,forexample,inFigure1.Needlesstosay,ifthisassumptionisnottruethentheinterpretationmaybealtered,andinparticular,itmaybedifficulttomakeafaircomparisonbetweenaneffectsizebasedonNormaldistributionsandonebasedonnonNormaldistributions.

    4 3 2 1 0 1 2 3 4

    StandardNormalDistribution

    (S.D.=1)

    Similarlookingdistributionwithfatterextremes

    (S.D.=3.3)

    Figure2: ComparisonofNormalandnonNormaldistributions

    AnillustrationofthisisgiveninFigure2,whichshowsthefrequencycurvesfortwodistributions,oneofthemNormal,theotheracontaminatednormaldistribution(Wilcox,1998),whichissimilarinshape,butwithsomewhatfatterextremes.Infact,thelatterdoeslookjustalittlemorespreadoutthantheNormaldistribution,butitsstandarddeviationisactuallyoverthreetimesasbig.TheconsequenceofthisintermsofeffectsizedifferencesisshowninFigure3.Bothgraphsshowdistributionsthatdifferbyaneffectsizeequalto1,buttheappearanceoftheeffectsizedifferencefromthegraphsisratherdissimilar.Ingraph(b),the

  • 13

    separationbetweenexperimentalandcontrolgroupsseemsmuchlarger,yettheeffectsizeisactuallythesameasfortheNormaldistributionsplottedingraph(a).Intermsoftheamountofoverlap,ingraph(b)97%ofthe'experimental'groupareabovethecontrolgroupmean,comparedwiththevalueof84%fortheNormaldistributionofgraph(a)(asgiveninTableI).ThisisquiteasubstantialdifferenceandillustratesthedangerofusingthevaluesinTableIwhenthedistributionisnotknowntobeNormal.

    3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 6

    (a) (b)

    Figure3: NormalandnonNormaldistributionswitheffectsize=1

    MeasurementreliabilityAthirdfactorthatcanspuriouslyaffectaneffectsizeisthereliabilityofthe

    measurementonwhichitisbased.Accordingtoclassicalmeasurementtheory,anymeasureofaparticularoutcomemaybeconsideredtoconsistofthetrueunderlyingvalue,togetherwithacomponentoferror.Theproblemisthattheamountofvariationinmeasuredscoresforaparticularsample(i.e.itsstandarddeviation)willdependonboththevariationinunderlyingscoresandtheamountoferrorintheirmeasurement.

    Togiveanexample,imaginethetimeofdayexperimentwereconductedtwicewithtwo(hypothetically)identicalsamplesofstudents.Inthefirstversionthetestusedtoassesstheircomprehensionconsistedofjust10itemsandtheirscoreswereconvertedintoapercentage.Inthesecondversionatestwith50itemswasused,andagainconvertedtoapercentage.Thetwotestswereofequaldifficultyandtheactualeffectofthedifferenceintimeofdaywasthesameineachcase,sotherespectivemeanpercentagesofthemorningandafternoongroupswerethesameforbothversions.However,itisalmostalwaysthecasethatalongertestwillbemorereliable,andhencethestandarddeviationofthepercentagesonthe50itemtestwillbelowerthanthestandarddeviationforthe10itemtest.Thus,althoughthetrueeffectwasthesame,thecalculatedeffectsizeswillbedifferent.

    Ininterpretinganeffectsize,itisthereforeimportanttoknowthereliabilityofthemeasurementfromwhichitwascalculated.Thisisonereasonwhythereliabilityofanyoutcomemeasureusedshouldbereported.Itistheoreticallypossibletomakeacorrectionforunreliability(sometimescalledattenuation),whichgivesanestimateofwhattheeffectsizewouldhavebeen,hadthereliabilityofthetestbeenperfect.However,inpracticetheeffectofthisisratheralarming,sincetheworsethetestwas,themoreyouincreasetheestimateoftheeffectsize.Moreover,estimatesofreliabilityaredependentontheparticularpopulationinwhichthetestwasused,andarethemselvesanywaysubjecttosamplingerror.Forfurtherdiscussionoftheimpactofreliabilityoneffectsizes,seeBaugh(2002).

  • 14

    8.Aretherealternativemeasuresofeffectsize?Anumberofstatisticsaresometimesproposedasalternativemeasuresof

    effectsize,otherthanthestandardisedmeandifference.Someofthesewillbeconsideredhere.

    ProportionofvarianceaccountedforIfthecorrelationbetweentwovariablesisr,thesquareofthisvalue(often

    denotedwithacapitalletter:R2)representstheproportionofthevarianceineachthatisaccountedforbytheother.In otherwords,thisistheproportionbywhichthevarianceoftheoutcomemeasureisreducedwhenitisreplacedbythevarianceoftheresidualsfromaregressionequation.Thisideacanbeextendedtomultipleregression(whereitrepresentstheproportionofthevarianceaccountedforbyalltheindependentvariablestogether)andhascloseanalogiesinANOVA(whereitisusuallycalledetasquared, h2).Thecalculationof r(andhenceR2)forthekindofexperimentalsituationwehavebeenconsideringhasalreadybeenreferredtoabove.

    BecauseR2hasthisreadyconvertibility,it(oralternativemeasuresofvarianceaccountedfor)issometimesadvocatedasauniversalmeasureofeffectsize(e.g.Thompson,1999).Onedisadvantageofsuchanapproachisthateffectsizemeasuresbasedonvarianceaccountedforsufferfromanumberoftechnicallimitations,suchassensitivitytoviolationofassumptions(heterogeneityofvariance,balanceddesigns)andtheirstandarderrorscanbelarge(OlejnikandAlgina,2000).Theyarealsogenerallymorestatisticallycomplexandhenceperhapslesseasilyunderstood.Further,theyarenondirectionaltwostudieswithpreciselyoppositeresultswouldreportexactlythesamevarianceaccountedfor.However,thereisamorefundamentalobjectiontotheuseofwhatisessentiallyameasureofassociationtoindicatethestrengthofaneffect.

    Expressingdifferentmeasuresintermsofthesamestatisticcanhideimportantdifferencesbetweentheminfact,thesedifferenteffectsizesarefundamentallydifferent,andshouldnotbeconfused.Thecrucialdifferencebetweenaneffectsizecalculatedfromanexperimentandonecalculatedfromacorrelationisinthecausalnatureoftheclaimthatisbeingmadeforit.Moreover,thewordeffecthasaninherentimplicationofcausality:talkingabouttheeffectofAonBdoessuggestacausalrelationshipratherthanjustanassociation.Unfortunately,however,thewordeffectisoftenusedwhennoexplicitcausalclaimisbeingmade,butitsimplicationissometimesallowedtofloatinandoutofthemeaning,takingadvantageoftheambiguitytosuggestasubliminalcausallinkwherenoneisreallyjustified.

    Thiskindofconfusionissowidespreadineducationthatitisrecommendedherethatthewordeffect(andthereforeeffectsize)shouldnotbeusedunlessadeliberateandexplicitcausalclaimisbeingmade.Whennosuchclaimisbeingmade,wemaytalkaboutthevarianceaccountedfor(R2)orthestrengthofassociation(r),orsimply andperhapsmostinformatively justcitetheregressioncoefficient(Tukey,1969).Ifacausalclaimisbeingmadeitshouldbeexplicitandjustificationprovided.FitzGibbon(2002)hasrecommendedanalternativeapproachtothisproblem.Shehassuggestedasystemofnomenclaturefordifferentkindsofeffectsizesthatclearlydistinguishesbetweeneffectsizesderivedfrom,forexample,randomisedcontrolled,quasiexperimentalandcorrelationalstudies.

    OthermeasuresofeffectsizeIthasbeenshownthattheinterpretationofthestandardisedmeandifference

    measureofeffectsizeisverysensitivetoviolationsoftheassumptionofnormality.Forthisreason,anumberofmorerobust(nonparametric)alternativeshavebeensuggested.AnexampleoftheseisgivenbyCliff(1993).Therearealsoeffectsize

  • 15

    measuresformultivariateoutcomes.AdetailedexplanationcanbefoundinOlejnikandAlgina(2000).Finally,amethodforcalculatingeffectsizeswithinmultilevelmodelshasbeenproposedbyTymmsetal.(1997).GoodsummariesofmanyofthedifferentkindsofeffectsizemeasuresthatcanbeusedandtherelationshipsamongthemcanbefoundinSnyderandLawson(1993),Rosenthal(1994)andKirk(1996).

    Finally,acommoneffectsizemeasurewidelyusedinmedicineistheoddsratio.Thisisappropriatewhereanoutcomeisdichotomous:successorfailure,apatientsurvivesordoesnot.Explanationsoftheoddsratiocanbefoundinanumberof medicalstatisticstexts,includingAltman(1991),andinFleiss(1994).

    ConclusionsAdviceontheuseofeffectsizescanbesummarisedasfollows:

    Effectsizeisastandardised,scalefreemeasureoftherelativesizeoftheeffectofanintervention. Itisparticularlyusefulforquantifyingeffectsmeasuredonunfamiliarorarbitraryscalesandforcomparingtherelativesizesofeffectsfromdifferentstudies.

    InterpretationofeffectsizegenerallydependsontheassumptionsthatcontrolandexperimentalgroupvaluesareNormallydistributedandhavethesamestandarddeviations.Effectsizescanbeinterpretedintermsofthepercentilesorranksatwhichtwodistributionsoverlap,intermsofthelikelihoodofidentifyingthesourceofavalue,orwithreferencetoknowneffectsoroutcomes.

    Useofaneffectsizewithaconfidenceintervalconveysthesameinformationasatestofstatisticalsignificance,butwiththeemphasisonthesignificanceoftheeffect,ratherthanthesamplesize.

    Effectsizes(withconfidenceintervals)shouldbecalculatedandreportedinprimarystudiesaswellasinmetaanalyses.

    InterpretationofstandardisedeffectsizescanbeproblematicwhenasamplehasrestrictedrangeordoesnotcomefromaNormaldistribution,orifthemeasurementfromwhichitwasderivedhasunknownreliability.

    Theuseofanunstandardisedmeandifference(i.e.therawdifferencebetweenthetwogroups,togetherwithaconfidenceinterval)maybepreferablewhen:

    theoutcomeismeasuredonafamiliarscale thesamplehasarestrictedrange theparentpopulationissignificantlynonNormalcontrolandexperimentalgroupshaveappreciablydifferentstandard

    deviations theoutcomemeasurehasveryloworunknownreliability

    Caremustbetakenincomparingoraggregatingeffectsizesbasedondifferentoutcomes,differentoperationalisationsofthesameoutcome,differenttreatments,orlevelsofthesametreatment,ormeasuresderivedfromdifferentpopulations.

    Thewordeffectconveysanimplicationofcausality,andtheexpressioneffectsizeshouldthereforenotbeusedunlessthisimplicationisintendedandcanbejustified.

    1Thiscalculationisderivedfromaprobittransformation(Glassetal.,1981,p136),basedontheassumptionofanunderlyingnormallydistributedvariablemeasuringacademicattainment,somethresholdofwhichisequivalenttoastudentachieving5+A* Cs.Percentagesforthechangefromastartingvalueof50%forothereffectsizevaluescanbereaddirectlyfromTable I.Alternatively,if F(z)isthestandardnormalcumulativedistributionfunction, p1istheproportionachievingagiventhresholdand p2theproportiontobeexpectedafterachangewitheffectsize, d,then,

    p2 = F{F1(p1)+ d}

  • 16

    References

    ALTMAN,D.G.(1991) PracticalStatisticsforMedicalResearch.London:ChapmanandHall.

    BANGERT, R.L., KULIK, J.A.ANDKULIK, C.C.(1983)Individualisedsystemsofinstructioninsecondaryschools. ReviewofEducationalResearch,53,143158.

    BANGERTDROWNS, R.L. (1988)Theeffectsofschoolbasedsubstanceabuseeducation:ametaanalysis. JournalofDrugEducation,18,3,24365.

    BAUGH, F.(2002)Correctingeffectsizesforscorereliability:Areminderthatmeasurementandsubstantiveissuesarelinkedinextricably.EducationalandPsychologicalMeasurement,62,2,254263.

    CLIFF, N.(1993)DominanceStatisticsordinalanalysestoanswerordinalquestions PsychologicalBulletin,114,3.494509.

    COHEN, J.(1969)StatisticalPowerAnalysisfortheBehavioralSciences.NY:AcademicPress.

    COHEN, J.(1994)TheEarth isRound(p

  • 17

    HEDGES, L.ANDOLKIN, I.(1985) StatisticalMethodsforMetaAnalysis.NewYork:AcademicPress.

    HEMBREE, R. (1988)Correlates,causeseffectsandtreatmentoftestanxiety. ReviewofEducationalResearch,58(1),4777.

    HUBERTY, C.J..(2002)Ahistoryofeffectsizeindices.EducationalandPsychologicalMeasurement,62,2,227240.

    HYMAN, R.B, FELDMAN, H.R., HARRIS, R.B., LEVIN, R.F.ANDMALLOY, G.B.(1989)Theeffectsofrelaxationtrainingonmedicalsymptoms:ameatanalysis.NursingResearch,38,216220.

    KAVALE, K.A.ANDFORNESS, S.R.(1983)Hyperactivityanddiettreatment:ameatanalysisoftheFeingoldhypothesis. JournalofLearningDisabilities,16,324330.

    KESELMAN, H.J., HUBERTY, C.J., LIX, L.M., OLEJNIK, S. CRIBBIE, R.A., DONAHUE, B.,KOWALCHUK, R.K., LOWMAN, L.L., PETOSKEY, M.D., KESELMAN, J.C.ANDLEVIN, J.R. (1998)Statisticalpracticesofeducationalresearchers:AnanalysisoftheirANOVA,MANOVA,andANCOVAanalyses.ReviewofEducationalResearch,68,3,350386.

    KIRK, R.E.(1996)PracticalSignificance:Aconceptwhosetimehascome.Educationaland PsychologicalMeasurement,56,5,746759.

    KULIK, J.A., KULIK, C.C.ANDBANGERT, R.L.(1984)Effectsofpracticeonaptitudeandachievementtestscores. AmericanEducationResearchJournal,21,435447.

    LEPPER, M.R., HENDERLONG, J., ANDGINGRAS, I. (1999)Understandingtheeffectsofextrinsicrewardsonintrinsicmotivation Usesandabusesofmetaanalysis:CommentonDeci,Koestner,andRyan. PsychologicalBulletin,125,6,669676.

    LIPSEY, M.W.(1992)Juveniledelinquencytreatment:ametaanalyticinquiryintothevariabilityofeffects.InT.D.Cook,H.Cooper,D.S.Cordray,H.Hartmann,L.V.Hedges,R.J.Light,T.A.LouisandF.Mosteller(Eds) Metaanalysisforexplanation.NewYork:RussellSageFoundation.

    LIPSEY, M.W.ANDWILSON, D.B. (1993)TheEfficacyofPsychological,Educational,andBehavioralTreatment:Confirmationfrommetaanalysis. AmericanPsychologist,48,12,11811209.

    MCGRAW, K.O. (1991)ProblemswiththeBESD:acommentonRosenthalsHowAreWeDoinginSoftPsychology.AmericanPsychologist,46,10846.

    MCGRAW, K.O.ANDWONG, S.P. (1992)ACommonLanguageEffectSizeStatistic.PsychologicalBulletin,111,361365.

    MOSTELLER, F., LIGHT, R.J.ANDSACHS, J.A.(1996)'Sustainedinquiryineducation:lessonsfromskillgroupingandclasssize.' HarvardEducationalReview,66,797842.

    OAKES, M.(1986) StatisticalInference:ACommentaryfortheSocialandBehavioralSciences.NewYork:Wiley.

    OLEJNIK, S.ANDALGINA, J.(2000)MeasuresofEffectSizeforComparativeStudies:Applications,InterpretationsandLimitations. ContemporaryEducationalPsychology,25,241286.

  • 18

    ROSENTHAL, R.(1994)ParametricMeasuresofEffectSizeinH.CooperandL.V.Hedges(Eds.), TheHandbookofResearchSynthesis.NewYork:RussellSageFoundation.

    ROSENTHAL, R,ANDRUBIN, D.B. (1982)Asimple,generalpurposedisplayofmagnitudeofexperimentaleffect. JournalofEducationalPsychology,74,166169.

    RUBIN, D.B.(1992)Metaanalysis:literaturesynthesisoreffectsizesurfaceestimation. JournalofEducationalStatistics,17,4,363374.

    SHYMANSKY, J.A., HEDGES, L.V.ANDWOODWORTH, G.(1990)Areassessmentoftheeffectsofinquirybasedsciencecurriculaofthe60sonstudentperformance.JournalofResearchinScienceTeaching,27,127144.

    SLAVIN, R.E.ANDMADDEN, N.A. (1989)Whatworksforstudentsatrisk?Aresearchsynthesis. EducationalLeadership,46(4),413.

    SMITH, M.L.ANDGLASS, G.V. (1980)Metaanalysisofresearchonclasssizeanditsrelationshiptoattitudesandinstruction. AmericanEducationalResearchJournal,17,419433.

    SNYDER, P. ANDLAWSON, S. (1993)EvaluatingResultsUsingCorrectedandUncorrectedEffectSizeEstimates.JournalofExperimentalEducation,61,4,334349.

    STRAHAN, R.F. (1991)RemarksontheBinomialEffectSizeDisplay.AmericanPsychologist,46,10834.

    THOMPSON, B. (1999)Commonmethodologymistakesineducationalresearch,revisited,alongwithaprimeronbotheffectsizesandthebootstrap.InvitedaddresspresentedattheannualmeetingoftheAmericanEducationalResearchAssociation,Montreal.[Accessedfrom,January2000]

    TYMMS, P., MERRELL, C.ANDHENDERSON, B. (1997)TheFirstYearasSchool:AQuantitativeInvestigationoftheAttainmentandProgressofPupils.EducationalResearchandEvaluation,3,2,101118.

    VINCENT, D.ANDCRUMPLER, M. (1997) BritishSpellingTestSeriesManual3X/Y.Windsor:NFERNelson.

    WANG, M.C.ANDBAKER, E.T.(1986)Mainstreamingprograms:Designfeaturesandeffects. JournalofSpecialEducation,19,503523.

    WILCOX, R.R. (1998)Howmanydiscoverieshavebeenlostbyignoringmodernstatisticalmethods?.AmericanPsychologist,53,3,300314.

    WILKINSON, L.AND TASK FORCEON STATISTICAL INFERENCE, APABOARDOFSCIENTIFICAFFAIRS(1999)StatisticalMethodsinPsychologyJournals:GuidelinesandExplanations.AmericanPsychologist,54,8,594604.