Effect Size

1

ItstheEffectSize,Stupid1Whateffectsizeisandwhyitisimportant

PaperpresentedattheBritishEducationalResearchAssociationannualconference,Exeter,1214September,2002

RobertCoe

SchoolofEducation,UniversityofDurham,LeazesRoad, [email protected]

AbstractEffectsizeisasimplewayofquantifyingthedifferencebetweentwogroups

thathasmanyadvantagesovertheuseoftestsofstatisticalsignificancealone.Effectsizeemphasisesthesizeofthedifferenceratherthanconfoundingthiswithsamplesize.However,primaryreportsrarelymentioneffectsizesandfewtextbooks,researchmethodscoursesorcomputerpackagesaddresstheconcept.Thispaperprovidesanexplicationofwhataneffectsizeis,howitiscalculatedandhowitcanbeinterpreted.Therelationshipbetweeneffectsizeandstatisticalsignificanceisdiscussedandtheuseofconfidenceintervalsforthelatteroutlined.Someadvantagesanddangersofusingeffectsizesinmetaanalysisarediscussedandotherproblemswiththeuseofeffectsizesareraised.Anumberofalternativemeasuresofeffectsizearedescribed.Finally,adviceontheuseofeffectsizesissummarised.

Effectsizeissimplyawayofquantifyingthesizeofthedifferencebetweentwogroups.Itiseasytocalculate,readilyunderstoodandcanbeappliedtoanymeasuredoutcomeinEducationorSocialScience.Itisparticularlyvaluableforquantifyingtheeffectivenessofaparticularintervention,relativetosomecomparison.Itallowsustomovebeyondthesimplistic,Doesitworkornot?tothefarmoresophisticated,Howwelldoesitworkinarangeofcontexts?Moreover,byplacingtheemphasisonthemostimportantaspectofanintervention thesizeoftheeffectratherthanitsstatisticalsignificance(whichconflateseffectsizeandsamplesize),itpromotesamorescientificapproachtotheaccumulationofknowledge.Forthesereasons,effectsizeisanimportanttoolinreportingandinterpretingeffectiveness.

Theroutineuseofeffectsizes,however,hasgenerallybeenlimitedtometaanalysisforcombiningandcomparingestimatesfromdifferentstudiesandisalltoorareinoriginalreportsofeducationalresearch(Keselman etal.,1998).Thisisdespitethefactthatmeasuresofeffectsizehavebeenavailableforatleast60years(Huberty,2002),andtheAmericanPsychologicalAssociationhasbeenofficiallyencouragingauthorstoreporteffectsizessince1994butwithlimitedsuccess(Wilkinson etal.,1999).Formulaeforthecalculationofeffectsizesdonotappearinmoststatisticstextbooks(otherthanthosedevotedtometaanalysis),arenotfeaturedinmanystatisticscomputerpackagesandareseldomtaughtinstandardresearchmethodscourses.Forthesereasons,eventheresearcherwhoisconvincedbythe

1Duringthe1992USPresidentialelectioncampaign,BillClintonsfortunesweretransformedwhenhisadvisorshelpedhimtofocusonthemainissuebywritingItstheeconomy,stupidonaboardtheyputinfrontofhimeverytimehewentouttospeak.

2

wisdomofusingmeasuresofeffectsize,andisnotafraidtoconfronttheorthodoxyofconventionalpractice,mayfindthatitisquitehardtoknowexactlyhowtodoso.

Thefollowingguideiswrittenfornonstatisticians,thoughinevitablysomeequationsandtechnicallanguagehavebeenused.Itdescribeswhateffectsizeis,whatitmeans,howitcanbeusedandsomepotentialproblemsassociatedwithusingit.

1.Whydoweneedeffectsize?ConsideranexperimentconductedbyDowson(2000)toinvestigatetimeof

dayeffectsonlearning:dochildrenlearnbetterinthemorningorafternoon?Agroupof38childrenwereincludedintheexperiment.Halfwererandomlyallocatedtolistentoastoryandanswerquestionsaboutit(ontape)at9am,theotherhalftohearexactlythesamestoryandanswerthesamequestionsat3pm.Theircomprehensionwasmeasuredbythenumberofquestionsansweredcorrectlyoutof20.

Theaveragescorewas15.2forthemorninggroup,17.9fortheafternoongroup:adifferenceof2.7.Buthowbigadifferenceisthis?Iftheoutcomeweremeasuredonafamiliarscale,suchasGCSEgrades,interpretingthedifferencewouldnotbeaproblem.Iftheaveragedifferencewere,say,halfagrade,mostpeoplewouldhaveafairideaoftheeducationalsignificanceoftheeffectofreadingastoryatdifferenttimesofday.However,inmanyexperimentsthereisnofamiliarscaleavailableonwhichtorecordtheoutcomes.Theexperimenteroftenhastoinventascaleortouse(oradapt)analreadyexistingonebutgenerallynotonewhoseinterpretationwillbefamiliartomostpeople.

(a) (b)

Figure1

Onewaytogetoverthisproblemistousetheamountofvariationinscorestocontextualisethedifference.Iftherewerenooverlapatallandeverysinglepersonintheafternoongrouphaddonebetteronthetestthaneveryoneinthemorninggroup,thenthiswouldseemlikeaverysubstantialdifference.Ontheotherhand,ifthespreadofscoreswerelargeandtheoverlapmuchbiggerthanthedifferencebetweenthegroups,thentheeffectmightseemlesssignificant.Becausewehaveanideaoftheamountofvariationfoundwithinagroup,wecanusethisasayardstickagainstwhichtocomparethedifference.Thisideaisquantifiedinthecalculationoftheeffectsize.TheconceptisillustratedinFigure1,whichshowstwopossiblewaysthedifferencemightvaryinrelationtotheoverlap.Ifthedifferencewereasingraph(a)itwouldbeverysignificantingraph(b),ontheotherhand,thedifferencemighthardlybenoticeable.

3

2.Howisitcalculated?Theeffectsizeisjustthestandardisedmeandifferencebetweenthetwo

groups.Inotherwords:

EffectSize=

Equation1

Ifitisnotobviouswhichoftwogroupsistheexperimental(i.e.theonewhichwasgiventhenewtreatmentbeingtested)andwhichthecontrol(theonegiventhestandardtreatment ornotreatment forcomparison),thedifferencecanstillbecalculated.Inthiscase,theeffectsizesimplymeasuresthedifferencebetweenthem,soitisimportantinquotingtheeffectsizetosaywhichwayroundthecalculationwasdone.

Thestandarddeviationisameasureofthespreadofasetofvalues.Hereitreferstothestandarddeviationofthepopulationfromwhichthedifferenttreatmentgroupsweretaken.Inpractice,however,thisisalmostneverknown,soitmustbeestimatedeitherfromthestandarddeviationofthecontrolgroup,orfromapooledvaluefrombothgroups(seequestion7,below,formorediscussionofthis).

InDowsonstimeofdayeffectsexperiment,thestandarddeviation(SD)=3.3,sotheeffectsizewas(17.915.2)/3.3=0.8.

3.Howcaneffectsizesbeinterpreted?Onefeatureofaneffectsizeisthatitcanbedirectlyconvertedintostatements

abouttheoverlapbetweenthetwosamplesintermsofacomparisonofpercentiles.AneffectsizeisexactlyequivalenttoaZscoreofastandardNormal

distribution.Forexample,aneffectsizeof0.8meansthatthescoreoftheaveragepersonintheexperimental groupis0.8standarddeviationsabovetheaveragepersoninthecontrolgroup,andhenceexceedsthescoresof79%ofthecontrolgroup.Withthetwogroupsof19inthetimeofdayeffectsexperiment,theaveragepersonintheafternoongroup(i.e.theonewhowouldhavebeenranked10th inthegroup)wouldhavescoredaboutthesameasthe4thhighestpersoninthemorninggroup.Visualisingthesetwoindividualscangivequiteagraphicinterpretationofthedifferencebetweenthetwoeffects.

TableIshowsconversionsofeffectsizes(column1)topercentiles(column2)andtheequivalentchangeinrankorderforagroupof25(column3).Forexample,foraneffectsizeof0.6,thevalueof73%indicatesthattheaveragepersonintheexperimentalgroupwouldscorehigherthan73%ofacontrolgroupthatwasinitiallyequivalent.Ifthegroupconsistedof25people,thisisthesameassayingthattheaverageperson(i.e.ranked13th inthegroup)wouldnowbeonapar withthepersonranked7th inthecontrolgroup.Noticethataneffectsizeof1.6wouldraisetheaveragepersontobelevelwiththetoprankedindividualinthecontrolgroup,soeffectsizeslargerthanthisareillustratedintermsofthetoppersoninalargergroup.Forexample,aneffectsizeof3.0wouldbringtheaveragepersoninagroupof740levelwiththepreviouslytoppersoninthegroup.

[Meanofexperimentalgroup] [Meanofcontrolgroup]

StandardDeviation

4

TableI: Interpretationsofeffectsizes

EffectSize

Percentageofcontrolgroupwhowouldbebelowaveragepersonin

experimentalgroup

Rankofpersoninacontrol

groupof25whowouldbe

equivalenttotheaveragepersonin

experimentalgroup

Probabilitythatyoucouldguesswhichgroupapersonwasin

fromknowledgeoftheirscore.

Equivalentcorrelation,r(=Differenceinpercentage

successfulineachofthetwogroups,BESD)

Probabilitythatpersonfromexperimentalgroupwillbehigherthanpersonfrom

control,ifbothchosenatrandom(=CLES)

0.0 50% 13th 0.50 0.00 0.50

0.1 54% 12th 0.52 0.05 0.53

0.2 58% 11th 0.54 0.10 0.56

0.3 62% 10th 0.56 0.15 0.58

0.4 66% 9th 0.58 0.20 0.61

0.5 69% 8th 0.60 0.24 0.64

0.6 73% 7th 0.62 0.29 0.66

0.7 76% 6th 0.64 0.33 0.69

0.8 79% 6th 0.66 0.37 0.71

0.9 82% 5th 0.67 0.41 0.74

1.0 84% 4th 0.69 0.45 0.76

1.2 88% 3rd 0.73 0.51 0.80

1.4 92% 2nd 0.76 0.57 0.84

1.6 95% 1st 0.79 0.62 0.87

1.8 96% 1st 0.82 0.67 0.90

2.0 98% 1st (or1stoutof

44)0.84 0.71 0.92

2.5 99% 1st (or1stoutof

160)0.89 0.78 0.96

3.0 99.9% 1st (or1stoutof

740)0.93 0.83 0.98

Anotherwaytoconceptualisetheoverlapisintermsoftheprobabilitythatonecouldguesswhichgroupapersoncamefrom,basedonlyontheirtestscore orwhatevervaluewasbeingcompared.Iftheeffectsizewere0(i.e.thetwogroupswerethesame)thentheprobabilityofacorrectguesswouldbeexactlyahalf or0.50.Withadifferencebetweenthetwogroupsequivalenttoaneffectsizeof0.3,thereisstillplentyofoverlap,andtheprobabilityofcorrectly identifyingthegroupsrisesonlyslightlyto0.56.Withaneffectsizeof1,theprobabilityisnow0.69,justoveratwothirdschance.Theseprobabilitiesareshowninthefourthcolumnof TableI.Itisclearthattheoverlapbetweenexperimentalandcontrolgroupsissubstantial(andthereforetheprobabilityisstillcloseto0.5),evenwhentheeffectsizeisquitelarge.

5

Aslightlydifferentwaytointerpreteffectsizesmakesuseofanequivalencebetweenthestandardisedmeandifference(d)andthecorrelationcoefficient,r.Ifgroupmembershipiscodedwithadummyvariable(e.g.denotingthecontrolgroupby0andtheexperimentalgroupby1)andthecorrelationbetweenthisvariableandtheoutcomemeasurecalculated,avalueof rcanbederived.Bymakingsomeadditionalassumptions,onecanreadilyconvertd intoringeneral,usingtheequationr2= d2 / (4+d2)(seeCohen,1969,pp2022forotherformulaeandconversiontable).RosenthalandRubin(1982)takeadvantageofaninterestingpropertyof rtosuggestafurtherinterpretation,whichtheycallthebinomialeffectsizedisplay(BESD).Iftheoutcomemeasureisreducedtoasimpledichotomy(forexample,whetherascoreisaboveorbelowaparticularvaluesuchasthemedian,whichcouldbethoughtofassuccessorfailure),rcanbeinterpretedasthedifferenceintheproportionsineachcategory.Forexample,aneffectsizeof0.2indicatesadifferenceof0.10intheseproportions,aswouldbethecaseif45%ofthecontrolgroupand55%ofthetreatmentgrouphadreachedsomethresholdofsuccess.Note,however,thatiftheoverallproportionsuccessfulisnotcloseto50%,thisinterpretationcanbesomewhatmisleading(Strahan,1991McGraw,1991).ThevaluesfortheBESDareshownincolumn5.

Finally,McGrawandWong(1992)havesuggestedaCommonLanguageEffectSize(CLES)statistic,whichtheyargueisreadilyunderstoodbynonstatisticians(shownincolumn6ofTableI).Thisistheprobabilitythatascoresampledatrandomfromonedistributionwillbegreaterthanascoresampledfromanother.Theygivetheexampleoftheheightsofyoungadultmalesandfemales,whichdifferbyaneffectsizeofabout2,andtranslatethisdifferencetoaCLESof0.92.Inotherwordsin92outof100blinddatesamongyoungadults,themalewillbetallerthanthefemale(p361).

ItshouldbenotedthatthevaluesinTableIdependontheassumptionofaNormaldistribution.Theinterpretationof effectsizesintermsofpercentilesisverysensitivetoviolationsofthisassumption(seequestion7,below).

Anotherwaytointerpreteffectsizesistocomparethemtotheeffectsizesofdifferencesthatarefamiliar.Forexample,Cohen(1969,p23)describesaneffectsizeof0.2assmallandgivestoillustrateittheexamplethatthedifferencebetweentheheightsof15yearoldand16yearoldgirlsintheUScorrespondstoaneffectofthissize.Aneffectsizeof0.5isdescribedasmedium andislargeenoughtobevisibletothenakedeye.A0.5effectsizecorrespondstothedifferencebetweentheheightsof14yearoldand18yearoldgirls.Cohendescribesaneffectsizeof0.8asgrosslyperceptibleandthereforelargeandequatesittothedifferencebetweentheheightsof13yearoldand18yearoldgirls.AsafurtherexamplehestatesthatthedifferenceinIQbetweenholdersofthePh.D.degreeandtypicalcollegefreshmeniscomparabletoaneffectsizeof0.8.

Cohendoesacknowledgethedangerofusingtermslikesmall,mediumandlargeoutofcontext.Glassetal.(1981,p104)areparticularlycriticalofthisapproach,arguingthattheeffectivenessofaparticularinterventioncanonlybeinterpretedinrelation tootherinterventionsthatseektoproducethesameeffect.Theyalsopointoutthatthepracticalimportanceofaneffectdependsentirelyonitsrelativecostsandbenefits.Ineducation,ifitcouldbeshownthatmakingasmallandinexpensivechangewouldraiseacademicachievementbyaneffectsizeofevenaslittleas0.1,thenthiscouldbeaverysignificantimprovement,particularlyiftheimprovementapplieduniformlytoallstudents,andevenmoresoiftheeffectwerecumulativeovertime.

6

TableII: Examplesofaverageeffectsizesfromresearch

Intervention OutcomeEffectSize

Source

Studentstestperformanceinreading 0.30Reducingclasssizefrom23to15 Studentstestperformanceinmaths 0.32

FinnandAchilles,(1990)

Attitudesofstudents 0.47Small(

7

inaspellingagefrom11to12correspondstoaneffectsizeofabout0.3,butseemstovaryaccordingtotheparticulartestused.

InEngland,thedistributionofGCSEgradesincompulsorysubjects(i.e.MathsandEnglish)havestandarddeviationsofbetween1.5 1.8 grades,soanimprovementofoneGCSEgraderepresentsaneffectsizeof0.50.7.Inthecontextofsecondaryschoolstherefore,introducingachangeinpracticewhoseeffectsizewasknowntobe0.6wouldresultinanimprovementofaboutaGCSEgradeforeachpupilineachsubject.Foraschoolinwhich50%ofpupilswerepreviouslygainingfiveormoreA*Cgrades,thispercentage(otherthingsbeingequal,andassumingthattheeffectappliedequallyacrossthewholecurriculum)wouldriseto73%.1 EvenCohenssmalleffectof0.2wouldproduceanincreasefrom50%to58% adifferencethatmostschoolswouldprobablycategoriseasquitesubstantial.OlejnikandAlgina(2000)giveasimilarexamplebasedontheIowaTestofBasicSkills

Finally,theinterpretationofeffectsizescanbegreatlyhelpedbyafewexamplesfromexistingresearch.TableIIlistsaselectionofthese,manyofwhicharetakenfromLipseyandWilson(1993).Theexamplescitedaregivenforillustrationoftheuseof effectsizemeasurestheyarenotintendedtobethedefinitivejudgementontherelativeefficacyofdifferentinterventions.Ininterpretingthem,therefore,oneshouldbearinmindthatmostofthemetaanalysesfromwhichtheyarederivedcanbe(andoftenhavebeen)criticisedforavarietyofweaknesses,thattherangeofcircumstancesinwhichtheeffectshavebeenfoundmaybelimited,andthattheeffectsizequotedisanaveragewhichisoftenbasedonquitewidelydifferingvalues.

ItseemstobeafeatureofeducationalinterventionsthatveryfewofthemhaveeffectsthatwouldbedescribedinCohensclassificationasanythingotherthansmall.Thisappearsparticularlysoforeffectsonstudentachievement.Nodoubtthisispartlyaresultofthewidevariationfoundinthepopulationasawhole,againstwhichthemeasureofeffectsizeiscalculated.Onemightalsospeculatethatachievementishardertoinfluencethanotheroutcomes,perhapsbecausemostschoolsarealreadyusingoptimalstrategies,orbecausedifferentstrategiesarelikelytobeeffectiveindifferentsituationsacomplexitythatisnotwellcapturedbyasingleaverageeffectsize.

4.Whatistherelationshipbetweeneffectsizeandsignificance?Effectsizequantifiesthesizeofthedifferencebetweentwogroups,andmay

thereforebesaidtobeatruemeasureofthesignificanceofthedifference.If,forexample,theresultsofDowsonstimeofdayeffectsexperimentwerefoundtoapplygenerally,wemightaskthequestion:Howmuchdifferencewoulditmaketochildrenslearningiftheyweretaughtaparticulartopicintheafternooninsteadofthemorning?Thebestanswerwecouldgivetothiswouldbeintermsoftheeffectsize.

However,instatisticsthewordsignificanceisoftenusedtomeanstatisticalsignificance,whichisthelikelihoodthatthedifferencebetweenthetwogroupscouldjustbeanaccidentofsampling.Ifyoutaketwosamplesfromthesamepopulationtherewillalwaysbeadifferencebetweenthem.Thestatisticalsignificanceisusuallycalculatedasapvalue,theprobabilitythatadifferenceofatleastthesamesizewouldhavearisenbychance,eveniftherereallywerenodifferencebetweenthetwopopulations.Fordifferencesbetweenthemeansoftwogroups,thispvaluewouldnormallybecalculatedfromattest.Byconvention,ifp

8

ofthesample.Onewouldgetasignificantresulteitheriftheeffectwereverybig(despitehavingonlyasmallsample)orifthesamplewereverybig(eveniftheactualeffectsizeweretiny).Itisimportanttoknowthestatisticalsignificanceofaresult,sincewithoutitthereisadangerofdrawingfirmconclusionsfromstudieswherethesampleistoosmalltojustifysuchconfidence.However,statisticalsignificancedoesnot tellyouthemostimportantthing:thesizeoftheeffect.Onewaytoovercomethisconfusionistoreporttheeffectsize,togetherwithanestimateofitslikelymarginforerrororconfidenceinterval.

5.Whatisthemarginforerrorinestimatingeffectsizes?Clearly,ifaneffectsizeiscalculatedfromaverylargesampleitislikelytobe

moreaccuratethanonecalculatedfromasmallsample.Thismarginforerrorcanbequantifiedusingtheideaofaconfidenceinterval,whichprovidesthesameinformationasisusuallycontainedinasignificancetest:usinga95%confidenceintervalisequivalenttotakinga5%significancelevel.Tocalculatea95%confidenceinterval,youassumethatthevalueyougot(e.g.theeffectsizeestimateof0.8)isthetruevalue,butcalculatetheamountofvariationinthisestimateyouwouldgetifyourepeatedlytooknewsamplesofthesamesize(i.e.differentsamplesof38children).Forevery100ofthesehypotheticalnewsamples,bydefinition,95wouldgiveestimatesoftheeffectsizewithinthe95%confidenceinterval.Ifthisconfidenceintervalincludeszero,thenthatisthesameassayingthattheresultisnotstatisticallysignificant.If,ontheotherhand,zeroisoutsidetherange,thenitisstatisticallysignificantatthe5%level.Usingaconfidenceintervalisabetterwayofconveyingthisinformationsinceitkeepstheemphasisontheeffectsizewhichistheimportantinformation ratherthanthepvalue.

AformulaforcalculatingtheconfidenceintervalforaneffectsizeisgivenbyHedgesandOlkin(1985,p86).Iftheeffectsizeestimatefromthesampleisd,thenitisNormallydistributed,withstandarddeviation:

Equation2

(WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively.)

Hencea95%confidenceintervalfordwouldbefrom

d1.96 s[d] to d+1.96 s[d]Equation3

Tousethefiguresfromthetimeofdayexperimentagain,NE=NC=19andd=0.8,so s[d]= (0.105+0.008)=0.34.Hencethe95%confidenceintervalis[0.14, 1.46].Thiswouldnormallybeinterpreted(despitethefactthatsuchaninterpretationisnotstrictlyjustifiedseeOakes,1986foranenlighteningdiscussionofthis)asmeaningthatthetrueeffectoftimeofdayisverylikelytobebetween

9

0.14and1.46.Inotherwords,itisalmostcertainlypositive(i.e.afternoonisbetterthanmorning)andthedifferencemaywellbequitelarge.

6.Howcanknowledgeabouteffectsizesbecombined?Oneofthemainadvantagesofusingeffectsizeisthatwhenaparticular

experimenthasbeenreplicated,thedifferenteffectsizeestimatesfromeachstudycaneasilybecombinedtogiveanoverallbestestimateofthesizeoftheeffect.Thisprocess ofsynthesisingexperimentalresultsintoasingleeffectsizeestimateisknownasmetaanalysis.Itwasdevelopedinitscurrentformbyaneducationalstatistician,GeneGlass(SeeGlassetal.,1981)thoughtherootsofmetaanalysiscanbetracedagooddealfurtherback(seeLepperetal.,1999),andisnowwidelyused,notonlyineducation,butinmedicineandthroughoutthesocialsciences.AbriefandaccessibleintroductiontotheideaofmetaanalysiscanbefoundinFitzGibbon(1984).

Metaanalysis,however,candomuchmorethansimplyproduceanoverallaverageeffectsize,importantthoughthisoftenis.If,foraparticularintervention,somestudiesproducedlargeeffects,andsomesmalleffects,itwouldbeoflimitedvaluesimply tocombinethemtogetherandsaythattheaverageeffectwasmedium.Muchmoreusefulwouldbetoexaminetheoriginalstudiesforanydifferencesbetweenthosewithlargeandsmalleffectsandtotrytounderstandwhatfactorsmightaccountforthedifference.Thebestmetaanalysis,therefore,involvesseekingrelationshipsbetweeneffectsizesandcharacteristicsoftheintervention,thecontextandstudydesigninwhichtheywerefound(Rubin,1992seealsoLepperetal. (1999)foradiscussionof theproblemsthatcanbecreatedbyfailingtodothis,andsomeotherlimitationsoftheapplicabilityofmetaanalysis).

Theimportanceofreplicationingainingevidenceaboutwhatworkscannotbeoverstressed.InDowsonstimeofdayexperimenttheeffectwasfoundtobelargeenoughtobestatisticallyandeducationallysignificant.Becauseweknowthatthepupilswereallocatedrandomlytoeachgroup,wecanbeconfidentthatchanceinitialdifferencesbetweenthetwogroupsareveryunlikelytoaccountforthedifferenceintheoutcomes.Furthermore,theuseofapretestofbothgroupsbeforetheinterventionmakesthisevenlesslikely.However,wecannotruleoutthepossibilitythatthedifferencearosefromsomecharacteristicpeculiartothechildreninthisparticularexperiment.Forexample,ifnoneofthemhadhadanybreakfastthatday,thismightaccountforthepoorperformanceofthemorninggroup.However,theresultwouldthenpresumablynotgeneralisetothewiderpopulationofschoolstudents,mostofwhomwouldhavehadsomebreakfast.Alternatively,theeffectmightdependontheageofthestudents.Dowsonsstudentswereaged7or8itisquitepossiblethattheeffectcouldbediminishedorreversedwitholder(oryounger)students.Thisillustratesthedangerofimplementingpolicyonthebasisofasingleexperiment.Confidenceinthegeneralityofaresultcanonlyfollowwidespreadreplication.

Animportantconsequenceofthecapacityofmetaanalysistocombineresultsisthatevensmallstudiescanmakeasignificantcontributiontoknowledge.Thekindofexperimentthatcanbedonebyasingleteacherinaschoolmightinvolveatotaloffewerthan30students.Unlesstheeffectishuge,astudyofthissizeismostunlikelytogetastatisticallysignificantresult.Accordingtoconventionalstatisticalwisdom,therefore,theexperimentisnotworthdoing.However,iftheresultsofseveralsuchexperimentsarecombinedusingmetaanalysis,theoverallresultislikelytobehighlystatisticallysignificant.Moreover,itwillhavetheimportantstrengthsofbeingderivedfromarangeofcontexts(thusincreasingconfidenceinitsgenerality)andfromreallifeworkingpractice(therebymakingitmorelikelythatthepolicyisfeasibleandcanbeimplementedauthentically).

10

Onefinalcaveatshouldbemadehereaboutthedangerofcombiningincommensurableresults.Giventwo(ormore)numbers,onecanalwayscalculateanaverage.However,iftheyareeffectsizesfromexperimentsthatdiffersignificantlyintermsoftheoutcomemeasuresused,thentheresultmaybetotallymeaningless.Itcanbeverytempting,onceeffectsizeshavebeencalculated,totreatthemasallthesameandlosesightoftheirorigins. Certainly,thereareplentyofexamplesofmetaanalysesinwhichthejuxtapositionofeffectsizesissomewhatquestionable.

Incomparing(orcombining)effectsizes,oneshouldthereforeconsidercarefullywhethertheyrelatetothesameoutcomes.Thisadviceappliesnotonlytometaanalysis,buttoanyothercomparisonofeffectsizes.Moreover,becauseofthesensitivityofeffectsizeestimatestoreliabilityandrangerestriction(seebelow),oneshouldalsoconsiderwhetherthoseoutcomemeasuresarederivedfromthesame(orsufficientlysimilar)instrumentsandthesame(orsufficientlysimilar)populations.

Itisalsoimportanttocompareonlylikewithlikeintermsofthetreatmentsusedtocreatethedifferencesbeingmeasured.Intheeducationliterature,thesamenameisoftengiventointerventionsthatareactuallyverydifferent,forexample,iftheyareoperationaliseddifferently,oriftheyaresimplynotwellenoughdefinedforittobeclearwhethertheyarethesameornot.Itcouldalsobethatdifferentstudieshaveusedthesamewelldefinedandoperationalisedtreatments,buttheactualimplementationdiffered,orthatthesametreatmentmayhavehaddifferentlevelsofintensityindifferentstudies.Inanyofthesecases,itmakesnosensetoaverageouttheireffects.

7.Whatotherfactorscaninfluenceeffectsize?Althougheffectsizeisasimpleandreadilyinterpretedmeasureof

effectiveness,itcanalsobesensitivetoanumberofspuriousinfluences,sosomecareneedstobetakeninitsuse.Someoftheseissuesareoutlinedhere.

Whichstandarddeviation?Thefirstproblemistheissueofwhichstandarddeviationtouse.Ideally,the

controlgroupwillprovidethebestestimateofstandarddeviation,sinceitconsistsofarepresentativegroupofthepopulationwhohavenotbeenaffectedbytheexperimentalintervention.However,unlessthecontrolgroupisverylarge,theestimateofthetruepopulationstandarddeviationderivedfromonlythecontrolgroup islikelytobeappreciablylessaccuratethananestimatederivedfromboththecontrolandexperimentalgroups.Moreover,instudieswherethereisnotatruecontrolgroup(forexamplethetimeofdayeffectsexperiment)thenitmaybeanarbitrarydecisionwhichgroupsstandarddeviationtouse,anditwilloftenmakeanappreciabledifferencetotheestimateofeffectsize.

Forthesereasons,itisoftenbettertouseapooledestimateofstandarddeviation.Thepooledestimateisessentially anaverageofthestandarddeviationsoftheexperimentalandcontrolgroups(Equation 4).Notethatthisisnotthesameasthestandarddeviationofallthevaluesinbothgroupspooledtogether.If,forexampleeachgrouphadalowstandarddeviationbutthetwomeansweresubstantiallydifferent,thetruepooledestimate(ascalculatedby Equation 4)wouldbemuchlowerthanthevalueobtainedbypoolingallthevaluestogetherandcalculatingthestandarddeviation.TheimplicationsofchoicesaboutwhichstandarddeviationtousearediscussedbyOlejnikandAlgina(2000).

11

Equation4

(WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively,andSDEandSDCaretheirstandarddeviations.)

Theuseofapooledestimateofstandarddeviationdependsontheassumptionthatthetwocalculatedstandarddeviationsareestimatesof thesamepopulationvalue.Inotherwords,thattheexperimentalandcontrolgroupstandarddeviationsdifferonlyasaresultofsamplingvariation.Wherethisassumptioncannotbemade(eitherbecausethereissomereasontobelievethatthetwostandarddeviationsarelikelytobesystematicallydifferent,oriftheactualmeasuredvaluesareverydifferent),thenapooledestimateshouldnotbeused.

IntheexampleofDowsonstimeofdayexperiment,thestandarddeviationsforthemorningandafternoongroupswere4.12and2.10respectively.WithNE=NC=19,Equation2thereforegivesSDpooledas3.3,whichwasthevalueusedin Equation1 togiveaneffectsizeof0.8.However,thedifferencebetweenthetwostandarddeviationsseemsquitelargeinthiscase.Giventhattheafternoongroupmeanwas17.9outof20,itseemslikelythatitsstandarddeviationmayhavebeenreducedbyaceilingeffect i.e.thespreadofscoreswaslimitedbythemaximumavailablemarkof20.Inthiscasetherefore,itmightbemoreappropriatetousethemorninggroupsstandarddeviationasthebestestimate.Doingthiswillreducetheeffectsizeto0.7,anditthenbecomesasomewhatarbitrarydecisionwhichvalueoftheeffectsizetouse.Ageneralruleofthumbinstatisticswhentwovalidmethodsgivedifferentanswersis:Ifindoubt,citeboth.

CorrectionsforbiasAlthoughusingthepooledstandarddeviationtocalculatetheeffectsize

generallygivesabetterestimatethanthecontrolgroupSD,itisstillunfortunatelyslightlybiasedandingeneral givesavalueslightlylargerthanthetruepopulationvalue(HedgesandOlkin,1985).HedgesandOlkin(1985,p80)giveaformulawhichprovidesanapproximatecorrectiontothisbias.

InDowsonsexperimentwith38values,thecorrectionfactorwillbe0.98,soitmakesverylittledifference,reducingtheeffectsizeestimatefrom0.82to0.80.Giventhelikelyaccuracyofthefiguresonwhichthisisbased,itisprobablyonlyworthquotingonedecimalplace,sothefigureof0.8stands.Infact,thecorrectiononlybecomessignificantforsmallsamples,inwhichtheaccuracyisanywaymuchless.Itisthereforehardlyworthworryingaboutitinprimaryreportsofempiricalresults.However,inmetaanalysis,whereresultsfromprimarystudiesarecombined,thecorrectionisimportant,sincewithoutitthisbiaswouldbeaccumulated.

RestrictedrangeSupposethetimeofdayeffectsexperimentweretoberepeated,oncewiththe

topsetinahighlyselectiveschoolandagainwithamixedabilitygroupin acomprehensive.Ifstudentswereallocatedtomorningandafternoongroupsatrandom,therespectivedifferencesbetweenthemmightbethesameineachcasebothmeansintheselectiveschoolmightbehigher,butthedifferencebetweenthetwogroupscouldbethesameasthedifferenceinthecomprehensive.However,itisunlikelythatthestandarddeviationswouldbethesame.Thespreadofscoresfound

12

withinthehighlyselectedgroupwouldbemuchlessthanthatinatruecrosssectionofthepopulation,asforexampleinthemixedabilitycomprehensiveclass.This,ofcourse,wouldhaveasubstantialimpactonthecalculationoftheeffectsize.Withthehighlyrestrictedrangefoundintheselectiveschool,theeffectsizewouldbemuchlargerthanthatfoundinthecomprehensive.

Ideally,incalculatingeffectsizeoneshouldusethestandarddeviationofthefullpopulation,inordertomakecomparisonsfair.However,therewillbemanycasesinwhichunrestrictedvaluesarenotavailable,eitherinpracticeorinprinciple.Forexample,inconsideringtheeffectofaninterventionwithuniversitystudents,orwithpupilswithreadingdifficulties,onemustrememberthatthesearerestrictedpopulations.Inreportingtheeffectsize,oneshoulddrawattentiontothisfactiftheamountofrestrictioncanbequantifieditmaybepossibletomakeallowanceforit.Anycomparisonwitheffectsizescalculatedfromafullrangepopulationmustbemadewithgreatcaution,ifatall.

NonNormaldistributionsTheinterpretationsofeffectsizesgiveninTableIdependontheassumption

thatbothcontrolandexperimentalgroupshaveaNormaldistribution,i.e.thefamiliarbellshapedcurve,shown,forexample,inFigure1.Needlesstosay,ifthisassumptionisnottruethentheinterpretationmaybealtered,andinparticular,itmaybedifficulttomakeafaircomparisonbetweenaneffectsizebasedonNormaldistributionsandonebasedonnonNormaldistributions.

4 3 2 1 0 1 2 3 4

StandardNormalDistribution

(S.D.=1)

Similarlookingdistributionwithfatterextremes

(S.D.=3.3)

Figure2: ComparisonofNormalandnonNormaldistributions

AnillustrationofthisisgiveninFigure2,whichshowsthefrequencycurvesfortwodistributions,oneofthemNormal,theotheracontaminatednormaldistribution(Wilcox,1998),whichissimilarinshape,butwithsomewhatfatterextremes.Infact,thelatterdoeslookjustalittlemorespreadoutthantheNormaldistribution,butitsstandarddeviationisactuallyoverthreetimesasbig.TheconsequenceofthisintermsofeffectsizedifferencesisshowninFigure3.Bothgraphsshowdistributionsthatdifferbyaneffectsizeequalto1,buttheappearanceoftheeffectsizedifferencefromthegraphsisratherdissimilar.Ingraph(b),the

13

separationbetweenexperimentalandcontrolgroupsseemsmuchlarger,yettheeffectsizeisactuallythesameasfortheNormaldistributionsplottedingraph(a).Intermsoftheamountofoverlap,ingraph(b)97%ofthe'experimental'groupareabovethecontrolgroupmean,comparedwiththevalueof84%fortheNormaldistributionofgraph(a)(asgiveninTableI).ThisisquiteasubstantialdifferenceandillustratesthedangerofusingthevaluesinTableIwhenthedistributionisnotknowntobeNormal.

3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 6

(a) (b)

Figure3: NormalandnonNormaldistributionswitheffectsize=1

MeasurementreliabilityAthirdfactorthatcanspuriouslyaffectaneffectsizeisthereliabilityofthe

measurementonwhichitisbased.Accordingtoclassicalmeasurementtheory,anymeasureofaparticularoutcomemaybeconsideredtoconsistofthetrueunderlyingvalue,togetherwithacomponentoferror.Theproblemisthattheamountofvariationinmeasuredscoresforaparticularsample(i.e.itsstandarddeviation)willdependonboththevariationinunderlyingscoresandtheamountoferrorintheirmeasurement.

Togiveanexample,imaginethetimeofdayexperimentwereconductedtwicewithtwo(hypothetically)identicalsamplesofstudents.Inthefirstversionthetestusedtoassesstheircomprehensionconsistedofjust10itemsandtheirscoreswereconvertedintoapercentage.Inthesecondversionatestwith50itemswasused,andagainconvertedtoapercentage.Thetwotestswereofequaldifficultyandtheactualeffectofthedifferenceintimeofdaywasthesameineachcase,sotherespectivemeanpercentagesofthemorningandafternoongroupswerethesameforbothversions.However,itisalmostalwaysthecasethatalongertestwillbemorereliable,andhencethestandarddeviationofthepercentagesonthe50itemtestwillbelowerthanthestandarddeviationforthe10itemtest.Thus,althoughthetrueeffectwasthesame,thecalculatedeffectsizeswillbedifferent.

Ininterpretinganeffectsize,itisthereforeimportanttoknowthereliabilityofthemeasurementfromwhichitwascalculated.Thisisonereasonwhythereliabilityofanyoutcomemeasureusedshouldbereported.Itistheoreticallypossibletomakeacorrectionforunreliability(sometimescalledattenuation),whichgivesanestimateofwhattheeffectsizewouldhavebeen,hadthereliabilityofthetestbeenperfect.However,inpracticetheeffectofthisisratheralarming,sincetheworsethetestwas,themoreyouincreasetheestimateoftheeffectsize.Moreover,estimatesofreliabilityaredependentontheparticularpopulationinwhichthetestwasused,andarethemselvesanywaysubjecttosamplingerror.Forfurtherdiscussionoftheimpactofreliabilityoneffectsizes,seeBaugh(2002).

14

8.Aretherealternativemeasuresofeffectsize?Anumberofstatisticsaresometimesproposedasalternativemeasuresof

effectsize,otherthanthestandardisedmeandifference.Someofthesewillbeconsideredhere.

ProportionofvarianceaccountedforIfthecorrelationbetweentwovariablesisr,thesquareofthisvalue(often

denotedwithacapitalletter:R2)representstheproportionofthevarianceineachthatisaccountedforbytheother.In otherwords,thisistheproportionbywhichthevarianceoftheoutcomemeasureisreducedwhenitisreplacedbythevarianceoftheresidualsfromaregressionequation.Thisideacanbeextendedtomultipleregression(whereitrepresentstheproportionofthevarianceaccountedforbyalltheindependentvariablestogether)andhascloseanalogiesinANOVA(whereitisusuallycalledetasquared, h2).Thecalculationof r(andhenceR2)forthekindofexperimentalsituationwehavebeenconsideringhasalreadybeenreferredtoabove.

BecauseR2hasthisreadyconvertibility,it(oralternativemeasuresofvarianceaccountedfor)issometimesadvocatedasauniversalmeasureofeffectsize(e.g.Thompson,1999).Onedisadvantageofsuchanapproachisthateffectsizemeasuresbasedonvarianceaccountedforsufferfromanumberoftechnicallimitations,suchassensitivitytoviolationofassumptions(heterogeneityofvariance,balanceddesigns)andtheirstandarderrorscanbelarge(OlejnikandAlgina,2000).Theyarealsogenerallymorestatisticallycomplexandhenceperhapslesseasilyunderstood.Further,theyarenondirectionaltwostudieswithpreciselyoppositeresultswouldreportexactlythesamevarianceaccountedfor.However,thereisamorefundamentalobjectiontotheuseofwhatisessentiallyameasureofassociationtoindicatethestrengthofaneffect.

Expressingdifferentmeasuresintermsofthesamestatisticcanhideimportantdifferencesbetweentheminfact,thesedifferenteffectsizesarefundamentallydifferent,andshouldnotbeconfused.Thecrucialdifferencebetweenaneffectsizecalculatedfromanexperimentandonecalculatedfromacorrelationisinthecausalnatureoftheclaimthatisbeingmadeforit.Moreover,thewordeffecthasaninherentimplicationofcausality:talkingabouttheeffectofAonBdoessuggestacausalrelationshipratherthanjustanassociation.Unfortunately,however,thewordeffectisoftenusedwhennoexplicitcausalclaimisbeingmade,butitsimplicationissometimesallowedtofloatinandoutofthemeaning,takingadvantageoftheambiguitytosuggestasubliminalcausallinkwherenoneisreallyjustified.

Thiskindofconfusionissowidespreadineducationthatitisrecommendedherethatthewordeffect(andthereforeeffectsize)shouldnotbeusedunlessadeliberateandexplicitcausalclaimisbeingmade.Whennosuchclaimisbeingmade,wemaytalkaboutthevarianceaccountedfor(R2)orthestrengthofassociation(r),orsimply andperhapsmostinformatively justcitetheregressioncoefficient(Tukey,1969).Ifacausalclaimisbeingmadeitshouldbeexplicitandjustificationprovided.FitzGibbon(2002)hasrecommendedanalternativeapproachtothisproblem.Shehassuggestedasystemofnomenclaturefordifferentkindsofeffectsizesthatclearlydistinguishesbetweeneffectsizesderivedfrom,forexample,randomisedcontrolled,quasiexperimentalandcorrelationalstudies.

OthermeasuresofeffectsizeIthasbeenshownthattheinterpretationofthestandardisedmeandifference

measureofeffectsizeisverysensitivetoviolationsoftheassumptionofnormality.Forthisreason,anumberofmorerobust(nonparametric)alternativeshavebeensuggested.AnexampleoftheseisgivenbyCliff(1993).Therearealsoeffectsize

15

measuresformultivariateoutcomes.AdetailedexplanationcanbefoundinOlejnikandAlgina(2000).Finally,amethodforcalculatingeffectsizeswithinmultilevelmodelshasbeenproposedbyTymmsetal.(1997).GoodsummariesofmanyofthedifferentkindsofeffectsizemeasuresthatcanbeusedandtherelationshipsamongthemcanbefoundinSnyderandLawson(1993),Rosenthal(1994)andKirk(1996).

Finally,acommoneffectsizemeasurewidelyusedinmedicineistheoddsratio.Thisisappropriatewhereanoutcomeisdichotomous:successorfailure,apatientsurvivesordoesnot.Explanationsoftheoddsratiocanbefoundinanumberof medicalstatisticstexts,includingAltman(1991),andinFleiss(1994).

ConclusionsAdviceontheuseofeffectsizescanbesummarisedasfollows:

Effectsizeisastandardised,scalefreemeasureoftherelativesizeoftheeffectofanintervention. Itisparticularlyusefulforquantifyingeffectsmeasuredonunfamiliarorarbitraryscalesandforcomparingtherelativesizesofeffectsfromdifferentstudies.

InterpretationofeffectsizegenerallydependsontheassumptionsthatcontrolandexperimentalgroupvaluesareNormallydistributedandhavethesamestandarddeviations.Effectsizescanbeinterpretedintermsofthepercentilesorranksatwhichtwodistributionsoverlap,intermsofthelikelihoodofidentifyingthesourceofavalue,orwithreferencetoknowneffectsoroutcomes.

Useofaneffectsizewithaconfidenceintervalconveysthesameinformationasatestofstatisticalsignificance,butwiththeemphasisonthesignificanceoftheeffect,ratherthanthesamplesize.

Effectsizes(withconfidenceintervals)shouldbecalculatedandreportedinprimarystudiesaswellasinmetaanalyses.

InterpretationofstandardisedeffectsizescanbeproblematicwhenasamplehasrestrictedrangeordoesnotcomefromaNormaldistribution,orifthemeasurementfromwhichitwasderivedhasunknownreliability.

Theuseofanunstandardisedmeandifference(i.e.therawdifferencebetweenthetwogroups,togetherwithaconfidenceinterval)maybepreferablewhen:

theoutcomeismeasuredonafamiliarscale thesamplehasarestrictedrange theparentpopulationissignificantlynonNormalcontrolandexperimentalgroupshaveappreciablydifferentstandard

deviations theoutcomemeasurehasveryloworunknownreliability

Caremustbetakenincomparingoraggregatingeffectsizesbasedondifferentoutcomes,differentoperationalisationsofthesameoutcome,differenttreatments,orlevelsofthesametreatment,ormeasuresderivedfromdifferentpopulations.

Thewordeffectconveysanimplicationofcausality,andtheexpressioneffectsizeshouldthereforenotbeusedunlessthisimplicationisintendedandcanbejustified.

1Thiscalculationisderivedfromaprobittransformation(Glassetal.,1981,p136),basedontheassumptionofanunderlyingnormallydistributedvariablemeasuringacademicattainment,somethresholdofwhichisequivalenttoastudentachieving5+A* Cs.Percentagesforthechangefromastartingvalueof50%forothereffectsizevaluescanbereaddirectlyfromTable I.Alternatively,if F(z)isthestandardnormalcumulativedistributionfunction, p1istheproportionachievingagiventhresholdand p2theproportiontobeexpectedafterachangewitheffectsize, d,then,

p2 = F{F1(p1)+ d}

16

References

ALTMAN,D.G.(1991) PracticalStatisticsforMedicalResearch.London:ChapmanandHall.

BANGERT, R.L., KULIK, J.A.ANDKULIK, C.C.(1983)Individualisedsystemsofinstructioninsecondaryschools. ReviewofEducationalResearch,53,143158.

BANGERTDROWNS, R.L. (1988)Theeffectsofschoolbasedsubstanceabuseeducation:ametaanalysis. JournalofDrugEducation,18,3,24365.

BAUGH, F.(2002)Correctingeffectsizesforscorereliability:Areminderthatmeasurementandsubstantiveissuesarelinkedinextricably.EducationalandPsychologicalMeasurement,62,2,254263.

CLIFF, N.(1993)DominanceStatisticsordinalanalysestoanswerordinalquestions PsychologicalBulletin,114,3.494509.

COHEN, J.(1969)StatisticalPowerAnalysisfortheBehavioralSciences.NY:AcademicPress.

COHEN, J.(1994)TheEarth isRound(p

17

HEDGES, L.ANDOLKIN, I.(1985) StatisticalMethodsforMetaAnalysis.NewYork:AcademicPress.

HEMBREE, R. (1988)Correlates,causeseffectsandtreatmentoftestanxiety. ReviewofEducationalResearch,58(1),4777.

HUBERTY, C.J..(2002)Ahistoryofeffectsizeindices.EducationalandPsychologicalMeasurement,62,2,227240.

HYMAN, R.B, FELDMAN, H.R., HARRIS, R.B., LEVIN, R.F.ANDMALLOY, G.B.(1989)Theeffectsofrelaxationtrainingonmedicalsymptoms:ameatanalysis.NursingResearch,38,216220.

KAVALE, K.A.ANDFORNESS, S.R.(1983)Hyperactivityanddiettreatment:ameatanalysisoftheFeingoldhypothesis. JournalofLearningDisabilities,16,324330.

KESELMAN, H.J., HUBERTY, C.J., LIX, L.M., OLEJNIK, S. CRIBBIE, R.A., DONAHUE, B.,KOWALCHUK, R.K., LOWMAN, L.L., PETOSKEY, M.D., KESELMAN, J.C.ANDLEVIN, J.R. (1998)Statisticalpracticesofeducationalresearchers:AnanalysisoftheirANOVA,MANOVA,andANCOVAanalyses.ReviewofEducationalResearch,68,3,350386.

KIRK, R.E.(1996)PracticalSignificance:Aconceptwhosetimehascome.Educationaland PsychologicalMeasurement,56,5,746759.

KULIK, J.A., KULIK, C.C.ANDBANGERT, R.L.(1984)Effectsofpracticeonaptitudeandachievementtestscores. AmericanEducationResearchJournal,21,435447.

LEPPER, M.R., HENDERLONG, J., ANDGINGRAS, I. (1999)Understandingtheeffectsofextrinsicrewardsonintrinsicmotivation Usesandabusesofmetaanalysis:CommentonDeci,Koestner,andRyan. PsychologicalBulletin,125,6,669676.

LIPSEY, M.W.(1992)Juveniledelinquencytreatment:ametaanalyticinquiryintothevariabilityofeffects.InT.D.Cook,H.Cooper,D.S.Cordray,H.Hartmann,L.V.Hedges,R.J.Light,T.A.LouisandF.Mosteller(Eds) Metaanalysisforexplanation.NewYork:RussellSageFoundation.

LIPSEY, M.W.ANDWILSON, D.B. (1993)TheEfficacyofPsychological,Educational,andBehavioralTreatment:Confirmationfrommetaanalysis. AmericanPsychologist,48,12,11811209.

MCGRAW, K.O. (1991)ProblemswiththeBESD:acommentonRosenthalsHowAreWeDoinginSoftPsychology.AmericanPsychologist,46,10846.

MCGRAW, K.O.ANDWONG, S.P. (1992)ACommonLanguageEffectSizeStatistic.PsychologicalBulletin,111,361365.

MOSTELLER, F., LIGHT, R.J.ANDSACHS, J.A.(1996)'Sustainedinquiryineducation:lessonsfromskillgroupingandclasssize.' HarvardEducationalReview,66,797842.

OAKES, M.(1986) StatisticalInference:ACommentaryfortheSocialandBehavioralSciences.NewYork:Wiley.

OLEJNIK, S.ANDALGINA, J.(2000)MeasuresofEffectSizeforComparativeStudies:Applications,InterpretationsandLimitations. ContemporaryEducationalPsychology,25,241286.

18

ROSENTHAL, R.(1994)ParametricMeasuresofEffectSizeinH.CooperandL.V.Hedges(Eds.), TheHandbookofResearchSynthesis.NewYork:RussellSageFoundation.

ROSENTHAL, R,ANDRUBIN, D.B. (1982)Asimple,generalpurposedisplayofmagnitudeofexperimentaleffect. JournalofEducationalPsychology,74,166169.

RUBIN, D.B.(1992)Metaanalysis:literaturesynthesisoreffectsizesurfaceestimation. JournalofEducationalStatistics,17,4,363374.

SHYMANSKY, J.A., HEDGES, L.V.ANDWOODWORTH, G.(1990)Areassessmentoftheeffectsofinquirybasedsciencecurriculaofthe60sonstudentperformance.JournalofResearchinScienceTeaching,27,127144.

SLAVIN, R.E.ANDMADDEN, N.A. (1989)Whatworksforstudentsatrisk?Aresearchsynthesis. EducationalLeadership,46(4),413.

SMITH, M.L.ANDGLASS, G.V. (1980)Metaanalysisofresearchonclasssizeanditsrelationshiptoattitudesandinstruction. AmericanEducationalResearchJournal,17,419433.

SNYDER, P. ANDLAWSON, S. (1993)EvaluatingResultsUsingCorrectedandUncorrectedEffectSizeEstimates.JournalofExperimentalEducation,61,4,334349.

STRAHAN, R.F. (1991)RemarksontheBinomialEffectSizeDisplay.AmericanPsychologist,46,10834.

THOMPSON, B. (1999)Commonmethodologymistakesineducationalresearch,revisited,alongwithaprimeronbotheffectsizesandthebootstrap.InvitedaddresspresentedattheannualmeetingoftheAmericanEducationalResearchAssociation,Montreal.[Accessedfrom,January2000]

TYMMS, P., MERRELL, C.ANDHENDERSON, B. (1997)TheFirstYearasSchool:AQuantitativeInvestigationoftheAttainmentandProgressofPupils.EducationalResearchandEvaluation,3,2,101118.

VINCENT, D.ANDCRUMPLER, M. (1997) BritishSpellingTestSeriesManual3X/Y.Windsor:NFERNelson.

WANG, M.C.ANDBAKER, E.T.(1986)Mainstreamingprograms:Designfeaturesandeffects. JournalofSpecialEducation,19,503523.

WILCOX, R.R. (1998)Howmanydiscoverieshavebeenlostbyignoringmodernstatisticalmethods?.AmericanPsychologist,53,3,300314.

WILKINSON, L.AND TASK FORCEON STATISTICAL INFERENCE, APABOARDOFSCIENTIFICAFFAIRS(1999)StatisticalMethodsinPsychologyJournals:GuidelinesandExplanations.AmericanPsychologist,54,8,594604.

Documents

Effect Size