Upload
rizki-amelia
View
29
Download
4
Embed Size (px)
DESCRIPTION
Effect Size
Citation preview
1
ItstheEffectSize,Stupid1Whateffectsizeisandwhyitisimportant
PaperpresentedattheBritishEducationalResearchAssociationannualconference,Exeter,1214September,2002
RobertCoe
SchoolofEducation,UniversityofDurham,LeazesRoad, [email protected]
AbstractEffectsizeisasimplewayofquantifyingthedifferencebetweentwogroups
thathasmanyadvantagesovertheuseoftestsofstatisticalsignificancealone.Effectsizeemphasisesthesizeofthedifferenceratherthanconfoundingthiswithsamplesize.However,primaryreportsrarelymentioneffectsizesandfewtextbooks,researchmethodscoursesorcomputerpackagesaddresstheconcept.Thispaperprovidesanexplicationofwhataneffectsizeis,howitiscalculatedandhowitcanbeinterpreted.Therelationshipbetweeneffectsizeandstatisticalsignificanceisdiscussedandtheuseofconfidenceintervalsforthelatteroutlined.Someadvantagesanddangersofusingeffectsizesinmetaanalysisarediscussedandotherproblemswiththeuseofeffectsizesareraised.Anumberofalternativemeasuresofeffectsizearedescribed.Finally,adviceontheuseofeffectsizesissummarised.
Effectsizeissimplyawayofquantifyingthesizeofthedifferencebetweentwogroups.Itiseasytocalculate,readilyunderstoodandcanbeappliedtoanymeasuredoutcomeinEducationorSocialScience.Itisparticularlyvaluableforquantifyingtheeffectivenessofaparticularintervention,relativetosomecomparison.Itallowsustomovebeyondthesimplistic,Doesitworkornot?tothefarmoresophisticated,Howwelldoesitworkinarangeofcontexts?Moreover,byplacingtheemphasisonthemostimportantaspectofanintervention thesizeoftheeffectratherthanitsstatisticalsignificance(whichconflateseffectsizeandsamplesize),itpromotesamorescientificapproachtotheaccumulationofknowledge.Forthesereasons,effectsizeisanimportanttoolinreportingandinterpretingeffectiveness.
Theroutineuseofeffectsizes,however,hasgenerallybeenlimitedtometaanalysisforcombiningandcomparingestimatesfromdifferentstudiesandisalltoorareinoriginalreportsofeducationalresearch(Keselman etal.,1998).Thisisdespitethefactthatmeasuresofeffectsizehavebeenavailableforatleast60years(Huberty,2002),andtheAmericanPsychologicalAssociationhasbeenofficiallyencouragingauthorstoreporteffectsizessince1994butwithlimitedsuccess(Wilkinson etal.,1999).Formulaeforthecalculationofeffectsizesdonotappearinmoststatisticstextbooks(otherthanthosedevotedtometaanalysis),arenotfeaturedinmanystatisticscomputerpackagesandareseldomtaughtinstandardresearchmethodscourses.Forthesereasons,eventheresearcherwhoisconvincedbythe
1Duringthe1992USPresidentialelectioncampaign,BillClintonsfortunesweretransformedwhenhisadvisorshelpedhimtofocusonthemainissuebywritingItstheeconomy,stupidonaboardtheyputinfrontofhimeverytimehewentouttospeak.
2
wisdomofusingmeasuresofeffectsize,andisnotafraidtoconfronttheorthodoxyofconventionalpractice,mayfindthatitisquitehardtoknowexactlyhowtodoso.
Thefollowingguideiswrittenfornonstatisticians,thoughinevitablysomeequationsandtechnicallanguagehavebeenused.Itdescribeswhateffectsizeis,whatitmeans,howitcanbeusedandsomepotentialproblemsassociatedwithusingit.
1.Whydoweneedeffectsize?ConsideranexperimentconductedbyDowson(2000)toinvestigatetimeof
dayeffectsonlearning:dochildrenlearnbetterinthemorningorafternoon?Agroupof38childrenwereincludedintheexperiment.Halfwererandomlyallocatedtolistentoastoryandanswerquestionsaboutit(ontape)at9am,theotherhalftohearexactlythesamestoryandanswerthesamequestionsat3pm.Theircomprehensionwasmeasuredbythenumberofquestionsansweredcorrectlyoutof20.
Theaveragescorewas15.2forthemorninggroup,17.9fortheafternoongroup:adifferenceof2.7.Buthowbigadifferenceisthis?Iftheoutcomeweremeasuredonafamiliarscale,suchasGCSEgrades,interpretingthedifferencewouldnotbeaproblem.Iftheaveragedifferencewere,say,halfagrade,mostpeoplewouldhaveafairideaoftheeducationalsignificanceoftheeffectofreadingastoryatdifferenttimesofday.However,inmanyexperimentsthereisnofamiliarscaleavailableonwhichtorecordtheoutcomes.Theexperimenteroftenhastoinventascaleortouse(oradapt)analreadyexistingonebutgenerallynotonewhoseinterpretationwillbefamiliartomostpeople.
(a) (b)
Figure1
Onewaytogetoverthisproblemistousetheamountofvariationinscorestocontextualisethedifference.Iftherewerenooverlapatallandeverysinglepersonintheafternoongrouphaddonebetteronthetestthaneveryoneinthemorninggroup,thenthiswouldseemlikeaverysubstantialdifference.Ontheotherhand,ifthespreadofscoreswerelargeandtheoverlapmuchbiggerthanthedifferencebetweenthegroups,thentheeffectmightseemlesssignificant.Becausewehaveanideaoftheamountofvariationfoundwithinagroup,wecanusethisasayardstickagainstwhichtocomparethedifference.Thisideaisquantifiedinthecalculationoftheeffectsize.TheconceptisillustratedinFigure1,whichshowstwopossiblewaysthedifferencemightvaryinrelationtotheoverlap.Ifthedifferencewereasingraph(a)itwouldbeverysignificantingraph(b),ontheotherhand,thedifferencemighthardlybenoticeable.
3
2.Howisitcalculated?Theeffectsizeisjustthestandardisedmeandifferencebetweenthetwo
groups.Inotherwords:
EffectSize=
Equation1
Ifitisnotobviouswhichoftwogroupsistheexperimental(i.e.theonewhichwasgiventhenewtreatmentbeingtested)andwhichthecontrol(theonegiventhestandardtreatment ornotreatment forcomparison),thedifferencecanstillbecalculated.Inthiscase,theeffectsizesimplymeasuresthedifferencebetweenthem,soitisimportantinquotingtheeffectsizetosaywhichwayroundthecalculationwasdone.
Thestandarddeviationisameasureofthespreadofasetofvalues.Hereitreferstothestandarddeviationofthepopulationfromwhichthedifferenttreatmentgroupsweretaken.Inpractice,however,thisisalmostneverknown,soitmustbeestimatedeitherfromthestandarddeviationofthecontrolgroup,orfromapooledvaluefrombothgroups(seequestion7,below,formorediscussionofthis).
InDowsonstimeofdayeffectsexperiment,thestandarddeviation(SD)=3.3,sotheeffectsizewas(17.915.2)/3.3=0.8.
3.Howcaneffectsizesbeinterpreted?Onefeatureofaneffectsizeisthatitcanbedirectlyconvertedintostatements
abouttheoverlapbetweenthetwosamplesintermsofacomparisonofpercentiles.AneffectsizeisexactlyequivalenttoaZscoreofastandardNormal
distribution.Forexample,aneffectsizeof0.8meansthatthescoreoftheaveragepersonintheexperimental groupis0.8standarddeviationsabovetheaveragepersoninthecontrolgroup,andhenceexceedsthescoresof79%ofthecontrolgroup.Withthetwogroupsof19inthetimeofdayeffectsexperiment,theaveragepersonintheafternoongroup(i.e.theonewhowouldhavebeenranked10th inthegroup)wouldhavescoredaboutthesameasthe4thhighestpersoninthemorninggroup.Visualisingthesetwoindividualscangivequiteagraphicinterpretationofthedifferencebetweenthetwoeffects.
TableIshowsconversionsofeffectsizes(column1)topercentiles(column2)andtheequivalentchangeinrankorderforagroupof25(column3).Forexample,foraneffectsizeof0.6,thevalueof73%indicatesthattheaveragepersonintheexperimentalgroupwouldscorehigherthan73%ofacontrolgroupthatwasinitiallyequivalent.Ifthegroupconsistedof25people,thisisthesameassayingthattheaverageperson(i.e.ranked13th inthegroup)wouldnowbeonapar withthepersonranked7th inthecontrolgroup.Noticethataneffectsizeof1.6wouldraisetheaveragepersontobelevelwiththetoprankedindividualinthecontrolgroup,soeffectsizeslargerthanthisareillustratedintermsofthetoppersoninalargergroup.Forexample,aneffectsizeof3.0wouldbringtheaveragepersoninagroupof740levelwiththepreviouslytoppersoninthegroup.
[Meanofexperimentalgroup] [Meanofcontrolgroup]
StandardDeviation
4
TableI: Interpretationsofeffectsizes
EffectSize
Percentageofcontrolgroupwhowouldbebelowaveragepersonin
experimentalgroup
Rankofpersoninacontrol
groupof25whowouldbe
equivalenttotheaveragepersonin
experimentalgroup
Probabilitythatyoucouldguesswhichgroupapersonwasin
fromknowledgeoftheirscore.
Equivalentcorrelation,r(=Differenceinpercentage
successfulineachofthetwogroups,BESD)
Probabilitythatpersonfromexperimentalgroupwillbehigherthanpersonfrom
control,ifbothchosenatrandom(=CLES)
0.0 50% 13th 0.50 0.00 0.50
0.1 54% 12th 0.52 0.05 0.53
0.2 58% 11th 0.54 0.10 0.56
0.3 62% 10th 0.56 0.15 0.58
0.4 66% 9th 0.58 0.20 0.61
0.5 69% 8th 0.60 0.24 0.64
0.6 73% 7th 0.62 0.29 0.66
0.7 76% 6th 0.64 0.33 0.69
0.8 79% 6th 0.66 0.37 0.71
0.9 82% 5th 0.67 0.41 0.74
1.0 84% 4th 0.69 0.45 0.76
1.2 88% 3rd 0.73 0.51 0.80
1.4 92% 2nd 0.76 0.57 0.84
1.6 95% 1st 0.79 0.62 0.87
1.8 96% 1st 0.82 0.67 0.90
2.0 98% 1st (or1stoutof
44)0.84 0.71 0.92
2.5 99% 1st (or1stoutof
160)0.89 0.78 0.96
3.0 99.9% 1st (or1stoutof
740)0.93 0.83 0.98
Anotherwaytoconceptualisetheoverlapisintermsoftheprobabilitythatonecouldguesswhichgroupapersoncamefrom,basedonlyontheirtestscore orwhatevervaluewasbeingcompared.Iftheeffectsizewere0(i.e.thetwogroupswerethesame)thentheprobabilityofacorrectguesswouldbeexactlyahalf or0.50.Withadifferencebetweenthetwogroupsequivalenttoaneffectsizeof0.3,thereisstillplentyofoverlap,andtheprobabilityofcorrectly identifyingthegroupsrisesonlyslightlyto0.56.Withaneffectsizeof1,theprobabilityisnow0.69,justoveratwothirdschance.Theseprobabilitiesareshowninthefourthcolumnof TableI.Itisclearthattheoverlapbetweenexperimentalandcontrolgroupsissubstantial(andthereforetheprobabilityisstillcloseto0.5),evenwhentheeffectsizeisquitelarge.
5
Aslightlydifferentwaytointerpreteffectsizesmakesuseofanequivalencebetweenthestandardisedmeandifference(d)andthecorrelationcoefficient,r.Ifgroupmembershipiscodedwithadummyvariable(e.g.denotingthecontrolgroupby0andtheexperimentalgroupby1)andthecorrelationbetweenthisvariableandtheoutcomemeasurecalculated,avalueof rcanbederived.Bymakingsomeadditionalassumptions,onecanreadilyconvertd intoringeneral,usingtheequationr2= d2 / (4+d2)(seeCohen,1969,pp2022forotherformulaeandconversiontable).RosenthalandRubin(1982)takeadvantageofaninterestingpropertyof rtosuggestafurtherinterpretation,whichtheycallthebinomialeffectsizedisplay(BESD).Iftheoutcomemeasureisreducedtoasimpledichotomy(forexample,whetherascoreisaboveorbelowaparticularvaluesuchasthemedian,whichcouldbethoughtofassuccessorfailure),rcanbeinterpretedasthedifferenceintheproportionsineachcategory.Forexample,aneffectsizeof0.2indicatesadifferenceof0.10intheseproportions,aswouldbethecaseif45%ofthecontrolgroupand55%ofthetreatmentgrouphadreachedsomethresholdofsuccess.Note,however,thatiftheoverallproportionsuccessfulisnotcloseto50%,thisinterpretationcanbesomewhatmisleading(Strahan,1991McGraw,1991).ThevaluesfortheBESDareshownincolumn5.
Finally,McGrawandWong(1992)havesuggestedaCommonLanguageEffectSize(CLES)statistic,whichtheyargueisreadilyunderstoodbynonstatisticians(shownincolumn6ofTableI).Thisistheprobabilitythatascoresampledatrandomfromonedistributionwillbegreaterthanascoresampledfromanother.Theygivetheexampleoftheheightsofyoungadultmalesandfemales,whichdifferbyaneffectsizeofabout2,andtranslatethisdifferencetoaCLESof0.92.Inotherwordsin92outof100blinddatesamongyoungadults,themalewillbetallerthanthefemale(p361).
ItshouldbenotedthatthevaluesinTableIdependontheassumptionofaNormaldistribution.Theinterpretationof effectsizesintermsofpercentilesisverysensitivetoviolationsofthisassumption(seequestion7,below).
Anotherwaytointerpreteffectsizesistocomparethemtotheeffectsizesofdifferencesthatarefamiliar.Forexample,Cohen(1969,p23)describesaneffectsizeof0.2assmallandgivestoillustrateittheexamplethatthedifferencebetweentheheightsof15yearoldand16yearoldgirlsintheUScorrespondstoaneffectofthissize.Aneffectsizeof0.5isdescribedasmedium andislargeenoughtobevisibletothenakedeye.A0.5effectsizecorrespondstothedifferencebetweentheheightsof14yearoldand18yearoldgirls.Cohendescribesaneffectsizeof0.8asgrosslyperceptibleandthereforelargeandequatesittothedifferencebetweentheheightsof13yearoldand18yearoldgirls.AsafurtherexamplehestatesthatthedifferenceinIQbetweenholdersofthePh.D.degreeandtypicalcollegefreshmeniscomparabletoaneffectsizeof0.8.
Cohendoesacknowledgethedangerofusingtermslikesmall,mediumandlargeoutofcontext.Glassetal.(1981,p104)areparticularlycriticalofthisapproach,arguingthattheeffectivenessofaparticularinterventioncanonlybeinterpretedinrelation tootherinterventionsthatseektoproducethesameeffect.Theyalsopointoutthatthepracticalimportanceofaneffectdependsentirelyonitsrelativecostsandbenefits.Ineducation,ifitcouldbeshownthatmakingasmallandinexpensivechangewouldraiseacademicachievementbyaneffectsizeofevenaslittleas0.1,thenthiscouldbeaverysignificantimprovement,particularlyiftheimprovementapplieduniformlytoallstudents,andevenmoresoiftheeffectwerecumulativeovertime.
6
TableII: Examplesofaverageeffectsizesfromresearch
Intervention OutcomeEffectSize
Source
Studentstestperformanceinreading 0.30Reducingclasssizefrom23to15 Studentstestperformanceinmaths 0.32
FinnandAchilles,(1990)
Attitudesofstudents 0.47Small(
7
inaspellingagefrom11to12correspondstoaneffectsizeofabout0.3,butseemstovaryaccordingtotheparticulartestused.
InEngland,thedistributionofGCSEgradesincompulsorysubjects(i.e.MathsandEnglish)havestandarddeviationsofbetween1.5 1.8 grades,soanimprovementofoneGCSEgraderepresentsaneffectsizeof0.50.7.Inthecontextofsecondaryschoolstherefore,introducingachangeinpracticewhoseeffectsizewasknowntobe0.6wouldresultinanimprovementofaboutaGCSEgradeforeachpupilineachsubject.Foraschoolinwhich50%ofpupilswerepreviouslygainingfiveormoreA*Cgrades,thispercentage(otherthingsbeingequal,andassumingthattheeffectappliedequallyacrossthewholecurriculum)wouldriseto73%.1 EvenCohenssmalleffectof0.2wouldproduceanincreasefrom50%to58% adifferencethatmostschoolswouldprobablycategoriseasquitesubstantial.OlejnikandAlgina(2000)giveasimilarexamplebasedontheIowaTestofBasicSkills
Finally,theinterpretationofeffectsizescanbegreatlyhelpedbyafewexamplesfromexistingresearch.TableIIlistsaselectionofthese,manyofwhicharetakenfromLipseyandWilson(1993).Theexamplescitedaregivenforillustrationoftheuseof effectsizemeasurestheyarenotintendedtobethedefinitivejudgementontherelativeefficacyofdifferentinterventions.Ininterpretingthem,therefore,oneshouldbearinmindthatmostofthemetaanalysesfromwhichtheyarederivedcanbe(andoftenhavebeen)criticisedforavarietyofweaknesses,thattherangeofcircumstancesinwhichtheeffectshavebeenfoundmaybelimited,andthattheeffectsizequotedisanaveragewhichisoftenbasedonquitewidelydifferingvalues.
ItseemstobeafeatureofeducationalinterventionsthatveryfewofthemhaveeffectsthatwouldbedescribedinCohensclassificationasanythingotherthansmall.Thisappearsparticularlysoforeffectsonstudentachievement.Nodoubtthisispartlyaresultofthewidevariationfoundinthepopulationasawhole,againstwhichthemeasureofeffectsizeiscalculated.Onemightalsospeculatethatachievementishardertoinfluencethanotheroutcomes,perhapsbecausemostschoolsarealreadyusingoptimalstrategies,orbecausedifferentstrategiesarelikelytobeeffectiveindifferentsituationsacomplexitythatisnotwellcapturedbyasingleaverageeffectsize.
4.Whatistherelationshipbetweeneffectsizeandsignificance?Effectsizequantifiesthesizeofthedifferencebetweentwogroups,andmay
thereforebesaidtobeatruemeasureofthesignificanceofthedifference.If,forexample,theresultsofDowsonstimeofdayeffectsexperimentwerefoundtoapplygenerally,wemightaskthequestion:Howmuchdifferencewoulditmaketochildrenslearningiftheyweretaughtaparticulartopicintheafternooninsteadofthemorning?Thebestanswerwecouldgivetothiswouldbeintermsoftheeffectsize.
However,instatisticsthewordsignificanceisoftenusedtomeanstatisticalsignificance,whichisthelikelihoodthatthedifferencebetweenthetwogroupscouldjustbeanaccidentofsampling.Ifyoutaketwosamplesfromthesamepopulationtherewillalwaysbeadifferencebetweenthem.Thestatisticalsignificanceisusuallycalculatedasapvalue,theprobabilitythatadifferenceofatleastthesamesizewouldhavearisenbychance,eveniftherereallywerenodifferencebetweenthetwopopulations.Fordifferencesbetweenthemeansoftwogroups,thispvaluewouldnormallybecalculatedfromattest.Byconvention,ifp
8
ofthesample.Onewouldgetasignificantresulteitheriftheeffectwereverybig(despitehavingonlyasmallsample)orifthesamplewereverybig(eveniftheactualeffectsizeweretiny).Itisimportanttoknowthestatisticalsignificanceofaresult,sincewithoutitthereisadangerofdrawingfirmconclusionsfromstudieswherethesampleistoosmalltojustifysuchconfidence.However,statisticalsignificancedoesnot tellyouthemostimportantthing:thesizeoftheeffect.Onewaytoovercomethisconfusionistoreporttheeffectsize,togetherwithanestimateofitslikelymarginforerrororconfidenceinterval.
5.Whatisthemarginforerrorinestimatingeffectsizes?Clearly,ifaneffectsizeiscalculatedfromaverylargesampleitislikelytobe
moreaccuratethanonecalculatedfromasmallsample.Thismarginforerrorcanbequantifiedusingtheideaofaconfidenceinterval,whichprovidesthesameinformationasisusuallycontainedinasignificancetest:usinga95%confidenceintervalisequivalenttotakinga5%significancelevel.Tocalculatea95%confidenceinterval,youassumethatthevalueyougot(e.g.theeffectsizeestimateof0.8)isthetruevalue,butcalculatetheamountofvariationinthisestimateyouwouldgetifyourepeatedlytooknewsamplesofthesamesize(i.e.differentsamplesof38children).Forevery100ofthesehypotheticalnewsamples,bydefinition,95wouldgiveestimatesoftheeffectsizewithinthe95%confidenceinterval.Ifthisconfidenceintervalincludeszero,thenthatisthesameassayingthattheresultisnotstatisticallysignificant.If,ontheotherhand,zeroisoutsidetherange,thenitisstatisticallysignificantatthe5%level.Usingaconfidenceintervalisabetterwayofconveyingthisinformationsinceitkeepstheemphasisontheeffectsizewhichistheimportantinformation ratherthanthepvalue.
AformulaforcalculatingtheconfidenceintervalforaneffectsizeisgivenbyHedgesandOlkin(1985,p86).Iftheeffectsizeestimatefromthesampleisd,thenitisNormallydistributed,withstandarddeviation:
Equation2
(WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively.)
Hencea95%confidenceintervalfordwouldbefrom
d1.96 s[d] to d+1.96 s[d]Equation3
Tousethefiguresfromthetimeofdayexperimentagain,NE=NC=19andd=0.8,so s[d]= (0.105+0.008)=0.34.Hencethe95%confidenceintervalis[0.14, 1.46].Thiswouldnormallybeinterpreted(despitethefactthatsuchaninterpretationisnotstrictlyjustifiedseeOakes,1986foranenlighteningdiscussionofthis)asmeaningthatthetrueeffectoftimeofdayisverylikelytobebetween
9
0.14and1.46.Inotherwords,itisalmostcertainlypositive(i.e.afternoonisbetterthanmorning)andthedifferencemaywellbequitelarge.
6.Howcanknowledgeabouteffectsizesbecombined?Oneofthemainadvantagesofusingeffectsizeisthatwhenaparticular
experimenthasbeenreplicated,thedifferenteffectsizeestimatesfromeachstudycaneasilybecombinedtogiveanoverallbestestimateofthesizeoftheeffect.Thisprocess ofsynthesisingexperimentalresultsintoasingleeffectsizeestimateisknownasmetaanalysis.Itwasdevelopedinitscurrentformbyaneducationalstatistician,GeneGlass(SeeGlassetal.,1981)thoughtherootsofmetaanalysiscanbetracedagooddealfurtherback(seeLepperetal.,1999),andisnowwidelyused,notonlyineducation,butinmedicineandthroughoutthesocialsciences.AbriefandaccessibleintroductiontotheideaofmetaanalysiscanbefoundinFitzGibbon(1984).
Metaanalysis,however,candomuchmorethansimplyproduceanoverallaverageeffectsize,importantthoughthisoftenis.If,foraparticularintervention,somestudiesproducedlargeeffects,andsomesmalleffects,itwouldbeoflimitedvaluesimply tocombinethemtogetherandsaythattheaverageeffectwasmedium.Muchmoreusefulwouldbetoexaminetheoriginalstudiesforanydifferencesbetweenthosewithlargeandsmalleffectsandtotrytounderstandwhatfactorsmightaccountforthedifference.Thebestmetaanalysis,therefore,involvesseekingrelationshipsbetweeneffectsizesandcharacteristicsoftheintervention,thecontextandstudydesigninwhichtheywerefound(Rubin,1992seealsoLepperetal. (1999)foradiscussionof theproblemsthatcanbecreatedbyfailingtodothis,andsomeotherlimitationsoftheapplicabilityofmetaanalysis).
Theimportanceofreplicationingainingevidenceaboutwhatworkscannotbeoverstressed.InDowsonstimeofdayexperimenttheeffectwasfoundtobelargeenoughtobestatisticallyandeducationallysignificant.Becauseweknowthatthepupilswereallocatedrandomlytoeachgroup,wecanbeconfidentthatchanceinitialdifferencesbetweenthetwogroupsareveryunlikelytoaccountforthedifferenceintheoutcomes.Furthermore,theuseofapretestofbothgroupsbeforetheinterventionmakesthisevenlesslikely.However,wecannotruleoutthepossibilitythatthedifferencearosefromsomecharacteristicpeculiartothechildreninthisparticularexperiment.Forexample,ifnoneofthemhadhadanybreakfastthatday,thismightaccountforthepoorperformanceofthemorninggroup.However,theresultwouldthenpresumablynotgeneralisetothewiderpopulationofschoolstudents,mostofwhomwouldhavehadsomebreakfast.Alternatively,theeffectmightdependontheageofthestudents.Dowsonsstudentswereaged7or8itisquitepossiblethattheeffectcouldbediminishedorreversedwitholder(oryounger)students.Thisillustratesthedangerofimplementingpolicyonthebasisofasingleexperiment.Confidenceinthegeneralityofaresultcanonlyfollowwidespreadreplication.
Animportantconsequenceofthecapacityofmetaanalysistocombineresultsisthatevensmallstudiescanmakeasignificantcontributiontoknowledge.Thekindofexperimentthatcanbedonebyasingleteacherinaschoolmightinvolveatotaloffewerthan30students.Unlesstheeffectishuge,astudyofthissizeismostunlikelytogetastatisticallysignificantresult.Accordingtoconventionalstatisticalwisdom,therefore,theexperimentisnotworthdoing.However,iftheresultsofseveralsuchexperimentsarecombinedusingmetaanalysis,theoverallresultislikelytobehighlystatisticallysignificant.Moreover,itwillhavetheimportantstrengthsofbeingderivedfromarangeofcontexts(thusincreasingconfidenceinitsgenerality)andfromreallifeworkingpractice(therebymakingitmorelikelythatthepolicyisfeasibleandcanbeimplementedauthentically).
10
Onefinalcaveatshouldbemadehereaboutthedangerofcombiningincommensurableresults.Giventwo(ormore)numbers,onecanalwayscalculateanaverage.However,iftheyareeffectsizesfromexperimentsthatdiffersignificantlyintermsoftheoutcomemeasuresused,thentheresultmaybetotallymeaningless.Itcanbeverytempting,onceeffectsizeshavebeencalculated,totreatthemasallthesameandlosesightoftheirorigins. Certainly,thereareplentyofexamplesofmetaanalysesinwhichthejuxtapositionofeffectsizesissomewhatquestionable.
Incomparing(orcombining)effectsizes,oneshouldthereforeconsidercarefullywhethertheyrelatetothesameoutcomes.Thisadviceappliesnotonlytometaanalysis,buttoanyothercomparisonofeffectsizes.Moreover,becauseofthesensitivityofeffectsizeestimatestoreliabilityandrangerestriction(seebelow),oneshouldalsoconsiderwhetherthoseoutcomemeasuresarederivedfromthesame(orsufficientlysimilar)instrumentsandthesame(orsufficientlysimilar)populations.
Itisalsoimportanttocompareonlylikewithlikeintermsofthetreatmentsusedtocreatethedifferencesbeingmeasured.Intheeducationliterature,thesamenameisoftengiventointerventionsthatareactuallyverydifferent,forexample,iftheyareoperationaliseddifferently,oriftheyaresimplynotwellenoughdefinedforittobeclearwhethertheyarethesameornot.Itcouldalsobethatdifferentstudieshaveusedthesamewelldefinedandoperationalisedtreatments,buttheactualimplementationdiffered,orthatthesametreatmentmayhavehaddifferentlevelsofintensityindifferentstudies.Inanyofthesecases,itmakesnosensetoaverageouttheireffects.
7.Whatotherfactorscaninfluenceeffectsize?Althougheffectsizeisasimpleandreadilyinterpretedmeasureof
effectiveness,itcanalsobesensitivetoanumberofspuriousinfluences,sosomecareneedstobetakeninitsuse.Someoftheseissuesareoutlinedhere.
Whichstandarddeviation?Thefirstproblemistheissueofwhichstandarddeviationtouse.Ideally,the
controlgroupwillprovidethebestestimateofstandarddeviation,sinceitconsistsofarepresentativegroupofthepopulationwhohavenotbeenaffectedbytheexperimentalintervention.However,unlessthecontrolgroupisverylarge,theestimateofthetruepopulationstandarddeviationderivedfromonlythecontrolgroup islikelytobeappreciablylessaccuratethananestimatederivedfromboththecontrolandexperimentalgroups.Moreover,instudieswherethereisnotatruecontrolgroup(forexamplethetimeofdayeffectsexperiment)thenitmaybeanarbitrarydecisionwhichgroupsstandarddeviationtouse,anditwilloftenmakeanappreciabledifferencetotheestimateofeffectsize.
Forthesereasons,itisoftenbettertouseapooledestimateofstandarddeviation.Thepooledestimateisessentially anaverageofthestandarddeviationsoftheexperimentalandcontrolgroups(Equation 4).Notethatthisisnotthesameasthestandarddeviationofallthevaluesinbothgroupspooledtogether.If,forexampleeachgrouphadalowstandarddeviationbutthetwomeansweresubstantiallydifferent,thetruepooledestimate(ascalculatedby Equation 4)wouldbemuchlowerthanthevalueobtainedbypoolingallthevaluestogetherandcalculatingthestandarddeviation.TheimplicationsofchoicesaboutwhichstandarddeviationtousearediscussedbyOlejnikandAlgina(2000).
11
Equation4
(WhereNEandNCarethenumbersintheexperimentalandcontrolgroups,respectively,andSDEandSDCaretheirstandarddeviations.)
Theuseofapooledestimateofstandarddeviationdependsontheassumptionthatthetwocalculatedstandarddeviationsareestimatesof thesamepopulationvalue.Inotherwords,thattheexperimentalandcontrolgroupstandarddeviationsdifferonlyasaresultofsamplingvariation.Wherethisassumptioncannotbemade(eitherbecausethereissomereasontobelievethatthetwostandarddeviationsarelikelytobesystematicallydifferent,oriftheactualmeasuredvaluesareverydifferent),thenapooledestimateshouldnotbeused.
IntheexampleofDowsonstimeofdayexperiment,thestandarddeviationsforthemorningandafternoongroupswere4.12and2.10respectively.WithNE=NC=19,Equation2thereforegivesSDpooledas3.3,whichwasthevalueusedin Equation1 togiveaneffectsizeof0.8.However,thedifferencebetweenthetwostandarddeviationsseemsquitelargeinthiscase.Giventhattheafternoongroupmeanwas17.9outof20,itseemslikelythatitsstandarddeviationmayhavebeenreducedbyaceilingeffect i.e.thespreadofscoreswaslimitedbythemaximumavailablemarkof20.Inthiscasetherefore,itmightbemoreappropriatetousethemorninggroupsstandarddeviationasthebestestimate.Doingthiswillreducetheeffectsizeto0.7,anditthenbecomesasomewhatarbitrarydecisionwhichvalueoftheeffectsizetouse.Ageneralruleofthumbinstatisticswhentwovalidmethodsgivedifferentanswersis:Ifindoubt,citeboth.
CorrectionsforbiasAlthoughusingthepooledstandarddeviationtocalculatetheeffectsize
generallygivesabetterestimatethanthecontrolgroupSD,itisstillunfortunatelyslightlybiasedandingeneral givesavalueslightlylargerthanthetruepopulationvalue(HedgesandOlkin,1985).HedgesandOlkin(1985,p80)giveaformulawhichprovidesanapproximatecorrectiontothisbias.
InDowsonsexperimentwith38values,thecorrectionfactorwillbe0.98,soitmakesverylittledifference,reducingtheeffectsizeestimatefrom0.82to0.80.Giventhelikelyaccuracyofthefiguresonwhichthisisbased,itisprobablyonlyworthquotingonedecimalplace,sothefigureof0.8stands.Infact,thecorrectiononlybecomessignificantforsmallsamples,inwhichtheaccuracyisanywaymuchless.Itisthereforehardlyworthworryingaboutitinprimaryreportsofempiricalresults.However,inmetaanalysis,whereresultsfromprimarystudiesarecombined,thecorrectionisimportant,sincewithoutitthisbiaswouldbeaccumulated.
RestrictedrangeSupposethetimeofdayeffectsexperimentweretoberepeated,oncewiththe
topsetinahighlyselectiveschoolandagainwithamixedabilitygroupin acomprehensive.Ifstudentswereallocatedtomorningandafternoongroupsatrandom,therespectivedifferencesbetweenthemmightbethesameineachcasebothmeansintheselectiveschoolmightbehigher,butthedifferencebetweenthetwogroupscouldbethesameasthedifferenceinthecomprehensive.However,itisunlikelythatthestandarddeviationswouldbethesame.Thespreadofscoresfound
12
withinthehighlyselectedgroupwouldbemuchlessthanthatinatruecrosssectionofthepopulation,asforexampleinthemixedabilitycomprehensiveclass.This,ofcourse,wouldhaveasubstantialimpactonthecalculationoftheeffectsize.Withthehighlyrestrictedrangefoundintheselectiveschool,theeffectsizewouldbemuchlargerthanthatfoundinthecomprehensive.
Ideally,incalculatingeffectsizeoneshouldusethestandarddeviationofthefullpopulation,inordertomakecomparisonsfair.However,therewillbemanycasesinwhichunrestrictedvaluesarenotavailable,eitherinpracticeorinprinciple.Forexample,inconsideringtheeffectofaninterventionwithuniversitystudents,orwithpupilswithreadingdifficulties,onemustrememberthatthesearerestrictedpopulations.Inreportingtheeffectsize,oneshoulddrawattentiontothisfactiftheamountofrestrictioncanbequantifieditmaybepossibletomakeallowanceforit.Anycomparisonwitheffectsizescalculatedfromafullrangepopulationmustbemadewithgreatcaution,ifatall.
NonNormaldistributionsTheinterpretationsofeffectsizesgiveninTableIdependontheassumption
thatbothcontrolandexperimentalgroupshaveaNormaldistribution,i.e.thefamiliarbellshapedcurve,shown,forexample,inFigure1.Needlesstosay,ifthisassumptionisnottruethentheinterpretationmaybealtered,andinparticular,itmaybedifficulttomakeafaircomparisonbetweenaneffectsizebasedonNormaldistributionsandonebasedonnonNormaldistributions.
4 3 2 1 0 1 2 3 4
StandardNormalDistribution
(S.D.=1)
Similarlookingdistributionwithfatterextremes
(S.D.=3.3)
Figure2: ComparisonofNormalandnonNormaldistributions
AnillustrationofthisisgiveninFigure2,whichshowsthefrequencycurvesfortwodistributions,oneofthemNormal,theotheracontaminatednormaldistribution(Wilcox,1998),whichissimilarinshape,butwithsomewhatfatterextremes.Infact,thelatterdoeslookjustalittlemorespreadoutthantheNormaldistribution,butitsstandarddeviationisactuallyoverthreetimesasbig.TheconsequenceofthisintermsofeffectsizedifferencesisshowninFigure3.Bothgraphsshowdistributionsthatdifferbyaneffectsizeequalto1,buttheappearanceoftheeffectsizedifferencefromthegraphsisratherdissimilar.Ingraph(b),the
13
separationbetweenexperimentalandcontrolgroupsseemsmuchlarger,yettheeffectsizeisactuallythesameasfortheNormaldistributionsplottedingraph(a).Intermsoftheamountofoverlap,ingraph(b)97%ofthe'experimental'groupareabovethecontrolgroupmean,comparedwiththevalueof84%fortheNormaldistributionofgraph(a)(asgiveninTableI).ThisisquiteasubstantialdifferenceandillustratesthedangerofusingthevaluesinTableIwhenthedistributionisnotknowntobeNormal.
3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 6
(a) (b)
Figure3: NormalandnonNormaldistributionswitheffectsize=1
MeasurementreliabilityAthirdfactorthatcanspuriouslyaffectaneffectsizeisthereliabilityofthe
measurementonwhichitisbased.Accordingtoclassicalmeasurementtheory,anymeasureofaparticularoutcomemaybeconsideredtoconsistofthetrueunderlyingvalue,togetherwithacomponentoferror.Theproblemisthattheamountofvariationinmeasuredscoresforaparticularsample(i.e.itsstandarddeviation)willdependonboththevariationinunderlyingscoresandtheamountoferrorintheirmeasurement.
Togiveanexample,imaginethetimeofdayexperimentwereconductedtwicewithtwo(hypothetically)identicalsamplesofstudents.Inthefirstversionthetestusedtoassesstheircomprehensionconsistedofjust10itemsandtheirscoreswereconvertedintoapercentage.Inthesecondversionatestwith50itemswasused,andagainconvertedtoapercentage.Thetwotestswereofequaldifficultyandtheactualeffectofthedifferenceintimeofdaywasthesameineachcase,sotherespectivemeanpercentagesofthemorningandafternoongroupswerethesameforbothversions.However,itisalmostalwaysthecasethatalongertestwillbemorereliable,andhencethestandarddeviationofthepercentagesonthe50itemtestwillbelowerthanthestandarddeviationforthe10itemtest.Thus,althoughthetrueeffectwasthesame,thecalculatedeffectsizeswillbedifferent.
Ininterpretinganeffectsize,itisthereforeimportanttoknowthereliabilityofthemeasurementfromwhichitwascalculated.Thisisonereasonwhythereliabilityofanyoutcomemeasureusedshouldbereported.Itistheoreticallypossibletomakeacorrectionforunreliability(sometimescalledattenuation),whichgivesanestimateofwhattheeffectsizewouldhavebeen,hadthereliabilityofthetestbeenperfect.However,inpracticetheeffectofthisisratheralarming,sincetheworsethetestwas,themoreyouincreasetheestimateoftheeffectsize.Moreover,estimatesofreliabilityaredependentontheparticularpopulationinwhichthetestwasused,andarethemselvesanywaysubjecttosamplingerror.Forfurtherdiscussionoftheimpactofreliabilityoneffectsizes,seeBaugh(2002).
14
8.Aretherealternativemeasuresofeffectsize?Anumberofstatisticsaresometimesproposedasalternativemeasuresof
effectsize,otherthanthestandardisedmeandifference.Someofthesewillbeconsideredhere.
ProportionofvarianceaccountedforIfthecorrelationbetweentwovariablesisr,thesquareofthisvalue(often
denotedwithacapitalletter:R2)representstheproportionofthevarianceineachthatisaccountedforbytheother.In otherwords,thisistheproportionbywhichthevarianceoftheoutcomemeasureisreducedwhenitisreplacedbythevarianceoftheresidualsfromaregressionequation.Thisideacanbeextendedtomultipleregression(whereitrepresentstheproportionofthevarianceaccountedforbyalltheindependentvariablestogether)andhascloseanalogiesinANOVA(whereitisusuallycalledetasquared, h2).Thecalculationof r(andhenceR2)forthekindofexperimentalsituationwehavebeenconsideringhasalreadybeenreferredtoabove.
BecauseR2hasthisreadyconvertibility,it(oralternativemeasuresofvarianceaccountedfor)issometimesadvocatedasauniversalmeasureofeffectsize(e.g.Thompson,1999).Onedisadvantageofsuchanapproachisthateffectsizemeasuresbasedonvarianceaccountedforsufferfromanumberoftechnicallimitations,suchassensitivitytoviolationofassumptions(heterogeneityofvariance,balanceddesigns)andtheirstandarderrorscanbelarge(OlejnikandAlgina,2000).Theyarealsogenerallymorestatisticallycomplexandhenceperhapslesseasilyunderstood.Further,theyarenondirectionaltwostudieswithpreciselyoppositeresultswouldreportexactlythesamevarianceaccountedfor.However,thereisamorefundamentalobjectiontotheuseofwhatisessentiallyameasureofassociationtoindicatethestrengthofaneffect.
Expressingdifferentmeasuresintermsofthesamestatisticcanhideimportantdifferencesbetweentheminfact,thesedifferenteffectsizesarefundamentallydifferent,andshouldnotbeconfused.Thecrucialdifferencebetweenaneffectsizecalculatedfromanexperimentandonecalculatedfromacorrelationisinthecausalnatureoftheclaimthatisbeingmadeforit.Moreover,thewordeffecthasaninherentimplicationofcausality:talkingabouttheeffectofAonBdoessuggestacausalrelationshipratherthanjustanassociation.Unfortunately,however,thewordeffectisoftenusedwhennoexplicitcausalclaimisbeingmade,butitsimplicationissometimesallowedtofloatinandoutofthemeaning,takingadvantageoftheambiguitytosuggestasubliminalcausallinkwherenoneisreallyjustified.
Thiskindofconfusionissowidespreadineducationthatitisrecommendedherethatthewordeffect(andthereforeeffectsize)shouldnotbeusedunlessadeliberateandexplicitcausalclaimisbeingmade.Whennosuchclaimisbeingmade,wemaytalkaboutthevarianceaccountedfor(R2)orthestrengthofassociation(r),orsimply andperhapsmostinformatively justcitetheregressioncoefficient(Tukey,1969).Ifacausalclaimisbeingmadeitshouldbeexplicitandjustificationprovided.FitzGibbon(2002)hasrecommendedanalternativeapproachtothisproblem.Shehassuggestedasystemofnomenclaturefordifferentkindsofeffectsizesthatclearlydistinguishesbetweeneffectsizesderivedfrom,forexample,randomisedcontrolled,quasiexperimentalandcorrelationalstudies.
OthermeasuresofeffectsizeIthasbeenshownthattheinterpretationofthestandardisedmeandifference
measureofeffectsizeisverysensitivetoviolationsoftheassumptionofnormality.Forthisreason,anumberofmorerobust(nonparametric)alternativeshavebeensuggested.AnexampleoftheseisgivenbyCliff(1993).Therearealsoeffectsize
15
measuresformultivariateoutcomes.AdetailedexplanationcanbefoundinOlejnikandAlgina(2000).Finally,amethodforcalculatingeffectsizeswithinmultilevelmodelshasbeenproposedbyTymmsetal.(1997).GoodsummariesofmanyofthedifferentkindsofeffectsizemeasuresthatcanbeusedandtherelationshipsamongthemcanbefoundinSnyderandLawson(1993),Rosenthal(1994)andKirk(1996).
Finally,acommoneffectsizemeasurewidelyusedinmedicineistheoddsratio.Thisisappropriatewhereanoutcomeisdichotomous:successorfailure,apatientsurvivesordoesnot.Explanationsoftheoddsratiocanbefoundinanumberof medicalstatisticstexts,includingAltman(1991),andinFleiss(1994).
ConclusionsAdviceontheuseofeffectsizescanbesummarisedasfollows:
Effectsizeisastandardised,scalefreemeasureoftherelativesizeoftheeffectofanintervention. Itisparticularlyusefulforquantifyingeffectsmeasuredonunfamiliarorarbitraryscalesandforcomparingtherelativesizesofeffectsfromdifferentstudies.
InterpretationofeffectsizegenerallydependsontheassumptionsthatcontrolandexperimentalgroupvaluesareNormallydistributedandhavethesamestandarddeviations.Effectsizescanbeinterpretedintermsofthepercentilesorranksatwhichtwodistributionsoverlap,intermsofthelikelihoodofidentifyingthesourceofavalue,orwithreferencetoknowneffectsoroutcomes.
Useofaneffectsizewithaconfidenceintervalconveysthesameinformationasatestofstatisticalsignificance,butwiththeemphasisonthesignificanceoftheeffect,ratherthanthesamplesize.
Effectsizes(withconfidenceintervals)shouldbecalculatedandreportedinprimarystudiesaswellasinmetaanalyses.
InterpretationofstandardisedeffectsizescanbeproblematicwhenasamplehasrestrictedrangeordoesnotcomefromaNormaldistribution,orifthemeasurementfromwhichitwasderivedhasunknownreliability.
Theuseofanunstandardisedmeandifference(i.e.therawdifferencebetweenthetwogroups,togetherwithaconfidenceinterval)maybepreferablewhen:
theoutcomeismeasuredonafamiliarscale thesamplehasarestrictedrange theparentpopulationissignificantlynonNormalcontrolandexperimentalgroupshaveappreciablydifferentstandard
deviations theoutcomemeasurehasveryloworunknownreliability
Caremustbetakenincomparingoraggregatingeffectsizesbasedondifferentoutcomes,differentoperationalisationsofthesameoutcome,differenttreatments,orlevelsofthesametreatment,ormeasuresderivedfromdifferentpopulations.
Thewordeffectconveysanimplicationofcausality,andtheexpressioneffectsizeshouldthereforenotbeusedunlessthisimplicationisintendedandcanbejustified.
1Thiscalculationisderivedfromaprobittransformation(Glassetal.,1981,p136),basedontheassumptionofanunderlyingnormallydistributedvariablemeasuringacademicattainment,somethresholdofwhichisequivalenttoastudentachieving5+A* Cs.Percentagesforthechangefromastartingvalueof50%forothereffectsizevaluescanbereaddirectlyfromTable I.Alternatively,if F(z)isthestandardnormalcumulativedistributionfunction, p1istheproportionachievingagiventhresholdand p2theproportiontobeexpectedafterachangewitheffectsize, d,then,
p2 = F{F1(p1)+ d}
16
References
ALTMAN,D.G.(1991) PracticalStatisticsforMedicalResearch.London:ChapmanandHall.
BANGERT, R.L., KULIK, J.A.ANDKULIK, C.C.(1983)Individualisedsystemsofinstructioninsecondaryschools. ReviewofEducationalResearch,53,143158.
BANGERTDROWNS, R.L. (1988)Theeffectsofschoolbasedsubstanceabuseeducation:ametaanalysis. JournalofDrugEducation,18,3,24365.
BAUGH, F.(2002)Correctingeffectsizesforscorereliability:Areminderthatmeasurementandsubstantiveissuesarelinkedinextricably.EducationalandPsychologicalMeasurement,62,2,254263.
CLIFF, N.(1993)DominanceStatisticsordinalanalysestoanswerordinalquestions PsychologicalBulletin,114,3.494509.
COHEN, J.(1969)StatisticalPowerAnalysisfortheBehavioralSciences.NY:AcademicPress.
COHEN, J.(1994)TheEarth isRound(p
17
HEDGES, L.ANDOLKIN, I.(1985) StatisticalMethodsforMetaAnalysis.NewYork:AcademicPress.
HEMBREE, R. (1988)Correlates,causeseffectsandtreatmentoftestanxiety. ReviewofEducationalResearch,58(1),4777.
HUBERTY, C.J..(2002)Ahistoryofeffectsizeindices.EducationalandPsychologicalMeasurement,62,2,227240.
HYMAN, R.B, FELDMAN, H.R., HARRIS, R.B., LEVIN, R.F.ANDMALLOY, G.B.(1989)Theeffectsofrelaxationtrainingonmedicalsymptoms:ameatanalysis.NursingResearch,38,216220.
KAVALE, K.A.ANDFORNESS, S.R.(1983)Hyperactivityanddiettreatment:ameatanalysisoftheFeingoldhypothesis. JournalofLearningDisabilities,16,324330.
KESELMAN, H.J., HUBERTY, C.J., LIX, L.M., OLEJNIK, S. CRIBBIE, R.A., DONAHUE, B.,KOWALCHUK, R.K., LOWMAN, L.L., PETOSKEY, M.D., KESELMAN, J.C.ANDLEVIN, J.R. (1998)Statisticalpracticesofeducationalresearchers:AnanalysisoftheirANOVA,MANOVA,andANCOVAanalyses.ReviewofEducationalResearch,68,3,350386.
KIRK, R.E.(1996)PracticalSignificance:Aconceptwhosetimehascome.Educationaland PsychologicalMeasurement,56,5,746759.
KULIK, J.A., KULIK, C.C.ANDBANGERT, R.L.(1984)Effectsofpracticeonaptitudeandachievementtestscores. AmericanEducationResearchJournal,21,435447.
LEPPER, M.R., HENDERLONG, J., ANDGINGRAS, I. (1999)Understandingtheeffectsofextrinsicrewardsonintrinsicmotivation Usesandabusesofmetaanalysis:CommentonDeci,Koestner,andRyan. PsychologicalBulletin,125,6,669676.
LIPSEY, M.W.(1992)Juveniledelinquencytreatment:ametaanalyticinquiryintothevariabilityofeffects.InT.D.Cook,H.Cooper,D.S.Cordray,H.Hartmann,L.V.Hedges,R.J.Light,T.A.LouisandF.Mosteller(Eds) Metaanalysisforexplanation.NewYork:RussellSageFoundation.
LIPSEY, M.W.ANDWILSON, D.B. (1993)TheEfficacyofPsychological,Educational,andBehavioralTreatment:Confirmationfrommetaanalysis. AmericanPsychologist,48,12,11811209.
MCGRAW, K.O. (1991)ProblemswiththeBESD:acommentonRosenthalsHowAreWeDoinginSoftPsychology.AmericanPsychologist,46,10846.
MCGRAW, K.O.ANDWONG, S.P. (1992)ACommonLanguageEffectSizeStatistic.PsychologicalBulletin,111,361365.
MOSTELLER, F., LIGHT, R.J.ANDSACHS, J.A.(1996)'Sustainedinquiryineducation:lessonsfromskillgroupingandclasssize.' HarvardEducationalReview,66,797842.
OAKES, M.(1986) StatisticalInference:ACommentaryfortheSocialandBehavioralSciences.NewYork:Wiley.
OLEJNIK, S.ANDALGINA, J.(2000)MeasuresofEffectSizeforComparativeStudies:Applications,InterpretationsandLimitations. ContemporaryEducationalPsychology,25,241286.
18
ROSENTHAL, R.(1994)ParametricMeasuresofEffectSizeinH.CooperandL.V.Hedges(Eds.), TheHandbookofResearchSynthesis.NewYork:RussellSageFoundation.
ROSENTHAL, R,ANDRUBIN, D.B. (1982)Asimple,generalpurposedisplayofmagnitudeofexperimentaleffect. JournalofEducationalPsychology,74,166169.
RUBIN, D.B.(1992)Metaanalysis:literaturesynthesisoreffectsizesurfaceestimation. JournalofEducationalStatistics,17,4,363374.
SHYMANSKY, J.A., HEDGES, L.V.ANDWOODWORTH, G.(1990)Areassessmentoftheeffectsofinquirybasedsciencecurriculaofthe60sonstudentperformance.JournalofResearchinScienceTeaching,27,127144.
SLAVIN, R.E.ANDMADDEN, N.A. (1989)Whatworksforstudentsatrisk?Aresearchsynthesis. EducationalLeadership,46(4),413.
SMITH, M.L.ANDGLASS, G.V. (1980)Metaanalysisofresearchonclasssizeanditsrelationshiptoattitudesandinstruction. AmericanEducationalResearchJournal,17,419433.
SNYDER, P. ANDLAWSON, S. (1993)EvaluatingResultsUsingCorrectedandUncorrectedEffectSizeEstimates.JournalofExperimentalEducation,61,4,334349.
STRAHAN, R.F. (1991)RemarksontheBinomialEffectSizeDisplay.AmericanPsychologist,46,10834.
THOMPSON, B. (1999)Commonmethodologymistakesineducationalresearch,revisited,alongwithaprimeronbotheffectsizesandthebootstrap.InvitedaddresspresentedattheannualmeetingoftheAmericanEducationalResearchAssociation,Montreal.[Accessedfrom,January2000]
TYMMS, P., MERRELL, C.ANDHENDERSON, B. (1997)TheFirstYearasSchool:AQuantitativeInvestigationoftheAttainmentandProgressofPupils.EducationalResearchandEvaluation,3,2,101118.
VINCENT, D.ANDCRUMPLER, M. (1997) BritishSpellingTestSeriesManual3X/Y.Windsor:NFERNelson.
WANG, M.C.ANDBAKER, E.T.(1986)Mainstreamingprograms:Designfeaturesandeffects. JournalofSpecialEducation,19,503523.
WILCOX, R.R. (1998)Howmanydiscoverieshavebeenlostbyignoringmodernstatisticalmethods?.AmericanPsychologist,53,3,300314.
WILKINSON, L.AND TASK FORCEON STATISTICAL INFERENCE, APABOARDOFSCIENTIFICAFFAIRS(1999)StatisticalMethodsinPsychologyJournals:GuidelinesandExplanations.AmericanPsychologist,54,8,594604.