Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
HighStakesQualityMeasuresinEarlyCareandEducation:ReconsideringtheEvidenceRachelA.Gordon,Professor
SociologyandInstituteofGovernmentandPublicAffairsUniversityofIllinoisatChicago
TheChallengeofUsingObserva3onalQualityRa3ngToolsinAccountabilitySystemsand
StrategiestoAddressThemNa#onalResearchConferenceonEarlyChildhood
July13,2016(v3)
Acknowledgments• Thisworkdrawsprimarilyfromcollaborativeexaminationsofthe
psychometricpropertiesofmeasuresofclassroomqualityandchildren’ssocio-emotionaldevelopmentinearlychildhoodfundedbyIESandNIH:
– IESR305A090065– NIHR01HD060711– IESR305A130118– IESR305A160010– IESR305H130012
• Resultsreflectourteams’interpretation(notnecessarilythoseofour
funders).• Presentationreflectsmysynthesis(notnecessarilyindividualteam
members).
PreviewofTalk
• Briefreminderofpolicycontext,especiallyhighstakesuseofmeasures.
• Highlightsfromresearchfindings.
– Generalimportanceoffreshlyconsideringtheevidencebasespecificallyforeachuseofameasure.
• Highstakes,professionaldevelopment,research,selfassessment.
– Currentmeasuresinhighstakesuse.• Dotheypredictlargeschoolreadinessgains?• Dotheysharplymeasureconstructsalignedwithreadinessgains?• Aretheyconstructedformaximalprecision(highsignalvs.noise)?
BriefReminder:PublicInvestmentsand
HighStakesUseofMeasures
ExpandingPublicInvestmentsinEarlyCareandEduca8on
NIEER:State-FundedPrekEnrollment
Percentageof4yearoldsenrolledinstatepre-kdoubled,2002to2010
ExpandingPublicInvestmentsinEarlyCareandEducation
Federalandstatechildcarespendingtripled,1997to2003
PolicyFocusonQualityEarlyCareandEducation
• Policyinitiativesfocusonhigh-qualityearlycareandeducation.
• Typicallywithatleastpartofthegoalbeingsupportforchildren’sschoolreadinessandlaterschoolandlifesuccess.
PolicyFocusonQualityEarlyCareandEducation
• Thisreflectsasensibledesiretoensurethatpublicdollarsinvestinhighqualitysettings.
• But,adesirethatisdifficulttoputintopractice.Ideally,consider:
– Whatarethepolicygoals?– Whataspectsofqualityalignwiththesegoals?– Whatarethebeststrategiesforassessingtheseaspectsof
qualityforthisparticularuse?
• Asthedesiretoassurequalityinpubliclyfundedprogramsgrewrapidly,decisionmakersturnedtoexistingmeasures.
ECERS-RandCLASS• Twoobservationalmeasuresmostwidelyused:ECERS-RandCLASS.
• Similaritiesanddifferences:
– Bothhaveobserversvisitclassroomsforseveralhourstorateactualclassroomactivitiesandinteractions.
– Bothproduceratingsona1to7scale.– But,differentoriginsandstructures.
ECERS• Developedin1970sfromachecklisttohelp
practitionersimprovethequalityoftheirsettings.
• Reflectstheearlychildhoodeducationfield’sconceptofdevelopmentallyappropriatepractice(wholechildapproach,child-initiatedactivities,teacherfacilitationresponsivetoindividualneeds).
• ECERS-R:43scoresassignedbasedonobserved400+indicators.
• Newversion:ECERS-3.
CLASS• Developedin1990s/2000sbeginninginaresearchstudyandlateraimedatprofessionaldevelopmentandcoaching.
• Reflectsdevelopmentaltheoryandresearchandemphasizesteacher-student(adult-child)interactionsastheprimarymechanismofdevelopmentandlearning.
• Observersassimilatewhattheyseetoassignscorestojustafewitems.
Themanualadvises:“BecauseofthehighlyinferentialnatureoftheCLASS,scoresshouldneverbegivenwithout
referringtothemanual.”(Pianta,LaParo&Hamre,p.17,boldin
original)
UseinStateQualityRatingandImprovementSystems
http://qriscompendium.org/top-ten/question-3/
2010~85%ECERS-R~7%CLASS
2014:~87%ECERS-R~37%CLASS
ERS=Broadersuiteofmeasuresforpreschools(ECERS-R),infant/toddlercenters(ITERS-R)andhomes(FCCERS-R).
Example:Illinois’QRISLearningEnvironment
http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file
UseinHeadStart
Highlights:EvidenceforHighStakesUse
Iambrieflyhighlightingfindingsintheinterestoftime.Ihavelistedcitations,andwouldbehappytosharefull
publicationsoradditionaldetails.
EvidenceforHighStakesUse• Alternativeoptionsforhighstakesuse.
– Choosetherelativelybestmeasureavailableatthetimeanduse“asis”(evenifevidencelimited)?
– Chooseexistingmeasurebutbuildinrigorousevidencebuildingandpotentialformodificationstomeasureduringuse?
– Requireanabsolutelevelofevidencebeforeuse?
• PotentialforlimitsofECERS-RandCLASSbasedonabsolutelevelofevidence.– Bothweredesignedforpurposesdifferentfromtheircurrenthighstakesuse.– Bothwereembeddedinpractice/professionalandconceptual/empirical
knowledge.– But,neitherusedafullymodernmeasurementdevelopmentand
psychometricapproach(e.g.,IRT)duringdevelopment,especiallyonealigningwiththisparticularhighstakesuse.
DoECERS-RandCLASSpredictlargeschoolreadinessgains?
AccumulatingEvidence:SmallAssociations(OftenNotSig.)
• Burchinal,KainzandCai(2011)– Effectsizes(adjustedcorrelations)of.14andbelowacrosspublishedstudieswitha
rangeofchildoutcomes.– Evenwhenfocusingonlow-incomechildrenandaligningsubscaleswithlanguage,
math,social,andbehavioraloutcomesinnewanalyses,32of36adjustedcorrelationsatorbelow.10.
• Keysandcolleagues(2013)– Averageeffectsizesbetween.01to.05forlanguage,math,social,andbehavioral
outcomes.
• Someevidenceofthresholds(strongereffectsinhigherqualityregion;Burchinal,Zaslow,&Tarullo,2016).– ButstillsmallforECERS-RandCLASS.– Anddemonstratedsmallsamplesizesacrossregionsofqualityandneedfordata
collectedspecificallytotestthisquestion(oversamplehigherquality).
NewDataSyntheses(IESR305A130118)
• Meta-analyses– 13datasetswithmultiplewavesandsubgroups
• Integrativedataanalysis(stackeddatasets)– FACES2000,2003,2006and2009– Greatersamplesizesacrossthequalitycontinuum
• Usednumerousstrategiestoexaminenon-linearity(dummyvariable,quadratic,piecewise,non-parametric).
• Usednumerousstrategiestoexaminesensitivity(e.g.,multipleapproachestomissingdata,complexsamples,qualityandoutcomescoring).
Example:PredictingGrowthinChildren’sVocabulary(PPVT)
Source:StackedanalysisoffourFACEScohorts.AdjustingforfallPPVTscoreaswellaschild’sgender,race-ethnicity,anddisabilitystatus,familyincome,whether3or4atHSenrollment,monthoffallassessments,monthsbetweenfallandspringassessment,andageinmonthsatspringassessment.
ECERS-R Teaching and Interactions
78
80
82
84
86
88
90
92
PPV
T (S
tand
ardi
zed)
1 2 3 4 5 6 7ECERS-R Teaching and Interactions
Linear
Sprin
g PP
VT
(Sta
ndar
dize
d)
78
80
82
84
86
88
90
92
PPV
T (S
tand
ardi
zed)
1 1.5 2 2.5 3 3.5 4 4.5 5CLASS - Instructional Support
Linear
Sprin
g PP
VT
(Sta
ndar
dize
d)
78
80
82
84
86
88
90
92
< 3 3 to < 4 4 to < 5 5 to < 6 6 to 7
PPV
T (S
tand
ardi
zed)
ECERS-R Teaching and Interactions
Dummy Variable
ab cde bef cf ad
Sprin
g PP
VT
(Sta
ndar
dize
d)
78
80
82
84
86
88
90
92
< 1.5 1.5 to < 2 2 to < 2.5 2.5 to < 3 >= 3
PP
VT
(S
tand
ardi
zed)
CLASS - Instructional Support
Dummy Variable
a b abc c Sprin
g PP
VT
(Sta
ndar
dize
d) CLASS Instructional Support
Effect size: .05
Effect size: .04
Max Std. Diff: .13
Max Std. Diff: .17
Example:PredictingGrowthinChildren’sVocabulary(PPVT)
DoECERS-RandCLASSsharplymeasureconstructsalignedwith
readinessgains?
ImportanceofDimensionsofQuality• Ideally,highstakesmeasureswouldbecreated
specificallytomeasureaspectsofqualityalignedwithpolicygoals.
– Content-focusedaspectsofqualityalignedwithparticularreadinessdomainsmayshowstrongerrelationships.
• Alternatively,ifmeasuresdesignedforotherpurposesareused,theyshouldhavecleardefinitionsoftheaspectsofqualitymeasuredandempiricalevidencethatitemsmeasurethem.
– Yet,evidenceforthesubscalesdefinedinECERS-RandCLASSmanuals(andoftenusedinpolicy)islimited.
ECERS-RDimensions:One,Seven,orTwo(Three)?
• TheERSpresumesaqualityprogramsupportsthreebasicneeds(health/safety,positiverelationships,opportunitiesforlearningfromexperience)and“nooneismoreorlessimportantthantheothers”http://ers.fpg.unc.edu/
– TheECERS-Rscaledeveloperssometimesdescribethemeasureascapturingasingleglobalaspectofquality.
– Butitemsareorganizedintosevensubscales,someofwhichonthesurfacealignwithparticularaspectsofquality(personalcare,interaction,activities).
– SomeQRIS,likeIllinois,relyoneitherthetotalorsubscalescores.
• Ontheotherhand,factoranalyseshaveidentified2-3dimensions,andthemostalignedoftheseareoftensomewhatmorehighlycorrelatedwithoutcomes;thesedimensionsaresometimesusedinQRIS.
TwoDimensionsReplicate(IESR305A130118)
• Tosolidifyevidencerelatedtodimensionality,andencourageconsistentpractice/policyuse,wefactor-analyzeddatafromeightsurveys(with14waves)andsynthesizedtheresults.
• Twobroaddimensionsreplicatedacrossthedatasets:– Language-Reasoning/Interaction(LR:Items16-18;Int:Items29-33).
– SpaceFurnishings/Activities(SF:Items2-6;Act:Items19-26&Item28).
• Butassociationswithoutcomesstillsmall.
CLASSPreKDimensions:ThreeDomainsora“Bi-Factor”?
• CLASSPreKmanualproducesscoresinthreebroaddomains:EmotionalSupport,ClassroomOrganization,InstructionalSupport.(http://teachstone.com/)
• InstructionalSupportcapturesdimensionsofteacherpracticemeanttocutacrosscontent(ConceptDevelopment,QualityofFeedback,LanguageModeling).
• Dimensionssometimesmorestronglyrelatedtooutcomeswhenalignedbydomain(butstillsmall).
CLASSPreKDimensions:ThreeDomainsora“Bi-Factor”?
– CLASSdevelopersrecentlypublishedadifferent“bi-factor”structurefortheCLASSPreK(Hamre,Hatfield,Pianta&Jamil,2014)thatdiffersfromthesubscaleswrittenintopolicy.• Onegeneraldimension(responsiveteaching).• Twospecificdimensions(proactivemanagementandroutines;
cognitivefacilitation).• Somewhatmoreconceptuallyconsistentpatternwith
outcomes,althougheffectsizesstillsmall(<.10).
– Wereplicatedthisbi-factorstructure,althoughliketheCLASSteamhadproblemswithconvergence(perhapsduetoitemskewnessandcorrelation).
AreECERS-RandCLASSconstructedformaximal
precision(highsignalvs.noise)?
ScoringStrategiesMayProduceNoise
• ThestructuresofECERS-RandCLASSarequitedifferent,buteachmayincreasenoise.
– ECERS-Rchecklistoriginof400+indicators,butused“stopscoring”sonotallwererated.
– CLASSahighlyinferentialapproach,wherecodersassimilatedallthey’veseenintheirheads(ratherthaninchecklists).
ECERS-RStandard“StopScoring”
§ Conditionsintheindicatorsoflowerscoresmustbemetbeforeindicatorsofhigherscoresareevaluated.
§ Higherscoremaynotalwaysreflecthigherquality,especiallyforaspectsofqualityrelevantforschoolreadiness.
ECERS-RItem10:Meals/Snacks
Source: Harms, T., Clifford, R.M., & Cryer, D. (1998). Early Childhood Environment Rating Scale, Revised Edition. New York, NY: Teachers College Press.
WhatWouldOrderLookLike?
• Ifhigherscoresreflecthigherquality,thenaveragequalityscoresshouldbehigherforcentersratedinhighercategoriesversuslowercategories.
WhatWouldOrderLookLike?
• Ifhigherscoresreflecthigherquality,thenaveragequalityscoresshouldbehigherforcentersratedinhighercategoriesversuslowercategories.
• Alternatively,mayseeunexpectedflatregionsordipsinaveragequalityatsomehighercategoryscores.
Non-OrderinCategoryAverages(IESR305A130118)
• Categoryaveragesoutoforderforsomeitemsinall8datasets.
• Inanalysisofstackeddatafiles(withgreatestprecision)overtwo-thirdsofitemshadatleastonepairofadjacentcategorymeansthatdidnotprogressinastairstepfashion.
• Mostcommonlocationofnon-orderwascategories2-to-3followedbycategories4-to-5.
• TheproblemwasmostevidentinthePersonalCareRoutinesitemsandleastevidentintheLanguage-Reasoningitems.
• Fornearly¾(26of36)items,atleastonecategory-totalpointbiserialcorrelationswasnegative.
IRTModelsAlsoIdentifyDisorder(IESR305A130118)
• PartialCreditModel– Foreveryitem,atleastonepairofadjacentthresholds-latentlevelof
qualitywherearaterisequallylikelytochoosebetweenadjacentcategories–wasoutoforder.
– Thisdisordergenerallyinvolvedthe3rdand5thcategories.
• NominalResponseModel– Categoryboundarydiscriminationsnegative(andscoringfunctionvalues
outoforder)inatleastoneplacefor15of36items(42%).– Categoryboundarydiscriminationssmall(0to0.5)(andscoringfunction
valuesprogressedminimally)for35of36items.– Thisnonordertypicallyoccurredbetweenscores2and3andscores4and5.
Visualiza8on
Visualiza8on
Visualiza8on
Visualiza8on
Interpreta8ons
• Consistentloca8onmayreflectdifferentscoringrulesforevenandoddcategories(requiringhalfvsallindicatorstobemet).
• Greatestproblemsevidentforpersonalcarerou8nesmayreflectmixingofdifferentaspectsofquality(alsoindicatoranalysisinGordonetal.,EED,2015).
Implica8ons• Theseresultscau#onuseofthesimplesum(average)oftherawscores,includinginrela#ontohighstakescutoffs.
• PrestonandReise(2015,p.392)– WhenCBDvaluesarenotposi8ve,aresponseinahighercategory“doesnotindicatemoreofthetrait”thanaresponseinalowercategory.
– “whencategorydis8nc8onsfailtodiscriminate,aresearcherwouldnotwanttouseascoringstrategythataggregatesrawintegeritemscores.”
Preston,K.S.J.&Reise,S.P.(2015).Detec8ngfaultywithin-itemcategoryfunc8oningwiththenominalresponsemodel.InS.P.Reise&D.A.Revicki.(Eds.),Handbookofitemresponsetheorymodeling:Applica#onstotypicalperformanceassessment(pp.386-405).NewYork:Routledge.
CLASSInter-RaterReliability:Is“WithinOne”GoodEnough?
TheCLASS(liketheECERS-Randotherobservationalsystems)assessesagreement“withinone”encompassingbroadregionsofthe7-pointscale.
Ascoreof5,6or7isconsideredinagreementwithamasterscoreof6.
Ascoreof3,4or5isconsideredinagreementwithamasterscoreof4.
Ascoreof1,2or3isconsideredinagreementwithamasterscoreof2.
Pianta,R.C.,LaParo,K.M.,&Hamre,B.K.(2008).ClassroomAssessmentScoringSystem–PreKManual.Bal8more,MD:BrookesPublishing.
ExactAgreementDifficultontheHighlyInferentialSystem
ChallengeofRaterVariance• BasedonHeadStarttraining,CLASSdevelopers(Cash,
Hamre,Pianta,&Myers,2012)reported:
– Exactagreementwaslow:41%overallexactagreementwithmasterscoreintrainingofover2,000HeadStartstaff.
– BlackandLatinoratersplacedtheirInstructionalSupportscoresfartherfromthemasterscoreasdidraterswhodisagreedwithintentionalteachingbeliefs.
• RecentreportonratererrorsinCLASS-S(McCaffreyet
al.,EducationalMeasurement).
Conclusions
Summary:LimitsofAdoptingExistingMeasuresforHighStakesUse
• Whenscrutinizingthesemeasureswhichweredevelopedforotherpurposes,itisnotsurprisingthattherearelimitationsforthewaysinwhichtheyhavebeenadoptedforhighstakespolicyuses.
• ThelimitationsofthereliabilityandvalidityevidenceforECERS-RandCLASSmay,inpart,explaintherelativelylowassociationswithchildren’sdevelopmentalgainsduringpreschool.
AlternativeApproachtoEvaluatingLevelofEvidence
• ConsistentwiththelatestStandardsforEducationalandPsychologicalTestingmayneedtostepbackandconsider:– theintentsofeachresearch,practiceandpolicyuse,– weighthefullbodyofreliabilityandvalidityevidenceagainsteachuse,
– buildincontinuousandlocalvalidationofmeasuresselectedfortheseuses,
– allowfortherefinementofmeasuresoverplaceandtime.
http://www.apa.org/science/programs/testing/standards.aspx
Inshort• Ameasureisnotsta8cally“reliableandvalid”
• Theevidencemustbefullyevaluatedandregularlyrevisited(includinglocally)foreachuse.
– Thebodyofevidenceneededtodemonstratereliabilityandvalidityforprogramself-assessment
– Maybedifferentfromreliabilityandvalidityforteacherprofessionaldevelopment
– Whichmaybedifferentfromreliabilityandvalidityforpolicydecisionmakingandaccountability
AlternativeApproachtoEvaluatingLevelofEvidence
• Asaconcreteexample,ifitisdesirabletodistinguishclassroomsthatfallaboveandbelowspecificcutpoints,asincurrentpolicyuses,thenmeasureswithveryhighinformation(andlowerror)atthosecutpointsareneeded.
• Ifthepolicygoalistoimprovechildren’sschoolreadiness,thenneedagreementondefinitionsofreadinessandtheaspectsofqualitythatsupportthem,andmeasuresdesignedandevaluatedtoassessthoseaspectsofquality.
ConsiderAlternativeApproachestoAccumulatingEvidence
• Continuousandlocalvalidationandimprovementofmeasures.
• Thisapproachcouldpotentiallybenefitfromviewingmeasuresas:– Notfixedinstone(movingawayfromsinglecopyrightedmeasurecontrolledbypublisher).
– Jointlyowned(movingawayfromfinancial/professionalstakeinafixeditem/measure).
LocalValidationCanAlsoEncompassContextualDiversity
• Doesameasurecaptureasingleconceptionofquality?– Isthatconceptionexplicitorimplicit?– Doesthatconceptionmatchwithpolicygoalsandwithon-the-groundpractice,acrosslocalcontexts?
• Doesameasurecapturewellallchildren’sexperiences?– Or,“average”teacherquality,“average”child,“substantialportionoftheday”
FutureDirections
• NewECERS-3– Someimportantchanges(e.g.,expandedcontentonlanguage,literacyandmath).
– Othersremainedthesame(encouragesscoringallindicators,andalternativescoringindevelopment,butmanualretainsstopscoring).
– IES-fundedvalidationstudyofECERS-3.
• Newmeasuredevelopment– IESResearchNetworkonEarlyChildhoodEducation.– Includesassessmentteam(headedbyCarolConnor).– NewmeasuredevelopmentforQRIS.
FutureDirec8ons(cont)• EarlyInvestmentsIni8a8ve(Gordon,Zinsser,Sheridan,Main,
Curby,etal.,IGPA)&EMOTERS(Zinsser,Curby,Gordon,etal.IESR305A160010)– NewmeasureofSocialandEmo8onalTeaching.
• Thevarietyofac8vi8esandprac8cesthatpromoteSELinchildren.
– Oneofthenoveldesignfeatures:videoweeks• Videofullweekperclassroom(panoramic&closeup).• Easiertohavemanycoders,toexamineIRR.• Facilitateiden8fica8onofwithin/acrossdayvaria8on.• Usinginitera8vemeasuredevelopment.• Usinggeneralizabilitytheorytoparsesourcesofvaria8on.
FutureDirec8ons(cont)• IESR305H130012
– Researcher-Prac88onerPartnershipsinEduca8onResearchgrant– Crea8ngaMonitoringSystemforSchoolDistrictstoPromote
Academic,Social,andEmo8onalLearning:AResearcher-Prac88onerPartnership
– CASEL&WashoeCountySchoolDistrict
• UsingIRT(Rasch)approachduringitera8veitemdevelopment.– Refiningconstructdefini8on– Developingitempool– Con8nuousrefinement– Anchor(common)andnew(unique)items
TheRaschRuler
CompetenciesthatarereallyHARDformost
kids
CompetenciesthatarereallyEASYformostkids
KidswhohavetheMOSTcompetency
KidswhohavetheLEASTcompetency
Marks=Competencies
Measures=Kids’Levels
TheRaschRuler
2+2Ifwehadmarksonlyatthebojomoftheruler–justtheeasymathitems–wecouldn’tseparatethestudentswithmoderatelytohighlycompetentmathskills.
-2.00
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
2.00
ThewayWCSDdepictedtheruler,showingourmostimprovedsetofitems:
RelaPonshipSkills.
Hardest to Do
Easiest to Do
1.SharingwhatIamfeelingwithothers.
2.JoiningagroupIdon’tusuallysitwithatlunch.
3.TalkingtoanadultwhenIhaveproblemsatschool.
4.Introducingmyselftoanewstudentatschool.
6.Beingpolitetoadults.
5.Genngalongwithmyclassmates.
1=HardesttoDo 6=EasiesttoDo