High Stakes Quality Measures in Early Care and Education · development and coaching. • Reﬂects developmental theory and research and emphasizes teacher-student (adult-child)

HighStakesQualityMeasuresinEarlyCareandEducation:ReconsideringtheEvidenceRachelA.Gordon,Professor

SociologyandInstituteofGovernmentandPublicAffairsUniversityofIllinoisatChicago

TheChallengeofUsingObserva3onalQualityRa3ngToolsinAccountabilitySystemsand

StrategiestoAddressThemNa#onalResearchConferenceonEarlyChildhood

July13,2016(v3)

Acknowledgments•  Thisworkdrawsprimarilyfromcollaborativeexaminationsofthe

psychometricpropertiesofmeasuresofclassroomqualityandchildren’ssocio-emotionaldevelopmentinearlychildhoodfundedbyIESandNIH:

–  IESR305A090065–  NIHR01HD060711–  IESR305A130118–  IESR305A160010–  IESR305H130012

•  Resultsreflectourteams’interpretation(notnecessarilythoseofour

funders).•  Presentationreflectsmysynthesis(notnecessarilyindividualteam

members).

PreviewofTalk

•  Briefreminderofpolicycontext,especiallyhighstakesuseofmeasures.

•  Highlightsfromresearchfindings.

–  Generalimportanceoffreshlyconsideringtheevidencebasespecificallyforeachuseofameasure.

•  Highstakes,professionaldevelopment,research,selfassessment.

–  Currentmeasuresinhighstakesuse.•  Dotheypredictlargeschoolreadinessgains?•  Dotheysharplymeasureconstructsalignedwithreadinessgains?•  Aretheyconstructedformaximalprecision(highsignalvs.noise)?

BriefReminder:PublicInvestmentsand

HighStakesUseofMeasures

ExpandingPublicInvestmentsinEarlyCareandEduca8on

NIEER:State-FundedPrekEnrollment

Percentageof4yearoldsenrolledinstatepre-kdoubled,2002to2010

ExpandingPublicInvestmentsinEarlyCareandEducation

Federalandstatechildcarespendingtripled,1997to2003

PolicyFocusonQualityEarlyCareandEducation

•  Policyinitiativesfocusonhigh-qualityearlycareandeducation.

•  Typicallywithatleastpartofthegoalbeingsupportforchildren’sschoolreadinessandlaterschoolandlifesuccess.

PolicyFocusonQualityEarlyCareandEducation

•  Thisreflectsasensibledesiretoensurethatpublicdollarsinvestinhighqualitysettings.

•  But,adesirethatisdifficulttoputintopractice.Ideally,consider:

–  Whatarethepolicygoals?–  Whataspectsofqualityalignwiththesegoals?–  Whatarethebeststrategiesforassessingtheseaspectsof

qualityforthisparticularuse?

•  Asthedesiretoassurequalityinpubliclyfundedprogramsgrewrapidly,decisionmakersturnedtoexistingmeasures.

ECERS-RandCLASS•  Twoobservationalmeasuresmostwidelyused:ECERS-RandCLASS.

•  Similaritiesanddifferences:

– Bothhaveobserversvisitclassroomsforseveralhourstorateactualclassroomactivitiesandinteractions.

– Bothproduceratingsona1to7scale.– But,differentoriginsandstructures.

ECERS•  Developedin1970sfromachecklisttohelp

practitionersimprovethequalityoftheirsettings.

•  Reflectstheearlychildhoodeducationfield’sconceptofdevelopmentallyappropriatepractice(wholechildapproach,child-initiatedactivities,teacherfacilitationresponsivetoindividualneeds).

•  ECERS-R:43scoresassignedbasedonobserved400+indicators.

•  Newversion:ECERS-3.

CLASS•  Developedin1990s/2000sbeginninginaresearchstudyandlateraimedatprofessionaldevelopmentandcoaching.

•  Reflectsdevelopmentaltheoryandresearchandemphasizesteacher-student(adult-child)interactionsastheprimarymechanismofdevelopmentandlearning.

•  Observersassimilatewhattheyseetoassignscorestojustafewitems.

Themanualadvises:“BecauseofthehighlyinferentialnatureoftheCLASS,scoresshouldneverbegivenwithout

referringtothemanual.”(Pianta,LaParo&Hamre,p.17,boldin

original)

UseinStateQualityRatingandImprovementSystems

http://qriscompendium.org/top-ten/question-3/

2010~85%ECERS-R~7%CLASS

2014:~87%ECERS-R~37%CLASS

ERS=Broadersuiteofmeasuresforpreschools(ECERS-R),infant/toddlercenters(ITERS-R)andhomes(FCCERS-R).

Example:Illinois’QRISLearningEnvironment

http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file

UseinHeadStart

Highlights:EvidenceforHighStakesUse

Iambrieflyhighlightingfindingsintheinterestoftime.Ihavelistedcitations,andwouldbehappytosharefull

publicationsoradditionaldetails.

EvidenceforHighStakesUse•  Alternativeoptionsforhighstakesuse.

–  Choosetherelativelybestmeasureavailableatthetimeanduse“asis”(evenifevidencelimited)?

–  Chooseexistingmeasurebutbuildinrigorousevidencebuildingandpotentialformodificationstomeasureduringuse?

–  Requireanabsolutelevelofevidencebeforeuse?

•  PotentialforlimitsofECERS-RandCLASSbasedonabsolutelevelofevidence.–  Bothweredesignedforpurposesdifferentfromtheircurrenthighstakesuse.–  Bothwereembeddedinpractice/professionalandconceptual/empirical

knowledge.–  But,neitherusedafullymodernmeasurementdevelopmentand

psychometricapproach(e.g.,IRT)duringdevelopment,especiallyonealigningwiththisparticularhighstakesuse.

DoECERS-RandCLASSpredictlargeschoolreadinessgains?

AccumulatingEvidence:SmallAssociations(OftenNotSig.)

•  Burchinal,KainzandCai(2011)–  Effectsizes(adjustedcorrelations)of.14andbelowacrosspublishedstudieswitha

rangeofchildoutcomes.–  Evenwhenfocusingonlow-incomechildrenandaligningsubscaleswithlanguage,

math,social,andbehavioraloutcomesinnewanalyses,32of36adjustedcorrelationsatorbelow.10.

•  Keysandcolleagues(2013)–  Averageeffectsizesbetween.01to.05forlanguage,math,social,andbehavioral

outcomes.

•  Someevidenceofthresholds(strongereffectsinhigherqualityregion;Burchinal,Zaslow,&Tarullo,2016).–  ButstillsmallforECERS-RandCLASS.–  Anddemonstratedsmallsamplesizesacrossregionsofqualityandneedfordata

collectedspecificallytotestthisquestion(oversamplehigherquality).

NewDataSyntheses(IESR305A130118)

•  Meta-analyses–  13datasetswithmultiplewavesandsubgroups

•  Integrativedataanalysis(stackeddatasets)–  FACES2000,2003,2006and2009– Greatersamplesizesacrossthequalitycontinuum

•  Usednumerousstrategiestoexaminenon-linearity(dummyvariable,quadratic,piecewise,non-parametric).

•  Usednumerousstrategiestoexaminesensitivity(e.g.,multipleapproachestomissingdata,complexsamples,qualityandoutcomescoring).

Example:PredictingGrowthinChildren’sVocabulary(PPVT)

Source:StackedanalysisoffourFACEScohorts.AdjustingforfallPPVTscoreaswellaschild’sgender,race-ethnicity,anddisabilitystatus,familyincome,whether3or4atHSenrollment,monthoffallassessments,monthsbetweenfallandspringassessment,andageinmonthsatspringassessment.

ECERS-R Teaching and Interactions

78

80

82

84

86

88

90

92

PPV

T (S

tand

ardi

zed)

1 2 3 4 5 6 7ECERS-R Teaching and Interactions

Linear

Sprin

g PP

VT

(Sta

ndar

dize

d)

78

80

82

84

86

88

90

92

PPV

T (S

tand

ardi

zed)

1 1.5 2 2.5 3 3.5 4 4.5 5CLASS - Instructional Support

Linear

Sprin

g PP

VT

(Sta

ndar

dize

d)

78

80

82

84

86

88

90

92

< 3 3 to < 4 4 to < 5 5 to < 6 6 to 7

PPV

T (S

tand

ardi

zed)

ECERS-R Teaching and Interactions

Dummy Variable

ab cde bef cf ad

Sprin

g PP

VT

(Sta

ndar

dize

d)

78

80

82

84

86

88

90

92

< 1.5 1.5 to < 2 2 to < 2.5 2.5 to < 3 >= 3

PP

VT

(S

tand

ardi

zed)

CLASS - Instructional Support

Dummy Variable

a b abc c Sprin

g PP

VT

(Sta

ndar

dize

d) CLASS Instructional Support

Effect size: .05

Effect size: .04

Max Std. Diff: .13

Max Std. Diff: .17

Example:PredictingGrowthinChildren’sVocabulary(PPVT)

DoECERS-RandCLASSsharplymeasureconstructsalignedwith

readinessgains?

ImportanceofDimensionsofQuality•  Ideally,highstakesmeasureswouldbecreated

specificallytomeasureaspectsofqualityalignedwithpolicygoals.

–  Content-focusedaspectsofqualityalignedwithparticularreadinessdomainsmayshowstrongerrelationships.

•  Alternatively,ifmeasuresdesignedforotherpurposesareused,theyshouldhavecleardefinitionsoftheaspectsofqualitymeasuredandempiricalevidencethatitemsmeasurethem.

–  Yet,evidenceforthesubscalesdefinedinECERS-RandCLASSmanuals(andoftenusedinpolicy)islimited.

ECERS-RDimensions:One,Seven,orTwo(Three)?

•  TheERSpresumesaqualityprogramsupportsthreebasicneeds(health/safety,positiverelationships,opportunitiesforlearningfromexperience)and“nooneismoreorlessimportantthantheothers”http://ers.fpg.unc.edu/

–  TheECERS-Rscaledeveloperssometimesdescribethemeasureascapturingasingleglobalaspectofquality.

–  Butitemsareorganizedintosevensubscales,someofwhichonthesurfacealignwithparticularaspectsofquality(personalcare,interaction,activities).

–  SomeQRIS,likeIllinois,relyoneitherthetotalorsubscalescores.

•  Ontheotherhand,factoranalyseshaveidentified2-3dimensions,andthemostalignedoftheseareoftensomewhatmorehighlycorrelatedwithoutcomes;thesedimensionsaresometimesusedinQRIS.

TwoDimensionsReplicate(IESR305A130118)

•  Tosolidifyevidencerelatedtodimensionality,andencourageconsistentpractice/policyuse,wefactor-analyzeddatafromeightsurveys(with14waves)andsynthesizedtheresults.

•  Twobroaddimensionsreplicatedacrossthedatasets:–  Language-Reasoning/Interaction(LR:Items16-18;Int:Items29-33).

–  SpaceFurnishings/Activities(SF:Items2-6;Act:Items19-26&Item28).

•  Butassociationswithoutcomesstillsmall.

CLASSPreKDimensions:ThreeDomainsora“Bi-Factor”?

•  CLASSPreKmanualproducesscoresinthreebroaddomains:EmotionalSupport,ClassroomOrganization,InstructionalSupport.(http://teachstone.com/)

•  InstructionalSupportcapturesdimensionsofteacherpracticemeanttocutacrosscontent(ConceptDevelopment,QualityofFeedback,LanguageModeling).

•  Dimensionssometimesmorestronglyrelatedtooutcomeswhenalignedbydomain(butstillsmall).

CLASSPreKDimensions:ThreeDomainsora“Bi-Factor”?

–  CLASSdevelopersrecentlypublishedadifferent“bi-factor”structurefortheCLASSPreK(Hamre,Hatfield,Pianta&Jamil,2014)thatdiffersfromthesubscaleswrittenintopolicy.•  Onegeneraldimension(responsiveteaching).•  Twospecificdimensions(proactivemanagementandroutines;

cognitivefacilitation).•  Somewhatmoreconceptuallyconsistentpatternwith

outcomes,althougheffectsizesstillsmall(<.10).

–  Wereplicatedthisbi-factorstructure,althoughliketheCLASSteamhadproblemswithconvergence(perhapsduetoitemskewnessandcorrelation).

AreECERS-RandCLASSconstructedformaximal

precision(highsignalvs.noise)?

ScoringStrategiesMayProduceNoise

•  ThestructuresofECERS-RandCLASSarequitedifferent,buteachmayincreasenoise.

– ECERS-Rchecklistoriginof400+indicators,butused“stopscoring”sonotallwererated.

– CLASSahighlyinferentialapproach,wherecodersassimilatedallthey’veseenintheirheads(ratherthaninchecklists).

ECERS-RStandard“StopScoring”

§  Conditionsintheindicatorsoflowerscoresmustbemetbeforeindicatorsofhigherscoresareevaluated.

§  Higherscoremaynotalwaysreflecthigherquality,especiallyforaspectsofqualityrelevantforschoolreadiness.

ECERS-RItem10:Meals/Snacks

Source: Harms, T., Clifford, R.M., & Cryer, D. (1998). Early Childhood Environment Rating Scale, Revised Edition. New York, NY: Teachers College Press.

WhatWouldOrderLookLike?

•  Ifhigherscoresreflecthigherquality,thenaveragequalityscoresshouldbehigherforcentersratedinhighercategoriesversuslowercategories.

WhatWouldOrderLookLike?

•  Ifhigherscoresreflecthigherquality,thenaveragequalityscoresshouldbehigherforcentersratedinhighercategoriesversuslowercategories.

•  Alternatively,mayseeunexpectedflatregionsordipsinaveragequalityatsomehighercategoryscores.

Non-OrderinCategoryAverages(IESR305A130118)

•  Categoryaveragesoutoforderforsomeitemsinall8datasets.

•  Inanalysisofstackeddatafiles(withgreatestprecision)overtwo-thirdsofitemshadatleastonepairofadjacentcategorymeansthatdidnotprogressinastairstepfashion.

•  Mostcommonlocationofnon-orderwascategories2-to-3followedbycategories4-to-5.

•  TheproblemwasmostevidentinthePersonalCareRoutinesitemsandleastevidentintheLanguage-Reasoningitems.

•  Fornearly¾(26of36)items,atleastonecategory-totalpointbiserialcorrelationswasnegative.

IRTModelsAlsoIdentifyDisorder(IESR305A130118)

•  PartialCreditModel–  Foreveryitem,atleastonepairofadjacentthresholds-latentlevelof

qualitywherearaterisequallylikelytochoosebetweenadjacentcategories–wasoutoforder.

–  Thisdisordergenerallyinvolvedthe3rdand5thcategories.

•  NominalResponseModel–  Categoryboundarydiscriminationsnegative(andscoringfunctionvalues

outoforder)inatleastoneplacefor15of36items(42%).–  Categoryboundarydiscriminationssmall(0to0.5)(andscoringfunction

valuesprogressedminimally)for35of36items.–  Thisnonordertypicallyoccurredbetweenscores2and3andscores4and5.

Visualiza8on

Visualiza8on

Visualiza8on

Visualiza8on

Interpreta8ons

•  Consistentloca8onmayreflectdifferentscoringrulesforevenandoddcategories(requiringhalfvsallindicatorstobemet).

•  Greatestproblemsevidentforpersonalcarerou8nesmayreflectmixingofdifferentaspectsofquality(alsoindicatoranalysisinGordonetal.,EED,2015).

Implica8ons•  Theseresultscau#onuseofthesimplesum(average)oftherawscores,includinginrela#ontohighstakescutoffs.

•  PrestonandReise(2015,p.392)– WhenCBDvaluesarenotposi8ve,aresponseinahighercategory“doesnotindicatemoreofthetrait”thanaresponseinalowercategory.

– “whencategorydis8nc8onsfailtodiscriminate,aresearcherwouldnotwanttouseascoringstrategythataggregatesrawintegeritemscores.”

Preston,K.S.J.&Reise,S.P.(2015).Detec8ngfaultywithin-itemcategoryfunc8oningwiththenominalresponsemodel.InS.P.Reise&D.A.Revicki.(Eds.),Handbookofitemresponsetheorymodeling:Applica#onstotypicalperformanceassessment(pp.386-405).NewYork:Routledge.

CLASSInter-RaterReliability:Is“WithinOne”GoodEnough?

TheCLASS(liketheECERS-Randotherobservationalsystems)assessesagreement“withinone”encompassingbroadregionsofthe7-pointscale.

Ascoreof5,6or7isconsideredinagreementwithamasterscoreof6.



Pianta,R.C.,LaParo,K.M.,&Hamre,B.K.(2008).ClassroomAssessmentScoringSystem–PreKManual.Bal8more,MD:BrookesPublishing.

ExactAgreementDifficultontheHighlyInferentialSystem

ChallengeofRaterVariance•  BasedonHeadStarttraining,CLASSdevelopers(Cash,

Hamre,Pianta,&Myers,2012)reported:

–  Exactagreementwaslow:41%overallexactagreementwithmasterscoreintrainingofover2,000HeadStartstaff.

–  BlackandLatinoratersplacedtheirInstructionalSupportscoresfartherfromthemasterscoreasdidraterswhodisagreedwithintentionalteachingbeliefs.

•  RecentreportonratererrorsinCLASS-S(McCaffreyet

al.,EducationalMeasurement).

Conclusions

Summary:LimitsofAdoptingExistingMeasuresforHighStakesUse

•  Whenscrutinizingthesemeasureswhichweredevelopedforotherpurposes,itisnotsurprisingthattherearelimitationsforthewaysinwhichtheyhavebeenadoptedforhighstakespolicyuses.

•  ThelimitationsofthereliabilityandvalidityevidenceforECERS-RandCLASSmay,inpart,explaintherelativelylowassociationswithchildren’sdevelopmentalgainsduringpreschool.

AlternativeApproachtoEvaluatingLevelofEvidence

•  ConsistentwiththelatestStandardsforEducationalandPsychologicalTestingmayneedtostepbackandconsider:–  theintentsofeachresearch,practiceandpolicyuse,– weighthefullbodyofreliabilityandvalidityevidenceagainsteachuse,

–  buildincontinuousandlocalvalidationofmeasuresselectedfortheseuses,

–  allowfortherefinementofmeasuresoverplaceandtime.

http://www.apa.org/science/programs/testing/standards.aspx

Inshort•  Ameasureisnotsta8cally“reliableandvalid”

•  Theevidencemustbefullyevaluatedandregularlyrevisited(includinglocally)foreachuse.

–  Thebodyofevidenceneededtodemonstratereliabilityandvalidityforprogramself-assessment

–  Maybedifferentfromreliabilityandvalidityforteacherprofessionaldevelopment

–  Whichmaybedifferentfromreliabilityandvalidityforpolicydecisionmakingandaccountability

AlternativeApproachtoEvaluatingLevelofEvidence

•  Asaconcreteexample,ifitisdesirabletodistinguishclassroomsthatfallaboveandbelowspecificcutpoints,asincurrentpolicyuses,thenmeasureswithveryhighinformation(andlowerror)atthosecutpointsareneeded.

•  Ifthepolicygoalistoimprovechildren’sschoolreadiness,thenneedagreementondefinitionsofreadinessandtheaspectsofqualitythatsupportthem,andmeasuresdesignedandevaluatedtoassessthoseaspectsofquality.

ConsiderAlternativeApproachestoAccumulatingEvidence

•  Continuousandlocalvalidationandimprovementofmeasures.

•  Thisapproachcouldpotentiallybenefitfromviewingmeasuresas:–  Notfixedinstone(movingawayfromsinglecopyrightedmeasurecontrolledbypublisher).

–  Jointlyowned(movingawayfromfinancial/professionalstakeinafixeditem/measure).

LocalValidationCanAlsoEncompassContextualDiversity

•  Doesameasurecaptureasingleconceptionofquality?–  Isthatconceptionexplicitorimplicit?– Doesthatconceptionmatchwithpolicygoalsandwithon-the-groundpractice,acrosslocalcontexts?

•  Doesameasurecapturewellallchildren’sexperiences?– Or,“average”teacherquality,“average”child,“substantialportionoftheday”

FutureDirections

•  NewECERS-3–  Someimportantchanges(e.g.,expandedcontentonlanguage,literacyandmath).

–  Othersremainedthesame(encouragesscoringallindicators,andalternativescoringindevelopment,butmanualretainsstopscoring).

–  IES-fundedvalidationstudyofECERS-3.

•  Newmeasuredevelopment–  IESResearchNetworkonEarlyChildhoodEducation.–  Includesassessmentteam(headedbyCarolConnor).–  NewmeasuredevelopmentforQRIS.

FutureDirec8ons(cont)•  EarlyInvestmentsIni8a8ve(Gordon,Zinsser,Sheridan,Main,

Curby,etal.,IGPA)&EMOTERS(Zinsser,Curby,Gordon,etal.IESR305A160010)– NewmeasureofSocialandEmo8onalTeaching.

•  Thevarietyofac8vi8esandprac8cesthatpromoteSELinchildren.

– Oneofthenoveldesignfeatures:videoweeks•  Videofullweekperclassroom(panoramic&closeup).•  Easiertohavemanycoders,toexamineIRR.•  Facilitateiden8fica8onofwithin/acrossdayvaria8on.•  Usinginitera8vemeasuredevelopment.•  Usinggeneralizabilitytheorytoparsesourcesofvaria8on.

FutureDirec8ons(cont)•  IESR305H130012

–  Researcher-Prac88onerPartnershipsinEduca8onResearchgrant–  Crea8ngaMonitoringSystemforSchoolDistrictstoPromote

Academic,Social,andEmo8onalLearning:AResearcher-Prac88onerPartnership

–  CASEL&WashoeCountySchoolDistrict

•  UsingIRT(Rasch)approachduringitera8veitemdevelopment.–  Refiningconstructdefini8on–  Developingitempool–  Con8nuousrefinement–  Anchor(common)andnew(unique)items

TheRaschRuler

CompetenciesthatarereallyHARDformost

kids

CompetenciesthatarereallyEASYformostkids

KidswhohavetheMOSTcompetency

KidswhohavetheLEASTcompetency

Marks=Competencies

Measures=Kids’Levels

TheRaschRuler

2+2Ifwehadmarksonlyatthebojomoftheruler–justtheeasymathitems–wecouldn’tseparatethestudentswithmoderatelytohighlycompetentmathskills.

-2.00

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

2.00

ThewayWCSDdepictedtheruler,showingourmostimprovedsetofitems:

RelaPonshipSkills.

Hardest to Do

Easiest to Do

1.SharingwhatIamfeelingwithothers.

2.JoiningagroupIdon’tusuallysitwithatlunch.

3.TalkingtoanadultwhenIhaveproblemsatschool.

4.Introducingmyselftoanewstudentatschool.

6.Beingpolitetoadults.

5.Genngalongwithmyclassmates.

1=HardesttoDo 6=EasiesttoDo

Documents

High Stakes Quality Measures in Early Care and Education · development and coaching. • Reﬂects developmental theory and research and emphasizes teacher-student (adult-child)