45
Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring and Developing Teacher Effectiveness: An Assessment of Research, Policy, and Practice New Orleans, Louisiana April 10, 2011 1

Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Embed Size (px)

Citation preview

Page 1: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Using Student Test Scores to Distinguish Good Teachers from Bad

Edward HaertelSchool of EducationStanford University

AERA Presidential SessionMeasuring and Developing Teacher Effectiveness: An Assessment of Research, Policy, and Practice

New Orleans, Louisiana

April 10, 2011

1

Page 2: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

FRAMING THE PROBLEMPolicy makers wish to measure

teachers' effectivenessThis goal is pursued using

complex statistics to model student test scores

2

Page 3: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Complicating FactorsStart-of-year student achievement varies due to◦Home background and community context

◦Individual interests and aptitudes

◦Peer culture◦Prior schooling

3

Page 4: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Complicating FactorsEnd-of-year student achievement varies due to◦Start-of-year differences◦Continuing effects of out-of-school factors, peers, and individual aptitudes and interests

◦Instructional effectiveness

4

Page 5: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Complicating FactorsInstructional effectiveness reflects◦District and state policies◦School policies and climate◦Available instructional materials and resources

◦Student attendance◦The teacher

5

Page 6: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Two Simplified AssumptionsTeaching matters, and some teachers teach better than others

There is a stable construct we may refer to as a teacher’s “effectiveness”

6

Simplified to

Page 7: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Two Simplified AssumptionsStudent achievement is a central goal of schooling

Valid tests can measure achievement

Achievement is a one-dimensional continuum

Brief, inexpensive achievement tests locate students on that continuum

7

Simplified to

Page 8: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

8

More later on these simplified assumptions.

Let us press ahead.

Page 9: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Logic of the Statistical ModelWhat is a “Teacher Effect”?

◦Student growth (change in test score) attributable to the teacher

◦I.e., caused by the teacher

9

Page 10: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Logic of the Statistical Model

10

Teacher Effect

on One Student

Student’sObser

ved Score

Student’s

Predicted Score

= —

“Predicted Score” is Counterfactual – an estimate of what would have been observed with a hypothetical average teacher, all else being equal

These (student-level) “Teacher Effects” are averaged up to the classroom level to obtain an overall score for the teacher.

Page 11: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

In or out?District leadershipSchool norms, academic pressQuality of school instructional staffEarly childhood history; medical

historyQuality of schooling in prior yearsParent involvementAssignment of pupils (to schools, to

classes)Peer cultureStudents’ school attendance histories…

11

Page 12: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Controlling for prior-year score is not sufficient

First problem—Measurement Error:prior-year achievement is imperfectly measured

Second problem—Omitted variables:models with additional variables predict different prior-year true scores as a function of◦additional test scores◦demographic / out-of-school factors

12

Page 13: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Controlling for prior-year score is not sufficient

Third problem—Different trajectories:students with identical prior-year true scores have different expected growth depending on◦individual aptitudes◦out-of-school supports for learning◦prior instructional histories

13

Page 14: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

A small digression:Student Growth PercentilesConstruction

◦Each student’s SGP score is the percentile rank of that student’s current-year score within the distribution for students with the same prior-year score

14

Prior year score

Curr

ent

year

score

Page 15: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Student Growth PercentilesInterpretation

◦How much this student has grown relative to others who began at the “same” (prior-year) starting point

Advantages◦Invariant under monotone

transformations of score scale◦Directs attention to distribution of

outcomes, versus point estimate

15

Page 16: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Is anything really new here?

16

Thanks to Andrew Ho and Katherine Furgol for this graphic

Page 17: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

EXAMINING THE EVIDENCE

Stability of “effectiveness” estimates◦ That first “simplified assumption”

Problems with the tests◦ That second “simplified assumption”

Random assignment?Professional consensus

17

Page 18: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

EXAMINING THE EVIDENCE

Stability of “effectiveness” estimates◦ That first “simplified assumption”

Problems with the tests◦ That second “simplified assumption”

Random assignment?Professional consensus

18

Page 19: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Stability of “Effectiveness” Estimates

Newton, Darling-Hammond, Haertel, & Thomas (2010) compared high school math and ELA teachers’ VAM scores across◦Statistical models◦Courses taught◦Years

19

Full report at http://epaa.asu.edu/ojs/article/view/810

Page 20: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Sample* for Math and ELA VAM AnalysesAcademic Year 2005-06 2006-07

Math teachers 57 46

ELA teachers 51 63

Students Grade 9Grade 10Grade 11

646714511

881693789

*Sample included all teachers who taught multiple courses. Ns in table are for teachers x courses. There were 13 math teachers for 2005-06 and 10 for 2006-07. There were 16 ELA teachers for 2005-06 and 15 for 2006-07.

Findings from Newton, et al.

Page 21: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

21

% of Teachers Whose Effectiveness Ratings Change …

By at least 1 decile

By at least 2 deciles

By at least 3 deciles

Across models* 56-80% 12-33% 0-14%

Across courses* 85-100% 54-92% 39-54%

Across years* 74-93% 45-63% 19-41%

*Depending on the model

Page 22: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

22

One Extreme Case: An English language arts teacher

02468

10

Decile Rank Y1 Decile Rank Y2

0

20

40

60

80

% ELL % Low-income

%Hispanic

Y1

Y2

Comprehensive high school

Not a beginning teacher

WhiteTeaching English IEstimates control

for:

◦ Prior achievement

◦ Demographics

◦ School fixed effect

Page 23: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

EXAMINING THE EVIDENCE

Stability of “effectiveness” estimates◦ That first “simplified assumption”

Problems with the tests◦ That second “simplified assumption”

Random assignment?Professional consensus

23

Page 24: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

11th Grade History/Social Studies

US11.11.2.Discuss the significant domestic policy speeches of Truman, Eisenhower, Kennedy, Johnson, Nixon, Carter, Reagan, Bush, and Clinton (e.g., education, civil rights, economic policy, environmental policy).

Page 25: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Item Testing US11.11.2

Page 26: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

9th Grade English-Language Arts

9RC2.8Expository Critique: Evaluate the credibility of an author’s argument or defense of a claim by critiquing the relationship between generalizations and evidence, the comprehensiveness of evidence, and the way in which the author’s intent affects the structure and tone of the text (e.g., in professional journals, editorials, political speeches, primary source material).

Page 27: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Item Testing 9RC2.8

Page 28: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Algebra I

25.1Students use properties of numbers to construct simple, valid arguments (direct and indirect) for, or formulate counterexamples to, claimed assertions.

Page 29: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Item Testing 25.1

Page 30: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

High School Biology

BI6.fStudents know at each link in a food web some energy is stored in newly made structures but much energy is dissipated into the environment as heat. This dissipation may be represented in an energy pyramid.

Page 31: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Item Testing BI6.f

Page 32: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

EXAMINING THE EVIDENCE

Stability of “effectiveness” estimates◦ That first “simplified assumption”

Problems with the tests◦ That second “simplified assumption”

Random assignment?Professional consensus

32

Page 33: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Student Assignments Affected ByTeachers’ particular specialtiesChildren’s particular

requirementsParents’ requestsPrincipals' judgmentsNeed to separate children who do

not get along

33

Page 34: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Teacher Assignments Affected ByDifferential salaries / working

conditionsSeniority / experienceMatch to school’s culture and

practicesResidential preferencesTeachers’ particular specialtiesChildren’s particular

requirements34

Page 35: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Does Non-Random Assignment Matter? A falsification test

Logically, future teachers cannot influence past achievement

Thus, if a model predicts significant effects of current-year teachers on prior-year test scores, then it is flawed or based on flawed assumptions

35

Page 36: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Falsification Test FindingsRothstein (2010) examined three

VAM specifications using a large data set and found “large ‘effects’ of fifth grade teachers on fourth grade test score gains.”

36

Page 37: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Falsification Test FindingsBriggs & Domingue (2011)

applied Rothstein’s test to LAUSD teacher data analyzed by Richard Buddin for the LA Times◦For Reading, ‘effects’ from next

year’s teachers were about the same as from this year’s teachers

◦For Math, ‘effects’ from next year’s teachers were about 2/3 to 3/4 as large as from this year’s teachers

37

Page 38: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

EXAMINING THE EVIDENCE

Stability of “effectiveness” estimates◦ That first “simplified assumption”

Problems with the tests◦ That second “simplified assumption”

Random assignment?Professional consensus

38

Page 39: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Professional Consensus

We do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions. – Donald Rubin

39

Page 40: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Professional Consensus

The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers or schools. – Researchers from RAND Corp.

40

Page 41: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

Professional Consensus

VAM estimates of teacher effectiveness that are based on data for a single class of students should not used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.– 2009 Letter Report from the

Board on Testing and Assessment,

National Research Council 41

Page 42: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

UNINTENDED EFFECTSNarrowing of curriculum and

instruction◦What doesn’t get tested doesn’t get

taughtInstructional focus on students

expected to make the largest or most rapid gains◦Student winners and losers will

depend on details of the model usedErosion of teacher collegial

support and cooperation 42

Page 43: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

VALID AND INVALID USESVALID

◦Low-stakes◦Aggregate-level interpretations◦Background factors as similar as

possible across groups comparedINVALID

◦High-stakes, individual-level decisions, comparisons across highly dissimilar schools or student populations

43

Page 44: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

UNINTENDED EFFECTS

44

“The most pernicious effect of these [test-based accountability] systems is to cause teachers to resent the children who don’t score well.”

—Anonymous teacher, in a workshop many years ago

Page 45: Using Student Test Scores to Distinguish Good Teachers from Bad Edward Haertel School of Education Stanford University AERA Presidential Session Measuring

45

Thank you