Exploring Value-Added Across Multiple Dimensions: A Bifactor Approach

1

Exploring Value-Added Across Multiple Dimensions: A Bifactor Approach

Derek BriggsBen Domingue

University of ColoradoMaryland Assessment Conference

October 18, 2012

2

Outline

• Motivation: – Value-Added Across Different Outcomes– Coming to Decisions in a High Stakes

Accountability Context• Longitudinal Item Response Data• A Bifactor Analysis• Value Added to What?

3

A Brief Aside on Vertical Scales• The original title of this talk was “Multidimensionality, Vertical

Scales and Value-Added Models”• There is a simple bottom line on this: vertical scales are not needed

for value-added modeling. It’s a non-issue.• Even for models that focus on repeated measures and growth

trajectories, the approach taken to create a vertical scale will rarely have an impact on teacher or school rankings. – For details, see working paper by Briggs & Domingue, “The Gains from

Vertical Scaling”• Vertical scales can play an important role in supporting inferences

about student growth in absolute magnitudes. – For a critique of current practice, see Briggs

“Measuring Growth with Vertical Scales” (in press) JEM

4

Motivation

5

“Don’t measure yourself by what you have accomplished, but by what you should have accomplished with your ability.”

John Wooden, Basketball Coach, 1910-2010

6

Our Theories and Intuition Tell Us Academic Success is Multidimensional

According to Common Core State Standards, students who are college and career ready in • reading, writing, speaking and listening:

– Demonstrate independence, build strong content knowledge, comprehend as well as critique, value evidence, use technology and digital media, understand other perspectives and cultures.

• mathematics:– Make sense of problems and persevere in solving them, reason

abstractly, construct viable arguments and critique the reasoning of others, model with mathematics, attend to precision, etc.

7

Previous Empirical Evidence

The variability in VA by outcome measure is greater than the variability by model specification.– Lockwood et al, JEM, (2007)– MET Study “Learning about Teaching” (2010)– Papay, AERJ (2011)

These studies focused on correlations between VA based on different tests within the same content domain (math, reading)

8

Math vs. ReadingData Source Unit of Analysis Sample

SizeModel r(Math,Reading)

Hawaii Schools 272 Colorado Growth Model (MGPs)

0.74

Wyoming Schools 214 Colorado Growth Model (MGPs)

0.53

Denver PSD Teachers 180 Colorado Growth Model (MGPs)

0.58

LAUSD Teachers 10794 Fixed Effects Regression, student, demographics, no classroom or school covariates (“LAVAM”)

0.60

LAUSD Teachers 3306 Fixed Effects Regression, with classroom and school

covariate (“altVAM”

0.46

9

Making High-Stakes Decisions about Teachers/Schools

Categorical Outcomes (K): 4 = Highly Effective3 = Effective2 = Partially Effective1 = Ineffective

Evidence of Value-Added in Student Outcomes

Direct Observations of Practice, Other Sources of

Evidence

10

Combining Information about Value-Added: Two Approaches

• Compensatory– Take simple or weighted average of value-added indicator

across test outcomes– Classify teachers/schools on basis of quantiles of

distribution or confidence intervals.• Conjunctive

– Classify teachers/schools in i categories on basis of j outcomes.

– Make rules that simplify ij decision matrix to k.– Ensure that no teacher/school is ineffective on a given

outcome.

11

What Are Tests Designed to “Measure?”

12

“Tests Measure Student Achievement”

• An achievement score is a function of the content sampled from an instructional domain

• Teachers/schools may vary in their ability to in teach different subject matter.

• Agnostic about underlying latent variable.• Observed achievement is an estimate of a true score or

universe score (G Theory)– Each achievement domain has a different hypothetical universe

score.• Consistent with compensatpory and conjunctive

approaches?

13

“Tests Measure Student Ability (θ)”

• This is a latent variable perspective.• But math and reading “abilities” are poorly defined

latent variables.• What is distinct and what is the same about these

variables? • What if reading and math items are really just

measuring the same unidimensional latent variable?• Spearman’s g?• Should this be the focus of value-added inferences?

14

A Novel Application of a Bifactor IRT Model

Common Factor“g”?

Math Knowledge &

Skills?

Reading K & S?

Item 1

Item 2

Item 45

… Item 1

Item 2

Item 54

…

Items from a Math Test Items from a Reading Test

15

Research Questions1. Is “achievement” distinct from “ability”?• If we remove the influence that is common to both math

and reading test performance, what is left? • Are the subject-specific variables substantively

interpretable across grades?• How do the three “theta” variables from the bifactor model

compare to the “theta” variables from successive unidimensional IRT models?

2. What insights does a bifactor model give us about different approaches to combining estimates of value-added across test outcomes?

16

Exploratory Strategy• Leverage longitudinal item response data to estimate six “theta”

variables:UNIDIMENSIONAL1. Math (2PL IRT)2. Reading (2PL IRT)3. Math + Reading (unidimensional 2PL IRT)MULTIDIMENSIONAL4. Bifactor math (Bifactor 2PL)5. Bifactor reading (Bifactor 2PL)6. Bifactor g (Bifactor 2PL)

• Examine the characteristics of each as a “measure”• Compare the use of these different variables as the outcome in a

(simple) value-added model

17

Data & Methods

18

Bifactor Model

i = items, j = item specific factors,g = general factor

Technical DetailsSoftware: IRTPro 2.1 (Cai, Thissen, Du Toit)Estimation Method: Bock-Aitken, 49 quadrature pointsReferences: Cai, 2010; Cai, Yang & Hansen, 2011; Rijmen, 2009; Rijmen et al 2008.

19

CSAP Tests in Math and Reading

Math, Grades 3-10Content Standards1. Number & Operation Sense 2. Algebra, Patterns and Functions3. Statistics and Probability4. Geometry5. Measurement6. Computational Techniques All 6 standards emphasize

application of content for problem solving and communication.

Mix of MC and CR items

Reading, Grades 3-10Content Standards1. Reading Comprehension2. Thinking Skills3. Use of Literary Information4. Literature Subcontent: Fiction, Nonfiction,

Vocabulary, Poetry Mix of MC and CR items

20

Longitudinal Item Response Structure: Students nested in Schools

Source: Denver Public School District

21

Student & School CharacteristicsAcross grades 5-9:• About 62% of DPS students are eligible for free or reduced

price lunch services (FRL).• About 10% receive special education services (SPED).• Between 10-20% are English Language Learners (ELL)Across DPS Schools:

Variable Mean SD

FRL 65% 28%

SPED 11% 8%

ELL 14% 13%

22

Students per School

Min 62

1st Qu 128

Median 210

Mean 236.6

3rd Qu 315.2

Max 578

40 Schools

23

Value-Added Model• Fixed Effects Regression

– Pools Grade 6 estimates (middle school) and Grade 9 estimates (high school)

• Outcome: One of the six “theta” variables created.• Covariates:

– Prior grade achievement in same outcome,– Free/reduced lunch status– English Language Learner status– Special Education status– Grade 9 dummy variable (grade 6 omitted)

• Empirical Bayes shrinkage estimators

24

Caveats

• This is a very simple VAM.• Limited set of covariates, no school-level vars.• No teacher linkages, only schools.• Only a single longitudinal cohort of students.• No adjustment for measurement error– (Though we did examine possible adjustments.

Results not shown here.)

25

Results

26

Correlational Patterns Across Grades for Unidimensional Math and Reading

5 6 7 8 95 (.76) .88 .87 .87 .846 .85 (.78) .91 .89 .887 .82 .86 (.78) .91 .898 .77 .82 .86 (.76) .899 .77 .80 .83 .85 (.74)

math lower triangle; reading upper trianglemain diagonal: math/reading correlations

Note how strong these correlations are even after 4 years.

27

Bifactor Loadings (Grade 7 2005)

Horizontal blue line at loading of .3

28


29


Seems clear that something is amiss with grade 7 data in 2005 so we omit this grade in analyses that follow.

30

Marginal Reliabilities

The bifactor and math and reading estimates are rather noisy estimates. Low reliability at the student level.

31

Correlational Patterns Across Grades for Bifactor Math & Reading

5 6 8 95 (-.23) .52 .45 .446 .37 (-.13) .54 .528 .03 -.02 (-.20) .569 .06 -.10 .27 (-.18)

math lower diagonal; reading upper diagonalmain diagonal: math/reading correlations

32

Regression Results with Unidimensional Outcomes

Unidimensional ApproachMath Reading Combined

Prior Grade Theta 0.79* 0.81* 0.86* Free/Reduced Price Lunch Eligible -0.04 -0.07* -0.03 English Language Learner -0.04 -0.01 -0.01 Student has an IEP -0.11* -0.15* -0.10*

R2 for model w/ school fixed effects 0.761 0.800 0.855R2 for model w/ NO school fixed effects 0.734 0.785 0.838Increase in R2 due to schools 0.027 0.014 0.016

Note: Each outcome is standardized, so coefficients can be interpreted in an effect size metric. * p < .05

33

Regression Results with Bifactor Outcomes

Multidimensional ApproachMath Reading g

Prior Grade Theta 0.33* 0.48* 0.84* Free/Reduced Price Lunch Eligible -0.04 -0.21* 0.00 English Language Learner 0.07 -0.20* 0.01 Student has an IEP -0.02 -0.22* -0.11*

R2 for model w/ school fixed effects 0.147 0.356 0.814R2 for model w/ NO school fixed effects 0.116 0.331 0.793Increase in R2 due to schools 0.032 0.025 0.021

Note: Each outcome is standardized, so coefficients can be interpreted in an effect size metric. * p < .05

34

School “Effects” Distributions from Unidimensional vs. Bifactor Outcomes

SD Math = 0.22SD Read = 0.13SD Comp = 0.18

Note: These are shrunken VA estimates

SD Math = 0.11SD Read = 0.11SD g = 0.21

35

Unidimensional vs. Bifactor Math

SD Uni Math = 0.22SD BF Math = 0.11


36

Unidimensional vs. Bifactor Reading

SD Uni Read = 0.13SD BF Read = 0.11


37

VA Comparisons: Uni Math, Uni Reading vs. g

Value-added for math seems mostly redundant with value-added for g (r = .98); but looking at reading separately yields some unique information (r = .82).

38

VA g is equivalent to VA from combined math and reading

39

Math vs. Reading: With and Without g

40

Math vs. Reading VA within Method

Bifactor VA by Subject Unidimensional VA by Subject

41

Relationship of VA with School-Level Status Variables

If low correlations with these variables was considered an indication of a VA indicator that successfully leveled the playing field, the school effects associated with bifactor math outcomes would “win”.

42

Discussion

43

Summary• When math and reading outcomes are quantitatively

combined (VAcomp or taking average of VA across subjects), this is essentially equivalent to estimating VA for “g”.

• Math and reading items tend to load strongly on g– Math items load weakly on math-specific bifactor– Reading items have moderate loadings on reading-specific bifactor

• Evidence that math and reading bifactors are not just noise. • School fixed effects explain more variability in math/reading

factors than in traditional unidimensional measures.• There is unique information about reading that would be

missed if math and reading were combined.

44

Limitations & Next StepsLimitations:• No links to teachers.• No access to actual test forms (and items) that were administered.Next steps:• Examine loadings by content and process standards in test

blueprints.• Do results generalize to

– schools & districts throughout the state?– multiple cohorts of students?– other tests?– more complex VAMs (control for unit-level aggregates)?

45

Tough Conceptual Questions

• What is g? – Is it sensitive to instruction?– Is it what we want to hold teachers and schools

accountable for increasing?• If a test measures something beyond g, what

is that something? Can it be distinguished?• Value-added to what?

46

Claims from the Smarter-BalancedLarge-Scale Assessment Consortium

In the domain of mathematics:Claim 1: Students can explain and apply mathematical concepts and interpret and carry out mathematical procedures with precision and fluency.Claim 2: Students can solve a range of complex well-posed problems in pure and applied mathematics, making productive use of knowledge and problem-solving strategies.Claim 3: Students can clearly and precisely construct viable arguments to support their own reasoning and to critique the reasoning of others.Claim 4: Students can analyze complex, real-world scenarios and can construct and use mathematical models to interpret and solve problems.

Should each of these claims be measured with a unique score? Should we expect variability in teacher efficacy on each? Or are all of these claims wrapped up in g?

Documents

Exploring Value-Added Across Multiple Dimensions: A Bifactor Approach