97
On the Measurement and Impact RUNNING HEAD: Measurement and Impact of the Treatment On the Measurement and Impact of Formative Assessment on Students’ Motivation, Achievement, and Conceptual Change Yue Yin 1 Carlos C. Ayala 2 Richard J. Shavelson 3 Maria Araceli Ruiz-Primo 4 Miki K. Tomita 3 Erin M. Furtak 5 Paul R. Brandon 1 Donald B. Young 1 1. College of Education, University of Hawaii 2. School of Education, Sonoma State University 3. School of Education, Stanford University 1

RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact of Formative Assessment on

Students’ Motivation, Achievement, and Conceptual Change

Yue Yin1

Carlos C. Ayala2

Richard J. Shavelson3

Maria Araceli Ruiz-Primo4

Miki K. Tomita3

Erin M. Furtak5

Paul R. Brandon1

Donald B. Young1

1. College of Education, University of Hawaii

2. School of Education, Sonoma State University

3. School of Education, Stanford University

4. University of Colorado at Denver and Health Sciences Center

5. Max Planck Institute for Human Development

1

Page 2: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Abstract

Formative assessment holds the promise of promoting achievement and

motivation. The effects of formative assessments, however, have rarely been examined

experimentally in regular education settings. Motivational beliefs are hypothesized to be

important factors that influence conceptual change. Because the motivational beliefs

expected to be modified by formative assessment happen to be the motivational beliefs

hypothesized to influence conceptual change, the study connected the formative

assessment and conceptual change frameworks and examined the effect of formative

assessments on student outcomes. In particular, the study explored whether formative

assessments would improve student motivation, achievement, and consequently

conceptual change. A randomized experimental design in a field setting was used and 12

middle school science teachers and their students participated in the study. It was found

that student motivation, achievement, and conceptual change varied significantly among

the participating teachers, but the formative assessment treatment did not have significant

impact on these outcome variables.

2

Page 3: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

On the Measurement and Impact of Formative Assessment on

Students’ Motivation, Achievement, and Conceptual Change

The logic of formative assessment—identify learning goals, assess where students

are with respect to those goals, and use effective teaching strategies to close the gap—is

compelling and has led to the expectation that formative assessment would improve

students’ learning and achievement (Ramaprasad, 1983; Sadler, 1989). Moreover,

substantial empirical evidence has supported the effectiveness of formative assessment

(Black & Wiliam, 1998a; Black & Wiliam, 1998b). However, the evidence mainly comes

from either laboratory studies or on anecdotal records (e.g., Black & Wiliam, 1998a). As

Black and Wiliam (1998a) pointed out, studies conducted in laboratory contexts may

suffer “ecological validity” problems and encounter reality obstacles when applied in

classrooms. The effects of formative assessments have rarely been examined

experimentally in regular education settings.

The study reported in this paper examined the effect of formative assessments on

student outcomes by using a randomized experimental design in a field setting. The study

explored whether formative assessments would improve student motivation,

achievement, and conceptual change. This paper presents (a) the conceptual framework

of the study, (b) research design, (c) the instruments used to measure the outcome

3

Page 4: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

variables—motivation, achievement, and conceptual change—and the technical qualities

of each instrument; and (d) results of the experiment.

Conceptual Framework

This section reviews literature on conceptual change and formative assessment.

Based on the literature review, the theoretical framework for this study is laid out and

research questions are raised.

Achievement is viewed as acquiring knowledge and the capacity to reason with

that knowledge. As described in Shavelson et al. (this issue), knowledge can be

distinguished as four types: declarative, procedural, schematic, and strategic. The focus

of this study is on schematic knowledge and more specifically how to move students

from a naïve conception of the natural world to a scientifically justifiable one as to “why

things sink and float”. Hence we focus on conceptual change and factors that affect it.

Conceptual Change

When students come into science classrooms, they often hold deep-rooted prior

knowledge or conceptions about the natural world. These conceptions influence how they

come to understand what they are taught. Some of their existing prior knowledge

provides good foundations for formal schooling, for example, their prior knowledge of

number and language. However, other prior conceptions are incompatible with currently

accepted scientific knowledge; these conceptions are commonly referred to as

misconceptions (NRC, 2001) or alternative conceptions (Abimbola, 1988).

4

Page 5: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Usually students derive misconceptions through limited observation and

experience. Consequently, learning is not only the acquisition of new knowledge but also

the interaction between new knowledge and students’ prior knowledge. For example,

everyday life experience leads young children to believe that the earth is flat. In fact,

even when some children learn that the earth is round, they believe that it is like a

pancake—round but still flat (Vosniadou & Brewer, 1992).

The topic of this study, “Why things sink and float,” is addressed in many middle

school physical science curricula and students’ misconceptions about the topic have been

well documented (e.g., M. G. Hewson, 1986; Kohn, 1993; Yin, Tomita, & Shavelson, in

press). Although sinking and floating is a common experience in everyday life, it is in

fact a sophisticated scientific phenomenon. To fully understand the fundamental reason

underlying sinking and floating, students need to have complicated knowledge that

includes an analysis of forces (buoyancy and gravity) and water pressure. That

knowledge, however, is either not introduced or not sufficiently addressed in middle

school curricula most likely because of the difficulty learning it. Rather, some curriculum

developers take a shortcut and use relative density to simplify the explanation of sinking

and floating (e.g., Pottenger & Young, 1992). Even so, relative density itself is

challenging for many students because density is a concept involving the ratio of mass to

volume (e.g., Smith, Snir, & Grosslight, 1992) and relative density involves comparing

two ratio variables, the density of an object and the density of an medium.

5

Page 6: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Despite its complexity in science, sinking and floating is a common phenomenon.

Most students have rich experiences and personal theories or mental models for

explaining sinking and floating. Unfortunately, many of their theories are either

misconceptions or conceptions that are only valid under certain circumstances--for

example, big/heavy things sink, things with holes in them sink, hollow things float, things

with air in them float, or flat things float. Those conceptions are so deeply rooted in

students’ minds that it is very difficult for students to change them, even after students

have been intensively exposed to scientific conceptions such as, “If an object’s density is

less than a liquid’s density, the object will float in the liquid regardless of the size or mass

of the object” (Yin et al., in press).

Similar to children’s misunderstanding of the shape of the planet earth and “why

things sink and float,” many other misconceptions are deeply rooted in everyday

experiences, which then inhibit students from acquiring scientific conceptions. These

alternative conceptions exist widely across different subject domains, among people of

different ages, across different cultures, and through the history of the development of

scientifically justifiable ideas (Yin, 2005).

Given the presence and persistence of alternative conceptions, the restructuring or

reorganization of existing knowledge, or conceptual change, has become a very

important component of teaching and learning (Carey, 1984; Schnotz, Vosniadou, &

Carretero, 1999). To fully establish scientifically justifiable conceptions of the natural

6

Page 7: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

world, sometimes students have to experience conceptual change (Carey, 1984) and

transform misconceptions to complete and accurate conceptions (NRC, 2001).

From a cognitive perspective, researchers have described the mechanics of

conceptual change (Demastes, Good, & Peebles, 1996; Posner, Strike, Hewson, &

Gertzog, 1982); developed theories about the nature of conceptual change (Chi, 1992;

Chi, Slotta, & deLeeuw, 1994); and explored practical ways to foster conceptual change

(Chinn & Malhotra, 2002; P. W. Hewson & Hewson, 1984). Our knowledge of the

cognitive aspects of alternative conceptions and conceptual change has been greatly

enriched over the past 20 years. However, the traditional approaches have been criticized

as “cold cognition” because they could not adequately explain why some students

changed their conceptions while others did not (Dreyfus, Jungwirth, & Eliovitch, 1990;

Pintrich, 1999; Pintrich, Marx, & Boyle, 1993).

Motivation researchers proposed that the conceptual change process is influenced

by students’ motivational beliefs, such as goal orientations, epistemological beliefs,

interest or value, and efficacy (e.g., Pintrich, Marx, & Boyle, 1993; Pintrich, 1999). They

suggested, conceptual change could be facilitated by students’ acquisition of a task goal

orientation, incremental intelligence belief, high self-efficacy, and strong interest; in

contrast, conceptual change could also be prevented by ego goal orientation and fixed

intelligence beliefs. Each of these possible influences on conceptual change will be

discussed in more detail below.

7

Page 8: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Goal orientations are self-constructed theories about what it means to succeed.

They help students make meaning of their own learning in context. Goal orientations are

generally classified into two main categories: intrinsic (also called, mastery or task-

involved orientation) and extrinsic (also called, performance or ego-involved orientation)

(Dweck & Leggett, 1988). Task goal orientation means that the goal of studying is to

learn; Ego-involved orientation can be further classified to ego approach orientation (t

those with an ego goal orientation, students with a task goal orientation tended to focus

on learning, understanding, and mastering the tasks, and engaged in deeper processing

strategies, which are helpful for conceptual change.

The intelligence belief is “whether intelligence or ability is fixed or not.”

Schommer (1992) found individuals who believe in intelligence as a malleable quality

that can be developed (i.e., incremental intelligence) are more likely to be cognitively

engaged in tasks. Dweck and Leggett (1988) suggested that intelligence theory can

influence students’ goal orientation, which, in turn, influences students’ learning. In their

model, a student with fixed intelligence beliefs is more likely to adopt a performance goal

orientation. As discussed in the task goal orientation section, performance goal

orientation is considered to constrain conceptual change by limiting the level of deeper

cognitive engagement.

Interest is students’ general attitude or preference for the content or task

(Eccles, 1983). Pintrich (1999) argued that interests help facilitate the process of

8

Page 9: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

conceptual change because conceptual change demands that students maintain their

cognitive engagement in order to accommodate new conflicting information.

Self-efficacy beliefs are individuals’ beliefs about their performance capabilities in

a particular domain (Bandura, 1986; Pajares, 1996). Self-efficacy is related to task choice

(Bandura, 1997), effort taken, task persistence (Bandura, 1986; Bandura & Cervone,

1983), and cognitive engagement (Pintrich & Schrauben, 1992). Pintrich suggested that

the students who are confident in their ability to change their ideas and to learn to use

proper cognitive strategies to integrate and synthesize divergent ideas are more likely

than others to realize conceptual change.

Pintrich and his colleagues’ greatest contribution is that they pointed out an

important missing component in the previous cognitive conceptual change model,

expanding researchers’ horizon from a narrow cognitive view to a broader one that

involves individuals’ motivations. They suggested that motivational beliefs influence

people’s cognitive engagement and depth of information processing, which in turn

determines whether conceptual change can occur. Moreover, Pintrich insisted that all

motivational beliefs are amenable to change, even though some of them are robust or

relatively stable (Pintrich, 1999; Pintrich et al., 1993). For instance, an educational

context that encourage task goal orientated will encourage students to develop task goal

orientation. Similarly, an ego goal-orientated context encourages ego goal orientation

among students.

9

Page 10: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Few studies have been conducted to examine empirically motivation theorists’

claims about the influence of motivation on conceptual change. One possible approach to

bringing about conceptual change that integrates cognition and motivation is formative

assessment.

Formative Assessment

Building on earlier reviews, Black and Wiliam reviewed 580 articles (or chapters)

from over 160 journals in a nine year period and concluded that formative assessments

have a positive impact on students’ learning, as well as on students’ motivation (Black,

Harrison, Lee, Marshall, & Wiliam, 2002; Black & Wiliam, 1998a; Black & Wiliam,

1998b; Hattie & Timperley, 2007).

In summative assessment, scores are often related to students’ rank compared to

their peers, and performance is the most important concern. This environment produces

several negative effects on motivation and achievement: (a) Students develop an ego-

involving goal orientation (Schunk & Swartz, 1993); (b) students, especially the low

achievers, tend to attribute achievement to capability (Siero & Van Oudenhoven, 1995)

and regarded ability as inborn and intelligence as fixed (Vispoel & Austin, 1995), which

might discourage them from putting effort into future learning; (c) students are reluctant

to seek help because they fear their questions will be regarded as evidence demonstrating

low ability (Blumenfeld, 1992); and (d) students tend to be engaged in superficial

information processes, for example, rote learning and concentrating on the recall of

isolated details (Butler, 1987; Butler & Neuman, 1995).

10

Page 11: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

In contrast, formative assessment possesses substantial advantages on motivation

and achievement: (a) Because formative assessment emphasizes the learning process and

closing the gap between students’ current situation and the desired goal, students are

more likely to form a task-involving goal orientation, which encourages students to

process information more deeply than does the performance orientation (Schunk &

Swartz, 1993); (b) Because formative assessment concentrates on improving students’

learning, it is less likely to cause students to lose confidence. Instead, students tend to

believe in incremental intelligence; that is, ability is just a collection of skills that people

can master over time (Vispoel and Austin, 1995). Formative assessment shapes self-

efficacy in a way that benefits students’ learning and, consequently, students’

achievement, especially that of lower achievers (Geisler-Brenstein & Schmeck, 1996). (c)

The activities involved in formative assessment can improve students’ interest in

learning. (d) Students engaged in formative assessments increase their self-regulation,

reasoning, and planning, which are all important for effective learning and conceptual

change (Black and William, 1998a).

In sum, formative assessment is expected to encourage the motivational beliefs

hypothesized to promote conceptual change, such as, task goal orientation, incremental

intelligence beliefs, self-efficacy, and interest. Meanwhile, formative assessments

discourages the motivational beliefs hypothesized to prevent conceptual change, such as

ego goal orientation and fixed intelligence beliefs (Figure 1). For future discussion

11

Page 12: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

convenience, the former motivational beliefs are called positive motivation and the latter

are called negative motivation.

--------------------------------

Insert Figure 1 about Here

--------------------------------

Because the two models share the motivational beliefs, it seems plausible to

connect the motivational model of conceptual change with formative assessment. We

conjectured that students who participate in formative assessments would be more likely

to change motivational beliefs in a way that promoting conceptual change, demonstrate

improved achievement, and realize conceptual change than those who did not participate

in formative assessments.

As promising as it sounds, formative assessment places greater demands on

teachers than regular teaching due to its uncertainty and flexibility (Bell & Cowie, 2001).

Black and Wiliam suggested that teachers “need to see examples of what doing better

means in practice” (1998b, p.146). To date, there is a paucity of systematic, detailed, and

practical information to address adequately the challenge of how to design and implement

formative assessments in teaching to help improve student learning and motivation.

Shavelson et al. (this issue) characterized formative assessment techniques as

falling on a continuum based on the amount of planning involved and the formality of

technique used, benchmarking points along the continuum as—(a) on-the-fly formative

assessment, which occurs when teachable moments unexpectedly arise in the classroom;

12

Page 13: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

(b) planned-for-interaction formative assessment, which is used during instruction but

prepared deliberately before class to align with instructional goal closely; and (c) formal

and embedded in curriculum formative assessment, which is designed to be embedded

after a curriculum unit to make sure that students have reached important goals before

moving to the next lesson. The first two kinds of formative assessments require teachers

to have rich teaching experience and/or strong spontaneous teaching skills. To assist

teachers to implement formative assessments, formal formative assessment, Embedded

Formative Assessment, was taken in this study’s examination of the impact of assessment

on student outcomes.

The study attempts to answer the following three questions: (a) Can Embedded

Formative Assessment improve students’ motivational beliefs, which were conjectured to

influence conceptual change? (b) Can Embedded Formative Assessment improve

students’ achievement? (c) Can Embedded Formative Assessment improve conceptual

change? Based on prior theory we expected that formative assessment could improve all

three outcomes.

Method

Design

To answer the three research questions, a small randomized experimental study

was conducted. The study involved two groups of teachers, one that employed Embedded

Formative Assessment while teaching a science unit and another that taught the same unit

13

Page 14: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

without formative assessments. Four steps were taken in the study: (a) Twelve teacher

participants along with their students were randomly assigned to either the experimental

group or the control group; (b) All of the students were given a pretest motivation survey

and science achievement test (including conceptual change items); (c) Both groups of

teachers taught the same curriculum unit, but the teachers in the experimental group used

formative assessments and were expected to use the information collected by formative

assessment to help improve teaching and learning; (d) All the students were given a

posttest motivation survey and achievement test. By comparing the experimental and

control students’ motivation and achievement scores on the pretest and posttest, we

examined whether the formative assessment treatment affected students’ motivation,

achievement, and conceptual change.

Details about the participants and procedures are addressed by Shavelson, et al.

(this issue). Details about the curriculum and treatment are described by Ayala, et al. (this

issue). This paper focuses on the instruments and results of the study1.

Instruments

Motivation questionnaire and achievement assessments were developed to

examine the impact of embedded formative assessments.

1 Due to space limitations, the discussion of instruments and results in this paper is rather brief. To know more details about the instruments and results, please refer to Yin (2005).

14

Page 15: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Motivation Questionnaire

Motivation questionnaire development. The motivation questionnaire measures

the constructs that were addressed in both conceptual change literature and formative

assessment literature (Black et al., 2002; Pintrich, 1999; Pintrich et al., 1993). Among

them, four motivational beliefs promote conceptual changes: (a) task goal orientation

(e.g., “An important reason I do my science work is because I like to learn new things.”),

(b) perceived task-goal orientation context (e.g., “Our teacher really thinks it is very

important to try hard”), (c) self-efficacy in science (e.g., “I am certain I can figure out

how to do difficult work in science”), and (d) interest in science (e.g., “I find science

interesting”). The other four motivational beliefs prevent conceptual change: (a) ego

approach orientation (e.g., “I want to do better than other students in my science class.”),

(b) ego avoidance orientation (e.g., “It is very important to me that I do not look stupid in

my science class”), (c) perceived performance-goal orientation context (e.g., “Our teacher

lets us know which students get the highest scores on tests”), and (d) inborn intelligence

epistemic beliefs (e.g., “I can’t change how smart I am in science”).

Most motivation items were drawn from Lau and Roeser’s study (2002). A five-

point Likert-type scale from 1 (strongly disagree) to 5 (strongly agree) was used in the

questionnaire to measure the degree to which students agreed or disagreed with each

statement. The same motivation questionnaire was given to the control and experimental

groups at pretest and posttest.

15

Page 16: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Technical quality of the motivation questionnaire. Internal consistency analysis,

exploratory factor analysis, and confirmatory factor analysis were conducted to validate

the motivation constructs at pretest and posttest and diagnose ill-matched items.

Coefficient alpha was calculated for each final scale. Exploratory factor analysis

examined whether the motivation items measured the constructs that they were designed

to measure and to identify ill-matched items. Principal Axis Factoring extraction and

Promax with Kaiser Normalization rotation were applied assuming the motivation

constructs to be correlated with each other. Factors with eigen values greater than 1 were

extracted. The factor loadings of each item in the pattern matrix were examined to

indicate the fitness of items with their corresponding construct. Finally confirmatory

factor analysis was conducted to further validate the motivation constructs and

modification indexes (MIs) were examined to capture misfit items in the model. Details

of these analyses are available from the first author.

Table 1 presents the internal consistency of each motivation construct and the

correlations among them, based on the posttest data. The results based on pretest data,

similar to the result based on posttest, were omitted to conserve space.

--------------------------------

Insert Table 1 about here

--------------------------------

Overall, the motivation sub-scales have acceptable internal consistency, except

for fixed intelligence (.44), possibly due to the small number of items (3) (Table 1). This

16

Page 17: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

caveat in mind, the average score across items in each subscale was calculated to form a

composite score. Ten composite scores corresponding to the eight constructs were used

for further data analyses. Most motivation constructs were significantly correlated with

each other (Table 1). All the positive motivation constructs were positively correlated

with each other. So were all the negative motivation constructs. Except for the correlation

between ego-approach and self-efficacy, most positive motivation constructs were

negatively correlated with negative motivation constructs, whenever the correlation was

significant. The correlation pattern among the motivation constructs provided evidence

for discriminant validity of the motivation measurement.

Achievement Assessments

In this section, the multiple-choice test and open-ended assessments are described

and their technical qualities examined.

Achievement assessment development. Assessment development was based on

two concerns: (a) to adequately cover the curriculum’s instructional objectives and (b) to

comprehensively represent different types of knowledge--in particular, declarative

knowledge (knowing that), procedural knowledge (knowing how), and schematic

knowledge (knowing why)--the knowledge-reasoning-type framework set forth in

Shavelson et al. (this issue). Four assessments were developed: a multiple-choice test, a

performance assessment, a short-answer assessment, and a predict-observe-explain

assessment. A description of predict-observe-explain assessment can be found in Ayala et

al. (this issue).

17

Page 18: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Thirty-eight items were included in the multiple-choice pretest, covering the

important instructional objectives across three types of knowledge. Multiple-choice test

item examples can be found in Shavelson et al. (this issue). In addition to the 38 items on

the pretest, five more items mainly focusing on common misconceptions about sinking

and floating were included on the posttest. Although we tried not to include any identical

achievement assessment items in the formative assessments, some of the content or style

of the formative assessments for the experimental group might be unavoidably similar to

that of the items or tests in the achievement assessments. To reduce the possible bias, we

selected 14 items from external sources, such as the National Assessment of Educational

Progress and Trends of International Mathematics and Science Study, to tap the spread of

the treatment effect. Because the correlation between the external items and internal items

was high, r = .70, p < .01, the external items were analyzed with the internal items

together in this paper.

The performance assessment was designed to assess students’ procedural and

schematic knowledge. In the first part of the performance assessment, given the mass

information of a rectangular block and some equipment, students were asked to find the

density of the block. To solve the problem, students needed to (a) use either a ruler or an

overflow container to measure the volume of a solid block and (b) apply the density

formula to calculate the density of the block. In the second part of the performance

assessment, given three blocks with density information and a bottle of mystery liquid

with unknown density, students were asked to find the density of the mystery liquid. To

18

Page 19: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

solve this problem, students needed to conduct an experiment, observe the sinking and

floating behavior of three blocks in mystery liquid, and infer density information of the

mystery liquid.

The short-answer assessment was designed to measure students’ declarative and

schematic knowledge. It asked students to explain “why do things sink or float” and to

provide evidence and an example to support their argument. The same short answer

question was repeatedly posed in the Embedded Formative Assessment for the

experimental group, bringing advantages as well as disadvantages to the study. The

advantage is that because the same question was asked at different times in the

experimental group, the experimental students’ conception developmental growth could

be tracked until the end of the instruction. Two disadvantages were possible: (a) the

experimental group had been unavoidably sensitized to this question and may have

benefited from their teachers’ feedback and (b) the experimental group might have

become bored with answering the same question three times, losing their motivation to

provide their best answers on the posttest. Therefore, caution should be taken when

interpreting the difference between experimental and control groups in their responses to

this question.

The predict-observe-explain assessment was also designed to assess students’

schematic knowledge. In the predict-observe-explain assessment, a test administrator

demonstrated that a bar of soap sunk in water. Then the test administrator cut the soap

into two unequal pieces—1/4 and 3/4—and asked students to predict what would happen

19

Page 20: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

when the two pieces were put in water and why. After students turned in their predictions

and reasons, the test administrator put the two pieces of soap in water, and asked students

to record their observation and to reconcile their predictions. The soap predict-observe-

explain assessment was intended to examine whether students understood two main

points: (a) density is a property of a material and will not change with the size of the

object, and (b) an object sinks or floats depending on its density (relative to the medium’s

density) instead of its volume and mass.

Multiple-choice test was administered at both the pretest and posttest.

Performance, short-answer, and predict-observe-explain assessments were administered

at posttest only for the following reasons: (a) The performance assessment was heavily

content loaded. The pilot study showed that students tended to use trial and error to solve

the performance assessment problem before they had learned the relevant curriculum

content. Therefore the performance assessment could not provide valuable information at

pretest. (b) The short-answer assessment asked a succinct and straightforward question,

making it easy to memorize. We worried that it might sensitize control-group teachers

and lead them to prepare students to answer it, which might unnecessarily boost control-

group students’ performance at posttest. And (c) the predict-observe-explain assessment

was so memorable that once students knew what happened to the two pieces of soap, they

might not forget the result.

Research team members administered the instruments in the classrooms of each

teacher immediately before and after the target unit. The motivation questionnaire was

20

Page 21: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

given before the achievement tests, so that students’ report of their motivation would not

be influenced by their performance on the achievement tests. On the posttest, the

achievement tests were given in the following order: multiple-choice test, performance

assessment, short-answer assessment, and predict-observe-explain assessment (Yin,

2005). The predict-observe-explain was given as the last one because of its instructional

effect.

Technical Quality of the Achievement Assessments. The quality of the multiple-

choice test items was examined as to: (a) item difficulties, (b) instructional sensitivity, (c)

factorial structure, and (d) internal consistency. With respect to difficulty, most items

appropriately fell in the moderate range (p-values.40 ~ .80 at posttest). Some items were

found to be too difficult for the students, and these items were the ones that assessed

whether students still have misconceptions. The proportions of students who correctly

answered these questions were close to or even lower than what would be expected by

guessing at random (.25). The results suggested that most students still held alternative

conceptions after instruction.

Instructional sensitivity (D) was used to examine whether multiple-choice test

items discriminated between examinees who had and those who had not been taught. It is

defined as the difference between an item’s proportion of students responding correctly

(p-value) after instruction (posttest) and the p-value before instruction (pretest) (Crocker

& Algina, 1986). A high positive D value indicates that the item can well discriminate

between examinees who have received instruction and those who have not. All the items

21

Page 22: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

had positive instruction sensitivity Ds. That is, for all the items given at pre- and posttest,

more students chose correct answers at posttest than at pretest. This result provided

evidence for the content validity of the multiple-choice test items. That is, the items were

linked to the curriculum content and would be expected to reflect the effect of instruction.

Exploratory factor analysis was used to examine whether the multiple-choice test

items measured the constructs they were designed to measure, such as content knowledge

and knowledge types. When items in a test can be classified into different constructs from

only one dimension, items measuring the same construct would share similar loading

patterns in exploratory factor analysis. However, multiple-choice test items were

designed to tap two dimensions: the curriculum instructional objectives and knowledge

types. If the former was dominant, items measuring the same instructional objective

should have similar loading patterns in the exploratory factor analysis. If the latter was

dominant, items measuring the same knowledge type should have similar loading

patterns. If neither classification was dominant, connections might not be clear between

items’ loading patterns and the original classifications because the two classifications or

neither of them might influence items’ loading patterns.

The results indicated that items measuring the same content knowledge clustered

together, while items measuring the same type of knowledge were scattered across

different factors, regardless if all 13 factors above an eigen value of 1 or an analysis

constrained to three factors reflecting knowledge types were run. Internal consistency

reliability was calculated for the total score, then, to examine whether items could

22

Page 23: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

reliably assess students’ achievement. The internal consistency of the 43 posttest

multiple-choice test items was .86, which indicated that the items in general reliably

measured students’ achievement. For conciseness, the total score of the multiple-choice

test items, rather than any subscales, will be used in the following data analysis.

Because the three open-ended assessments (performance, predict-observe-explain,

and short-answer) shared many features, the three measures are examined together in this

section.

In order to sufficiently capture information about students’ achievement reflected

in students’ responses to the assessments, analytical scoring systems were carefully

developed for each assessment. The scoring systems were aligned with the conceptual

developmental trajectory (Ayala et al., this issue) and important instructional objectives.

Using scoring system drafts, three researchers scored students’ response samples to

examine whether the preliminary scoring systems could adequately capture the variations

among students’ responses, differentiate high understanding levels from low

understanding levels, and assist scorers to reach a satisfactory inter-rater reliability.

Doctoral students at Stanford Educational Assessment Laboratory scored the three

open-ended assessments: performance assessment (six raters), short-answer assessment

(six raters), and predict-observe-explain assessment (four raters). Before independent

scoring, all raters received scoring training until they reached satisfactory inter-rater

reliability (above .80) or agreement (above 80%). In the scoring training process, when

raters encountered disagreements, they reached consensus through discussions. Based on

23

Page 24: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

the consensus, scoring systems were further improved and specific rules were clarified.

This process was implemented continuously until the scoring system could effectively

collect information about students’ performance and raters could reliably score students’

responses.

The scoring systems were finalized prior to independent scoring and raters had

reached satisfactory inter-rater reliability or agreement before independently scoring

student work. Performance assessments were scored with numerical values assigned by

six raters, therefore generalizability theory and G-coefficient were used to estimate

reliability. Based on 48 randomly selected student responses on the performance

assessment, the average G-coefficient for the six raters (three in each group) was 0.83.

Short-answer and predict-observe-explain assessments were scored with categorical

values, therefore agreements were used to evaluate the inter-rater agreement. Based on 48

randomly selected student responses on short-answer, the average agreement of nine

raters (two, three, and four raters in each group) was 87%. Based on 56 randomly

selected student responses on predict-observe-explain, the four raters’ agreement was

92%.

Internal consistency and discriminant validity. After all the assessments were

scored, the internal consistency of each assessment and the correlations among the

assessments were calculated. Because the four assessments measure students’ knowledge

about the same topic but with different emphases, high internal consistency within each

24

Page 25: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

assessment and moderately high correlations among assessments were expected. Table 2

shows that high internal consistency empirically confirming expectation.

-------------------------------

Insert Table 2 about here

-------------------------------

Conceptual change score and achievement total score. Five multiple-choice test

items, seven predict-observe-explain assessment items, one performance assessment

item, and one short-answer assessment items focused on conceptual change. Internal

consistency of the 14 items was .79. The composite score was interpreted as a conceptual

change score, with the maximum score of 14.

In addition to scores on the individual tests, an overall achievement total score

was calculated. It combined scores from all four achievement assessments and had 72

items, an internal consistency reliability of 0.91, and a total score of 86. This achievement

total score was also used in the overall analyses reported below.

Data Analysis Plan

Because teachers were in different states or different school districts in a state, the

teaching contexts varied dramatically across teachers, such as the number and size of

classes a teacher taught. Moreover, students’ achievement and motivation subscale scores

varied greatly across classes and teachers at pretest, even though pairs of teachers were

matched on school-level variables and randomly assigned to the treatment or control

group (Shavelson et al., this issue). To reduce these differences, we selected one class

25

Page 26: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

from each teacher based on students’ pretest motivation and achievement scores for

intensive study such that the pairs of classes selected were as close as possible on mean

pretest scores. The twelve classes selected were called focus classes.

In this report, only the focus classes were included to examine the treatment effect

for three reasons: (a) The focus classes were the most closely matched classes of teachers

in the control and experimental groups; (b) the focus classes were the classes from which

we collected the most elaborate implementation information; and (c) the answers to the

research questions based on focus classes were similar to those based on all the students

(Yin, 2005).

To check on the mean equivalence of pretest scores, mean score differences

between each experimental-control teacher pair’s focus class were compared statistically

(Appendix A). Significant differences exist among three teacher pairs on motivation

scores. Experimental group teacher Aden’s students had a significantly higher mean score

than control group teacher Rachel’s students on two positive motivation scales: perceived

task goal orientation, and self-efficacy. Aden’s students also had a significantly lower

mean score than Rachel’s students on a negative motivation scale: perceived performance

goal orientation. Experimental group teacher Carol’s students scored significantly lower,

on average, than control group teacher Sofia’s students on three negative motivation

scales: ego approach orientation, ego avoidance orientation, and perceived performance

goal. Experimental group teacher Diana’s students had significantly a higher mean score

than control group teacher Ben’s students on all the positive motivation scales.

26

Page 27: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

As for achievement, students in the control group of two teachers, Lenny and Ben,

scored significantly higher on the multiple-choice test than did their experimental group

counterparts on average, students in the experimental group of two teachers, Andy and

Diana, respectively.

Overall, experimental and control groups were not completely equivalent at the

beginning of the study. The experimental group started with better average motivation

scores but worse achievement scores than the control group at pretest. While not

completely adequate, covariate adjustments were made in future analyses.

The data were first analyzed descriptively and statistically for the teacher pairs—

matched roughly for their school-level characteristics. The overall differences between

the control and experimental group on the outcome measures were then compared and

discussed. Finally, because students were nested in teachers, and teachers were nested in

treatment groups, a two-level hierarchical structure existed in the data. To deal with this

data structure, two-level Hierarchical linear modeling (HLM) was then applied to

examine the formative assessment treatment’s influence statistically. The limitation of

using HLM is that the analysis lacks statistical power due to the small sample size of only

six teachers in each group.2

Results

The analysis presented in this chapter addresses three research questions: (1) Did

formative assessment improved students’ motivational beliefs? (2) Did formative 2 The intent of the study was exploratory and we and the National Science Foundation agreed on the small sample size for cost reasons and as a first step in providing an “existence proof” that might lead to future funding.

27

Page 28: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

assessment improved students’ science achievement? And (3) did formative assessment

bring about conceptual change? Descriptive analyses of students’ motivation,

achievement, and conceptual change scores at pretest and posttest are first presented for

individual teachers in two groups. Following that, the effects of formative assessment on

students’ motivation, achievement and conceptual change are reported.

Initial Analysis of Pretest and Posttest Scores

In this section, the experimental-control group focus class pairs were compared

and then the overall experimental-control group differences were analyzed.

Analysis of Experimental-Control Group Pairs

In contrast to the pretest, overall the experimental group did not have a

significantly better motivation average score than control group at posttest. Instead, two

experimental classes had better motivation scores than their control pairs, so did two

control classes. Control group teacher Lenny’s students have higher scores than

experimental group teacher Andy’s students on three positive motivation scales: task goal

orientation, perceived task goal orientation, and interest. Control group teacher Sofia’s

students also scored significantly higher than experimental group teacher Carol’s students

on three positive motivation scales: task goal orientation, perceived task goal orientation,

and interest. Similar to the pretest, experimental group teacher Diana’s students still had

significantly higher scores than control group teacher Ben’s students on all the positive

motivation scales. Moreover, Diana’s students had lower scores than Ben’s students on a

negative motivation scale: perceived performance goal orientation. Finally, experimental

28

Page 29: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

group teacher Robert’s students scored significantly higher than control group teacher

Serena’s students on two positive motivations, perceived goal orientation and interest.

As for achievement and conceptual change, two control group teachers’ students

outperformed their counterpart in the experimental group (Appendix B). Similar to the

pretest, control group teacher Lenny’s students, on average, still outperformed the

experimental group teacher Andy’s students on achievement scores and conceptual

change scores at posttest, with the exception of the short-answer question (Appendix B).

Moreover, control group teacher Rachel’s students outperformed experimental group

teacher Aden’s students on the achievements tests, except for performance assessment

and conceptual change items (Appendix B), although they did not differ significantly on

the pre multiple-choice test (Appendix A). The only “winning” experimental-group class

among all the pairs was Diana’s. Control group teacher Ben’s students did not

significantly outperform experimental-group teacher Diana’s students as they did at

pretest.

Experimental-Control Group Comparisons

Due to the nested structure, it is not statistically legitimate to combine all the

students in each treatment group for data analysis. However, the descriptive analyses

based on the combined groups can provide a preliminary overall picture of the two group

comparison and treatment effect at posttest. Table 3 indicates that overall the

experimental group scored higher than the control group on most positive motivation

29

Page 30: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

scores but scored lower on achievement test. This result is consistent with the previous

individual teacher pair comparison.

-------------------------------

Insert Table 3 about here

-------------------------------

Table 4 shows that experimental group still scored higher than control group on

positive motivation scales at posttest, but only the significant difference was in the mean

difference in perceived task goal orientation. As for achievement, overall control group

still outperformed the experimental group, scoring significantly higher on the multiple-

choice test, performance assessment, and total achievement score. Based on the

comparison of Table 3 and Table 4, the treatment seemed to have no influence on student

motivation, achievement or conceptual change at posttest.

-------------------------------

Insert Table 4 about here

-------------------------------

However, in general, experimental group students had significantly lower

variance than the control group on the predict-observe-explain assessment (F = 4.09, p

< .05) and conceptual change score (F = 3.88, p = .05), that is, the achievement gap

between higher achievers and lower achievers in the experimental group was not as big as

that in the control group. This is consistent with literature that formative assessments are

particularly useful for low achievers (Black & Wiliam, 1998a).

30

Page 31: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Hierarchical Linear Modeling

Variables Used in the HLM

We were interested in the influence of formative assessment on students’

motivation, achievement, and conceptual change; therefore posttest motivation,

achievement, and conceptual change scores served as dependent variables in the model.

Specifically, these dependent variables include the ten posttest motivation subscale

scores, all posttest achievement scores, the composite achievement score, and the

composite conceptual change score.

Independent variables, or predictors, include pretest scores that corresponded to

the posttest score and treatment group. The pretest scores of individual students,

including pretest motivation scores and pretest multiple-choice test score, were the

predictors at the first level. Treatment group was the predictor at the second level. Due to

heterogeneous school context and teacher backgrounds, there could be many possible

teacher and school level predictors at level-2, such as, school size, ethnical composition,

percent of students who receive free or reduced price lunch, students’ grade level,

teachers’ teaching experience, highest educational degree, science background, teaching

load, the length of instruction. However, due to the constraint of small sample size (there

were only 12 teachers at level-2), only the treatment group, the focus in this study, was

introduced into the model as the predictor as level-2.

Model Construction

Three steps were taken to analyze all the outcome variables.

31

Page 32: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Step 1: Unconditional model. An unconditional model was first applied to

examine the variation in all of the outcome variables at the two different levels--student

and teacher. A one-way ANOVA model was applied for each outcome variable at level-

1. The unconditional model was specified by equations 1 and 2.

Level-1

rij ~ N (0, σ2) [1]

Level-2

jj 0000 μ0j ~ N (0, τ00) [2]

Yij was the outcome variable, such as a motivation subscale scores or achievement

scores. β0j was the intercept, in this case, the mean score for each teacher’s students. rij

was the random error at student level. The variance of rij was σ2. Variance due to rij

represented that the score variation at level-1, in this case, within teachers. β0j was a

function of grand mean, γ00, plus a random error, μ0j. The variance of μ0j was τ00.

Significant variance, τ00, due to μ0j indicated that teacher mean score varied at level-2.

Step 2: Conditional model with covariate(s) at level-1. According to Raudenbush

and Bryk (2002), statistical adjustments for individuals’ background are important,

because (a) “persons are not usually assigned at random to organizations, failure to

control for background may bias the estimates of organization effect” and (b) “if level-1

predictors (or covariates) are strongly related to the outcome of interest, controlling for

32

Page 33: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

them will increase the precision of any estimates of organization effects and the power of

hypothesis tests by reducing unexplained level-1 error variance” (p. 111).

In this study, students had rather different starting motivation and achievement

scores; therefore, the pretest scores were added to the model as covariate(s) at level-1 to

account for the outcome variable variance within-teacher. Only the corresponding pretest

score was introduced to the model at level-1. For example, when the posttest motivation

subscale, interest, was the outcome variable, only pretest interest scores were introduced

as the predictor at level-1. When a posttest achievement score or conceptual change score

was the outcome variable, only the pretest multiple-choice test score was introduced as

the predictor at level-1. There are two reasons for including only one covariate at level-1:

First, the corresponding pretest variable was the most relevant covariate of the

corresponding posttest score. Second, preliminary analyses indicated that, after a pretest

motivation score was included in a model to explain the corresponding posttest score, an

additional pretest motivation score was usually not a significant predictor for the posttest

score because motivation constructs were correlated with each other. Since pretest scores

were continuous variables, they were grand-mean centered (Raudenbush & Bryk, 2002).

The model was specified as follows.

Level-1

rij ~ N (0, σ2) [3]

Level-2

33

Page 34: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

jj 0000 μ0j ~ N (0, τ00) [4]

jj 1101 μ1j ~ N (0, τ11) [5]

In this model, Xij was a pretest score as the predictor for the corresponding

posttest score. is the grand-mean. The intercept, β0j, was the expected outcome for a

subject whose value on Xij was equal to the grand mean, . The slope, β1j, represented

the regression parameter between students’ pretest and posttest scores. rij was the random

error after controlling for students’ pretest score. β0j and β1j might vary across teachers in

the level-2 model as a function of a grand mean and a random error. Since X was grand-

mean centered, γ00 was the adjusted mean of the teachers on students’ posttest scores. γ10

was the adjusted mean of pretest and posttest regression slope across those teachers. μ0j

was the unique increase to the intercept associated with teacher j. Significant variance

associated with μ0j indicated that the mean posttest score varied across teachers. μ1j was

the unique increase to the slope associated with teacher j. Significant variance associated

with μ1j indicated that pretest-posttest regression slope varied across teachers.

Analysis showed that variance of μ1j was not significantly different from 0. That

is, the regression coefficients between outcome variable and predictor variable within

teachers at level-1 did not vary across teachers at level-2 (“homogeneity of regression

slopes”). Therefore, μ1j was eliminated from the final model. Accordingly, equation [5]

was simplified into [6].

[6]

34

Page 35: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Step 3: Model with intercepts-as-outcomes. One of the important purposes for

using HLM is to examine the cross level effects. Specifically, in this study, we examined

whether the regression coefficients (regression slopes of posttest scores on covariates) at

the student level varied across teachers. Analyses indicated that regression coefficients at

level-1 did not significantly vary across teachers, i.e., homogeneity of regressions slopes.

Therefore, the regression coefficients at level-1 were set as fixed at level-2 and only the

intercept at level-1 was an outcome variable at level-2. As mentioned earlier, treatment

group was the only predictor at level-2. Since treatment group was a dummy variable

(experimental group = 1, control group = 0), it was added to the model without being

centered (Raudenbush & Bryk, 2002).

Analyses showed that τ00 in equation [4] significantly differed from 0. That is, the

intercept at level-1 varied significantly at level-2 across teachers. A predictor at level-2,

treatment group, was then included in the model to account for the intercept variance at

leve-2. The model was specified in equation [7].

[7]

35

Page 36: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

W1j was the treatment variable at level-2. Because treatment was a dummy

variable, it was added to the model without being centered, for example treatment group,

which was coded as “0” for control group and “1” for experimental group.

HLM Results

Tables 4 and 5 present changes in the variance components for motivation scores

and achievement scores, respectively, when different models were implemented. Most

motivational and achievement outcomes varied significantly across teachers (at level-2),

specifically motivation scores except for ego avoidance, self-efficacy, and fixed ability

(Table 4), and achievement scores except for conceptual change (Table 5).

-------------------------------

Insert Table 4 about here

--------------------------------

--------------------------------

Insert Table 5 about here

--------------------------------

Tables 6 and 7 present the results when pretest scores were included as covariate

at level-1 and treatment was included as predictor at level-2. Results reported in these

tables indicate that pretest scores were significant predictors for the corresponding

posttest scores at level-1; however, treatment was not a significant predictor of the

36

Page 37: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

variance across teachers at level-2. And for the achievement indicators, the negative level

2 coefficient for treatment indicates a very slight, non-significant adjusted mean

advantage for the control group. An important reason for the non-significant group effect

might be the “small” sample size.

--------------------------------

Insert Table 6 about here

---------------------------------

---------------------------------

Insert Table 7 about here

---------------------------------

In a word, based on HLM analyses, students’ motivation, achievement, and

conceptual change scores greatly varied among teachers. However, the formative-

assessment treatment did not explain the variation among teachers. Experimental group

students did not seem to benefit, on average, from the formative assessments they

received.

Conclusion

In this study, one questionnaire and four types of achievement tests were used in

this study to measure students’ motivational beliefs, achievement, and conceptual change.

37

Page 38: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Empirical evidence was provided in support of the reliability and validity of scores from

these instruments.

Both the descriptive statistics and HLM analyses show that the assessments

embedded in the FAST curriculum used by the experimental group did not have a

significant influence, on average, on students’ motivation, achievement, and conceptual

change compared to students in the control group that used FAST without embedded

assessments. This said, regardless of treatment group, teachers varied greatly on the

motivation and achievement outcomes they produced in their students. This finding did

not support our conjectures about the salutatory effect of formative assessment on student

outcomes or the findings reported in literature reviews.

More elaborative analyses show that teachers’ educational background, teaching

experience, teaching load, age, and some school information (such as school size, and

percent of free and reduced lunch) were not significant predictors for students’

motivation and achievement difference either (Yin, 2005). Why, then, did the Embedded

Formative Assessments not work as expected? What might account for the differences in

outcomes among students taught by different teachers? Furtak, et al. (this issue) suggest

answers to these questions.

38

Page 39: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

References

Abimbola, I. O. (1988). The problem of terminology in the study of student conceptions

in science. Science Education, 72(2), 175-184.

Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.

Englewood Cliffs, NJ: Prentice Hall.

Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman.

Bandura, A., & Cervone, D. (1983). Self-evaluative and self-efficacy mechanisms

governing the motivational effects of goal systems. Journal of Personality and

Social Psychology, 45(5), 1017-1028.

Bell, B., & Cowie, B. (2001). Formative Assessment and Science Education. Dordrecht,

The Netherlands: Kluwer Academic Publishers.

Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Testing, Motivation

and Learning. Cambridge: University of Cambridge, Faculty of Education: The

Assessment Reform Group.

Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in

Education, 5, 7-68.

Black, P., & Wiliam, D. (1998b). Inside the black box: raising standards through

classroom assessment. Phi Delta Kappan, 80(2), 139-148.

Blumenfeld, P. C. (1992). Classroom learning and motivation: Clarifying and expanding

goal theory. Journal of Educational Psychology, 84(3), 272-281.

39

Page 40: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Butler, R. (1987). Task-Involving and Ego-Involving Properties of Evaluation. Journal of

Educational Psychology, 79, 474-482.

Butler, R., & Neuman, O. (1995). Effects of task and ego achievement goals on help-

seeking behaviors and attitudes. Journal of Educational Psychology, 87, 261-271.

Carey, S. (1984). Conceptual Change in Childhood. Cambridge, MA: MIT Press.

Chi, M. T. H. (1992). Conceptual change within and across ontological categories:

Examples from learning and discovery in science. In Cognitive models of science:

Minnesota studies in the philosophy of science (pp. 129-160). Minneapolis, MN:

University of Minnesota Press.

Chi, M. T. H., Slotta, J. D., & deLeeuw, N. (1994). From things to processes: A theory of

conceptual change for learning science concepts. Learning and Instruction, 4, 27-

43.

Chinn, C. A., & Malhotra, B. A. (2002). Children's responses to Anomalous Scientific

Data: How is Conceptual Change Impeded? Journal of Educational Psychology,

94(2), 327-343.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.

Harcourt Brace, PA: Jovanovich College Publishers.

Demastes, S. S., Good, R. G., & Peebles, P. (1996). Patterns of conceptual change in

evolution. Journal of Research in Science Teaching, 33, 407-431.

40

Page 41: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Dreyfus, A., Jungwirth, E., & Eliovitch, R. (1990). Applying the 'Cognitive Conflict'

strategy for conceptual change - some implications, difficulties and problems.

Science Education, 74(5), 555-569.

Dweck, C. S., & Leggett, E. L. (1988). A Social-Cognitive Approach to Motivation and

Personality. Psychological Review, 95(2), 256-273.

Eccles, J. (1983). Expectancies, values and academic behaviors. San Francisco: Freeman.

Geisler-Brenstein, E., & Schmeck, R. R. (1996). The Revised Inventory of Learning

Processes: A multifaceted perspective on individual differences in learning. In M.

Birenbaum & F. J. R. C. Dochy (Eds.), Alternatives in Assessment of

Achievements, Learning Processes and Prior Knowledge. Boston, MA: Kluwer

Academic Publishers.

Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational

Research, 77(1), 81-112.

Hewson, M. G. (1986). The acquisition of scientific knowledge: Analysis and

representation of student conceptions concerning density. Science Education,

70(2), 159-170.

Hewson, P. W., & Hewson, M. G. (1984). The role of conceptual conflict in conceptual

change and the design of science instruction. Instructional Science, 13(1), 1-13.

Kohn, A. S. (1993). Preschoolers' reasoning about density: Will it float? Child

Development, 64, 1637-1650.

41

Page 42: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Lau, S., & Roeser, R. W. (2002). Cognitive abilities and motivational processes in high

school students' situational engagement and achievement in science. Educational

Assessment, 8(2), 139-162.

NRC. (2001). Knowing What Students Know. Washington, DC: National Academies

Press, National Research Council.

Pajares, F. (1996). Self-Efficacy beliefs in academic settings. Review of Educational

Research, 66(4), 543-578.

Pintrich, P. R. (1999). Motivational beliefs as resources for and constraints on conceptual

change. In W. Schnotz, S. Vosniadou & M. Carretero (Eds.), New Perspectives in

Conceptual Change Research. Oxford, England: Pergamon Press.

Pintrich, P. R., Marx, R., & Boyle, R. (1993). Beyond cold conceptual change: The role

of motivational beliefs and classroom contextual factors in the process of

conceptual change. Review of Educational Research, 63, 167-199.

Pintrich, P. R., & Schrauben, B. (1992). Student’s motivational beliefs and their cognitive

engagement in classroom academic tasks. In D. Schunk & J. Meece (Eds.),

Student perceptions in the Classroom: Causes and Consequences (pp. 149-183).

Hillsdale, NJ: Lawrence Erlbaum.

Posner, G. J., Strike, K. A., Hewson, P. W., & Gertzog, W. F. (1982). Accommodation of

a scientific conception: toward a theory of conceptual change. Science Education,

66, 211-227.

42

Page 43: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Pottenger, F. M. I., & Young, D. B. (1992). The Local Environment: FAST 1,

Foundational Approaches in Science Teaching. Honolulu: University of Hawaii

Curriculum Research & Development group.

Ramaprasad, A. (1983). On the definition of feedback. Behavioral Science, 28(1), 4-13.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and

data analysis methods (2 ed.). Newbury Park, CA: Sage.

Sadler, R. (1989). Formative assessment and the design of instructional systems.

Instructional Science, 18, 119-144.

Schnotz, W., Vosniadou, S., & Carretero, M. (1999). Preface. In W. Schnotz, S.

Vosniadou & M. Carretero (Eds.), New Perspectives in Conceptual Change

Research. Oxford, England: Pergamon Press.

Schunk, D. H., & Swartz, C. W. (1993). Goals and progress feedback: Effects on self-

efficacy and writing achievement. Contemporary Educational Psychology, 18,

337-354.

Siero, F., & Van Oudenhoven, J. P. (1995). The effects of contingent feedback on

perceived control and performance. European Journal of Psychology of

Education, 10, 13-24.

Smith, C., Snir, J., & Grosslight, L. (1992). Using conceptual models to facilitate

conceptual change: the case of weight-density differentiation. Cognition and

Instruction, 9(3), 221-283.

43

Page 44: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Vispoel, W. P., & Austin, J. R. (1995). Success and failure in junior high school: A

critical incident approach to understanding students' attributional beliefs.

American Educational Research Journal, 32(2), 377-412.

Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: A study of

conceptual change in childhood. Cognitive Psychology, 24, 535-585.

Yin, Y. (2005). The Influence of Formative Assessment on Student Motivation,

Achievement, and Conceptual Change. Stanford University, Stanford.

Yin, Y., Tomita, K. M., & Shavelson, R. J. (in press). Diagnosing and dealing with

student misconceptions about “Sinking and Floating”. Science Scope.

44

Page 45: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Appendix A

Average Class Pretest Motivation Scores, Multiple-choice Score, and Class Size for

Focus Classa

Variables Size Mean SD Size Mean SD DifferenceExperimental ( E ) Control ( C ) E - C

Pair 1 Aden Rachel Task Goal Orientation 29 3.79 0.69 29 3.59 0.73 0.21Perceived Task Goal 29 4.26 0.45 29 3.86 0.44 0.40**Self Efficacy 29 4.13 0.61 29 3.77 0.59 0.37*Interest 29 4.15 0.80 29 3.82 0.60 0.33Ego Approach Orientation 29 2.82 0.81 29 2.89 0.71 -0.07Ego Avoidance Orientation 29 2.90 0.85 29 2.94 0.79 -0.04Perceived Performance Goal 29 2.10 0.52 29 2.52 0.57 -0.42**Fixed Ability 29 2.23 0.81 29 2.19 0.67 0.04Pretest Multiple-choice 29 14.55 5.08 29 14.76 3.90 -0.21

Pair 2 Andy Lenny Task Goal Orientation 25 3.63 0.65 19 3.60 0.64 0.03Perceived Task Goal 25 3.92 0.50 19 4.23 0.45 -0.31*Self Efficacy 25 3.84 0.63 19 4.02 0.58 -0.19Ego Approach Orientation 25 2.90 0.86 19 2.96 1.02 -0.06Ego Avoidance Orientation 25 3.06 0.77 19 3.23 0.93 -0.17Perceived Performance Goal 25 2.25 0.60 19 2.41 0.59 -0.17Fixed Ability 25 2.55 0.87 19 2.11 0.66 0.44Interest 25 3.93 0.96 19 3.95 0.88 -0.02Pretest Multiple-choice 26 13.12 3.68 19 15.95 4.13 -2.83*

Pair 3 Becca Ellen Task Goal Orientation 20 4.22 0.51 23 3.92 0.52 0.30Perceived Task Goal 20 4.37 0.45 23 4.24 0.40 0.13Self Efficacy 20 4.03 0.42 23 3.80 0.50 0.22Interest 20 4.20 0.61 23 4.28 0.50 -0.08Ego Approach Orientation 20 3.23 0.99 23 3.20 0.61 0.03Ego Avoidance Orientation 20 2.97 0.91 23 2.99 0.82 -0.02Perceived Performance Goal 20 2.59 0.44 23 2.82 0.47 -0.23Fixed Ability 20 2.87 0.96 23 2.54 0.85 0.33Pretest Multiple-choice 20 12.25 2.75 22 13.23 4.19 -0.98

Pair 4 Carol Sofia Task Goal Orientation 18 3.88 0.59 21 3.99 0.53 -0.11Perceived Task Goal 18 4.21 0.42 21 4.23 0.40 -0.02Self Efficacy 18 4.10 0.57 21 4.03 0.57 0.07Interest 18 4.09 0.77 21 4.31 0.67 -0.22

45

Page 46: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Variables Size Mean SD Size Mean SD DifferenceExperimental ( E ) Control ( C ) E - C

Ego Approach Orientation 18 2.40 0.75 21 3.26 0.85 -0.86**Ego Avoidance Orientation 18 2.56 0.91 21 3.15 0.58 -0.60*Fixed Ability 18 2.15 0.69 21 2.31 0.91 -0.16Perceived Performance Goal 18 1.89 0.53 21 2.49 0.62 -0.60**Pretest Multiple-choice 19 13.16 4.49 24 15.21 3.96 -2.05

Pair 5 Diana Ben Task Goal Orientation 25 4.00 0.58 21 3.36 0.65 0.64**Perceived Task Goal 25 4.41 0.29 21 4.08 0.59 0.33*Self Efficacy 25 4.11 0.58 21 3.74 0.53 0.37*Interest 25 4.28 0.58 21 3.38 0.80 0.90**Ego Approach Orientation 25 3.16 0.88 21 3.18 0.86 -0.02Ego Avoidance Orientation 25 3.19 0.74 21 3.18 0.72 0.01Perceived Performance Goal 25 2.39 0.78 21 2.49 0.62 -0.10Fixed Ability 25 2.29 0.85 21 2.33 0.58 -0.03Pretest Multiple-choice 25 15.52 3.61 21 18.38 5.11 -2.86*

Pair 6 Robert Serena Task Goal Orientation 27 3.84 0.52 20 3.87 0.67 -0.03Perceived Task Goal 27 4.09 0.52 20 4.21 0.48 -0.12Self Efficacy 27 4.04 0.43 20 4.16 0.61 -0.12Interest 27 4.17 0.41 20 4.03 0.81 0.14Ego Approach Orientation 27 3.13 0.74 20 2.98 0.90 0.15Ego Avoidance Orientation 27 3.06 0.91 20 3.05 0.96 0.01Perceived Performance Goal 27 2.21 0.51 20 2.30 0.67 -0.09Fixed Ability 27 2.31 0.72 20 2.17 0.84 0.14Pretest Multiple-choice 26 16.00 4.58 20 15.20 4.64 0.80

a: * p < 0.05; ** p < 0.01.

46

Page 47: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Appendix B

Average Class Posttest Motivation Scores, Multiple-choice Score, and Class Size for

Focus Classa

Variables Size Mean SD Size Mean SD DifferenceExperimental ( E ) Control ( C ) E - C

Pair 1 Aden Rachel Task Goal Orientation 29 3.56 0.84 21 3.10 0.77 0.46Perceived Task Goal 25 3.96 0.72 21 3.63 0.67 0.33Self Efficacy 25 3.95 0.80 21 3.79 0.39 0.16Interest 25 3.71 0.94 21 3.14 0.92 0.56*Ego Approach Orientation 29 2.68 0.92 21 2.64 0.72 0.04Ego Avoidance Orientation 25 2.77 1.08 21 2.52 0.79 0.24Perceived Performance Goal 25 2.01 0.59 21 2.33 0.67 -0.32Fixed Ability 25 2.08 0.64 21 2.29 0.53 -0.21Multiple-choice 25 21.79 7.19 21 28.10 6.18 -6.31**Performance Assessment 22 13.31 8.91 22 17.91 5.94 -2.01Predict-Observe-Explain 26 2.15 1.74 21 3.33 2.29 -2.01*Short-Answer 26 1.54 1.68 21 3.10 1.18 -3.59**Achievement Total 20 40.45 17.00 19 54.53 11.76 -2.99**Conceptual Change 20 5.10 3.26 19 6.54 3.20 -1.39

Pair 2 Andy Lenny Task Goal Orientation 25 2.98 0.73 19 3.72 0.66 -0.74**Perceived Task Goal 22 3.83 0.61 18 4.33 0.44 -0.50**Self Efficacy 22 3.85 0.71 18 4.08 0.58 -0.23Interest 22 2.85 0.96 18 3.89 0.68 -1.04**Ego Approach Orientation 25 2.72 0.79 19 2.99 0.78 -0.27Ego Avoidance Orientation 22 3.00 0.85 18 2.80 0.60 0.19Perceived Performance Goal 22 2.60 0.60 18 2.23 0.55 0.37Fixed Ability 22 2.17 0.79 18 2.02 0.66 0.15Multiple-choice 22 22.91 5.92 18 28.79 6.84 -5.88**Performance Assessment 22 9.23 4.41 17 20.53 7.86 -5.32**Predict-Observe-Explain 20 3.70 2.03 16 5.31 2.41 -2.18*Short-Answer 22 2.14 1.21 19 2.47 1.26 -0.87Achievement Total 18 39.56 11.09 13 62.08 10.57 -5.68**Conceptual Change 18 6.90 3.41 13 10.55 3.22 -3.01**

Pair 3 Becca Ellen Task Goal Orientation 20 3.85 0.71 23 4.04 0.45 -0.19Perceived Task Goal 16 4.25 0.37 21 4.25 0.52 -0.01Self Efficacy 16 3.88 0.53 21 3.89 0.58 -0.01Interest 16 4.19 0.69 21 4.30 0.83 -0.11Ego Approach Orientation 20 3.03 1.03 23 2.65 0.85 0.38

47

Page 48: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Variables Size Mean SD Size Mean SD DifferenceExperimental ( E ) Control ( C ) E - C

Ego Avoidance Orientation 16 2.83 0.84 21 2.56 0.98 0.27Perceived Performance Goal 16 2.64 0.78 21 2.29 0.61 0.34Fixed Ability 16 2.35 0.75 21 2.29 0.72 0.07Multiple-choice 16 20.25 5.73 21 22.18 5.57 -1.93Performance Assessment 16 11.25 2.98 24 13.17 5.74 -1.23Predict-Observe-Explain 16 2.36 1.91 24 1.54 1.14 1.45Short-Answer 14 2.07 1.44 24 1.42 1.69 1.30Achievement Total 14 36.36 8.66 22 38.55 11.75 -0.60Conceptual Change 14 4.58 2.48 22 3.51 2.20 1.35

Pair 4 Carol Sofia Task Goal Orientation 18 3.28 0.64 21 3.84 0.68 -0.55*Perceived Task Goal 17 3.81 0.66 22 4.27 0.57 -0.45*Self Efficacy 17 3.76 0.55 22 3.95 0.76 -0.19Interest 17 3.27 1.08 22 4.21 0.93 -0.94**Ego Approach Orientation 18 2.50 0.87 21 3.03 0.83 -0.53Ego Avoidance Orientation 17 2.47 0.85 22 2.53 1.02 -0.05Perceived Performance Goal 17 2.37 0.77 22 2.03 0.80 0.34Fixed Ability 17 2.10 0.44 22 2.24 1.08 -0.14Multiple-choice 17 28.12 6.27 22 27.32 5.80 0.80Performance Assessment 15 17.93 8.57 21 16.38 6.09 0.60Predict-Observe-Explain 17 3.00 2.50 21 2.71 1.31 0.41Short-Answer 18 2.94 1.00 22 2.91 1.31 0.09Achievement Total 13 51.92 16.55 20 49.9 11.67 0.38Conceptual Change 13 6.55 4.03 20 5.32 2.76 0.97

Pair 5 Diana Ben Task Goal Orientation 25 3.90 0.71 21 2.76 0.77 1.14**Perceived Task Goal 23 4.46 0.46 21 3.20 0.72 1.26**Self Efficacy 23 4.11 0.58 21 3.63 0.82 0.48*Interest 23 4.19 0.72 21 2.40 1.11 1.79**Ego Approach Orientation 25 3.18 0.89 21 3.25 0.93 -0.07Ego Avoidance Orientation 23 2.98 0.74 21 2.84 0.90 0.15Perceived Performance Goal 23 2.18 0.65 21 2.59 0.53 -0.41*Fixed Ability 23 2.14 0.94 21 2.10 0.63 0.04Multiple-choice 23 24.78 4.49 21 26.14 7.88 -1.35Performance Assessment 21 16.71 7.13 20 14.95 5.99 0.86Predict-Observe-Explain 21 2.29 1.42 20 2.35 2.13 -0.11Short-Answer 21 3.14 1.42 22 2.77 1.51 0.83Achievement Total 21 47.05 11.08 17 45.29 15.37 0.40Conceptual Change 21 4.86 2.08 17 4.60 3.00 0.30

Pair 6 Robert Serena

48

Page 49: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Variables Size Mean SD Size Mean SD DifferenceExperimental ( E ) Control ( C ) E - C

Task Goal Orientation 27 3.70 0.70 20 3.76 0.91 -0.06Perceived Task Goal 22 4.52 0.39 20 4.09 0.54 0.44**Interest 22 4.37 0.65 20 3.62 0.92 0.75**Self Efficacy 22 4.20 0.36 20 3.92 0.63 0.28Ego Approach Orientation 27 2.95 0.94 20 3.38 0.86 -0.42Ego Avoidance Orientation 22 2.60 0.98 20 3.11 0.82 -0.51Perceived Performance Goal 22 2.12 0.78 20 2.42 0.74 -0.30Fixed Ability 22 2.06 0.70 20 2.55 0.89 -0.49Multiple-choice 22 20.48 6.88 20 18.05 4.62 2.43Performance Assessment 21 11.00 5.06 19 9.53 6.10 0.84Predict-Observe-Explain 22 1.86 1.54 17 1.76 2.05 0.17Short-Answer 22 1.86 1.42 20 1.60 1.39 0.61Achievement Total 21 35.86 12.47 16 31.9 12.08 1.02Conceptual Change 21 4.34 2.13 16 3.69 2.70 0.83

a: * p < 0.05; ** p < 0.01.

49

Page 50: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 1

Motivational Belief Constructs, Number of Items, Coefficient Alpha, and Correlations

among Constructs at Posttest

Positive Motivation Negative Motivation

Construct Number 1 2 3 4 5 6 7 8

Number of Items 5 6 5 3 4 5 6 3

1 Task Goal (.83)

2 Task Perceived .73** (.86)

3 Self-Efficacy .77** .72** (.89)

4 Interest .85** .85** .64** (.81)

5 Ego Approach .04 .04 .10* -.00 (.77)

6 Ego Avoidance .02 .04 .03 -.02 .84** (.78)

7 Performance

Perceived-.36** -.53** -.42** -.42** .41** .37** (.70)

8 Fixed Ability -.26** -.31** -.39** -.33** .50** .48** .74** (.44)

Note: a. * p < 0.05, ** p < 0.01, N = 788 ~ 826. b. Results in this table are the output from AMOS, but they

are arranged in a conventional way for display convenience. c. Correlations computed in AMOS were

attenuated by the reliability of each construct. d. Numbers in the diagonal of the matrix are internal

consistency of the corresponding construct.

50

Page 51: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 2

Internal Consistency and Correlations among All Score a

(1) (2) (3) (4) Knowledge

Item Number 43 18 4 7 Reasoning

Total Score 43 32 4 7 Measured

(1) Post Multiple-choice (.86)b Declarative/

Schematic/

Procedural

(2) Performance Assessment .69 (.81) Procedural/

Schematic

(3) Short-Answer .50 .49 (.83) Declarative/

Schematic

(4) Predict-Observe-Explain .48 .42 .39 (.74) Schematic

Note: a. N = 732 ~ 830, all the correlations are significant, p < 0.01. b. The numbers in

the diagonal are the internal consistency of the corresponding scale.

51

Page 52: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 3

Comparison of Experimental and Control Group Student Motivation, Achievement, and

Conceptual Change Scores at Posttest

Experimental Control

Differenc

e

N Mean SD N Mean SD t

Pre Task Goal 144 3.88 .61 13

3

3.72 .66 2.12*

Pre Task Perceived 144 4.20 .47 13

3

4.13 .48 1.38

Pre Self-Efficacy 144 4.04 .55 13

3

3.91 .58 1.99*

Pre Interest 144 4.14 .71 13

3

3.96 .76 2.06*

Pre Ego Approach 144 2.95 .86 13

3

3.07 .82 -1.14

Pre Ego Avoidance 144 2.97 .85 13

3

3.08 .80 -1.04

Pre Performance

Perceived

144 2.24 .60 13

3

2.51 .60 -3.82

Pre Fixed Ability 144 2.39 .84 13 2.27 .76 1.19

52

Page 53: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

3

Pre Multiple-choice

145 14.22 4.30

13

5 15.39 4.49

-1.16*

53

Page 54: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 4

Comparison of Experimental and Control Group Student Motivation, Achievement, and

Conceptual Change Scores at Posttest

Group Experimental (E) Control (C)

Differenc

e

Variable N Mean SD N Mean SD t

Post Task Goal 125 3.55 .79 12

3

3.53 .84 0.14

Post Task Perceived 125 4.14 .62 12

3

3.95 .71 2.25*

Post Self-Efficacy 125 3.97 .62 12

3

3.87 .65 1.25

Post Interest 125 3.76 1.00 12

3

3.59 1.12 1.29

Post Ego Approach 125 2.85 .92 12

3

2.99 .86 -1.24

Post Ego Avoidance 125 2.79 .91 12

3

2.72 .88 0.57

Post Performance Perceived 125 2.29 .72 12

3

2.31 .67 -0.25

Post Fixed Ability 125 2.14 .72 12 2.25 .78 -1.14

54

Page 55: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

3

Post Multiple-choice 129 22.92 6.59 12

5

25.15 7.16 -2.58**

Post Performance Assessment 117 13.05 7.11 12

3

15.31 6.99 -2.48*

Post Predict-Observe-Explain 120 2.53 1.93 11

9

2.74 2.26 -0.79

Post Short-Answer 125 2.24 1.49 12

8

2.37 1.53 -0.67

Post Achievement 107 41.55 13.94 10

7

46.41 15.31 -2.43*

Conceptual Change 107 5.32 2.99 10

7

5.44 3.52 -0.29

Multiple-Choice Gain Score 127 8.82 5.83 12

0

9.58 6.49 -0.96

55

Page 56: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 5

Comparison of Unconditional, Covariate, and Final Model Variance Components of

Posttest Achievement and Conceptual Change

Model 1

(Unconditional)

Model 2 (with Pretest

multiple-choice test

Level-1 Only)

Model 3 (with

Predictor(s) at both

Level-1 & 2)

Post Multiple-choice

Test Score

Within Teachers (σ2) 38.832 27.680 27.681

Between Teachers (τ00) 10.987** 10.764** 11.794**

Multiple-choice

Gain Score

Within Teachers (σ2) 28.127 27.680 27.681

Between Teachers (τ00) 11.196** 10.764** 11.794**

Performance

Assessment

Within Teachers (σ2) 40.992 36.947 36.949

Between Teachers (τ00) 11.050** 10.255** 10.620**

Short-Answer Within Teachers (σ2) 1.970 1.827 1.827

Between Teachers (τ00) 0.322** 0.270** 0.304

Predict-Observe-

Explain

Within Teachers (σ2) 3.684 3.614 3.613

Between Teachers (τ00) 0.861** 0.783** 0.881**

Achievement Total Within Teachers (σ2) 162.042 122.692 122.688

Between Teachers (τ00) 68.552** 68.541** 73.778**

Conceptual Change Within Teachers (σ2) 9.361 8.634 8.632

Score Between Teachers (τ00) 3.399** 3.414** 3.827**

Note: ** p < 0.01, p < 0.05

56

Page 57: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 6

Multilevel Regression Estimates for Posttest Motivational Beliefs a

Outcome Variable (Posttest) Predictors Coefficient SE P valueTask Goal Orientation Level-1 Variable

TASKGOAL, γ10

0.548162** 0.069905 0.000

Level-2 VariableGROUP, γ01

-0.048264 0.197369 0.812

Perceived Task Goal Level-1 VariablePERCTASK, γ10

0.470225** 0.079936 0.000

Level-2 Variable GROUP, γ01

0.164382 0.202496 0.436

Self Efficacy Level-1 VariableEFFICACY, γ10

0.406299** 0.068589 0.000

Level-2 Variable GROUP, γ01

0.078124 0.079747 0.351

Interest Level-1 VariableINTEREST, γ10

0.598573** 0.075573 0.000

Level-2 Variable GROUP, γ01

0.089683 0.310063 0.778

Ego Approach Level-1 VariableEGOPERF, γ10

0.549829** 0.059008 0.000

Level-2 Variable GROUP, γ01

-0.058126 0.135709 0.677

Ego Avoidance Level-1 VariableEGOAVO, γ10

0.531549** 0.059259 0.000

Level-2 Variable GROUP, γ01

0.187889 0.125347 0.165

Perceived Performance Goal

Level-1 VariablePERCPERF, γ10

0.439538** 0.068559 0.000

Level-2 Variable GROUP, γ01

0.138793 0.137974 0.339

Fixed Ability Level-1 VariableFIXEDABILITY, γ10

0.369011** 0.057722 0.000

Level-2 Variable GROUP, γ01

-0.128559 0.090237 0.185

Note: ** p < 0.01, p < 0.05

57

Page 58: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Table 7

Multilevel Regression Estimates for Posttest Achievement and Conceptual Change a

Outcome Variable (Posttest) Predictors Coefficient SE P value

Post Multiple-choice Test

Score

Level-1 Variable

PRETOTAL, γ10

0.81** 0.08 0.00

Level-2 Variable

GROUP, γ01

-0.76 2.10 0.73

Multiple-choice Gain Score Level-1 Variable

PRETOTAL, γ10

-0.19* 0.08 0.023

Level-2 Variable

GROUP, γ01

-0.76 2.10 0.73

Performance Assessment Level-1 Variable

PRETOTAL, γ10

0.50** 0.10 0.00

Level-2 Variable

GROUP, γ01

-1.68 2.05 0.43

Short-Answer Level-1 Variable

PRETOTAL, γ10

0.09** 0.02 0.00

Level-2 Variable

GROUP, γ01

-0.03 0.36 0.93

Predict-Observe-Explain Level-1 Variable

PRETOTAL, γ10

0.09** 0.03 0.00

Level-2 Variable

GROUP, γ01

-0.11 0.60 0.86

Achievement Total Level-1 Variable

PRETOTAL, γ10

1.53** 0.19 0.00

Level-2 Variable

GROUP, γ01

-2.84 5.21 0.60

Conceptual Change Items Level-1 Variable

PRETOTAL, γ10

0.22** 0.05 0.00

Level-2 Variable 0.03 1.21 1.00

58

Page 59: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Outcome Variable (Posttest) Predictors Coefficient SE P value

GROUP, γ01

Note: ** p < 0.01, p < 0.05

59

Page 60: RUNNING HEAD: Measurement and Impact of the Treatmentweb.stanford.edu/dept/SUSE/SEAL/special/3-Measurement... · Web viewTitle RUNNING HEAD: Measurement and Impact of the Treatment

On the Measurement and Impact

Motivation Encouraged Motivation Discouraged

Motivation Promoting Conceptual Change

Task goal orientation

Incremental intelligence beliefs

Interest

Self-efficacy

Motivation Constraining Conceptual Change

Ego goal orientation (including ego-avoidance and ego-approach)

Fixed intelligence belief

Figure 1. Motivational beliefs involved in formative assessment and conceptual change.

60