Web viewCentrality in a Modified-Angoff Standard Setting. PhD Dissertation - Proposal. by. Michael Scott Sommers (張夏石) Candidate for the Degree. Doctor of

Centrality in a Modified-Angoff Standard Setting

PhD Dissertation - Proposal

by

Michael Scott Sommers (張夏石)

Candidate for the Degree

Doctor of Philosophy

Department of Educational Psychology & Counseling

National Taiwan Normal University

Dissertation Advisor: 林世華 (Sieh-Hwa Lin)

1

Table of Contents

Chapter 1 Introduction 5

1.1 Motivation & Purpose 5

1.2 Research Questions 7

1.3 Significance & Contribution 8

1.4 Definitions & Terms 9

Chapter 2 Literature Review 10

2.1 An Overview of the Standard Setting 10

2.2 The Angoff Standard Setting Method 15

2.3 Centrality 21

2.4 Centrality and the Angoff Standard Setting 26

Chapter 3 Methods 29

3.1 Data & Materials 29

3.2 The Operational Angoff Standard Setting 34

3.3 Procedures 47

3.4 Analysis and Expected Results 48

References 60

Appendixes

Appendix 1.Common European Framework of Reference - Global Scale.........76

Appendix 2. Informed Consent Form...............................................................78Appendix 3. Security Form...............................................................................80Appendix 4. Angoff Panelist Record Form......................................................81Appendix 5. Panelist Information Form............................................................83

Appendix 6. Evaluation 1..................................................................................85

2

Appendix 7. Evaluation 2..................................................................................86

Appendix 8. Evaluation 3...................................................................................87



Tables

Table 1. Contents of the English Proficiency Test (EPT)..............................33

Table 2. Angoff Panelists .............................................................................36

Table 3. Contents of the Test Form Used in the Standard Setting................38

Table 4. B1Listening and Reading Cutscores, Rounds 1 to 3.......................44

Table 5. Chronological Order of Panelist Evaluations..................................46Table 6. Schedule for Completion of Dissertation.........................................48

Figures

Figure 1 Characteristics of Kurtosis………………………………………….53

Figure 2. Direction of Centrality Change.........................................................57

3

4

CHAPTER 1 Introduction

1.1 Motivation & Purpose

Beginning in the 1970s, the need for criterion-referenced testing to address

problems with the norm-referencing of tests became increasing obvious to educators

and education policy makers (Glaser, 1963). With the pressure created by this demand

came the realization that methods for producing criterion-referenced tests lagged far

behind the idea. The decades that followed, the 1990s and 2000s, saw an explosion in

understanding and standard setting became the established method for determining the

cutscores needed for criterion-referenced tests.

This research is about standard setting (Cizek, 2001; Cizek and & Bunch, 2007).

In particular, this is about a family of standard setting procedures referred to as the

Angoff Method or the modified-Angoff Method. These methods are among the oldest

and most widely used to determine cutscores. Despite this, there are many problems

associated with their use and they remain among the most controversial. There are

obvious signs that the expert panel members used to make decisions during the

procedure have difficulty performing some of the tasks required of them. The Angoff

family of procedures requires that panelists use a scale to match their judgments about

item difficulty with judgments about populations of test takers. There is considerable

evidence that even the most highly trained experts have difficulty doing this in the

required fashion. This research is an examination of some of the aspects of why

panelists have this problem.

One of the most significant functions of ratings is as scales so that objects,

people, and performances can be compared and decisions made from these

comparisons. Ratings such as these in the Angoff standard setting are ubiquitous in

5

our modern society. Yet despite the widespread use of such ratings, raters can not

always make judgments in a fashion that is expected from them. Judges sometimes

use some of the values of a scale more than expected. When this happens with the

median values of the scale, as opposed to the extreme values, it is called centrality

(Wolfe, 2004; Saal, Downey & Lahey, 1980).

In the context of the Angoff standard setting, it has long been speculated that

centrality plays a major role. Previous research has shown the difficulty with which

even the most well-trained experts have in estimating the difficulty measures of test

items (Lorge & Kruglov, 1953; Shepard, 1994). This has been identified as

particularly an issue for the most and the least difficult items, a condition that would

lead to centrality. Study after study of both the ratings of panelists in high stakes

Angoff panelists, as well as studies of the kind of skills used in the Angoff method,

have found even the most highly trained experts have trouble making estimates about

items that are extremely easy or extremely difficult. In a review of the literature

pertinent to the Angoff standard setting procedure, Lavellee (2012, p. 14) concluded,

“…results consistent with a centrality effect have been found every time they have

been looked for” (italics in the original).

Centrality is one of the least well-understood of a class of phenomena often

called a 'rater effect' (Saal, Downey & Lahey, 1980). That is, an aspect of the rating

that is separate from the person, object or performance being rated and intrinsic to the

rater (Saal, Downey & Lahey, 1980;Yue, 2011). The result of this has been a focus on

the centrality of raters as a measure of their performance. For example, it is often

assumed that centrality is an error and a demonstration of its existence in the ratings

of a given rater are a sign that the ratings from that rater are problematic (Saal,

6

Downey & Lahey, 1980; Yue, 2011). Despite this, most of what has been written

about centrality tells us more about the kind of items that produce the centrality than

about the raters themselves or their ability to produce ratings, and this incongruence

has led to a gap in understanding of what centrality really is and how its presence

should be assessed. One of the most contentious problems with the issue of centrality

has been its measurement. There exists no consensus about the measurement of

centrality. While there are standard classical measures identified in the research

literature, many aspects of these measurements have not been clarified. For example,

their relationship to each other is not at all clear. Nor are their useful guidelines

available for their application and interpretation. So, as Lavalle indicates, much of the

research results are “consistent” with the idea of centrality as a factor in the Angoff

standard setting, however it is not really clear what this means.

1.2 Research Questions

This study uses modern and classical measures of centrality to examine changes

in centrality across the 3 rounds of a modified-Angoff standard setting. The purpose is

to gain an understanding into the role of centrality as a factor in the modified-Angoff

standard setting and the usefulness of the concept in understanding the cutscore

decisions of modified-Angoff panelists. In particular it will use these measures to

address, the following research questions.

Question 1: Does centrality exist in the modified-Angoff standard setting?

Question 2: How does centrality change across the rounds of the modified-Angoff

procedure?

7

Question3: Does centrality explain differences in panelist ratings between extreme

(difficult and easy) items and median difficulty item?

1.3 Significance & Contribution

A first contribution of this study is a greater understanding of rater effects and

the role of centrality. One of the main identified problems of the Angoff family of

standard setting procedures involve issues with rating extremely difficult and

extremely easy items and how the affects cutscore decisions. Certainly this is widely

speculated in the research literature, or at least the research literature consistently

speculates in a manner indicating a major role for centrality. Is this true? Is this

because of centrality that occurs in raters or items?

This first contribution points to a wider understanding of raters, items and the

role of centrality that this research hopes to understand. The use of rating and rating

scales is ubiquitous in our modern society. It appears in our jobs, in our homes, in our

leisure time, and in practically any other imaginable place. Greater understanding of

such a central aspect of our lives is important. Centrality is very poorly understood

aspect of the rating process. Despite this, there are applications for the idea, such as in

the Angoff standard setting, where it plays a potentially major role. A stronger

understanding of the concept of centrality would contribute to a clearer understanding

of the general process of rating objects, people and performances.

A further contribution offered by this research is a clarification of the

measurement of centrality. While a number of classical and latent-trait methods have

been developed, their lack of widespread use has lead to a poor understanding of these

measurements. This research hopes of contribute a better developed sense of what

these measurements mean, as well as how and when they should be used.

A final, but central, contribution of this research is, of course, a greater

8

understanding of standard setting procedures and particularly the modifed-Angoff.

The ways in which panelist ratings changes in a modified-Angoff standard setting

procedure has been studied extensively. And while a full understanding of these is still

not entirely clear, there is a growing picture of the way in which they operate during a

modified-Angoff standard setting. Some aspects of the procedure, however, remain

quite unclear. The role of centrality in final cut score decisions is among these. What

is the impact of centrality on standard setting decisions? What is the relationship

between centrality and other facts that we know about the standard setting?

1.4 Definitions & Terms

Angoff Method Standard Setting -- In this study, the term Angoff Method refers

to the general category of standard setting methods derived from the original

suggestion of William Angoff. The term modified-Angoff Method refers to the group

of recently adopted versions of this that use percentage or proportion estimates instead

of Angoff’s original suggestion of the yes/no. While the term Angoff has come to

refer to a wide range of different standard setting methods, the key elements of these

is that panelists are presented with items one at a time and asked to estimate and vote

on an examinee's ability to answer the item correctly according to various criteria.

Centrality -- Centrality is present when a rater clusters his or her ratings around

the central point of a rating scale or around the center of a perceived range of

performances. This results in a compressed distribution of ratings and reduced

variation in scores. Centrality has been discussed in relationship to a large number of

different measurement concepts. Wolfe (2004) distinguishes between centrality and

restriction of range, stating that restriction of range is centrality combined with

leniency/harshness. The significance of this is that restriction of range also shows

9

skew. With respect to this study, there can be 2 different kinds of centrality. Item

centrality exists when ratings are clustered for the same items across different raters.

Rater centrality exists when ratings are clustered for the same rater across different

items.

Differences in panelist ratings – Panelist ratings are the estimates assigned by

panelists to individual items during the modified-Angoff method. The modified-

Angoff requires that panelists assign percentage estimates to items. Differences

between these estimates are the differences between panelist ratings.

CHAPTER 2. Literature Review

2.1. An Overview of the Standard Setting

Standard setting is a procedure used in the development of criterion-referenced

testing. This section will review some important aspects of the development of the

procedure and how the raters in the procedure influence the outcome of the standard

setting.

Standard setting refers to the family of procedures used to establish cutscores on

a scaled examination. Cutscores separate scaled scores into categories of performance

defined in a performance standard (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007;

Cizek, Bunch & Koons, 2004). Standard setting is used in criterion-referenced

examinations to match standardized test scores with verbal criterion defined in

performance level descriptors (PLDs) of the performance standards. Panels of judges

use different methods to compare PLDs with different types of information about

items or examinees. The term "standard setting" is used to refer to the different

procedures and materials used to make the cutscore decision. Since the first

suggestion of this idea in the 1950s (Nedelsky, 1954), dozens of different procedures

10

have been developed. In one recent survey (Kaftandjieva, 2010), more than 60

different methods were identified, 15 of which have appeared since the year 2000.

Standard setting grew out of the expanded role of criteria in testing.

Examinations can be defined as norm-referenced or criterion-referenced (Glaser,

1963; Shepard, 1980). Norm-referenced tests produce results that allow for

comparison between individuals and dominated high stakes testing for much of the

last century. Such tests are limited by an inability to indicate what the score means for

examinee ability. Criterion-referenced tests produce results that can assign a defined

meaning to a score. These abilities are typically defined in descriptions ranking them

from least to most capable. Such descriptions are referred to as a 'performance

standard' and the descriptions that define individual categories of performance as

'performance level descriptors or PLDs. The standard setting allows for these ranked

descriptions - the PLDs - to be placed along scaled test scores providing latent trait

scores that correspond with the different categories of ability defined in the standard.

Cizek and Bunch (2007, p. 13) have stated that,

Standard setting refers to the process of establishing one or more cutscores on a

test...Cut cores function to separate a test score scale into two or more regions,

creating categories of performance or classifications of examinees.

A large number of different standard setting procedures have been developed

(Kaftandjieva, 2010; Cizek & Bunch, 2007) that allow for the assigning of cutscores

from a scaled test on to categories defined in performance level descriptors. While

these procedures vary enormously in the details, they all share one property. These

procedures present panels of experts with performance standards and different types

of information about items or examinees. Panelists are then taken through detailed

11

training and procedures producing a score on the test that separates the various

categories of the performance standards. The actual procedure used can vary

considerably and different procedures may use a wide range of different types of

information. A typical convention in contemporary standard setting procedures is to

permit a significant amount of input from others sources to inform panelists about the

impact of their cutscore decisions. For example, one common way to handle this is for

panel organizers to allow discussion between panelists about their decisions then tell

them what percentage of an examinee population would fall above and below their

cutscore decisions.

As a result of this wide range of methods and procedures, different panels do not

always agree on the cutscore decision, even for the same test items and with the

provision of the same feedback information about pass/fail rates. It has long been

known that different methods produce estimates of cutscore decisions that are

systematically different (Buckendahl et al., 2002; Green et al., 2003; Hambleton &

Patoniak, 2006; Reckase, 2006; Yin & Schultz, 2005). Even small changes in standard

setting procedures can result in changes in panelist decisions (Cross et al., 1984;Hertz

& Chinn; 2002; Jaeger 1982). Panelists, or even the same panelist, may not make the

same judgments under very similar conditions (George, Haque & Oyebode, 2006).

Very little has been written about the validity of the various standard setting

procedures. Much of what has been written about standard setting validity describes it

as an arbitrary procedure to set standards. A general theory of validity, sometimes

referred to as the argument-based concept of construct validity, has emerged as the

dominant model in psychological testing (Cronbach, 1988; Cronbach & Meehl, 1955;

Loevinger, 1957; Messick, 1981, 1989, 1998).Despite its dominance elsewhere, it has

not been incorporated into discussions of a general theory of validity for standard

12

setting. Examinations of validity in standard setting methods appear to be based on a

series of ad hoc principles (Kane, 1992, 2001) and derived from the approach that

accepts "just because a standard setting is arbitrary does not mean it is not useful"

(Hambleton, 1980, p. 102). In later work, Hambleton (2001; see also Schafer, 2005)

builds on this suggestion that further information is necessary to determine the

'usefulness' of the standard setting, including a (1) Definitional perspective, (2)

Psychometric perspective, (3) Legal perspective, and (4) an Institutional perspective.

The definitions for standard setting validity suggested by Cizek, Kane, and all other

modern standard setting theorists reject the conceptualization of standard setting as a

psychometric technique capable of discovering a knowable or estimable parameter

(Cizek & Bunch, 2007, p. 18).

Following in this tradition, Cizek and Bunch (2007) suggest that panel organizers

should report a number of statistical tests to support their argument for validity. In

contrast, Dixy McGinty (2005) points out that such statistics are really more

accurately thought of as indicators of reliability, and while useful in demonstrating

validity, are not themselves measures of validity. As a result of this confusion, in

comparison with other psychological assessment procedures, a scientific justification

for the validity of a particular procedural decision, such as choice of a method, or

variation in a procedure, is very rarely given and such justifications are typically

procedural in nature.

It is widely stated that standard setting procedures are dominated by two methods

that are historically linked - the modified-Angoff method and the Bookmark method.

The modified-Angoff method is derived from an original method named after William

Angoff who, ironically only briefly mentioned the method as a note and attributed the

idea named after him to his colleague Ledyard Tucker (Cizek & Bunch, 2007). The

13

main principle of the method is that items are examined one-at-a-time and judged in

various ways for their suitability to make decisions about examinee classification.

Since the Angoff procedure is the main focus of this study, much more will be said

about it in the following sections; however, the Angoff is widely cited as being "the

most commonly used method for setting performance standards in contemporary use

in licensure and certification context" (Cizek & Bunch, 2007, p. 82). Regardless of the

literal accuracy of this statement, it is unquestionably widely used to make cutscore

decisions for high stakes tests.

The other widely used standard setting method is the Bookmark. The Bookmark

emerged from procedural difficulties with the Angoff method. It was first suggested

by Mitzel et al. (2001), although Cizek and Bunch (2007) trace its roots back to

procedures extended from the Angoff method and used in the 1990s by researchers at

American College Testing (ACT) for the National Assessment of Educational

Progress (NAEP). In the Bookmark, items are placed in a booklet, referred to as the

Ordered Item Booklet (OIB), where they are ordered by their difficulty measures.

Panelists then place a mark on the items that separate the various categories of the

performance standards. Engelhard (2007) speculates that, because of its widespread

use in assessments related to the American educational policy No Child Left Behind

(NCLB), the Bookmark is in fact not the most widely used standard setting method.

Standard setting is now a routine aspect of test development. Huge numbers are

performed regularly as part of the development of state and private tests. Standard

setting panels are conducted as part of the vast network of accountability tests in

American schools known as No Child Left Behind (NCLB) (Linn, Baker, &

Betebenner, 2002; Linn, 2003), as well as in and other public education accountability

projects throughout the world. Standard setting also plays a role in the development of

14

the examinations that establish standards for a wide variety of occupations and

professions (Nelson, 1994). In addition, panels similar to those in the standard setting

are increasingly used for other purposes. For example, Roach, McGrath, Wixson and

Talapatra (2010) describe a procedure similar to a standard setting panel to 'align' two

or more different types of assessments whose content is not directly comparable. The

results of their study resemble what could be produced from a mathematical equating

of different assessment procedures. Their use of the panel comparison instead of an

equating stems from the limited use of the assessments and hence limited numbers of

observations available to perform an equating.

Little is known about the growth in use of criterion-referenced testing or standard

setting. While numerous reviews of the different methods have been done (Norcini &

Shea, 1997; Shepard, 1980), including special issues of journals (see the 1994 special

edition of Educational Measurement: Issues & Practices, volume 13), and the

complexity and detail of methods has expanded enormously (Huynh & Schneider,

2005; Lissitz & Wei, 2008; Lissitz & Huynh, 2003), little is understood about what

policy decisions that surround the procedure, such as what contributes to decisions to

use various standard setting methods or the increasing move away from standard

setting toward correlation methods of cutscore decision making, for example, value-

added assessment (Saunder, 1998; Saunders & Horn, 1995; Saunders & Horn, 1998).

Although it is clear that political reasons play a significant part of what goes into the

decision-making around testing and assessment policy (Geisinger & McCormick,

2010; Giraud & Impara, 2005;Hein & Skaggs, 2009; McGinty, 2005; Nelson, 1994),

most of the discussion about this has been speculation and no research exists on what

factors are important in these decisions or what mechanisms are involved.

15

2.2 The Angoff Standard Setting Method

As mentioned above, the Angoff method is named after William Angoff (Angoff,

1971) who, ironically, only briefly mentioned the method in a note and attributed the

source of the method named after him to his colleague Ledyard Tucker (Cizek &

Bunch, 2007). The Angoff standard setting method is one of the oldest methods and is

reputed to be among the most widely used methods in the world for setting cutscores

(Cizek & Bunch, 2007). From a research point of view, the Angoff is particularly

useful because it produces many discrete values at points throughout the procedure

permitting the application of techniques derived from latent trait theory, such as Item

Response Theory (Embretson & Reise, 2000) and Rasch Modeling (Bond & Fox,

2001). This will be explained in more detail in the description of the method that

follows.

There are many different versions of the Angoff method in use today. For this

reason, methods that belong to the Angoff family of standard setting methods are

frequently described as a "modified-Angoff'. It has been suggested there is no general

agreement on a definition of the Angoff method (Brandon, 2004; Reckase, 2000), and

Brandon (2004) lists 5 steps he believes characterize the modified-Angoff procedure,

1. selecting judges

2. training judges

3. defining and describing the performance level descriptors

4. estimating examinee performance at the level of each item

5. review of empirical information by judges and discussion of item estimates

This definition, while widely cited, is difficult to use. All of these points are

routine aspects of other standard setting methods and only (4) is an aspect distinctive

to the Angoff family of standard setting procedures. While estimation of examinee

performance at the item level is found in other methods, such as the Nedelvsky

16

procedure (Nedelvsky, 1954) the way it is done in the Angoff offers a true distinction

between the modified-Angoff and other standard setting methods.

The modified-Angoff is distinctive in its procedures for estimating the cutscore

in that,

1. Panelists are presented with items one-at-a-time.

2. Panelists are asked to estimate examinee's ability to answer the item correctly.

3. Estimation of examinee ability to correctly answer the item is done item-by-item,

and items are not necessarily presented in any particular order.

The second point, estimation of an examinee's ability to answer the item

correctly, has been done in many different ways. Brandon (2004, p. 60 note 2)

provides a partial list of different ways.

Sometimes percentages are recorded instead of probabilities. Sometimes judges

specify the number of candidates out of 100 who could answer the problem

correctly. (e.g., Engelhard & Anderson, 1998; Impara & Plake, 1998).

Sometimes judges are given a choice of range of percentages or proportions, for

example, Cross, Impara, Frary and Jaeger (1984) and Plake and Giraud (1998)

instructed judges to select from deciles. Halpin, Sigmon and Halpin (1983)

printed the lowest acceptable probability and the highest probability. Angoff

also proposed a yes-no form of judgment...

In addition, Cizek and Bunch (2007) list several different versions of the

modified Angoff. This includes the yes/ no Angoff procedure in which panelists

indicate only a yes or a no concerning their judgment of examinee ability to answer

the item correctly.

In addition to these procedures, modified-Angoff standard settings

17

conventionally incorporate a number of other procedures to produce a convergence of

scores across panelists. These are referred to in point (5) of Brandon's (2004) list

above. These additional procedures include,

1. Panelists have several opportunities to refine their estimations, referred to as

'rounds'. The current convention is to perform a standard setting in sometimes two,

but often three, rounds (Cizek & Bunch, 2007).

2. In between rounds, panelists have the opportunity to compare their estimations

and discuss why they made their individual decisions. This is referred to as

“discussion”.

3. In addition to discussion, panelists are presented with data that reflects the

impact of their decisions. For example, panelists may be shown the percentage of

examinees who would fall above or below their estimated cutscores. This is referred

to as 'impact data' or 'feedback'.

From a statistical point of view, these procedures produce a convergence of the

standard deviation of cutscore decisions making it appear as a growing consensus

between panelists from round to round. Much of the issues about standard setting that

remain unanswered and are addressed in this paper are related to the nature of this

convergence and what factors in the standard setting panels influence this

convergence.

One final procedural issue is the method used to calculate the final recommended

cutscore decision. Cizek & Bunch (2007) recommend that the final cutscore for the

standard Angoff is calculated by averaging the rater or the item means, typically from

the final round of panelist vote. This is the conventional method used to handle the

recommendation, although some debate exists about this. Issues related to this are

beyond the scope of this study and will not be covered in this paper.

18

A wide range of external influences have been investigated for their potential

effect on the outcome of a standard setting. These can be thought of as social

influences, situational influences and rater effects. Research on social and situational

influences has not been able to produce strong evidence of an impact, although what

research has been done does not point to a great deal of research on these aspects of

influence. Rater effects have been the most completely investigated, and a large and

robust research literature has documented their significant influence on the outcome

of the Angoff and modified-Angoff standard setting procedures.

All standard settings, no matter which method is used, when conducted with

multiple rounds, discussion between panelists between rounds and the provision of

impact data demonstrate a convergence of cutscore decisions across rounds. So

characteristic is this result that Cizek (2001a, p. 10) refers to this as a “common

feature of standard settings”. This convergence is not unanticipated. Experts given the

opportunity to discuss data relevant to their expertise will develop elaborate

explanations for the data based on information drawn from their shared background

(Chi, Glaser & Farr, 1988; Johnson, 1988; Larkin, McDermott, Simon & Simon,

1980). It is thus reasonable to interpret the convergence of cutscore decisions as a

growing expert consensus about the contents of the standard setting and its panels.

This, however, is not the commonly understood meaning of the convergence of

standard setting decisions. The generally accepted interpretation is that rather than

reflecting a decision making process and it is the result of influences that have

distracted panelists away from an optimal decision (Fitzpatrick, 1989; Hurtz &

Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010).

The exact nature of these effects is not well understood and much speculation has

been generated about its origin. Many types of effects have been suggested as a

19

potential issue in the decisions of panelists. For example, it is widely suggested in the

standard setting literature that social influences during the discussion drive cutscores

toward agreement (Fitzpatrick, 1989; Hurtz & Auerbach, 2003; Hertz & Chinn, 2002;

Wessen, 2010).Researchers have suggested a number of social influences, as well as

the effects of dominant individuals or group conformity, may be driving panelists to

report cutscore decisions that are more and more similar to each other. The exact

mechanism of this has not been well-established. Despite widespread speculation

about the role of these social influences (Fitzpatrick, 1989) and some empirical

examinations (Hurtz & Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010), there is

no well-constructed theoretical or empirical demonstration of how social influences

could operate in a standard setting.

The effects of situational influences have been more completely investigated, but

the evidence that they play a major role in the outcome of standard setting panels is

not compelling. Wiley and Guille (2002) provide a review of the idea that an

"occasion facet" could affect the outcome of a panel. The aim of their study was to

examine the effect of an Angoff ratings obtained by panelists at home rather than

through the conventional tables of panelists seated together. While the authors

reported some interaction with other factors of the standard setting, they were unable

to find evidence that the occasion facet plays a significant role in the outcome of the

standard setting panel. The only meaningful finding reported in the study was that the

variance of scores in the 'at-home' panels was greater than that of traditional panels.

These findings add to the consistent results of other studies that could not find an

effect related to an occasion facet.

So while there are a wide range of suggested effects operating in the standard

setting, there is little understanding of their nature. In fact, it is not clearly understood

20

how such effects operate or even if they exist in a fashion that would affect the

outcome of the standard setting. Effects operating in the standard setting have not

been clearly documented and categorized, as a result, there continues to be confusion

about how cutscores are influenced.

2.3 Centrality

Variation in judgment and complete use of a rating scale is widely interpreted as

a positive indicator of rater ability. Failure to use the range of options that a scale

provides is interpreted as a flaw in the rater's ability. While a full discussion of this is

beyond the scope of this research, this distinction is one that appears regularly in the

research literature dealing with centrality and its related issues.

A variety of names have become attached to patterns in the failure to use the

complete options available on the rating scale. Centrality is the term used to describe

a lack of variation in ratings given by raters which center around the middle of the

rating scale. The term is used synonymously with central tendency. Most

contemporary research on centrality traces discussion of the concept back to the 1980

paper of Saal, Downey, and Lahey. In this paper, the authors reviewed a large number

of related definitions used by researchers of the time examining centrality classical

methods of measuring centrality. Koran (1971, pp. 180-181), for example, defined

central tendency as, " the tendency to rate all rating objects around the 'middle' or

mean of the rating continuum and not around the extremes." In contrast, DeCoths

(1977) uses the almost metaphorical understanding of central tendency as a rater's

unwillingness to go out on the proverbial limb in either the favorable or unfavorable

direction. The definition has been restated more recently by researchers using latent

trait theory. Reproducing the point that centrality reflects rating error, Farrokhi,

21

Esfandiari and Daili (2011, p. 71) define centrality in terms of a lack of variability in

rater scores, citing a definition from Wolfe and McVay (2010) of central tendency as,

"systematic error that causes the assigned ratings to be more tightly clustered than

ratings that do not contain error."

Using the latent trait concept of measurement, Wolfe (2004, pp. 39-40) contrasts

centrality with a number of related terms from the literature,

...centrality...results in a concentration of assigned ratings in the middle of the

rating scale...restricted range exists when centrality is combined with leniency or

harshness. That is the, restriction of range results in a restricted range around a

non-central location on the rating scale. The converse of rater centrality occurs

when raters tend to overuse the extreme rating scale categories - a rater effect

called extremism.

In a doctoral dissertation supervised by Wolfe (Yue, 2011, p. 10), a similar point is

made about the distinction between centrality and restricted range.

(3) central tendency (i.e., a rater overuses the middle category of a rating scale

while avoiding the extreme categories), and (4) restriction of range (i.e., a rater

overuses any point on a rating continuum).

Few of these definitions have had a major impact on the methods used by researchers

to examine centrality or central tendency. Classical methods of analysis have varied

enormously. Saal, Downey, and Lahey (1980) described four techniques used in the

research to look for central tendency.

1) standard deviation

2) distance from the mean

3) kurtosis, and

4) rater X ratee ANOVA.

None of these methods have been ever compared for their ability to detect centrality.

22

It is difficult to see from this list of measures what common characteristic of central

tendency is being addressed. For example, standard deviation is a measure of

dispersion of scores. Kurtosis has historically been interpreted a measure of the

peakedness of a curve and its deviation from the normal distribution ((DeCarlo, 1997;

Darlington, 1970, Dittmar, 2002). As such, it appears the two measures would indicate

centrality in different places. Indeed, studies of the relationship between the two

measures have found only poor correlations (Murphy & Blazer, 1989). In a study of

simulated data, Yue (2011) found that standard deviation was able to detect Type 1

error while maintaining statistical power, even as well as latent trait measures,

indicating that measures of dispersion may be a more effective definition. She did not

compare measures of kurtosis.

Measures of centrality developed by Wolfe and his students (Myford & Wolfe,

2003; 2004; Wolfe, 2004; Wolfe, Moulder & Myford, 2001; Yue, 2011) rely on latent

trait measures and particularly their residuals to detect centrality. Wolfe (2004, p. 40)

provides the following rational for his measures.

Rater centrality and extremism manifest themselves in both the pattern and the

spread of the residuals When centrality occurs, the observed ratings regress

toward the center of the rating scale. As a result, residuals tend to be large and

positive for low expecting ratings and high and negative for high expecting

ratings.

From this understanding, he has derived a number of different measures for the

detection of centrality. Yue (2011) used simulated data to examine these measures and

their detection of centrality. While she also included the classical measure standard

deviation, most of the measures she compared were derived from latent trait theory.

1. Mean-square fit statistics

2. Expected-residual correlations

23

3. Ratee measures and their residuals derived from Multi-Facted Rasch

Measurement models (MFRM)

4. Correlation of Rasch measures and measures from raters

5. Rater slope (point biserial)

Yue reported that some of the latent trait measures, such as measure-residual

correlation and expected-residual correlations were effective at detecting centrality.

However, she also reported that standard deviation showed comparable ability to

protect against Type I error while maintaining an acceptable level of statistical power.

Latent trait measures, while not necessarily providing better detection than

classical measures, do offer some significant advantage. In addition to addressing no

common definition of centrality, classical measures provide no guidelines for the

detection of centrality, such as definitions of measurers where centrality can be

assumed to exist. In contrast, Wolfe's latent trait measures provide guidelines for the

detection of centrality. For example, if there is no centrality at all, the expected-

measure correlation should be 0. Under conditions of perfect centrality, the expected-

measure correlation should be -1.0, while under conditions of perfect extremism, the

expected-measure correlation should be 1.0. (Wolfe, 2004). Most cases that fall in

between can thus be separated into cases that are central or non-central based on

whether or not their measures are negative or positive.

One final point addressed is the cause of centrality effect'. Central raters are

described as in error and needing closer examination. In Wolfe and McVay (2012),

central raters are described as "interesting raters" and identified as raters whose

ratings should be handled differently from non-central raters. This assumption about

centrality is widespread and assumed even outside education and psychology.

Research on issues related to job appraisal, for example, has identified centrality as

one of the more important problems (Berger, Harbinger & Sliwka, 2010; Bol, 2006;

24

Moers, 2005;Prendergast, 1999). As Bol (2006, pp. 9-10) states,

The lack of distinction between performance ratings of different employees is

expected to influence employee incentives...for below-average performers, as

compression influences their ratings in a positive way. Because of centrality bias,

their performance seems similar to that of top performers... with above-average

performers, the lack of variance in performance ratings might also negatively

affect the motivation of below-average performers as their marginal cost to

improve performance ratings may be higher than the marginal benefit they

receive from the performance rating improvement

This research has not drawn solutions to the problem from measurement theory

or educational research. Instead, solutions revolve around the identification and

comparison of ratings with objective measures of job performance and comparisons

of their various statistical properties.

Likewise, research on the judgments of experts similarly describes a lack of

variance in ratings as a sign of lack of expertise. Based on their work observing

different kinds of experts in both rating and prediction settings, (Shanteau, 1992,

2001; Weiss & Shanteau, n.d. a; Weiss & Shanteau, 2003; Weiss & Shanteau, 2004)

have developed a measure of expertise based on the variance of scores given. The

Cochran-Weis-Shanteau approach defines the CWS Index (Weiss & Shanteau, n.d.b),

p.1) as, "a numerical value that captures the degree of expertise demonstrated in a

response set." Stating that, "expertise at its core requires the ability to evaluate the

stimuli in one's domain" (p. 2), The CWS is defined as,

CWS=DiscrimiationInconsistency (1)

where

25

Discrimination = variance among averaged responses to different stimuli

Inconsistency = variance among responses to the same stimuli

In terms of evaluating judge’s ratings, Consistency is interpreted as the item variance

of ratings given by a single judge. Discrimination is interpreted as the judge variance

of ratings given by many judges to the same object. The CWS Index is, in effect, a

measure of centrality. The less central the estimates given by a judge, the more 'expert'

he or she is seen to be.

And, once again, while centrality appears to be a key concept for some

researchers in the identification of experts and expertise, their measures draw from a

completely different place from latent trait theory or even classical measures of

centrality.

2.4 Centrality and the Angoff Standard Setting

Centrality, as such, has never been studied in the Angoff standard setting or in

the standard setting situation in general. However, much available research indicates

that centrality exists as an issue that should be taken into account when considering

the results of the modified- Angoff standard setting method.

In its modified form, the Angoff standard setting method asks panelists to rate

the difficulty of test items. The ability of expert judges to make such estimates is

crucial to the validity of the method, and as such, a large and comprehensive research

literature has been developed to address the issue. A large number of references are

typically cited questioning the ability of even the most highly trained experts to

accurately estimate the difficulty of test items (Brandon, 2004; Goodwin, 1999;

Impara & Plake, 1998; Lorge & Krulou, 1953; Linn & Shepard, 1997; Norcini et. al.,

1987; Norcini, Shea & Kanya, 1988; Shepard, 1994; Smith & Smith 1988; Taube,

1997).

26

Smith and Smith (1988) asked 15 Angoff judges to estimate the ability of

minimally competent students to correctly answer 64 reading items from a statewide

high school graduation test using the Angoff method. These were compared with the

estimates obtained from 16 judges using the similar Nedelsky method (Nedelsky,

1954). The overall agreement of the Angoff estimates with the empirical p-values

from the items was .60, which was higher than the estimates obtained for the

Nedelsky method (r = .37).

Goodwin (1999) used items from an administration of a licensing examination

for financial planners. She asked judges to estimate the proportion of all examinees

who could answer the question correctly, as well as the proportion of borderline

examinees who could do so. Fourteen judges estimated 140 items that had recently

been administered to 972 examinees in a high stakes examination. She found that

results for the two difference estimates were very similar, with a correlation of .51 for

the predicted p-values and .55 for the predicted performance of the minimally

competent examinees.

A review of results obtained from various Angoff method procedures conducted

by Brandon (2004) concluded that typically, the values obtained by correlating the

measured p-values with the estimates obtained from Angoff panelists range from

around .40 to .70, indicating that at best, actual measurements of the p-value can

rarely account for more than half of the variance in a judged estimate. In conclusion,

he states (p. 71), results of this level, "show that the ordering of item estimates -

particularly those in operational standard setting studies - can be expected to mirror

moderately the ordering of item difficulty, thus supporting the validity of the modified

Angoff cutscores."

Of greater importance to the issue of centrality is which items are more likely to

27

be incorrectly estimated. Are items that are easier or more difficult also more difficult

for judges to estimate? Without accurate measures of centrality, one indication of this

would be that estimated values for difficulty would have a smaller standard deviation

than the actual measurements, indicating that estimated values for easy items and for

difficult items are not correct. This, in fact, is a commonly observed aspect of the

research. Lavellee (2012, p. 14) reviewed the literature related to this issue and

concluded, “…results consistent with a centrality effect have been found every time

they have been looked for” (italics in the original). In addition, the tendency for judges

to cluster estimates of actual values in tighter distributions than the actual values

themselves, has been the subject of comment for almost as long as there has been

systematic scientific investigation into standard setting results.

Lorge and Kruglov’s original (1953) study found a standard deviation of 16.3 for

the judge’s estimates compared with 23.7 for the empirical difficulty values. More

recently, Goodwin (1999) reported, in her study of the results of the financial planner

licensing exam, that the judge’s estimated p-values was “more homogeneous” than the

actual results obtained from the administration of the items to candidates. The

standard deviations for the estimates of total group and for borderline examinees were

.09 and .10 respectively. The actual observed values were .19 and .18.Van de Watering

and van der Rijt (2006) compared the estimates of difficulty values for teachers and

students. They found high rates of inaccuracy among both these groups. Interestingly,

their student group did not overestimate the difficulty of easy items, although they

showed dramatic underestimation of difficult items. Teacher’s estimates of easy items

showed much more centrality and systematically underestimated the easiest items.

While Lavallee’s (2012) remark that, “…results consistent with a centrality effect

have been found every time they have been looked for” is technically correct and

28

overwhelmingly true (Clauser, Swanson & Harik, 2002; Clauser et al., 2009,

Humphry, 2005) some studies have used methods different from correlation to

examine estimates and actual p-values, and found a different pattern. In a widely cited

study, Impara and Plake (1998) had 26 school teachers make estimates for

administration of a highly standardized test to their students. Teachers were asked to

estimate the performance of students who would get a D/F (borderline students) and

of the class as a whole. Teacher’s estimates of performance were reasonably accurate,

with the estimate of borderline students being more accurate than for the group as a

whole. The more widely cited result (p. 77) is based on the calculation in which

estimated scores were subtracted (not correlated with) the actual scores obtained

during test administration.

…teachers did not systematically overestimate (or underestimate) performance

on the easy items or overestimate (or underestimate) performance on the hard

items regardless of the target group. Instead, the systematic over- and

underestimation was a function of the teacher’s perception of the ability level of

the students for whom the estimate was provided

While this study did not find the centrality suggested in other studies, it is significant

that their method for detection involved the raw difference between scores, rather than

their correlation.

So while the literature on standard setting does not directly address the issue of

centrality, much of the information reported in published papers is consistent with the

suggestion that centrality exists as a problem in the ratings of panelists. However, the

standard setting research that addresses centrality is confusing. It is not the direct

object of any of the published research, nor is it specifically reported in any existing

study. In fact, descriptions of studies provided in pertinent research are such that it is

29

not always clear what is being addressed (See Brandon, 2004 for a discussion of

deficiencies in the reported procedures and results in standard setting research).

CHAPTER 3 Methods

3.1 Data & Materials

The data used in this study is drawn from standard setting meetings held at a

Taiwan university to link a university-level English proficiency exam to the Common

European Framework of Reference (CEFR) (See Appendix 1). It is the same data used

by Lavallee (2012) in his examination of many faceted Rasch modeling in a modified-

Angoff standard setting. Readers with questions about the data that are 6 snot

answered in this paper may be able to find them in his dissertation.

The exam used in this study, the English Proficiency Test or EPT, is an

examination of English as a Foreign Language. The EPT is a series of in-house

language proficiency test developed to meet the needs of the Practical English (PE)

program adopted at Ming Chuan University. The PE exams are multiple-choice

exams. They examine a series of listening and reading and vocabulary skills in a

number of different practical contexts. The tests have a long history at the school and

were originally designed to be administered four times a year to undergraduate non-

English majors who are students at Ming Chuan University during the spring midterm

examination and final examination period. Beginning in 2004, a series of structural

and administrative reforms reorganized the tests, and beginning in 2010 all students in

the school, including students majoring in Applied English, began taking the EPT.

The EPT is administered as part of a larger English proficiency program, the

Practical English program, developed by Ming Chuan University to address

proficiency demands of students at the school. It is divided into 8 sections. Students

progress through 2 sections of Practical English each year: PE 1 and 2 are taught to

30

freshmen (1st year), PE 3 and 4 to sophomore (2nd year), PE 5 and 6 to junior (3rd

year), and PE 7 and 8 to graduating seniors (4th year). Budgetary and other reforms

have resulted in major changes to the program and the structure of instruction is

rapidly becoming quite different. However, at the time of the standard setting (2010),

the program was still organized according to its traditional structure. Although

different students take the EPT, the composition of the test is now identical from year

to year of administration. An outline of the test organization is detailed in Table 2.

The original construction of the EPT test was based on traditional concepts of

test writing used in Taiwan. Test validity was determined by expert judges who were

themselves teachers. There were few specifications to follow during item writing and

determination of ‘error’ in test construction was based on issues more related to face

validity, such as spelling and grammar errors, or confusion over selecting answers,

than to psychometric properties. In fact, initially, there were no provisions made to

analyze the psychometric results of testing sessions, and determination of a successful

test was based on complaints from teachers and students about quality-related issues

Items for the EPT are written by the classroom teachers of the PE program

under the supervision of test editors, who are assigned by the school and have

historically been compensated by the school for their work. More recently, this

compensation has been withdrawn following budget cutbacks. Items are written

following extremely vague item specifications and aimed at covering the materials

contained in school textbooks, which were also written by the classroom teachers of

the PE program. Items are then sent to a proofreader and finally returned to the

editors. The test editor returns the test to the school who then print and distribute the

test forms to students. The various tests of the EPT are administered on a single day.

So for example, freshman students all receive the test at the same time. Sophomores

31

receive the test at the time, which is different from the time for freshman students.

Following student examinations, test results are collated and sent to a test

coordinator. The current test coordinator is Dr. Joseph Lavallee. Results are calibrated

with Winsteps Rasch Modeling software (Linacre, 2012) and all test items placed on a

single difficulty scale. Items are sorted by their point-biserial correlations and

difficulty values, and stored in an item bank for later use. Currently, most items that

appear on PE tests are drawn from this item bank, although teachers continue to write

new items to expand the item bank.

The test items used in this standard setting were drawn from several different

midterm examinations. All items are calibrated using Rasch modeling onto a single

scale. This standard setting project was designed to establish cutscores along the scale

used to calibrate all items in the item bank and not along a raw score scale

corresponding to a single test form. Accordingly, the test form used in the project was

actually a composite form, with its items drawn from a series of different test forms

administered during the midterm examination period for first, second, third and fourth

year students. The tests shared a number of items which were used to equate them and

calibrate them together onto the same scale.

32

Table 1. Contents of the English Proficiency Test (EPT)

Skill Section Description Items Time

L

What’s next?

Student hears 2

conversational turns and is

asked to choose the next

response.

20

45 ~45 minDialogues

Student hears short

conversation of about 8-14

turns and answers 3-5

comprehension questions.

10

Extended

Listening

Student hears a short

monologue and answers 3-5

comprehension questions.

15

R

Fill in the

Blank

Student chooses a word or

short phrase to complete a

sentence.

10

50 ~55 min

Cloze

Reading

Student chooses words or

short phrases to complete a

short passage (multiple-

choice cloze).

10

Reading with

Questions

Student reads a short passage

(150-300 words) and answer

3-5 comprehension questions

based on the text.

30

95 100 min

(Lavallee, 2012).

33

3.2 The Operational Angoff Standard Setting

Selection and training of panelists was extremely conventional and followed

suggestions given in such authoritative sources as Cizek (2001); Cizek and Bunch

(2001); and the Council of Europe (2009).

In the modified-Angoff study, 18 panelists were asked to set a cutscore only

for the B1 level. Panelists were selected from teachers who had an intimate

understanding of both the PE tests and the student body on which the PE tests were

regularly used. Panelists were selected from faculty and staff at the university who

were familiar with the student body and the EPT. All MCU panelists had served in

some capacity as item writers, proofreader, and test editors for the PE test, and all had

served as school-appointed proctors invigilating the test administration. In addition,

each panel was supplemented by a teacher from a similar program at a different

university in Taiwan. The level of familiarity of the panelists with the tests, the test

items, students, the CEFR and with standard setting in general was very high. Of the

18 panelists, 15 were faculty and staff teachers who had taught students in the PE

program at the university. Ten panelists had served as item writers for the PE tests and

some had written items that were used in the test form of the standard setting. Two of

the panelists and both facilitators had served on a preliminary yes/no Angoff panel

held earlier in the year (Lavallee, 2012). In addition, the CEFR is an important aspect

of instruction at the university. All MCU university faculty and staff from the English

Language Center would be familiar to some degree with the CEFR, having been

exposed to it during internal departmental training. One of the panelists had presented

numerous papers at local conferences and led several school-sponsored training

sessions about the use of the CEFR in the language classroom for ELC staff and

34

faculty.

External panelists were selected to provide diversity to the standard setting

decisions. These panelists were selected because of their experience teaching students

at similar universities in Taiwan. Two of the external panelists were faculty members

at similar universities in the Taipei area and one was a doctoral candidate at a Taipei

university but had taught remedial classes at the university at which she was studying.

Table 2 provides a summary of the panelists and a brief description of the background

of each.

35

Table 2. Angoff Panelists

Group Judge Gender

English

NS/NNS

Position

I

(Mon)

Panelist 1 F NNS Administrator, former teacher

Panelist 2 F NNS Teacher

Panelist 3 M NNS Teacher

Panelist 4 F NS Teacher


Panelist 6 F NNS Teaching Assistant, recently graduated

student

II

(Wed)

Panelist 7 M NS Teacher

Panelist 8 M NS Teacher, External University


Panelist 10 M NNS Teacher

Panelist 11 F NNS Teaching Assistant; recently graduated

student

Panelist 12 M NS Teacher

III

(Fri)

Panelist 13 F NNS Administrator, Teacher

Panelist 14 F NNS Administrator, former teacher


Panelist 16 F NNS Teacher, External University



F=female, M = male

NS = native English speaker, NNS = non-native English speaker

(Lavallee, 2012).

36

A one-day training/orientation session was held on Saturday, July 10, 2010 for

all the participants. The panelists themselves were then separated into three different

panels which were held on Monday, July 12, Wednesday, July 14, and Friday, July 16

in 2010. A group of six panelists participated on each day. The individual panels were

conducted on three separate days to ensure that proper procedures were followed,

particularly during the discussion period. The facilitator for each discussion session

acted as the moderators of each of the panels, thus requiring having the groups meet

on separate days.)

As noted, in Table 1, the test form presented to each of the panels was a

composite form drawn from a series of different tests in the Practical English series of

tests. The items were drawn from test forms administered as part of the annual EPT

for all four year levels of the program, and differed slightly from the EPT exam

described earlier.

37

Table 3. Contents of the Test Form Used in the Standard Setting

Listening

What’s Next? 16 items

Dialogues 12 items (3 listening texts)

Extended Listening 12 items (3 listening texts)

Reading

Fill in the Blank 10 items

Text Completion 16 items

Reading with Questions 14 items

(Lavallee, 2012).

For the purposes of acclimatizing panelists to difficulties encountered taking

the test and provide panelists with the experience of taking the exam, a training form

was created with the same format as the regular exam. The test form used in the

operational standard setting does not contain the scripts for the listening passages, so a

separate form was created for the listening test which contained both the listening

scripts and the associated items. During the training session, panelists took the test

using the training form. Also, during the operational standard setting meeting,

panelists were not able to hear the taped version of the questions but were also

provided with the scripts of the listening questions.

The week prior to the training session, an email was sent to all panelists

containing

an introductory letter containing a link to a CEFR familiarization website,

<www.CEFtrain.net

an agenda for the training session, consisted of adapted versions of pages

33-36 from the CEFR.

38

the training materials, the listening and reading components of the CEFR

self-assessment grid (CEFR Table 2); and a link to the website.

two forms collecting personal information and agreements concerning

test security and informed consent for the research portion of the project.

(also see Appendix 2, 3, and 4)

As homework, panelists were asked to refer to the website and level

summaries, and use the self-assessment grid to assess themselves (in any

second language) and their students, in terms of the CEFR levels.

(Council of Europe, 2009).

As stated above, training of panelists was extremely conventional and

followed suggestions given in such authoritative sources as Cizek (2001), Cizek and

Bunch (2001), and the Council of Europe (2009).

On the day of the training, panelists were given a brief Power Point

presentation explaining the purpose of the project, a description of the EPT and an

explanation of how it was developed and validated. Following guidelines provided by

Cizek (2001) and Cizek and Bunch (2007) a great deal of effort was extended during

training to familiarize panelists with the descriptors used for the panels. They then

took part in a CEFR familiarization process. After a brief description of the CEFR,

their understanding of descriptors was tested. Panelists were given a sheet containing

the global level descriptors from the CEFR Table of Global Descriptors. The

descriptors had been rearranged, and the panelists were asked to sort them back into

the correct order (first individually, then in pairs). After providing them with a copy of

the original CEFR Table and discussing the correct answers, the panelists were asked

to take out their ‘homework’ activity in which they rated their own ability and that of

their students using the CEFR levels, and to discuss their answers in pairs.

39

The session then shifted to the CEFR reading Performance Level Descriptors

(PLDs). The first activity was another sorting activity, in which panelists were asked

to sort (individually, and then in pairs) 20 CEFR reading descriptors from CEFR

levels A1 to B2. They were then given the CEFR reading descriptors from the scales

used in the study (see section 1.2 above), for CEFR levels A1 to B2. Next, panelists

were given a 13-item reading test, taken from training material made available by the

Council of Europe. For each item, the panelists were asked to first attempt to answer

the item and then to assign the item to a CEFR level, based on the skills required to

correctly answer the item. After sharing their answers in pairs, the answers from the

original CoE study were shown and discussed. This concluded training for the reading

PLDs.

The training for the listening PLDs was conducted in parallel fashion.

Panelists were asked, individually and in pairs to sort 20 PLDs taken from the CEFR

A1 to B2 levels. After they finished correct answers were provided along with a full

list of the listening PLDs from the scales used in the study (see section 1.2 above) for

levels A1 to B2. Panelists were then given a 6-item listening test, taken from the CoE

training material mentioned above, and asked to attempt to respond to the item and

then assign a CEFR-level to the item based on its perceived difficulty. Panelists

shared their answers in pairs, and then the recommended answers from the CoE study

were shown and discussed by the whole group. This concluded training for the

listening PLDs.

After a break for lunch, panelists took the practice test that was described

above. The panelists were then divided into the three groups of six people in each of

the operational panels. The groups were asked to sit together in a circle with the other

members of their standard setting panel. A group leader was chosen, and each group

40

was asked to go through the test form, item by item. As a group they were asked to

discuss what knowledge, skills and abilities were required to answer each item, and

how the items differed in terms of difficulty. One hour and fifteen minutes was

allotted to this task. The discussions were taped by the facilitators for later

transcription.

Following this activity, the panelists were introduced to the concept of the

barely proficient B1 student (B1 BPS). They were then given a form which contained

space for their notes on the BPS. They were asked to refer to their listening and

reading PLDs for the A2 and B1 levels, and to summarize on the forms what they felt

to be the key characteristics of a B1 BPS for both listening and reading. They were

then asked to discuss their summaries in pairs or small groups. This was the final

training activity of the day. Panelists were then given the opportunity to ask any

questions they had about what had been discussed to that point. They were told that

when they returned for the actual meeting, they would have one training round prior

to the meeting. Then they would perform the actual standard setting. This concluded

the training session.

The Angoff meetings were held over the period of one week on July 12, 14

and 16. The meetings were divided into two panels with standards set for the reading

test in the morning and the listening test in the afternoon. Before beginning, panelists

were given a brief review of the contents of the previous training session. This

included a review of the B1 BPS. Panelist were then asked to estimate, based on their

understanding of students in the PE program (or Taiwanese university students in

general for panelists who were instructors at other universities), the percentage of

students who had reached the B1 level for the skill in question and write down this

estimate. Then, the test form and the round 1 rating form (See Appendix 4)for the

41

reading test was distributed to the panelists and a practice round was conducted.

The rating form contained a single column for each item being rated with each

column containing a list of probabilities in increments of 0.1, starting from 0.1 to 0.9

with a space between each figure. Panelists were asked to “circle or insert” the

probability that a just-B1 level student would answer the item correctly, and to write

their answer at the bottom of the column. Panelists were instructed not to attempt to

include guessing in their calculation of probabilities. They were then given a practice

round, in which they were asked to write their ratings for the first few of items. It was

made clear this was simply a practice round, to ensure that they understood the

procedure and that they could change their answers later. The facilitators circulated

from panelist to panelist while they were performing the practice round to make sure

the procedures were understood. Once all panelists had finished, they were asked if

there were any remaining questions. After questions were answered, the first round of

ratings was then conducted.

After returning from a break, panelists were given forms containing both

impact data and empirical item-difficulty data. They were given ‘impact data’ in the

form of a distribution of actual students in the program at different scores levels on

the test. The rating form for the second and third rounds incorporated further feedback

data. The range of scale scores was divided into 40 categories of approximately equal

size. A column was added to the left side of the form. Each row in the column

contained one of the 40 categories, from low to high. Once again, there was one

column per item and the columns contained probabilities in increments of 0.1. This

time the probabilities were placed in rows corresponding to the scale scores in the

leftmost column. Based on empirical results from the Spring 2010 administration of

the EPT, the probabilities were placed in the particular scale-score row to correspond

42

to the approximate probability that a student in that scale-score category would

answer the item correctly. Panelists were guided in the use of the feedback material,

so that they could use their initial estimates of students at the B1 level, the

distributional data and the second rating form to contrast their round 1 rating with

what their rating would have been based on their estimate of the number of students at

the B1 level. Finally, at the bottom of the column for each item was the empirical p-

value for all PE students who took the midterm EPT. The listening form also

contained this information for graduating students. For reading, the difference

between graduating students and all students was not large, so this information was

omitted.

After being instructed in the use of the feedback information, a discussion

session was held. For each item, the panelists announced their round 1 ratings and

briefly explained why they had given the rating. The assistant facilitator entered

ratings into the computer as they were announced. Once the discussion period was

finished, the cutscores were calculated and shown to the panelists. Using the

distribution data, panelists were asked to contrast the percentage of students they had

initially estimated to be at the B1 level with the percentage of students who would be

classified at the B1 level based on their round one ratings. They were then asked to

make their round two ratings. It was emphasized that they did not need to change their

ratings.

The round two ratings were entered into the computer and cutscores were

calculated. (There was no discussion of individual decisions following round 2; rather,

panelists handed their rating forms to the facilitators who entered their ratings into the

computer while panelists took a break.) Panelists were again asked to consider the

impact (distributional) data, and given the opportunity to ask questions or make

43

comments. Following this, they were asked to make their final round of ratings. The

ratings for the final round were used to derive the recommended cutscores.

44

Table 4. B1Listening and Reading Cutscores, Rounds 1 to 3

CEFR Level

Round 1

Mean

(SD)

Round 2

Mean

(SD)

Round 3

Mean

(SD)

B1 Reading6.06

(0.41)

6.22

(0.21)

6.31

(0.16)

B1 Listening6.83

(0.25)

6.84

(0.15)

6.89

(0.25)

45

At the opening training meeting, all participants were asked to sign a research

consent form releasing all the data generated from the standard setting to the school

for any research and administrative purposes that were necessary (See Appendix 2 and

3). In addition, panelist’s feedback about their confidence and understanding of the

training was gathered regularly throughout both training and the operational standard

setting panels. This included,

Day 1

1. pre-training assessment of individual preparation (Appendix 5)

2. 4 different assessments throughout the training day assessing progress

and confidence in their task (Appendix 6, 7, and 8)

Day 2

3. An assessment at the opening of the operational panel to address

confidence and preparation in the day's coming activities. (Appendix 9)

4. A final assessment in panelist confidence in their final cutscore decision

and satisfaction with the manner in which the standard setting training

and panels had been conducted. This was modeled after the sample form

contained in Cizek & Bunch (2009).

Table 5 shows the chronological ordering in table form of the evaluations throughout

the 2 days of training and standard setting.

46

Table 5. Chronological Order of Panelist Evaluations

Panelist Information Form(Appendix 5)

DAY 1 - Outline of the Training Session

Introduction to the project evaluation 1 (Appendix 6)

Introduction to CEFR evaluation 2 (Appendix 7)

Introduction to the test system of the school evaluation 3 (Appendix 8)

Introduction to the Angoff Standard Setting Procedure

Sorting of CEFR descriptors

Lunch

Group discussion of item difficulty

Take the test

DAY 2 - Outline of the Operational Panel

Review of standard setting procedures

Evaluation 4 (Appendix 9)

Reading Panel

Lunch

Listening PanelEvaluation 5 (Appendix 10)

In addition to the feedback forms, the group discussion activities described

earlier were recorded and later transcribed. These recordings were made by hand held

analog tape recorders with full knowledge of all the participants. At no point in the

gathering of this data were participants deceived. Full disclosure of all data gathering

practices was conducted throughout.

Following the operational standard setting, one of the panels volunteered to

take part in a focus group to discuss their impressions of the standard setting. This

47

focus group was recorded and later transcribed for use in understanding panelist

perceptions of the standard setting, its procedures and its outcomes.

3.3 Procedures

Table 6 presents the key steps to date and the projected steps remaining to be

taken.

48

Table 6 Schedule for Completion of Dissertation

July 2009 – May 2010 Refining of standard setting design, preparation for

operational standard setting

June 2010 Operational Study – Modified-Angoff Standard

Setting

Sept –Jan 2012 Development of research questions, preparation of

proposal

May 24, 2013 Oral Defense of PhD proposal

Feb-Aug 2013 Quantitative analysis of data

Aug-Sept 2013 Preparation of dissertation

Oct 2013 Defense of dissertation

3.4 Analysis and Expected Results

The proposed analysis in this study involves an examination of centrality using

the classical measures of standard deviation and kurtosis, and latent trait measures of

centrality. These measures will be used in the ways described below to examine the

centrality of panelists and items and answer the research questions of this study.

Standard deviation: Standard deviation is the measure of the variability of a

49

population on a characteristic. It is also widely used to measure confidence in

statistical outcomes. Standard deviation is one of the most widely reported of all

statistics. The standard deviation of a population is calculated by

(2)

where,

= indicates the sum of calculations

= population standard deviation= population mean

n = number of scores in sample.

The standard deviation of a sample is estimated using the following formula,

(3)

where,


X = observation value

= sample mean

s = sample standard deviationn = number of scores in sample.

This calculation produces a number between negative and positive infinity with a

mean of 0. Standard deviation is a measure of the distance from the mean and

indicates the percentage of the population or sample that will fall between the mean of

the population or sample and the limit of the standard deviation. In a normal

distribution,68.23% of the sample will fall between plus and minus 1 standard

deviation of the mean (0).

50

Standard deviation is one of the most studied of all statistics. It is widely

incorporated into descriptive statistics and a standard part of the reports of test and

research results. All of the major software packages provide the measure as a part of

output. It is also available on many types of pocket and desktop calculators. This

study reports both the standard deviation of panelist cutscore estimates (judge

standard deviation) and estimates for individual items (item standard deviation).

These will be calculated and reported separately and will be calculated using the

Microsoft Excel 2010 (Microsoft, 2013a)

Standard deviation is one of the classical measures of centrality described in

Saal, Downey, and Lahey (1980). In addition, it is a recommended statistic for

panelist final cutscore results (Cizek & Bunch, 2007) in the Angoff or other methods

for standard setting. As such, it provides a clear measure of a distribution central

tendency. It was reported for this data set in Lavallee (2012)

If centrality increases across the 3 rounds of the modified-Angoff standard

setting, then the standard deviation of ratings given to items by individual raters

should decrease across rounds. This would not be a surprising finding, since, as

reported earlier, a decrease in the standard deviation of rater scores (judge standard

deviation) across the rounds of a Angoff-type standard settings is a typical feature of

the different versions of the method (Cizek, 2001; Cizek & Bunch, 2007). The

standard deviation of individual panelist estimates of different items is widely

reported in Angoff-type standard settings, as well as other methods. A decrease in

standard deviation across rounds is a “common feature of standard settings” (Cizek,

2001a, p. 10) and is generally interpreted as an indicator of the validity of the

51

particular standard setting being examined. In her 2011 study of rater centrality Yue

appears to interpret standard deviation in a different from Cizek. She designates this

as a measure of rater centrality, stating that

One would expect central tendency to be indicated by neutral ratings with little

variability. Therefore, for a given instrument, a displaced mean rating in

conjunction with a relatively large standard deviation would lead to the

conclusion that the instrument is sensitive to criterion performance, and therefore

resistant to central tendency (p.16-7).

Yue (2011) cautions that the kind of interpretation stated above by Cizek

(2001a)is, as well as the use of standard deviation as a measure of centrality, could be

problematic because the measure has the tendency to become inflated with random

error.

As a result, it would be unclear whether raters who produce ratings with small

standard deviations are engaging in centrality or are simply accurate raters in a

sample of inaccurate raters.(p.17)

However, Yue's study of simulated data was able to demonstrate that standard

deviation performed best among the measures she assessed, and was even better than

latent trait measures at distinguishing central raters from non-central rates while still

maintaining an acceptable level of power.

Kurtosis: Kurtosis is a classical measure of central tendency. It was not investigated in

Yue (2011) or in Lavallee (2012),but is one of the classical measures of rater bias

described in Saal, Downey, and Lahey (1980).It is statistically defined as the 4th

population movement around the mean, and is a measure of deviation from the normal

distribution.

52

(4)

where,


X = observed value

= population mean

N = number of scores in sample.

As the formula indicates, the normal distribution has kurtosis of 3. This is

conventionally rescaled to 0 by subtracting 3 from the result. This formula is

commonly referred to as Pearson kurtosis after its originator, Karl Pearson. Many

different versions of the formula have been derived. One of the most workable is

(5)

where,


X = the observation value

Y = the sample mean

N = number of scores in sample.

s = standard deviation of the sample

Kurtosis continues to be widely used in finance studies of the stock market

(Corrado & Su, 1996; Dittmar, 2002), but is less widely recognized as a useful

measure in educational or psychological research and has fallen out of use. The

general terminology associated with kurtosis for the shape of distributions is

illustrated in the diagram below.

53

Figure 1. Characteristics of Kurtosis

As stated above, a normal distribution, has a kurtosis of 0. A platykurtic curve is

less peaked than the normal distribution and has a negative kurtosis. A leptokurtic

curve is more peaked than the normal distribution and has a positive kurtosis.

Kurtosis is a measure of a non-normal distribution's deviation from the normal

curve. It is generally interpreted as a measure of the peakedness of a curve (deCarlo,

1997), but this has received heavy criticism. Darlington (1970) states kurtosis may

also be measuring a variety of other things in addition to peakedness, including the

thickness of the tails.. A large number of other measures of peakedness have been

identified (Balanda and MacGillivray, 1988; DeCarlo, 1997; Wilcox and Keselman,

2003). However, these methods have rarely been used outside these initial proposals.

As such, they are difficult to work with and poorly understood.

This study will report the Pearson kurtosis of panelist cutscores (judge kurtosis),

as well as for individual items (item kurtosis). Along with standard deviation, Pearson

kurtosis is one of the oldest statistics in use today. It is well-understood and easily

managed by all of the major commercial software packages and calculation tools. In

this study it will be calculated using the Microsoft package Excel 2010 (Microsoft,

2013b), which calculates the Pearson Kurtosis as described above.

Latent-trait measures: This study will use Rasch Modeling to examine latent trait

characteristics of rater and item estimates from the modified-Angoff standard setting

54

relies on a form of latent trait modeling. As originally suggested by Gorg Rasch, there

exists an invariant relationship between the ability of the person and item parameters.

In Rasch's original formulation of the stochastic function of person and item

parameters for responses was formulated for dichotomous data (Humphry, 2005).

ln Pni11−Pni1

=βn−δi

where,βn is the location of person n along the underlying latent trait, δi is the location of item i along the same latent variable, Pni1 and Pni0 are the probabilities of person n on item i scoring 1 and 0, respectively

A comparison of each calculated value with the observed value leaves a residual.

xnr−Enr, (7)where xnr is the observed vale for rater r and

Enris the expected value for rater r

Edward Wolfe and his students (Wolfe, 2004; Wolfe, Moulder & Myford, 2001;

Myford & Wolfe, 2003; 2004; Yue, 2011) have identified several methods to use these

residual value of the Rasch measure to detect centrality. In her study of simulated

data, Yue (2011) identified the correlation between the expected value of the rating

and the residual of the expected and observed value as one of the most successful

measures at detecting centrality at a reasonable level of power.

The expected-residual correlation (rexp,res) is the correlation of the expected latent-

trait (in this case Rasch) value for the rating with the residual. Unlike its classical

measures of centrality, latent trait measures provide a clear indication of when and

what type of centrality exists. Under the condition of perfect centrality, rexp,res should

be -1.0. Similarly, under conditions of perfect extremism, rexp,res should be 1.0. If there

55

is no centrality at all, rexp,res should be 0 (Wolfe, 2004).

Despite the precise definitions of centrality, the application of these two

measures has been quite limited. The only study outside of Wolfe's group that has

reported results from these measures is Lavellee (2012), and this was not an attempt

to document their limitations or validity. As such, they remain largely theoretical and

poorly understood. However, unlike the classical measures discussed above, there are

guidelines for how to interpret their results in terms of identifying centrality. The

dissertation of Yue (2011) contains the SAS PROC commands for the calculation of

these values. In this study, they are calculated from item and panelist measures

supplied by WINSTEPS Rasch Modeling software (Linacre, 2012). The calculations

themselves are performed on Excel 2010 using instructions provided in Embretson

and Reise (2000).

Standard deviation, kurtosis, and the latent trait measures of centrality will be

used in the following manner to address the research questions stated earlier.

Research question 1: Does centrality exist in the modified-Angoff standard setting?

The standard deviation, kurtosis, and latent trait measures of each of the items

and each of the judges during the different rounds will be examined. As stated above,

there is no accepted conventional interpretation of how much standard deviation or

kurtosis indicates centrality.

The interpretation of this is not straight forward, and a successful answer may

not be possible using classical measures. Despite its widespread use and its previous

use in centrality studies, there are no existing guidelines for the use of standard

definition in the detection of centrality. A similar problem affects kurtosis. Kurtosis is

said to exist when the Pearson kurtosis exceeds 0. No existing guidelines are available

for the use of kurtosis in the detection of centrality. It is expected that latent trait

56

measures, which provide stronger guidelines in the identification of centrality, should

find that at some point in the standard setting, panelist ratings show rater centrality.

When panelist centrality exists, the expected-residual correlation will be positive.

This research makes no prediction about when rater centrality will become

apparent in a modified-Angoff standard setting. It is possible that panelist ratings as

early as the first round show centrality, however this is not of significance to the

expected results of the study. By the final round, indicators of centrality should show

that it is present.

Question 2: How does centrality change across the rounds of the modified-Angoff

procedure?

This research question is a corollary of the question above. What are the patterns

of change for the measures examined in this study between the rounds of the

modified-Angoff standard setting? Do they increase? Or do they decrease? Are their

patterns among panelists? Do difficult items differ from easy items in terms of their

centrality?

In addition to the analysis of the item and judge classical and latent trait

measures described above, I propose a measure similar to that used in Impara and

Plake (1998). In this measure, the absolute difference between estimated and actual

scores is used to construct a nominal measure index based on a combination of the

two measures classical measures of standard deviation and kurtosis. To obtain the

values for this matrix, the first round panelist estimates will be subtracted from the

final value in the third round. The value for kurtosis will be positive or negative

depending on whether it, and the centrality of raters or items, is increasing or

decreasing, across the standard setting. A similar operation will be performed for the

standard deviation. Combining these 2 measures produces the following matrix. The

57

quadrants of the index should produce a hierarchy of centrality.

KURTOSIS across 3 rounds

positive negative

A I B

smaller HIGHEST I

CENTRALITY I

ST. -----------------------------------------------------------------

DEV. C I D

across 3 rounds I LOWEST

I I CENTRALITY

larger I

Figure 2. Direction of Centrality Change

The 4 quadrants of the 2 X 2 matrix show different combinations of kurtosis and

standard deviation. The A Quadrant shows the highest level of centrality, containing

those items and panelists with a decreasing standard deviation and increasingly

positive kurtosis. The B Quadrant contains those items and panelists with decreasing

standard deviation but increasingly negative kurtosis. The C Quadrant contains those

items and panelists with an increasing standard deviation and an increasingly positive

kurtosis. The final D Quadrant contains those items with the least centrality, with an

increasing standard deviation and an increasingly negative kurtosis.

Latent trait measures can also be used in this operation, and similar indexes can

be constructed using combinations of these and classical measures to assess the

change in centrality between rounds.

58

Do the estimates of panelists fall in some quadrants more than others? What is

the movement of panelist estimates between rounds? What is the movement of item

estimates between rounds? Do item estimates and panelist estimates differ? I expect

that most panelist estimates will fall in Quadrant A and Quadrant B, indicating that

centrality has increased across rounds. Few, if any items and panelists will fall in

Quadrant C and Quadrant D indicating a decrease in centrality.

Question 3: Does centrality explain differences in panelist ratings between extreme

(difficult and easy) items and median difficulty item?

This research question is the primary question of the project. Once patterns

consistent with centrality have been established, are these patterns also consistent with

the idea that items of extreme difficulty values - both the most difficult and the easiest

- show the most centrality?

A measure of how extreme an item's difficulty will be constructed from the

difficulty measures of the items. The absolute value of the distance from the midpoint

of logit scores will be used as this measure of item extremism. This measure can in

turn be used in calculations of the correlation with classical measures of centrality,

such as standard deviation or kurtosis, as well as the various latent trait measures. In

addition, it can be compared with the index of centrality change described above to

determine if items most likely to change in the hierarchy of centrality measures are

the most extreme items.

It is expected that correlations between the measure of item extremism and all

measures of centrality will be significant, or at least that these correlations will

increase across the three rounds of the modified-Angoff standard setting, indicating a

strong association between item extreme value and its centrality in a collection of

rated items. It is expected that the most extreme items as measured by my measure of

59

item extremism will be the items most likely to show the most centrality change in the

hierarchy of centrality described above. That is, that items with the most extreme

values of item difficulty will be most likely found in Quadrant A.

60

References

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In: R. L. Thorndike

(Ed.), Educational Measurement (pp. 508-600). Washington, DC: American

Council on Education.

Balanda, K. P.; and H. L. MacGillivray. (1988). Kurtosis: a critical review. American

Statistician, 42, 111-119.

Berger, J., Harbring, C.& Sliwka D. (2010). Performance appraisals and the

impact of forced distribution: An experimental investigation. Discussion Paper

No. 5020, The Institute for the Study of Labor, Bonn, Germany.

Bol, J. C. (2006).The determinants and performance effects of supervisor bias. IESE

Business School. University of Navarra.

Bond, T. G., & Fox, C. M. (2001).Applying the Rasch model: Fundamental

measurement in human sciences. Mahwah, NJ: Erlbaum.

Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff

standard-setting topics. Applied Measurement in Education, 17, 59–88.

Buckendahl, C.W., Smith, R.W., Impara, J.C. & Plake, B.S. (2002).A comparison of

Angoff and bookmark standard setting methods. Journal of Educational

Measurement, 39, 253-26.

61

Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.). (1988). The nature of

expertise. Hillsdale, NJ: Erlbaum.

Cizek, G. J. (1996). An NCME instructional module on setting passing scores.

Educational Measurement: Issues and Practice, 15, 20-31.

Cizek, G. J. (ed.) (2001) Setting performance standards: concepts, methods, and

perspectives. Mahwah. NJ: Erlbaum.

Cizek, G. J., & Bunch, M. B. (2007).Standard setting: A guide to establishing and

evaluating performance standards on tests. Thousand Oaks, CA: Sage.

Cizek, G.J., Bunch, M.B., & Koons, H. (2004).Setting performance standards:

contemporary methods. Educational Measurement: Issues and Practice, 23,

31–50.

Clauser, B. E., Harik, P., Margolis, M. J., McManus, I. C., Mollon, J., Chris, L. &

Williams, S. (2009). An empirical examination of the impact of group

discussion and examinee performance information on judgments made in the

Angoff standard-setting procedure. Applied Measurement in Education, 22,

1-21

Clauser, B. E., Swanson, D. B., & Harik, P. (2002).A multivariate generalizability

analysis of the impact of training and examinee performance information on

judgments made in an Angoff-style standard-setting procedure. Journal of

Educational Measurement, 39, 269–290.

62

Corrado, C., and T. Su. (1996). Skewness and Kurtosis in S&P 500 Index Returns

Implied by Option Prices. Journal of Financial Research, 19,175-192.

Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Wainer & H.

Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Erlbaum.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.

Psychological Bulletin, 52, 281–302.

Cross, L. H., Impara, J. C., Frary, R. B., & Jarger, R. M. (1984). A comparison of

three methods on the National Teacher Examination. Journal of Educational

Measurement, 21, 113-129.

Council of Europe (2001).Common European Framework of Reference for

Languages: learning, teaching, assessment. Cambridge: Cambridge University

Press. Retrieved March 25, 2013 from

http://www.coe.int/t/DG4/Portfolio/documents/Framework_EN.pdf

Council of Europe (2009).Relating language examinations to the Common European

Framework of Reference for Languages: Learning, teaching, assessment

(CEFR).Cambridge: Cambridge University Press. Retrieved March 25, 2013

from http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Publications/

Standards_language_EN.pdf

Cross, L. H., Frary, R. B., Kelly, P. P., Small, R. C., & Impara, J. C. (1985).

63

http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Publications/Standards_language_EN.pdf

http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Publications/Standards_language_EN.pdf

http://www.coe.int/t/DG4/Portfolio/documents/Framework_EN.pdf

Establishing minimum standards for essays: Blind versus informed reviews.

Journal of Educational Measurement, 22,137-146.

Darlington, R. B. (1970). Is kurtosis really "peakedness"? American Statistician, 24,

19-22.

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2,

292-306.

DeCoths, Thomas A. (1977). An analysis of the external validity and applied

relevance of three rating formats. Organizational Behavior and Human

Performance, 19, 247-266.

Dittmar, R., (2002). Nonlinear Pricing Kernels, Kurtosis Preference, and Evidence

from the Cross-Section of Equity Returns. Journal of Finance, 57, 369-403.

Embretson, S.E. and Reise, S.P. (2000).Item response theory for psychologists.

Mahwah, NY: Lawrence Erlbaum Associates.

Engelhard, G. (2007). Evaluating bookmark judgments. Rasch measurement

Transactions, 21, 1097-1098.

Engelhard, G. and Anderson, D. W. (1998). A binomial trials model for examining the

ratings of standard setting judges. Applied Measurement in Education,

11209-230.

64

Farrokhi, F., Esfandiari, R., & Dalili, M. V. (2011). Applying the Many-Facet Rasch

Model to Detect Centrality in Self-Assessment, Peer-Assessment and Teacher

assessment. World Applied Science Journal, 15, 70-77.

Fitzpatrick, A. R. (1989). Social influences in standard setting: The effects of social

interaction on group judgments. Review of Educational Research, 59, 315-328.

Geisinger, K., & McCormick, C. (2010). Adopting cut scores: Post-standard-setting

panel considerations for decision makers. Educational Measurement: Issues and

Practice, 10, 17-22.

George, S., Haque, M. S.& Oyebode, F. (2006). Standard setting: Comparison of two

methods, BMC Medical Education, 6, 46.

Giraud, G. T., & Impara, J. C. (2005).Making the cut: The cut score setting process in

a public school district. Applied Measurement in Education, 18, 289-312.

Glaser, R. (1963). Instructional technology and the measurement of learning

outcomes. American Psychologist,18, 519–522.

Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff

minimum passing levels for a group of minimally competent examinees. Applied

Measurement in Education, 12, 13-28.

Green, D. R., Trimble, C. S., & Lewis, D. M. (2003).Interpreting the results of three

standard-setting procedures. Educational Measurement: Issues and Practice, 22,

65

22–32.

Halpin, G., Sigmon, G., & Halpin, G. (1983).Minimum competency standards set by

three divergent groups of raters using a three judgmental procedures:

Educational and Psychological Measurement, 47, 977-983.

Hambleton , R. K. (1980) Test score validity and standard-setting methods. In R. A.

Berk (Ed.),Criterion - referenced measurement: The state of the art. Baltimore,

MD: Johns Hopkins University Press.

Hambleton, R.K., (2001). Setting performance standards on educational assessments

and criteria for evaluating the process. In G.J. Cizek (Ed.), Setting performance

standards: Concepts, methods, and perspectives. (pp.89-116).Mahwah, NJ:

Lawrence Erlbaum Associates

Hambleton, R. K., & Pitoniak. M.(2006).Setting performance standards. In R. L.

Brennan (Ed.), Educational measurement (pp. 433-470). Westport, CT:

Praeger Publishers.

Hein, S. F. & Skaggs, G. E. (2009).A qualitative investigation of panelists’

experiences of standard setting using two variations of the bookmark method.

Applied Measurement in Education, 22, 207-228.

Hertz, N. R., & Chinn, R. N. (2002, April).The role of deliberation style in standard

setting for licensing and certification examinations. Paper presented at the

annual meeting of the National Council on Measurement in Education, New

66

Orleans, LA.

Humphry, S. (2005). Maintaining a common arbitrary unit in social measurement.

Unpublished doctoral dissertation, Murdoch University, Western Australia.

Hurtz, G. M., & Auerbach, M. A. (2003).A meta-analysis of the effects of

modifications to the Angoff method on cutoff scores and judgment consensus.

Educational and Psychological Measurement, 63, 584–601

Huynh, H. & Schneider, C. (2005). Vertically moderated standards: Background, p

assumptions, and practices. Applied Measurement in Education, 18, 99-113.

Impara, J.C., & Plake, B.S. (1998). Teachers’ ability to estimate item difficulty: A test

of the assumptions in the Angoff standard-setting method. Journal of

Educational Measurement, 35,69-81.

Jaeger, R. M. (1982). An iterative structures judgment process for establishing

standards on competency tests: Theory and application. Educational evaluation

and Policy Analysis, 4, 461-475.

Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and

process. In M. Chi, R. Glaser, & M. J. Farr (Eds.), The nature of expertise.

(pp. 209-228). Hillsdale, NJ: Lawrence Erlbaum Associates.

Kaftandjieva, F. (2010). Methods for Setting Cut Scores in Criterion-references

Achievement Tests. A Comparative Analysis of Six Recent Methods with an

67

Application to Tests of Reading in EFL. EALTA publication. Retrieved March

25, 2013 from

http://www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf

Kane, M. T. (2001). So much remains the same: conception and status of validation in

setting standards. In G. J. Cizek (Ed.),Setting performance standards: concepts,

methods and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum

Associates, Inc.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin,

112, 527-535.

Larkin, J. H., McDermott, J., Simon, D. P., & Simon, H. A. (1980). Expert and novice

performance in solving physics problems. Science, 208, 1335-1342.

Lavallee, J. (2012). Validation Issues in an Angoff Standard Setting: A Facets-based

investigation. Unpublished PhD Dissertation, Department of Counseling and

Educational Psychology, National Taiwan Normal University, Taipei, Taiwan.

Linacre, J. Facets: Many-Facet Rasch Analysis, Retrieved June 27, 2012 from

http://www.winsteps.com/facets.htm

Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems:

Implications of requirements of the No Child Left Behind Act of 2001.

Educational Researcher, 31, 3–16.

68

http://www.winsteps.com/facets.htm

Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations.

Educational Researcher, 32, 3-13.

Lissitz, R. W. & Huynh H. (2003). Vertical equating for state assessments: Issues

and solutions in determination of adequate yearly progress and school

accountability. Practical Assessment, Research & Evaluation, 8 (10). Retrieved

March 25, 2012 from http://pareonline.net/getvn.asp?v=8&n=10

Lissitz, R. W. & Wei, H. (2008).Consistency of standard setting in an augmented

state testing system. Educational Measurement, 27, 46-56.

Loevinger, J. (1957). Objective tests as instruments of psychological theory.

Psychological Reports, 3, 635–694.

Lorge, L, & Kruglov, L. (1953).A suggested technique for the improvement of

difficulty prediction of test items. Educational and Psychological Measurement,

12, 554-561.

McGinty, D. (2005). Illuminating the “Black Box” of standard setting: An exploratory

qualitative study. Applied Measurement in Education, 18, 269–287

Messick, S. (1981). Constructs and their vicissitudes in educational and

psychological measurement. Psychological Bulletin, 89, 575–588.

Messick, S. (1989).Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–

103). Washington, DC: American Council on Education and National Council on

69

Measurement in Education.

Messick, S. (1998). Test validity: A matter of consequence. Social Indicators

Research, 45, 35–44

Microsoft (2013a) STDEV in Welcome to Office. Retrieved January 23, 2012 from

http://office.microsoft.com/en-us/excel-help/stdev-HP005209277.aspx?CTT=3.

Microsoft (2013b) KURT in Welcome to Office. Retrieved from January 23, 2013

from http://office.microsoft.com/en-us/excel-help/kurt-HP005209150.aspx

January 23, 2013.

Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark

procedure: Psychological perspectives. In G. J. Cizek (Ed.). Setting performance

standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ:

Erlbaum.

Moers, F.(2005). Discretion and bias in performance evaluation: the impact of

diversity and subjectivity. Accounting, Organizations and Society, 30, 67–80.

Murphy, K. R., & Balzer, W. K. 1989. Rater errors and rating accuracy. Journal of

Applied Psychology, 74, 619-624.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using

many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4,

386–422.

70

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using

many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5,

189–227.

Nedelvsky, L. (1954). Absolute Grading Standards for Objective Tests. Educational

and Psychological Measurement, 14, 3-19.

Nelson, D. S., (1994) Job Analysis for Licensure and Certification Exams: Science or

Politics? Educational Measurement: Issues and Practice, 29-35.

Norcini, J. J. & Shea, J. A. (1997). The credibility and comparability of standards.

Applied Measurement in Education, 10, 39–59

Norcini, J., Lipner, R., Langdon, L., & Strecker, C. (1987). A comparison of three

variations on a standard-setting method. Journal of Educational Measurement,

24, 56-64.

Plake, B., & Giraud, G. (1998) Effect of a modified Angoff strategy for Obtaining

item performance estimates in a standard setting study. Paper presented at the

Annual Meeting of the American Educational Research Association. San Diego,

Calf.

Prendergast, C., 1999, The provision of incentives in firms. Journal of Economic

Literature, 37, 7-63.

71

Reckase M. D.(2000). The ACT/NAGB standard setting process: How "modified"

does it have to be before it is no longer a modified-Angoff process? Paper

presented at the annual meeting of the American Educational Research

Association, New Orleans.

Reckase, M. D. (2006) Rejoinder: Evaluating standard setting methods using error

models proposed by Schulz. Educational Measurement, 25, 14-17

Roach, A. T., McGrath, D., Wixon, C., & Talapatra, D. (2010).Aligning an early

childhood assessment to state kindergarten content standards: application of a

nationally recognized alignment framework. Educational Measurement: Issues

and Practice, 29(1), 25-37.

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the

psychometric quality of rating data. Psychological Bulletin, 88, 413-428.

Schafer, W. D. (2005).Criteria for standard setting from the sponsor’s perspective.

Applied Measurement in Education, 18, 61-81.

Saunders, W. L. (1998). Value added assessment. The School Administrator, 55,

24-32.

Saunders, W. L., & Horn, S. P. (1998). Research findings from the Tennessee value-

added assessment system (TVAAS) database: Implications for educational

evaluation and research. Journal of Personnel Evaluation in Education, 12,

247–256.

72

Saunders, W. L. & Horn, S P. (1995). The Tennessee Value-Added Assessment System

(TVAAS): Mixed Model Methodology in Educational Assessment. In

Shrinkfietd, A. J. & Stufflebeam, D. (Eds.), Teacher evaluation: Guide to effective

practice (pp. 337-350). Boston, MA: Kluwer.

Shanteau, J.(1992). Competence in experts: the role of task characteristics.

Organizational Behavior and Human Decision Processes. 53, 252–266.

Shanteau, J.(2001). What does it mean when experts disagree? In E. Salas & G. Klein

(Eds.), Linking expertise and naturalistic decision making (pp. 229–244).

Mahwah, NJ: Erlbaum

Shepard, L.A. (1980). Standard setting issues and methods. Applied Psychological

Measurement, 1980, 4, 447-467.

Shepard, L. A. (1994). Implications for standard setting of the National Academy of

Educational Evaluation of the National Assessment of Educational Progress

achievement levels. In: Proceedings of the joint conference on standard setting

for large-scale assessments of the National Assessment Governing Board and the

National Center for Educational Statistics (pp. 143–159). Washington, DC: U.S.

Government Printing Office.

Smith, R. L. and Smith, J. S. (1988). Differential use of item information by judges

using Angoff and Nedelsky procedures. Journal of Educational Measurement,

25, 259-274.

73

Taube, K.T. (1997). The incorporation of empirical item difficulty data in the Angoffstandard-setting procedure. Evaluation and the Health Professions, 20, 479-498.

van de Watering G., van der Rijt J. (2006). Teachers’ and students’ perceptions of

assessments: a review and a study into the ability and accuracy of estimating the

difficulty levels of assessment items. Educational Research Review,1, 133–147.

Weiss, D. J., & Shanteau, J. (2004). The vice of consensus and the virtue of

consistency. In C. Smith, J. Shanteau, & P. Johnson (Eds.), Psychological

investigations of competent decision making (pp. 226–240). Cambridge, UK:

Cambridge University Press.

Wessen, C. (2010). Analysis of Pre- and Post-Discussion Angoff ratings for evidence

of social influence effects. Unpublished MA Dissertation, Department of

Psychology, University of California, Sacramento.

Wiley, A., & Guille, R. (2002, April).The occasion effect for “at-home” Angoff

ratings. Paper presented at the annual meeting of the National Council on

Measurement in Education, New Orleans, LA.

Weiss, D. J., & Shanteau, J. (2003). Empirical assessment of expertise. Human Factors, 45, 104–116.

Weiss, D. J. & Shanteau, J. (no date, a). CWS: A user's guide. Retrieved January 23,

2013 from https://www.ksu.edu/psych/cws/pdf/using_cws.pdf.

Weiss, D. J. & Shanteau, J. (no date, b). Do judgments alone provide sufficient

74

information to determine expertise of the judge who made them? Retrieved

January 23, 2013 from http://www.k-state.edu/psych/cws/pdf/judgment.pdf.

Wilcox R. R. & Keselman H. J. (2003). Modern robust data analysis methods:

Measures of central tendency. Psychological Methods, 8, 254–274.

Wolfe, E.& McVay, A. (2012) Application of latent trait models to identifying

substantively interesting raters. Educational Measurement: Issues and Practice,

31, pp. 31-37.

Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology

Science, 46, 35–51.

Wolfe, E. W., Moulder, B. C., & Myford, C. M. (2001). Detecting differential rater

functioning over time (DRIFT) using a Rasch multi-faceted rating scale model.

Journal of Applied Measurement, 2, 256–280.

Yin, P., & Schultz, E. M. (2005, April).A comparison of cut scores and cut score

variability from Angoff-based and Bookmark-based procedures in standard

setting. Paper presented at the annual meeting of the National Council on

Measurement in Education, Montreal, Canada.

Yue, Xiaohui (2011). Detecting Rater Centrality Effect: Using Simulation Methods

and Rasch Measurement Analysis. Unpublished PhD Dissertation, Department of

Educational Research and Evaluation, Virginia Polytechnic Institute, Blacksburg,

Virginia.

75

http://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2012.00241.x/full

http://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2012.00241.x/full

76

Appendix 1. Common European Framework of Reference - Global Scale

Level Performance Level Descriptors

C2

Can understand with ease virtually everything heard or read. Can summarise

information from different spoken and written sources, reconstructing

arguments and accounts in a coherent presentation. Can express him/herself

spontaneously, very fluently and precisely, differentiating finer shades of

meaning even in more complex situations.

C1

Can understand a wide range of demanding, longer texts, and recognise implicit

meaning. Can express him/herself fluently and spontaneously without much

obvious searching for expressions. Can use language flexibly and effectively for

social, academic and professional purposes. Can produce clear, well-structured,

detailed text on complex subjects, showing controlled use of organisational

patterns, connectors and cohesive devices.

B2

Can understand the main ideas of complex text on both concrete and abstract

topics, including technical discussions in his/her field of specialisation. Can

interact with a degree of fluency and spontaneity that makes regular interaction

with native speakers quite possible without strain for either party. Can produce

clear, detailed text on a wide range of subjects and explain a viewpoint on a

topical issue giving the advantages and Independent disadvantages of various

options.

B1

Can understand the main points of clear standard input on familiar matters

regularly encountered in work, school, leisure, etc. Can deal with most

situations likely to arise whilst travelling in an area where the language is

spoken. Can produce simple connected text on topics which are familiar or of

personal interest. Can describe experiences and events, dreams, hopes and

ambitions and briefly give reasons and explanations for opinions and plans.

A2

Can understand sentences and frequently used expressions related to areas of

most immediate relevance (e.g. very basic personal and family information,

shopping, local geography, employment). Can communicate in simple and

routine tasks requiring a simple and direct exchange of information on familiar

and routine matters. Can describe in simple terms aspects of his/her background,

77

immediate environment and matters in areas of immediate need.

A1

Can understand and use familiar everyday expressions and very basic phrases

aimed at the satisfaction of needs of a concrete type. Can introduce him/herself

and others and can ask and answer questions about personal details such as

where he/she lives, people he/she knows and things he/she has. Can interact in a

simple way provided the other person talks slowly and clearly and is prepared to

help.

Source: CoE, 2001, p. 24.

78

Appendix 2. Informed Consent Form

Informed Consent Form for ELC Standard Setting Pilot Studies (July 2010)

The ELC assessment committee is doing research on the standard setting process used to link

tests to the Common European Framework of Reference (CEFR). The results of this process

are important, because they determine what test scores count as ‘proof’ that a student has

reached a certain ability level. However, the process itself is very subjective, and there is no

way to prove that a given score means that a student has “really” reached a given ability level.

The purpose of this study is thus to help us better understand the factors that influence the

decision-making process, as part of the longer-term goal of improving the process.

You are being invited to take part because of your background in the TEFL field. Your

participation is entirely voluntary, and your choice will have no bearing on your job or on any

work-related evaluations or reports. If you accept, you will be asked to complete a short

preparatory assignment and to participate in two one-day workshops. At these meetings, you

will receive more training and then you will be asked to make a series of judgments

concerning the difficulty level of items from MCU’s English exams in relation to the CEFR

descriptors.

Audio and video recordings will be made of the group discussion and the interviews, and the

recordings will be transcribed. The recordings will be treated as confidential and no real

names will be used in the transcripts. No one outside of the Assessment Committee will have

access to the transcripts.

We expect that your participation will help us to better understand how judges make standard

setting decisions. We will share any findings with you. The findings may also be shared with

other researchers in the field through presentations and/or publications.

All participants will receive an honorarium.

If you have any questions about any aspect of the study, please do ask.

Joseph Lavallee, Principal Investigator

ELC Assessment Committee, Ming Chuan University

I have read the foregoing information. I have had the opportunity to ask questions about it and any questions I have been asked have been answered to my satisfaction. I consent voluntarily to participate in this study.

79

Name (please print clearly): ________________________Signature: ________________________Date (day/month/year): ________________________

80

Appendix 3. Security FormStandard Setting Security Agreement Form

I, _______________________ (print panelist name here), understand and accept the following terms and conditions.

1. Panelists will follow all test security procedures set forth in writing or verballyby English Language Center representatives.2. Panelists will turn over to English Language Center representatives all products of the standard setting meeting at the close of the session or as directed by said representatives.3. Panelists relinquish any claim or right to any and all products turned over to the English Language Center.4. All of the materials used at the meeting are considered secure and panelists are expected to turn in all such materials at the close of the session or as directed by English Language Center representatives and to maintain complete confidentiality regarding these materials.

________________________________________ ___________Panelist signature Date

81

Appendix 4.Angoff Panelist Record FormPANELIST NAME: ________________ Listening - ROUND 1

Circle or insert the probability that a just-B1 level student would get the item correct. Then write your probability at the bottom of the table.

ITEM NO.1 2 3 4 5 6 7 8

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7

0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9

Prob.

82

PANELIST NAME: ________________ Reading - ROUND 1

Circle or insert the probability that a just-B1 level student would get the item correct. Then write your probability at the bottom of the table.

ITEM NO.1 2 3 4 5 6 7 8

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7

0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9

Prob.

83

84

Appendix 5. Panelist Information Form

Panelist Information Form Please Answer the Following Questions

Name (in native language)

_____________________________________________________________________

Current Job Title/Employer

_____________________________________________________________________

Highest relevant degree attained (please list the name of the degree and the granting institution): ____________________________________________________________________

_____________________________________________________________________

Years of teaching experience with university-level students in Taiwan:

_____________________________________________________________________

Years of experience with test design/development:

_____________________________________________________________________

Have you ever lived in an English-speaking country? Where? How long?

_____________________________________________________________________

Please circle the answer that best shows your opinion

85

I am familiar with the test design and construction process.Disagree 1 2 3 4 Agree

I am familiar with the Common European Framework of Reference(CEFR).Disagree 1 2 3 4 Agree

I am familiar with Rasch or IRT Modeling.Disagree 1 2 3 4 Agree

I am familiar with the standard setting process.Disagree 1 2 3 4 Agree

86

Appendix 6. Evaluation 1PART I. Procedures Panelist Number _____


I have completed the introduction to the procedures of a standard setting.Disagree 1 2 3 4 Agree

The group leader answered all of my questions.Disagree 1 2 3 4 Agree

I understand the instructions so far.Disagree 1 2 3 4 Agree

I understand why a standard setting is important.Disagree 1 2 3 4 Agree

I have made decisions about student standards before.Disagree 1 2 3 4 Agree

I feel qualified to make this kind of judgment.Disagree 1 2 3 4 Agree

I feel I have the experience to make this kind of judgment.Disagree 1 2 3 4 Agree

Standard setting is common sense.Disagree 1 2 3 4 Agree

87

Appendix 7. Evaluation 2

PART II. Common European FrameworkPanelist Number _____

I have completed the introduction to the CEFR.Disagree 1 2 3 4 Agree



I understand the CEFR.Disagree 1 2 3 4 Agree

I understand the difference between the different levels of the CEF.Disagree 1 2 3 4 Agree

I understand the B1 level.Disagree 1 2 3 4 Agree

The CEFR is a useful way to think about teaching English.Disagree 1 2 3 4 Agree

I agree with the order of the CEFR levels.Disagree 1 2 3 4 Agree

The CEFR is common sense.Disagree 1 2 3 4 Agree

88


PART III. Ming Chuan University Practical English TestPanelist Number ______

I have completed the introduction to the Practical English Test.Disagree 1 2 3 4 Agree


The practice test helped me understand more about the test.Disagree 1 2 3 4 Agree


I understand item difficulty.Disagree 1 2 3 4 Agree

I understand item proficiency scales.Disagree 1 2 3 4 Agree

I understand scale scores.Disagree 1 2 3 4 Agree

I understand why some items are more difficult than others.Disagree 1 2 3 4 Agree

89


Review of Standard Setting ProceduresPanelist Number ______


I have completed Angoff Standard Setting training.Disagree 1 2 3 4 Agree

I understand Angoff Standard SettingDisagree 1 2 3 4 Agree

I understand the procedures of an Angoff Standard Setting.Disagree 1 2 3 4 Agree

I understand the Common European Framework.Disagree 1 2 3 4 Agree

I understand item difficulty.Disagree 1 2 3 4 Agree

I feel qualified to perform an Angoff Standard Setting.Disagree 1 2 3 4 Agree

I understand the B1 level of the CEFR.Disagree 1 2 3 4 Agree

I feel ready to start the Angoff Standard Setting.Disagree 1 2 3 4 Agree

I am ready to begin the Angoff Standard Setting.

____________________________________ (signature) _____________ (date)

90


Angoff Standard Setting. Final Evaluationpage 1 of 2Panelist Number ______


The training and practice exercises helped me understand how to perform the task.Disagree 1 2 3 4 Agree

The time provided for discussions was adequate.Disagree 1 2 3 4 Agree

There was an equal opportunity for everyone in my group to contribute his/her ideas and opinions.Disagree 1 2 3 4 Agree

I was able to follow the instructions and complete the evaluation accurately.Disagree 1 2 3 4 Agree

The discussions after the first round of ratings were helpful to me.Disagree 1 2 3 4 Agree

The discussions after the second round of ratings were helpful to meDisagree 1 2 3 4 Agree

I am confident about the defensibility and appropriateness of the final recommended cut scores.Disagree 1 2 3 4 Agree

The facilities and food service helped create a productive and efficient working environment.Disagree 1 2 3 4 Agree

The information showing the distribution of student scores was helpful to me.

91

Disagree 1 2 3 4 Agree TURN PAPER OVER

Angoff Standard Setting. Final Evaluationpage 2 of 2Panelist Number ______

I found the discussion between rounds to be useful.Disagree 1 2 3 4 Agree

I changed my scores between rounds.Disagree 1 2 3 4 Agree

The discussion between rounds influenced me to change my score.Disagree 1 2 3 4 Agree

The information about student performance influenced me to change my score.Disagree 1 2 3 4 Agree

I am confident with my final decisions.Disagree 1 2 3 4 Agree

I believe my final decisions are the best decision I can make.Disagree 1 2 3 4 Agree

92

Documents

Web viewCentrality in a Modified-Angoff Standard Setting. PhD Dissertation - Proposal. by. Michael Scott Sommers (張夏石) Candidate for the Degree. Doctor of