Upload
vutu
View
218
Download
2
Embed Size (px)
Citation preview
Centrality in a Modified-Angoff Standard Setting
PhD Dissertation - Proposal
by
Michael Scott Sommers (張夏石)
Candidate for the Degree
Doctor of Philosophy
Department of Educational Psychology & Counseling
National Taiwan Normal University
Dissertation Advisor: 林世華 (Sieh-Hwa Lin)
1
Table of Contents
Chapter 1 Introduction 5
1.1 Motivation & Purpose 5
1.2 Research Questions 7
1.3 Significance & Contribution 8
1.4 Definitions & Terms 9
Chapter 2 Literature Review 10
2.1 An Overview of the Standard Setting 10
2.2 The Angoff Standard Setting Method 15
2.3 Centrality 21
2.4 Centrality and the Angoff Standard Setting 26
Chapter 3 Methods 29
3.1 Data & Materials 29
3.2 The Operational Angoff Standard Setting 34
3.3 Procedures 47
3.4 Analysis and Expected Results 48
References 60
Appendixes
Appendix 1.Common European Framework of Reference - Global Scale.........76
Appendix 2. Informed Consent Form...............................................................78Appendix 3. Security Form...............................................................................80Appendix 4. Angoff Panelist Record Form......................................................81Appendix 5. Panelist Information Form............................................................83
Appendix 6. Evaluation 1..................................................................................85
2
Appendix 7. Evaluation 2..................................................................................86
Appendix 8. Evaluation 3...................................................................................87
Appendix 9. Evaluation 4...................................................................................88
Appendix 10. Evaluation 5...................................................................................89
Tables
Table 1. Contents of the English Proficiency Test (EPT)..............................33
Table 2. Angoff Panelists .............................................................................36
Table 3. Contents of the Test Form Used in the Standard Setting................38
Table 4. B1Listening and Reading Cutscores, Rounds 1 to 3.......................44
Table 5. Chronological Order of Panelist Evaluations..................................46Table 6. Schedule for Completion of Dissertation.........................................48
Figures
Figure 1 Characteristics of Kurtosis………………………………………….53
Figure 2. Direction of Centrality Change.........................................................57
3
4
CHAPTER 1 Introduction
1.1 Motivation & Purpose
Beginning in the 1970s, the need for criterion-referenced testing to address
problems with the norm-referencing of tests became increasing obvious to educators
and education policy makers (Glaser, 1963). With the pressure created by this demand
came the realization that methods for producing criterion-referenced tests lagged far
behind the idea. The decades that followed, the 1990s and 2000s, saw an explosion in
understanding and standard setting became the established method for determining the
cutscores needed for criterion-referenced tests.
This research is about standard setting (Cizek, 2001; Cizek and & Bunch, 2007).
In particular, this is about a family of standard setting procedures referred to as the
Angoff Method or the modified-Angoff Method. These methods are among the oldest
and most widely used to determine cutscores. Despite this, there are many problems
associated with their use and they remain among the most controversial. There are
obvious signs that the expert panel members used to make decisions during the
procedure have difficulty performing some of the tasks required of them. The Angoff
family of procedures requires that panelists use a scale to match their judgments about
item difficulty with judgments about populations of test takers. There is considerable
evidence that even the most highly trained experts have difficulty doing this in the
required fashion. This research is an examination of some of the aspects of why
panelists have this problem.
One of the most significant functions of ratings is as scales so that objects,
people, and performances can be compared and decisions made from these
comparisons. Ratings such as these in the Angoff standard setting are ubiquitous in
5
our modern society. Yet despite the widespread use of such ratings, raters can not
always make judgments in a fashion that is expected from them. Judges sometimes
use some of the values of a scale more than expected. When this happens with the
median values of the scale, as opposed to the extreme values, it is called centrality
(Wolfe, 2004; Saal, Downey & Lahey, 1980).
In the context of the Angoff standard setting, it has long been speculated that
centrality plays a major role. Previous research has shown the difficulty with which
even the most well-trained experts have in estimating the difficulty measures of test
items (Lorge & Kruglov, 1953; Shepard, 1994). This has been identified as
particularly an issue for the most and the least difficult items, a condition that would
lead to centrality. Study after study of both the ratings of panelists in high stakes
Angoff panelists, as well as studies of the kind of skills used in the Angoff method,
have found even the most highly trained experts have trouble making estimates about
items that are extremely easy or extremely difficult. In a review of the literature
pertinent to the Angoff standard setting procedure, Lavellee (2012, p. 14) concluded,
“…results consistent with a centrality effect have been found every time they have
been looked for” (italics in the original).
Centrality is one of the least well-understood of a class of phenomena often
called a 'rater effect' (Saal, Downey & Lahey, 1980). That is, an aspect of the rating
that is separate from the person, object or performance being rated and intrinsic to the
rater (Saal, Downey & Lahey, 1980;Yue, 2011). The result of this has been a focus on
the centrality of raters as a measure of their performance. For example, it is often
assumed that centrality is an error and a demonstration of its existence in the ratings
of a given rater are a sign that the ratings from that rater are problematic (Saal,
6
Downey & Lahey, 1980; Yue, 2011). Despite this, most of what has been written
about centrality tells us more about the kind of items that produce the centrality than
about the raters themselves or their ability to produce ratings, and this incongruence
has led to a gap in understanding of what centrality really is and how its presence
should be assessed. One of the most contentious problems with the issue of centrality
has been its measurement. There exists no consensus about the measurement of
centrality. While there are standard classical measures identified in the research
literature, many aspects of these measurements have not been clarified. For example,
their relationship to each other is not at all clear. Nor are their useful guidelines
available for their application and interpretation. So, as Lavalle indicates, much of the
research results are “consistent” with the idea of centrality as a factor in the Angoff
standard setting, however it is not really clear what this means.
1.2 Research Questions
This study uses modern and classical measures of centrality to examine changes
in centrality across the 3 rounds of a modified-Angoff standard setting. The purpose is
to gain an understanding into the role of centrality as a factor in the modified-Angoff
standard setting and the usefulness of the concept in understanding the cutscore
decisions of modified-Angoff panelists. In particular it will use these measures to
address, the following research questions.
Question 1: Does centrality exist in the modified-Angoff standard setting?
Question 2: How does centrality change across the rounds of the modified-Angoff
procedure?
7
Question3: Does centrality explain differences in panelist ratings between extreme
(difficult and easy) items and median difficulty item?
1.3 Significance & Contribution
A first contribution of this study is a greater understanding of rater effects and
the role of centrality. One of the main identified problems of the Angoff family of
standard setting procedures involve issues with rating extremely difficult and
extremely easy items and how the affects cutscore decisions. Certainly this is widely
speculated in the research literature, or at least the research literature consistently
speculates in a manner indicating a major role for centrality. Is this true? Is this
because of centrality that occurs in raters or items?
This first contribution points to a wider understanding of raters, items and the
role of centrality that this research hopes to understand. The use of rating and rating
scales is ubiquitous in our modern society. It appears in our jobs, in our homes, in our
leisure time, and in practically any other imaginable place. Greater understanding of
such a central aspect of our lives is important. Centrality is very poorly understood
aspect of the rating process. Despite this, there are applications for the idea, such as in
the Angoff standard setting, where it plays a potentially major role. A stronger
understanding of the concept of centrality would contribute to a clearer understanding
of the general process of rating objects, people and performances.
A further contribution offered by this research is a clarification of the
measurement of centrality. While a number of classical and latent-trait methods have
been developed, their lack of widespread use has lead to a poor understanding of these
measurements. This research hopes of contribute a better developed sense of what
these measurements mean, as well as how and when they should be used.
A final, but central, contribution of this research is, of course, a greater
8
understanding of standard setting procedures and particularly the modifed-Angoff.
The ways in which panelist ratings changes in a modified-Angoff standard setting
procedure has been studied extensively. And while a full understanding of these is still
not entirely clear, there is a growing picture of the way in which they operate during a
modified-Angoff standard setting. Some aspects of the procedure, however, remain
quite unclear. The role of centrality in final cut score decisions is among these. What
is the impact of centrality on standard setting decisions? What is the relationship
between centrality and other facts that we know about the standard setting?
1.4 Definitions & Terms
Angoff Method Standard Setting -- In this study, the term Angoff Method refers
to the general category of standard setting methods derived from the original
suggestion of William Angoff. The term modified-Angoff Method refers to the group
of recently adopted versions of this that use percentage or proportion estimates instead
of Angoff’s original suggestion of the yes/no. While the term Angoff has come to
refer to a wide range of different standard setting methods, the key elements of these
is that panelists are presented with items one at a time and asked to estimate and vote
on an examinee's ability to answer the item correctly according to various criteria.
Centrality -- Centrality is present when a rater clusters his or her ratings around
the central point of a rating scale or around the center of a perceived range of
performances. This results in a compressed distribution of ratings and reduced
variation in scores. Centrality has been discussed in relationship to a large number of
different measurement concepts. Wolfe (2004) distinguishes between centrality and
restriction of range, stating that restriction of range is centrality combined with
leniency/harshness. The significance of this is that restriction of range also shows
9
skew. With respect to this study, there can be 2 different kinds of centrality. Item
centrality exists when ratings are clustered for the same items across different raters.
Rater centrality exists when ratings are clustered for the same rater across different
items.
Differences in panelist ratings – Panelist ratings are the estimates assigned by
panelists to individual items during the modified-Angoff method. The modified-
Angoff requires that panelists assign percentage estimates to items. Differences
between these estimates are the differences between panelist ratings.
CHAPTER 2. Literature Review
2.1. An Overview of the Standard Setting
Standard setting is a procedure used in the development of criterion-referenced
testing. This section will review some important aspects of the development of the
procedure and how the raters in the procedure influence the outcome of the standard
setting.
Standard setting refers to the family of procedures used to establish cutscores on
a scaled examination. Cutscores separate scaled scores into categories of performance
defined in a performance standard (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007;
Cizek, Bunch & Koons, 2004). Standard setting is used in criterion-referenced
examinations to match standardized test scores with verbal criterion defined in
performance level descriptors (PLDs) of the performance standards. Panels of judges
use different methods to compare PLDs with different types of information about
items or examinees. The term "standard setting" is used to refer to the different
procedures and materials used to make the cutscore decision. Since the first
suggestion of this idea in the 1950s (Nedelsky, 1954), dozens of different procedures
10
have been developed. In one recent survey (Kaftandjieva, 2010), more than 60
different methods were identified, 15 of which have appeared since the year 2000.
Standard setting grew out of the expanded role of criteria in testing.
Examinations can be defined as norm-referenced or criterion-referenced (Glaser,
1963; Shepard, 1980). Norm-referenced tests produce results that allow for
comparison between individuals and dominated high stakes testing for much of the
last century. Such tests are limited by an inability to indicate what the score means for
examinee ability. Criterion-referenced tests produce results that can assign a defined
meaning to a score. These abilities are typically defined in descriptions ranking them
from least to most capable. Such descriptions are referred to as a 'performance
standard' and the descriptions that define individual categories of performance as
'performance level descriptors or PLDs. The standard setting allows for these ranked
descriptions - the PLDs - to be placed along scaled test scores providing latent trait
scores that correspond with the different categories of ability defined in the standard.
Cizek and Bunch (2007, p. 13) have stated that,
Standard setting refers to the process of establishing one or more cutscores on a
test...Cut cores function to separate a test score scale into two or more regions,
creating categories of performance or classifications of examinees.
A large number of different standard setting procedures have been developed
(Kaftandjieva, 2010; Cizek & Bunch, 2007) that allow for the assigning of cutscores
from a scaled test on to categories defined in performance level descriptors. While
these procedures vary enormously in the details, they all share one property. These
procedures present panels of experts with performance standards and different types
of information about items or examinees. Panelists are then taken through detailed
11
training and procedures producing a score on the test that separates the various
categories of the performance standards. The actual procedure used can vary
considerably and different procedures may use a wide range of different types of
information. A typical convention in contemporary standard setting procedures is to
permit a significant amount of input from others sources to inform panelists about the
impact of their cutscore decisions. For example, one common way to handle this is for
panel organizers to allow discussion between panelists about their decisions then tell
them what percentage of an examinee population would fall above and below their
cutscore decisions.
As a result of this wide range of methods and procedures, different panels do not
always agree on the cutscore decision, even for the same test items and with the
provision of the same feedback information about pass/fail rates. It has long been
known that different methods produce estimates of cutscore decisions that are
systematically different (Buckendahl et al., 2002; Green et al., 2003; Hambleton &
Patoniak, 2006; Reckase, 2006; Yin & Schultz, 2005). Even small changes in standard
setting procedures can result in changes in panelist decisions (Cross et al., 1984;Hertz
& Chinn; 2002; Jaeger 1982). Panelists, or even the same panelist, may not make the
same judgments under very similar conditions (George, Haque & Oyebode, 2006).
Very little has been written about the validity of the various standard setting
procedures. Much of what has been written about standard setting validity describes it
as an arbitrary procedure to set standards. A general theory of validity, sometimes
referred to as the argument-based concept of construct validity, has emerged as the
dominant model in psychological testing (Cronbach, 1988; Cronbach & Meehl, 1955;
Loevinger, 1957; Messick, 1981, 1989, 1998).Despite its dominance elsewhere, it has
not been incorporated into discussions of a general theory of validity for standard
12
setting. Examinations of validity in standard setting methods appear to be based on a
series of ad hoc principles (Kane, 1992, 2001) and derived from the approach that
accepts "just because a standard setting is arbitrary does not mean it is not useful"
(Hambleton, 1980, p. 102). In later work, Hambleton (2001; see also Schafer, 2005)
builds on this suggestion that further information is necessary to determine the
'usefulness' of the standard setting, including a (1) Definitional perspective, (2)
Psychometric perspective, (3) Legal perspective, and (4) an Institutional perspective.
The definitions for standard setting validity suggested by Cizek, Kane, and all other
modern standard setting theorists reject the conceptualization of standard setting as a
psychometric technique capable of discovering a knowable or estimable parameter
(Cizek & Bunch, 2007, p. 18).
Following in this tradition, Cizek and Bunch (2007) suggest that panel organizers
should report a number of statistical tests to support their argument for validity. In
contrast, Dixy McGinty (2005) points out that such statistics are really more
accurately thought of as indicators of reliability, and while useful in demonstrating
validity, are not themselves measures of validity. As a result of this confusion, in
comparison with other psychological assessment procedures, a scientific justification
for the validity of a particular procedural decision, such as choice of a method, or
variation in a procedure, is very rarely given and such justifications are typically
procedural in nature.
It is widely stated that standard setting procedures are dominated by two methods
that are historically linked - the modified-Angoff method and the Bookmark method.
The modified-Angoff method is derived from an original method named after William
Angoff who, ironically only briefly mentioned the method as a note and attributed the
idea named after him to his colleague Ledyard Tucker (Cizek & Bunch, 2007). The
13
main principle of the method is that items are examined one-at-a-time and judged in
various ways for their suitability to make decisions about examinee classification.
Since the Angoff procedure is the main focus of this study, much more will be said
about it in the following sections; however, the Angoff is widely cited as being "the
most commonly used method for setting performance standards in contemporary use
in licensure and certification context" (Cizek & Bunch, 2007, p. 82). Regardless of the
literal accuracy of this statement, it is unquestionably widely used to make cutscore
decisions for high stakes tests.
The other widely used standard setting method is the Bookmark. The Bookmark
emerged from procedural difficulties with the Angoff method. It was first suggested
by Mitzel et al. (2001), although Cizek and Bunch (2007) trace its roots back to
procedures extended from the Angoff method and used in the 1990s by researchers at
American College Testing (ACT) for the National Assessment of Educational
Progress (NAEP). In the Bookmark, items are placed in a booklet, referred to as the
Ordered Item Booklet (OIB), where they are ordered by their difficulty measures.
Panelists then place a mark on the items that separate the various categories of the
performance standards. Engelhard (2007) speculates that, because of its widespread
use in assessments related to the American educational policy No Child Left Behind
(NCLB), the Bookmark is in fact not the most widely used standard setting method.
Standard setting is now a routine aspect of test development. Huge numbers are
performed regularly as part of the development of state and private tests. Standard
setting panels are conducted as part of the vast network of accountability tests in
American schools known as No Child Left Behind (NCLB) (Linn, Baker, &
Betebenner, 2002; Linn, 2003), as well as in and other public education accountability
projects throughout the world. Standard setting also plays a role in the development of
14
the examinations that establish standards for a wide variety of occupations and
professions (Nelson, 1994). In addition, panels similar to those in the standard setting
are increasingly used for other purposes. For example, Roach, McGrath, Wixson and
Talapatra (2010) describe a procedure similar to a standard setting panel to 'align' two
or more different types of assessments whose content is not directly comparable. The
results of their study resemble what could be produced from a mathematical equating
of different assessment procedures. Their use of the panel comparison instead of an
equating stems from the limited use of the assessments and hence limited numbers of
observations available to perform an equating.
Little is known about the growth in use of criterion-referenced testing or standard
setting. While numerous reviews of the different methods have been done (Norcini &
Shea, 1997; Shepard, 1980), including special issues of journals (see the 1994 special
edition of Educational Measurement: Issues & Practices, volume 13), and the
complexity and detail of methods has expanded enormously (Huynh & Schneider,
2005; Lissitz & Wei, 2008; Lissitz & Huynh, 2003), little is understood about what
policy decisions that surround the procedure, such as what contributes to decisions to
use various standard setting methods or the increasing move away from standard
setting toward correlation methods of cutscore decision making, for example, value-
added assessment (Saunder, 1998; Saunders & Horn, 1995; Saunders & Horn, 1998).
Although it is clear that political reasons play a significant part of what goes into the
decision-making around testing and assessment policy (Geisinger & McCormick,
2010; Giraud & Impara, 2005;Hein & Skaggs, 2009; McGinty, 2005; Nelson, 1994),
most of the discussion about this has been speculation and no research exists on what
factors are important in these decisions or what mechanisms are involved.
15
2.2 The Angoff Standard Setting Method
As mentioned above, the Angoff method is named after William Angoff (Angoff,
1971) who, ironically, only briefly mentioned the method in a note and attributed the
source of the method named after him to his colleague Ledyard Tucker (Cizek &
Bunch, 2007). The Angoff standard setting method is one of the oldest methods and is
reputed to be among the most widely used methods in the world for setting cutscores
(Cizek & Bunch, 2007). From a research point of view, the Angoff is particularly
useful because it produces many discrete values at points throughout the procedure
permitting the application of techniques derived from latent trait theory, such as Item
Response Theory (Embretson & Reise, 2000) and Rasch Modeling (Bond & Fox,
2001). This will be explained in more detail in the description of the method that
follows.
There are many different versions of the Angoff method in use today. For this
reason, methods that belong to the Angoff family of standard setting methods are
frequently described as a "modified-Angoff'. It has been suggested there is no general
agreement on a definition of the Angoff method (Brandon, 2004; Reckase, 2000), and
Brandon (2004) lists 5 steps he believes characterize the modified-Angoff procedure,
1. selecting judges
2. training judges
3. defining and describing the performance level descriptors
4. estimating examinee performance at the level of each item
5. review of empirical information by judges and discussion of item estimates
This definition, while widely cited, is difficult to use. All of these points are
routine aspects of other standard setting methods and only (4) is an aspect distinctive
to the Angoff family of standard setting procedures. While estimation of examinee
performance at the item level is found in other methods, such as the Nedelvsky
16
procedure (Nedelvsky, 1954) the way it is done in the Angoff offers a true distinction
between the modified-Angoff and other standard setting methods.
The modified-Angoff is distinctive in its procedures for estimating the cutscore
in that,
1. Panelists are presented with items one-at-a-time.
2. Panelists are asked to estimate examinee's ability to answer the item correctly.
3. Estimation of examinee ability to correctly answer the item is done item-by-item,
and items are not necessarily presented in any particular order.
The second point, estimation of an examinee's ability to answer the item
correctly, has been done in many different ways. Brandon (2004, p. 60 note 2)
provides a partial list of different ways.
Sometimes percentages are recorded instead of probabilities. Sometimes judges
specify the number of candidates out of 100 who could answer the problem
correctly. (e.g., Engelhard & Anderson, 1998; Impara & Plake, 1998).
Sometimes judges are given a choice of range of percentages or proportions, for
example, Cross, Impara, Frary and Jaeger (1984) and Plake and Giraud (1998)
instructed judges to select from deciles. Halpin, Sigmon and Halpin (1983)
printed the lowest acceptable probability and the highest probability. Angoff
also proposed a yes-no form of judgment...
In addition, Cizek and Bunch (2007) list several different versions of the
modified Angoff. This includes the yes/ no Angoff procedure in which panelists
indicate only a yes or a no concerning their judgment of examinee ability to answer
the item correctly.
In addition to these procedures, modified-Angoff standard settings
17
conventionally incorporate a number of other procedures to produce a convergence of
scores across panelists. These are referred to in point (5) of Brandon's (2004) list
above. These additional procedures include,
1. Panelists have several opportunities to refine their estimations, referred to as
'rounds'. The current convention is to perform a standard setting in sometimes two,
but often three, rounds (Cizek & Bunch, 2007).
2. In between rounds, panelists have the opportunity to compare their estimations
and discuss why they made their individual decisions. This is referred to as
“discussion”.
3. In addition to discussion, panelists are presented with data that reflects the
impact of their decisions. For example, panelists may be shown the percentage of
examinees who would fall above or below their estimated cutscores. This is referred
to as 'impact data' or 'feedback'.
From a statistical point of view, these procedures produce a convergence of the
standard deviation of cutscore decisions making it appear as a growing consensus
between panelists from round to round. Much of the issues about standard setting that
remain unanswered and are addressed in this paper are related to the nature of this
convergence and what factors in the standard setting panels influence this
convergence.
One final procedural issue is the method used to calculate the final recommended
cutscore decision. Cizek & Bunch (2007) recommend that the final cutscore for the
standard Angoff is calculated by averaging the rater or the item means, typically from
the final round of panelist vote. This is the conventional method used to handle the
recommendation, although some debate exists about this. Issues related to this are
beyond the scope of this study and will not be covered in this paper.
18
A wide range of external influences have been investigated for their potential
effect on the outcome of a standard setting. These can be thought of as social
influences, situational influences and rater effects. Research on social and situational
influences has not been able to produce strong evidence of an impact, although what
research has been done does not point to a great deal of research on these aspects of
influence. Rater effects have been the most completely investigated, and a large and
robust research literature has documented their significant influence on the outcome
of the Angoff and modified-Angoff standard setting procedures.
All standard settings, no matter which method is used, when conducted with
multiple rounds, discussion between panelists between rounds and the provision of
impact data demonstrate a convergence of cutscore decisions across rounds. So
characteristic is this result that Cizek (2001a, p. 10) refers to this as a “common
feature of standard settings”. This convergence is not unanticipated. Experts given the
opportunity to discuss data relevant to their expertise will develop elaborate
explanations for the data based on information drawn from their shared background
(Chi, Glaser & Farr, 1988; Johnson, 1988; Larkin, McDermott, Simon & Simon,
1980). It is thus reasonable to interpret the convergence of cutscore decisions as a
growing expert consensus about the contents of the standard setting and its panels.
This, however, is not the commonly understood meaning of the convergence of
standard setting decisions. The generally accepted interpretation is that rather than
reflecting a decision making process and it is the result of influences that have
distracted panelists away from an optimal decision (Fitzpatrick, 1989; Hurtz &
Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010).
The exact nature of these effects is not well understood and much speculation has
been generated about its origin. Many types of effects have been suggested as a
19
potential issue in the decisions of panelists. For example, it is widely suggested in the
standard setting literature that social influences during the discussion drive cutscores
toward agreement (Fitzpatrick, 1989; Hurtz & Auerbach, 2003; Hertz & Chinn, 2002;
Wessen, 2010).Researchers have suggested a number of social influences, as well as
the effects of dominant individuals or group conformity, may be driving panelists to
report cutscore decisions that are more and more similar to each other. The exact
mechanism of this has not been well-established. Despite widespread speculation
about the role of these social influences (Fitzpatrick, 1989) and some empirical
examinations (Hurtz & Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010), there is
no well-constructed theoretical or empirical demonstration of how social influences
could operate in a standard setting.
The effects of situational influences have been more completely investigated, but
the evidence that they play a major role in the outcome of standard setting panels is
not compelling. Wiley and Guille (2002) provide a review of the idea that an
"occasion facet" could affect the outcome of a panel. The aim of their study was to
examine the effect of an Angoff ratings obtained by panelists at home rather than
through the conventional tables of panelists seated together. While the authors
reported some interaction with other factors of the standard setting, they were unable
to find evidence that the occasion facet plays a significant role in the outcome of the
standard setting panel. The only meaningful finding reported in the study was that the
variance of scores in the 'at-home' panels was greater than that of traditional panels.
These findings add to the consistent results of other studies that could not find an
effect related to an occasion facet.
So while there are a wide range of suggested effects operating in the standard
setting, there is little understanding of their nature. In fact, it is not clearly understood
20
how such effects operate or even if they exist in a fashion that would affect the
outcome of the standard setting. Effects operating in the standard setting have not
been clearly documented and categorized, as a result, there continues to be confusion
about how cutscores are influenced.
2.3 Centrality
Variation in judgment and complete use of a rating scale is widely interpreted as
a positive indicator of rater ability. Failure to use the range of options that a scale
provides is interpreted as a flaw in the rater's ability. While a full discussion of this is
beyond the scope of this research, this distinction is one that appears regularly in the
research literature dealing with centrality and its related issues.
A variety of names have become attached to patterns in the failure to use the
complete options available on the rating scale. Centrality is the term used to describe
a lack of variation in ratings given by raters which center around the middle of the
rating scale. The term is used synonymously with central tendency. Most
contemporary research on centrality traces discussion of the concept back to the 1980
paper of Saal, Downey, and Lahey. In this paper, the authors reviewed a large number
of related definitions used by researchers of the time examining centrality classical
methods of measuring centrality. Koran (1971, pp. 180-181), for example, defined
central tendency as, " the tendency to rate all rating objects around the 'middle' or
mean of the rating continuum and not around the extremes." In contrast, DeCoths
(1977) uses the almost metaphorical understanding of central tendency as a rater's
unwillingness to go out on the proverbial limb in either the favorable or unfavorable
direction. The definition has been restated more recently by researchers using latent
trait theory. Reproducing the point that centrality reflects rating error, Farrokhi,
21
Esfandiari and Daili (2011, p. 71) define centrality in terms of a lack of variability in
rater scores, citing a definition from Wolfe and McVay (2010) of central tendency as,
"systematic error that causes the assigned ratings to be more tightly clustered than
ratings that do not contain error."
Using the latent trait concept of measurement, Wolfe (2004, pp. 39-40) contrasts
centrality with a number of related terms from the literature,
...centrality...results in a concentration of assigned ratings in the middle of the
rating scale...restricted range exists when centrality is combined with leniency or
harshness. That is the, restriction of range results in a restricted range around a
non-central location on the rating scale. The converse of rater centrality occurs
when raters tend to overuse the extreme rating scale categories - a rater effect
called extremism.
In a doctoral dissertation supervised by Wolfe (Yue, 2011, p. 10), a similar point is
made about the distinction between centrality and restricted range.
(3) central tendency (i.e., a rater overuses the middle category of a rating scale
while avoiding the extreme categories), and (4) restriction of range (i.e., a rater
overuses any point on a rating continuum).
Few of these definitions have had a major impact on the methods used by researchers
to examine centrality or central tendency. Classical methods of analysis have varied
enormously. Saal, Downey, and Lahey (1980) described four techniques used in the
research to look for central tendency.
1) standard deviation
2) distance from the mean
3) kurtosis, and
4) rater X ratee ANOVA.
None of these methods have been ever compared for their ability to detect centrality.
22
It is difficult to see from this list of measures what common characteristic of central
tendency is being addressed. For example, standard deviation is a measure of
dispersion of scores. Kurtosis has historically been interpreted a measure of the
peakedness of a curve and its deviation from the normal distribution ((DeCarlo, 1997;
Darlington, 1970, Dittmar, 2002). As such, it appears the two measures would indicate
centrality in different places. Indeed, studies of the relationship between the two
measures have found only poor correlations (Murphy & Blazer, 1989). In a study of
simulated data, Yue (2011) found that standard deviation was able to detect Type 1
error while maintaining statistical power, even as well as latent trait measures,
indicating that measures of dispersion may be a more effective definition. She did not
compare measures of kurtosis.
Measures of centrality developed by Wolfe and his students (Myford & Wolfe,
2003; 2004; Wolfe, 2004; Wolfe, Moulder & Myford, 2001; Yue, 2011) rely on latent
trait measures and particularly their residuals to detect centrality. Wolfe (2004, p. 40)
provides the following rational for his measures.
Rater centrality and extremism manifest themselves in both the pattern and the
spread of the residuals When centrality occurs, the observed ratings regress
toward the center of the rating scale. As a result, residuals tend to be large and
positive for low expecting ratings and high and negative for high expecting
ratings.
From this understanding, he has derived a number of different measures for the
detection of centrality. Yue (2011) used simulated data to examine these measures and
their detection of centrality. While she also included the classical measure standard
deviation, most of the measures she compared were derived from latent trait theory.
1. Mean-square fit statistics
2. Expected-residual correlations
23
3. Ratee measures and their residuals derived from Multi-Facted Rasch
Measurement models (MFRM)
4. Correlation of Rasch measures and measures from raters
5. Rater slope (point biserial)
Yue reported that some of the latent trait measures, such as measure-residual
correlation and expected-residual correlations were effective at detecting centrality.
However, she also reported that standard deviation showed comparable ability to
protect against Type I error while maintaining an acceptable level of statistical power.
Latent trait measures, while not necessarily providing better detection than
classical measures, do offer some significant advantage. In addition to addressing no
common definition of centrality, classical measures provide no guidelines for the
detection of centrality, such as definitions of measurers where centrality can be
assumed to exist. In contrast, Wolfe's latent trait measures provide guidelines for the
detection of centrality. For example, if there is no centrality at all, the expected-
measure correlation should be 0. Under conditions of perfect centrality, the expected-
measure correlation should be -1.0, while under conditions of perfect extremism, the
expected-measure correlation should be 1.0. (Wolfe, 2004). Most cases that fall in
between can thus be separated into cases that are central or non-central based on
whether or not their measures are negative or positive.
One final point addressed is the cause of centrality effect'. Central raters are
described as in error and needing closer examination. In Wolfe and McVay (2012),
central raters are described as "interesting raters" and identified as raters whose
ratings should be handled differently from non-central raters. This assumption about
centrality is widespread and assumed even outside education and psychology.
Research on issues related to job appraisal, for example, has identified centrality as
one of the more important problems (Berger, Harbinger & Sliwka, 2010; Bol, 2006;
24
Moers, 2005;Prendergast, 1999). As Bol (2006, pp. 9-10) states,
The lack of distinction between performance ratings of different employees is
expected to influence employee incentives...for below-average performers, as
compression influences their ratings in a positive way. Because of centrality bias,
their performance seems similar to that of top performers... with above-average
performers, the lack of variance in performance ratings might also negatively
affect the motivation of below-average performers as their marginal cost to
improve performance ratings may be higher than the marginal benefit they
receive from the performance rating improvement
This research has not drawn solutions to the problem from measurement theory
or educational research. Instead, solutions revolve around the identification and
comparison of ratings with objective measures of job performance and comparisons
of their various statistical properties.
Likewise, research on the judgments of experts similarly describes a lack of
variance in ratings as a sign of lack of expertise. Based on their work observing
different kinds of experts in both rating and prediction settings, (Shanteau, 1992,
2001; Weiss & Shanteau, n.d. a; Weiss & Shanteau, 2003; Weiss & Shanteau, 2004)
have developed a measure of expertise based on the variance of scores given. The
Cochran-Weis-Shanteau approach defines the CWS Index (Weiss & Shanteau, n.d.b),
p.1) as, "a numerical value that captures the degree of expertise demonstrated in a
response set." Stating that, "expertise at its core requires the ability to evaluate the
stimuli in one's domain" (p. 2), The CWS is defined as,
CWS=DiscrimiationInconsistency (1)
where
25
Discrimination = variance among averaged responses to different stimuli
Inconsistency = variance among responses to the same stimuli
In terms of evaluating judge’s ratings, Consistency is interpreted as the item variance
of ratings given by a single judge. Discrimination is interpreted as the judge variance
of ratings given by many judges to the same object. The CWS Index is, in effect, a
measure of centrality. The less central the estimates given by a judge, the more 'expert'
he or she is seen to be.
And, once again, while centrality appears to be a key concept for some
researchers in the identification of experts and expertise, their measures draw from a
completely different place from latent trait theory or even classical measures of
centrality.
2.4 Centrality and the Angoff Standard Setting
Centrality, as such, has never been studied in the Angoff standard setting or in
the standard setting situation in general. However, much available research indicates
that centrality exists as an issue that should be taken into account when considering
the results of the modified- Angoff standard setting method.
In its modified form, the Angoff standard setting method asks panelists to rate
the difficulty of test items. The ability of expert judges to make such estimates is
crucial to the validity of the method, and as such, a large and comprehensive research
literature has been developed to address the issue. A large number of references are
typically cited questioning the ability of even the most highly trained experts to
accurately estimate the difficulty of test items (Brandon, 2004; Goodwin, 1999;
Impara & Plake, 1998; Lorge & Krulou, 1953; Linn & Shepard, 1997; Norcini et. al.,
1987; Norcini, Shea & Kanya, 1988; Shepard, 1994; Smith & Smith 1988; Taube,
1997).
26
Smith and Smith (1988) asked 15 Angoff judges to estimate the ability of
minimally competent students to correctly answer 64 reading items from a statewide
high school graduation test using the Angoff method. These were compared with the
estimates obtained from 16 judges using the similar Nedelsky method (Nedelsky,
1954). The overall agreement of the Angoff estimates with the empirical p-values
from the items was .60, which was higher than the estimates obtained for the
Nedelsky method (r = .37).
Goodwin (1999) used items from an administration of a licensing examination
for financial planners. She asked judges to estimate the proportion of all examinees
who could answer the question correctly, as well as the proportion of borderline
examinees who could do so. Fourteen judges estimated 140 items that had recently
been administered to 972 examinees in a high stakes examination. She found that
results for the two difference estimates were very similar, with a correlation of .51 for
the predicted p-values and .55 for the predicted performance of the minimally
competent examinees.
A review of results obtained from various Angoff method procedures conducted
by Brandon (2004) concluded that typically, the values obtained by correlating the
measured p-values with the estimates obtained from Angoff panelists range from
around .40 to .70, indicating that at best, actual measurements of the p-value can
rarely account for more than half of the variance in a judged estimate. In conclusion,
he states (p. 71), results of this level, "show that the ordering of item estimates -
particularly those in operational standard setting studies - can be expected to mirror
moderately the ordering of item difficulty, thus supporting the validity of the modified
Angoff cutscores."
Of greater importance to the issue of centrality is which items are more likely to
27
be incorrectly estimated. Are items that are easier or more difficult also more difficult
for judges to estimate? Without accurate measures of centrality, one indication of this
would be that estimated values for difficulty would have a smaller standard deviation
than the actual measurements, indicating that estimated values for easy items and for
difficult items are not correct. This, in fact, is a commonly observed aspect of the
research. Lavellee (2012, p. 14) reviewed the literature related to this issue and
concluded, “…results consistent with a centrality effect have been found every time
they have been looked for” (italics in the original). In addition, the tendency for judges
to cluster estimates of actual values in tighter distributions than the actual values
themselves, has been the subject of comment for almost as long as there has been
systematic scientific investigation into standard setting results.
Lorge and Kruglov’s original (1953) study found a standard deviation of 16.3 for
the judge’s estimates compared with 23.7 for the empirical difficulty values. More
recently, Goodwin (1999) reported, in her study of the results of the financial planner
licensing exam, that the judge’s estimated p-values was “more homogeneous” than the
actual results obtained from the administration of the items to candidates. The
standard deviations for the estimates of total group and for borderline examinees were
.09 and .10 respectively. The actual observed values were .19 and .18.Van de Watering
and van der Rijt (2006) compared the estimates of difficulty values for teachers and
students. They found high rates of inaccuracy among both these groups. Interestingly,
their student group did not overestimate the difficulty of easy items, although they
showed dramatic underestimation of difficult items. Teacher’s estimates of easy items
showed much more centrality and systematically underestimated the easiest items.
While Lavallee’s (2012) remark that, “…results consistent with a centrality effect
have been found every time they have been looked for” is technically correct and
28
overwhelmingly true (Clauser, Swanson & Harik, 2002; Clauser et al., 2009,
Humphry, 2005) some studies have used methods different from correlation to
examine estimates and actual p-values, and found a different pattern. In a widely cited
study, Impara and Plake (1998) had 26 school teachers make estimates for
administration of a highly standardized test to their students. Teachers were asked to
estimate the performance of students who would get a D/F (borderline students) and
of the class as a whole. Teacher’s estimates of performance were reasonably accurate,
with the estimate of borderline students being more accurate than for the group as a
whole. The more widely cited result (p. 77) is based on the calculation in which
estimated scores were subtracted (not correlated with) the actual scores obtained
during test administration.
…teachers did not systematically overestimate (or underestimate) performance
on the easy items or overestimate (or underestimate) performance on the hard
items regardless of the target group. Instead, the systematic over- and
underestimation was a function of the teacher’s perception of the ability level of
the students for whom the estimate was provided
While this study did not find the centrality suggested in other studies, it is significant
that their method for detection involved the raw difference between scores, rather than
their correlation.
So while the literature on standard setting does not directly address the issue of
centrality, much of the information reported in published papers is consistent with the
suggestion that centrality exists as a problem in the ratings of panelists. However, the
standard setting research that addresses centrality is confusing. It is not the direct
object of any of the published research, nor is it specifically reported in any existing
study. In fact, descriptions of studies provided in pertinent research are such that it is
29
not always clear what is being addressed (See Brandon, 2004 for a discussion of
deficiencies in the reported procedures and results in standard setting research).
CHAPTER 3 Methods
3.1 Data & Materials
The data used in this study is drawn from standard setting meetings held at a
Taiwan university to link a university-level English proficiency exam to the Common
European Framework of Reference (CEFR) (See Appendix 1). It is the same data used
by Lavallee (2012) in his examination of many faceted Rasch modeling in a modified-
Angoff standard setting. Readers with questions about the data that are 6 snot
answered in this paper may be able to find them in his dissertation.
The exam used in this study, the English Proficiency Test or EPT, is an
examination of English as a Foreign Language. The EPT is a series of in-house
language proficiency test developed to meet the needs of the Practical English (PE)
program adopted at Ming Chuan University. The PE exams are multiple-choice
exams. They examine a series of listening and reading and vocabulary skills in a
number of different practical contexts. The tests have a long history at the school and
were originally designed to be administered four times a year to undergraduate non-
English majors who are students at Ming Chuan University during the spring midterm
examination and final examination period. Beginning in 2004, a series of structural
and administrative reforms reorganized the tests, and beginning in 2010 all students in
the school, including students majoring in Applied English, began taking the EPT.
The EPT is administered as part of a larger English proficiency program, the
Practical English program, developed by Ming Chuan University to address
proficiency demands of students at the school. It is divided into 8 sections. Students
progress through 2 sections of Practical English each year: PE 1 and 2 are taught to
30
freshmen (1st year), PE 3 and 4 to sophomore (2nd year), PE 5 and 6 to junior (3rd
year), and PE 7 and 8 to graduating seniors (4th year). Budgetary and other reforms
have resulted in major changes to the program and the structure of instruction is
rapidly becoming quite different. However, at the time of the standard setting (2010),
the program was still organized according to its traditional structure. Although
different students take the EPT, the composition of the test is now identical from year
to year of administration. An outline of the test organization is detailed in Table 2.
The original construction of the EPT test was based on traditional concepts of
test writing used in Taiwan. Test validity was determined by expert judges who were
themselves teachers. There were few specifications to follow during item writing and
determination of ‘error’ in test construction was based on issues more related to face
validity, such as spelling and grammar errors, or confusion over selecting answers,
than to psychometric properties. In fact, initially, there were no provisions made to
analyze the psychometric results of testing sessions, and determination of a successful
test was based on complaints from teachers and students about quality-related issues
Items for the EPT are written by the classroom teachers of the PE program
under the supervision of test editors, who are assigned by the school and have
historically been compensated by the school for their work. More recently, this
compensation has been withdrawn following budget cutbacks. Items are written
following extremely vague item specifications and aimed at covering the materials
contained in school textbooks, which were also written by the classroom teachers of
the PE program. Items are then sent to a proofreader and finally returned to the
editors. The test editor returns the test to the school who then print and distribute the
test forms to students. The various tests of the EPT are administered on a single day.
So for example, freshman students all receive the test at the same time. Sophomores
31
receive the test at the time, which is different from the time for freshman students.
Following student examinations, test results are collated and sent to a test
coordinator. The current test coordinator is Dr. Joseph Lavallee. Results are calibrated
with Winsteps Rasch Modeling software (Linacre, 2012) and all test items placed on a
single difficulty scale. Items are sorted by their point-biserial correlations and
difficulty values, and stored in an item bank for later use. Currently, most items that
appear on PE tests are drawn from this item bank, although teachers continue to write
new items to expand the item bank.
The test items used in this standard setting were drawn from several different
midterm examinations. All items are calibrated using Rasch modeling onto a single
scale. This standard setting project was designed to establish cutscores along the scale
used to calibrate all items in the item bank and not along a raw score scale
corresponding to a single test form. Accordingly, the test form used in the project was
actually a composite form, with its items drawn from a series of different test forms
administered during the midterm examination period for first, second, third and fourth
year students. The tests shared a number of items which were used to equate them and
calibrate them together onto the same scale.
32
Table 1. Contents of the English Proficiency Test (EPT)
Skill Section Description Items Time
L
What’s next?
Student hears 2
conversational turns and is
asked to choose the next
response.
20
45 ~45 minDialogues
Student hears short
conversation of about 8-14
turns and answers 3-5
comprehension questions.
10
Extended
Listening
Student hears a short
monologue and answers 3-5
comprehension questions.
15
R
Fill in the
Blank
Student chooses a word or
short phrase to complete a
sentence.
10
50 ~55 min
Cloze
Reading
Student chooses words or
short phrases to complete a
short passage (multiple-
choice cloze).
10
Reading with
Questions
Student reads a short passage
(150-300 words) and answer
3-5 comprehension questions
based on the text.
30
95 100 min
(Lavallee, 2012).
33
3.2 The Operational Angoff Standard Setting
Selection and training of panelists was extremely conventional and followed
suggestions given in such authoritative sources as Cizek (2001); Cizek and Bunch
(2001); and the Council of Europe (2009).
In the modified-Angoff study, 18 panelists were asked to set a cutscore only
for the B1 level. Panelists were selected from teachers who had an intimate
understanding of both the PE tests and the student body on which the PE tests were
regularly used. Panelists were selected from faculty and staff at the university who
were familiar with the student body and the EPT. All MCU panelists had served in
some capacity as item writers, proofreader, and test editors for the PE test, and all had
served as school-appointed proctors invigilating the test administration. In addition,
each panel was supplemented by a teacher from a similar program at a different
university in Taiwan. The level of familiarity of the panelists with the tests, the test
items, students, the CEFR and with standard setting in general was very high. Of the
18 panelists, 15 were faculty and staff teachers who had taught students in the PE
program at the university. Ten panelists had served as item writers for the PE tests and
some had written items that were used in the test form of the standard setting. Two of
the panelists and both facilitators had served on a preliminary yes/no Angoff panel
held earlier in the year (Lavallee, 2012). In addition, the CEFR is an important aspect
of instruction at the university. All MCU university faculty and staff from the English
Language Center would be familiar to some degree with the CEFR, having been
exposed to it during internal departmental training. One of the panelists had presented
numerous papers at local conferences and led several school-sponsored training
sessions about the use of the CEFR in the language classroom for ELC staff and
34
faculty.
External panelists were selected to provide diversity to the standard setting
decisions. These panelists were selected because of their experience teaching students
at similar universities in Taiwan. Two of the external panelists were faculty members
at similar universities in the Taipei area and one was a doctoral candidate at a Taipei
university but had taught remedial classes at the university at which she was studying.
Table 2 provides a summary of the panelists and a brief description of the background
of each.
35
Table 2. Angoff Panelists
Group Judge Gender
English
NS/NNS
Position
I
(Mon)
Panelist 1 F NNS Administrator, former teacher
Panelist 2 F NNS Teacher
Panelist 3 M NNS Teacher
Panelist 4 F NS Teacher
Panelist 5 F NNS Teacher
Panelist 6 F NNS Teaching Assistant, recently graduated
student
II
(Wed)
Panelist 7 M NS Teacher
Panelist 8 M NS Teacher, External University
Panelist 9 F NNS Teacher
Panelist 10 M NNS Teacher
Panelist 11 F NNS Teaching Assistant; recently graduated
student
Panelist 12 M NS Teacher
III
(Fri)
Panelist 13 F NNS Administrator, Teacher
Panelist 14 F NNS Administrator, former teacher
Panelist 15 F NNS Teacher
Panelist 16 F NNS Teacher, External University
Panelist 17 F NNS Teacher
Panelist 18 F NNS Teacher
F=female, M = male
NS = native English speaker, NNS = non-native English speaker
(Lavallee, 2012).
36
A one-day training/orientation session was held on Saturday, July 10, 2010 for
all the participants. The panelists themselves were then separated into three different
panels which were held on Monday, July 12, Wednesday, July 14, and Friday, July 16
in 2010. A group of six panelists participated on each day. The individual panels were
conducted on three separate days to ensure that proper procedures were followed,
particularly during the discussion period. The facilitator for each discussion session
acted as the moderators of each of the panels, thus requiring having the groups meet
on separate days.)
As noted, in Table 1, the test form presented to each of the panels was a
composite form drawn from a series of different tests in the Practical English series of
tests. The items were drawn from test forms administered as part of the annual EPT
for all four year levels of the program, and differed slightly from the EPT exam
described earlier.
37
Table 3. Contents of the Test Form Used in the Standard Setting
Listening
What’s Next? 16 items
Dialogues 12 items (3 listening texts)
Extended Listening 12 items (3 listening texts)
Reading
Fill in the Blank 10 items
Text Completion 16 items
Reading with Questions 14 items
(Lavallee, 2012).
For the purposes of acclimatizing panelists to difficulties encountered taking
the test and provide panelists with the experience of taking the exam, a training form
was created with the same format as the regular exam. The test form used in the
operational standard setting does not contain the scripts for the listening passages, so a
separate form was created for the listening test which contained both the listening
scripts and the associated items. During the training session, panelists took the test
using the training form. Also, during the operational standard setting meeting,
panelists were not able to hear the taped version of the questions but were also
provided with the scripts of the listening questions.
The week prior to the training session, an email was sent to all panelists
containing
an introductory letter containing a link to a CEFR familiarization website,
<www.CEFtrain.net
an agenda for the training session, consisted of adapted versions of pages
33-36 from the CEFR.
38
the training materials, the listening and reading components of the CEFR
self-assessment grid (CEFR Table 2); and a link to the website.
two forms collecting personal information and agreements concerning
test security and informed consent for the research portion of the project.
(also see Appendix 2, 3, and 4)
As homework, panelists were asked to refer to the website and level
summaries, and use the self-assessment grid to assess themselves (in any
second language) and their students, in terms of the CEFR levels.
(Council of Europe, 2009).
As stated above, training of panelists was extremely conventional and
followed suggestions given in such authoritative sources as Cizek (2001), Cizek and
Bunch (2001), and the Council of Europe (2009).
On the day of the training, panelists were given a brief Power Point
presentation explaining the purpose of the project, a description of the EPT and an
explanation of how it was developed and validated. Following guidelines provided by
Cizek (2001) and Cizek and Bunch (2007) a great deal of effort was extended during
training to familiarize panelists with the descriptors used for the panels. They then
took part in a CEFR familiarization process. After a brief description of the CEFR,
their understanding of descriptors was tested. Panelists were given a sheet containing
the global level descriptors from the CEFR Table of Global Descriptors. The
descriptors had been rearranged, and the panelists were asked to sort them back into
the correct order (first individually, then in pairs). After providing them with a copy of
the original CEFR Table and discussing the correct answers, the panelists were asked
to take out their ‘homework’ activity in which they rated their own ability and that of
their students using the CEFR levels, and to discuss their answers in pairs.
39
The session then shifted to the CEFR reading Performance Level Descriptors
(PLDs). The first activity was another sorting activity, in which panelists were asked
to sort (individually, and then in pairs) 20 CEFR reading descriptors from CEFR
levels A1 to B2. They were then given the CEFR reading descriptors from the scales
used in the study (see section 1.2 above), for CEFR levels A1 to B2. Next, panelists
were given a 13-item reading test, taken from training material made available by the
Council of Europe. For each item, the panelists were asked to first attempt to answer
the item and then to assign the item to a CEFR level, based on the skills required to
correctly answer the item. After sharing their answers in pairs, the answers from the
original CoE study were shown and discussed. This concluded training for the reading
PLDs.
The training for the listening PLDs was conducted in parallel fashion.
Panelists were asked, individually and in pairs to sort 20 PLDs taken from the CEFR
A1 to B2 levels. After they finished correct answers were provided along with a full
list of the listening PLDs from the scales used in the study (see section 1.2 above) for
levels A1 to B2. Panelists were then given a 6-item listening test, taken from the CoE
training material mentioned above, and asked to attempt to respond to the item and
then assign a CEFR-level to the item based on its perceived difficulty. Panelists
shared their answers in pairs, and then the recommended answers from the CoE study
were shown and discussed by the whole group. This concluded training for the
listening PLDs.
After a break for lunch, panelists took the practice test that was described
above. The panelists were then divided into the three groups of six people in each of
the operational panels. The groups were asked to sit together in a circle with the other
members of their standard setting panel. A group leader was chosen, and each group
40
was asked to go through the test form, item by item. As a group they were asked to
discuss what knowledge, skills and abilities were required to answer each item, and
how the items differed in terms of difficulty. One hour and fifteen minutes was
allotted to this task. The discussions were taped by the facilitators for later
transcription.
Following this activity, the panelists were introduced to the concept of the
barely proficient B1 student (B1 BPS). They were then given a form which contained
space for their notes on the BPS. They were asked to refer to their listening and
reading PLDs for the A2 and B1 levels, and to summarize on the forms what they felt
to be the key characteristics of a B1 BPS for both listening and reading. They were
then asked to discuss their summaries in pairs or small groups. This was the final
training activity of the day. Panelists were then given the opportunity to ask any
questions they had about what had been discussed to that point. They were told that
when they returned for the actual meeting, they would have one training round prior
to the meeting. Then they would perform the actual standard setting. This concluded
the training session.
The Angoff meetings were held over the period of one week on July 12, 14
and 16. The meetings were divided into two panels with standards set for the reading
test in the morning and the listening test in the afternoon. Before beginning, panelists
were given a brief review of the contents of the previous training session. This
included a review of the B1 BPS. Panelist were then asked to estimate, based on their
understanding of students in the PE program (or Taiwanese university students in
general for panelists who were instructors at other universities), the percentage of
students who had reached the B1 level for the skill in question and write down this
estimate. Then, the test form and the round 1 rating form (See Appendix 4)for the
41
reading test was distributed to the panelists and a practice round was conducted.
The rating form contained a single column for each item being rated with each
column containing a list of probabilities in increments of 0.1, starting from 0.1 to 0.9
with a space between each figure. Panelists were asked to “circle or insert” the
probability that a just-B1 level student would answer the item correctly, and to write
their answer at the bottom of the column. Panelists were instructed not to attempt to
include guessing in their calculation of probabilities. They were then given a practice
round, in which they were asked to write their ratings for the first few of items. It was
made clear this was simply a practice round, to ensure that they understood the
procedure and that they could change their answers later. The facilitators circulated
from panelist to panelist while they were performing the practice round to make sure
the procedures were understood. Once all panelists had finished, they were asked if
there were any remaining questions. After questions were answered, the first round of
ratings was then conducted.
After returning from a break, panelists were given forms containing both
impact data and empirical item-difficulty data. They were given ‘impact data’ in the
form of a distribution of actual students in the program at different scores levels on
the test. The rating form for the second and third rounds incorporated further feedback
data. The range of scale scores was divided into 40 categories of approximately equal
size. A column was added to the left side of the form. Each row in the column
contained one of the 40 categories, from low to high. Once again, there was one
column per item and the columns contained probabilities in increments of 0.1. This
time the probabilities were placed in rows corresponding to the scale scores in the
leftmost column. Based on empirical results from the Spring 2010 administration of
the EPT, the probabilities were placed in the particular scale-score row to correspond
42
to the approximate probability that a student in that scale-score category would
answer the item correctly. Panelists were guided in the use of the feedback material,
so that they could use their initial estimates of students at the B1 level, the
distributional data and the second rating form to contrast their round 1 rating with
what their rating would have been based on their estimate of the number of students at
the B1 level. Finally, at the bottom of the column for each item was the empirical p-
value for all PE students who took the midterm EPT. The listening form also
contained this information for graduating students. For reading, the difference
between graduating students and all students was not large, so this information was
omitted.
After being instructed in the use of the feedback information, a discussion
session was held. For each item, the panelists announced their round 1 ratings and
briefly explained why they had given the rating. The assistant facilitator entered
ratings into the computer as they were announced. Once the discussion period was
finished, the cutscores were calculated and shown to the panelists. Using the
distribution data, panelists were asked to contrast the percentage of students they had
initially estimated to be at the B1 level with the percentage of students who would be
classified at the B1 level based on their round one ratings. They were then asked to
make their round two ratings. It was emphasized that they did not need to change their
ratings.
The round two ratings were entered into the computer and cutscores were
calculated. (There was no discussion of individual decisions following round 2; rather,
panelists handed their rating forms to the facilitators who entered their ratings into the
computer while panelists took a break.) Panelists were again asked to consider the
impact (distributional) data, and given the opportunity to ask questions or make
43
comments. Following this, they were asked to make their final round of ratings. The
ratings for the final round were used to derive the recommended cutscores.
44
Table 4. B1Listening and Reading Cutscores, Rounds 1 to 3
CEFR Level
Round 1
Mean
(SD)
Round 2
Mean
(SD)
Round 3
Mean
(SD)
B1 Reading6.06
(0.41)
6.22
(0.21)
6.31
(0.16)
B1 Listening6.83
(0.25)
6.84
(0.15)
6.89
(0.25)
45
At the opening training meeting, all participants were asked to sign a research
consent form releasing all the data generated from the standard setting to the school
for any research and administrative purposes that were necessary (See Appendix 2 and
3). In addition, panelist’s feedback about their confidence and understanding of the
training was gathered regularly throughout both training and the operational standard
setting panels. This included,
Day 1
1. pre-training assessment of individual preparation (Appendix 5)
2. 4 different assessments throughout the training day assessing progress
and confidence in their task (Appendix 6, 7, and 8)
Day 2
3. An assessment at the opening of the operational panel to address
confidence and preparation in the day's coming activities. (Appendix 9)
4. A final assessment in panelist confidence in their final cutscore decision
and satisfaction with the manner in which the standard setting training
and panels had been conducted. This was modeled after the sample form
contained in Cizek & Bunch (2009).
Table 5 shows the chronological ordering in table form of the evaluations throughout
the 2 days of training and standard setting.
46
Table 5. Chronological Order of Panelist Evaluations
Panelist Information Form(Appendix 5)
DAY 1 - Outline of the Training Session
Introduction to the project evaluation 1 (Appendix 6)
Introduction to CEFR evaluation 2 (Appendix 7)
Introduction to the test system of the school evaluation 3 (Appendix 8)
Introduction to the Angoff Standard Setting Procedure
Sorting of CEFR descriptors
Lunch
Group discussion of item difficulty
Take the test
DAY 2 - Outline of the Operational Panel
Review of standard setting procedures
Evaluation 4 (Appendix 9)
Reading Panel
Lunch
Listening PanelEvaluation 5 (Appendix 10)
In addition to the feedback forms, the group discussion activities described
earlier were recorded and later transcribed. These recordings were made by hand held
analog tape recorders with full knowledge of all the participants. At no point in the
gathering of this data were participants deceived. Full disclosure of all data gathering
practices was conducted throughout.
Following the operational standard setting, one of the panels volunteered to
take part in a focus group to discuss their impressions of the standard setting. This
47
focus group was recorded and later transcribed for use in understanding panelist
perceptions of the standard setting, its procedures and its outcomes.
3.3 Procedures
Table 6 presents the key steps to date and the projected steps remaining to be
taken.
48
Table 6 Schedule for Completion of Dissertation
July 2009 – May 2010 Refining of standard setting design, preparation for
operational standard setting
June 2010 Operational Study – Modified-Angoff Standard
Setting
Sept –Jan 2012 Development of research questions, preparation of
proposal
May 24, 2013 Oral Defense of PhD proposal
Feb-Aug 2013 Quantitative analysis of data
Aug-Sept 2013 Preparation of dissertation
Oct 2013 Defense of dissertation
3.4 Analysis and Expected Results
The proposed analysis in this study involves an examination of centrality using
the classical measures of standard deviation and kurtosis, and latent trait measures of
centrality. These measures will be used in the ways described below to examine the
centrality of panelists and items and answer the research questions of this study.
Standard deviation: Standard deviation is the measure of the variability of a
49
population on a characteristic. It is also widely used to measure confidence in
statistical outcomes. Standard deviation is one of the most widely reported of all
statistics. The standard deviation of a population is calculated by
(2)
where,
= indicates the sum of calculations
= population standard deviation= population mean
n = number of scores in sample.
The standard deviation of a sample is estimated using the following formula,
(3)
where,
= indicates the sum of calculations
X = observation value
= sample mean
s = sample standard deviationn = number of scores in sample.
This calculation produces a number between negative and positive infinity with a
mean of 0. Standard deviation is a measure of the distance from the mean and
indicates the percentage of the population or sample that will fall between the mean of
the population or sample and the limit of the standard deviation. In a normal
distribution,68.23% of the sample will fall between plus and minus 1 standard
deviation of the mean (0).
50
Standard deviation is one of the most studied of all statistics. It is widely
incorporated into descriptive statistics and a standard part of the reports of test and
research results. All of the major software packages provide the measure as a part of
output. It is also available on many types of pocket and desktop calculators. This
study reports both the standard deviation of panelist cutscore estimates (judge
standard deviation) and estimates for individual items (item standard deviation).
These will be calculated and reported separately and will be calculated using the
Microsoft Excel 2010 (Microsoft, 2013a)
Standard deviation is one of the classical measures of centrality described in
Saal, Downey, and Lahey (1980). In addition, it is a recommended statistic for
panelist final cutscore results (Cizek & Bunch, 2007) in the Angoff or other methods
for standard setting. As such, it provides a clear measure of a distribution central
tendency. It was reported for this data set in Lavallee (2012)
If centrality increases across the 3 rounds of the modified-Angoff standard
setting, then the standard deviation of ratings given to items by individual raters
should decrease across rounds. This would not be a surprising finding, since, as
reported earlier, a decrease in the standard deviation of rater scores (judge standard
deviation) across the rounds of a Angoff-type standard settings is a typical feature of
the different versions of the method (Cizek, 2001; Cizek & Bunch, 2007). The
standard deviation of individual panelist estimates of different items is widely
reported in Angoff-type standard settings, as well as other methods. A decrease in
standard deviation across rounds is a “common feature of standard settings” (Cizek,
2001a, p. 10) and is generally interpreted as an indicator of the validity of the
51
particular standard setting being examined. In her 2011 study of rater centrality Yue
appears to interpret standard deviation in a different from Cizek. She designates this
as a measure of rater centrality, stating that
One would expect central tendency to be indicated by neutral ratings with little
variability. Therefore, for a given instrument, a displaced mean rating in
conjunction with a relatively large standard deviation would lead to the
conclusion that the instrument is sensitive to criterion performance, and therefore
resistant to central tendency (p.16-7).
Yue (2011) cautions that the kind of interpretation stated above by Cizek
(2001a)is, as well as the use of standard deviation as a measure of centrality, could be
problematic because the measure has the tendency to become inflated with random
error.
As a result, it would be unclear whether raters who produce ratings with small
standard deviations are engaging in centrality or are simply accurate raters in a
sample of inaccurate raters.(p.17)
However, Yue's study of simulated data was able to demonstrate that standard
deviation performed best among the measures she assessed, and was even better than
latent trait measures at distinguishing central raters from non-central rates while still
maintaining an acceptable level of power.
Kurtosis: Kurtosis is a classical measure of central tendency. It was not investigated in
Yue (2011) or in Lavallee (2012),but is one of the classical measures of rater bias
described in Saal, Downey, and Lahey (1980).It is statistically defined as the 4th
population movement around the mean, and is a measure of deviation from the normal
distribution.
52
(4)
where,
= indicates the sum of calculations
X = observed value
= population mean
N = number of scores in sample.
As the formula indicates, the normal distribution has kurtosis of 3. This is
conventionally rescaled to 0 by subtracting 3 from the result. This formula is
commonly referred to as Pearson kurtosis after its originator, Karl Pearson. Many
different versions of the formula have been derived. One of the most workable is
(5)
where,
= indicates the sum of calculations
X = the observation value
Y = the sample mean
N = number of scores in sample.
s = standard deviation of the sample
Kurtosis continues to be widely used in finance studies of the stock market
(Corrado & Su, 1996; Dittmar, 2002), but is less widely recognized as a useful
measure in educational or psychological research and has fallen out of use. The
general terminology associated with kurtosis for the shape of distributions is
illustrated in the diagram below.
53
Figure 1. Characteristics of Kurtosis
As stated above, a normal distribution, has a kurtosis of 0. A platykurtic curve is
less peaked than the normal distribution and has a negative kurtosis. A leptokurtic
curve is more peaked than the normal distribution and has a positive kurtosis.
Kurtosis is a measure of a non-normal distribution's deviation from the normal
curve. It is generally interpreted as a measure of the peakedness of a curve (deCarlo,
1997), but this has received heavy criticism. Darlington (1970) states kurtosis may
also be measuring a variety of other things in addition to peakedness, including the
thickness of the tails.. A large number of other measures of peakedness have been
identified (Balanda and MacGillivray, 1988; DeCarlo, 1997; Wilcox and Keselman,
2003). However, these methods have rarely been used outside these initial proposals.
As such, they are difficult to work with and poorly understood.
This study will report the Pearson kurtosis of panelist cutscores (judge kurtosis),
as well as for individual items (item kurtosis). Along with standard deviation, Pearson
kurtosis is one of the oldest statistics in use today. It is well-understood and easily
managed by all of the major commercial software packages and calculation tools. In
this study it will be calculated using the Microsoft package Excel 2010 (Microsoft,
2013b), which calculates the Pearson Kurtosis as described above.
Latent-trait measures: This study will use Rasch Modeling to examine latent trait
characteristics of rater and item estimates from the modified-Angoff standard setting
54
relies on a form of latent trait modeling. As originally suggested by Gorg Rasch, there
exists an invariant relationship between the ability of the person and item parameters.
In Rasch's original formulation of the stochastic function of person and item
parameters for responses was formulated for dichotomous data (Humphry, 2005).
ln Pni11−Pni1
=βn−δi
where,βn is the location of person n along the underlying latent trait, δi is the location of item i along the same latent variable, Pni1 and Pni0 are the probabilities of person n on item i scoring 1 and 0, respectively
A comparison of each calculated value with the observed value leaves a residual.
xnr−Enr, (7)where xnr is the observed vale for rater r and
Enris the expected value for rater r
Edward Wolfe and his students (Wolfe, 2004; Wolfe, Moulder & Myford, 2001;
Myford & Wolfe, 2003; 2004; Yue, 2011) have identified several methods to use these
residual value of the Rasch measure to detect centrality. In her study of simulated
data, Yue (2011) identified the correlation between the expected value of the rating
and the residual of the expected and observed value as one of the most successful
measures at detecting centrality at a reasonable level of power.
The expected-residual correlation (rexp,res) is the correlation of the expected latent-
trait (in this case Rasch) value for the rating with the residual. Unlike its classical
measures of centrality, latent trait measures provide a clear indication of when and
what type of centrality exists. Under the condition of perfect centrality, rexp,res should
be -1.0. Similarly, under conditions of perfect extremism, rexp,res should be 1.0. If there
55
is no centrality at all, rexp,res should be 0 (Wolfe, 2004).
Despite the precise definitions of centrality, the application of these two
measures has been quite limited. The only study outside of Wolfe's group that has
reported results from these measures is Lavellee (2012), and this was not an attempt
to document their limitations or validity. As such, they remain largely theoretical and
poorly understood. However, unlike the classical measures discussed above, there are
guidelines for how to interpret their results in terms of identifying centrality. The
dissertation of Yue (2011) contains the SAS PROC commands for the calculation of
these values. In this study, they are calculated from item and panelist measures
supplied by WINSTEPS Rasch Modeling software (Linacre, 2012). The calculations
themselves are performed on Excel 2010 using instructions provided in Embretson
and Reise (2000).
Standard deviation, kurtosis, and the latent trait measures of centrality will be
used in the following manner to address the research questions stated earlier.
Research question 1: Does centrality exist in the modified-Angoff standard setting?
The standard deviation, kurtosis, and latent trait measures of each of the items
and each of the judges during the different rounds will be examined. As stated above,
there is no accepted conventional interpretation of how much standard deviation or
kurtosis indicates centrality.
The interpretation of this is not straight forward, and a successful answer may
not be possible using classical measures. Despite its widespread use and its previous
use in centrality studies, there are no existing guidelines for the use of standard
definition in the detection of centrality. A similar problem affects kurtosis. Kurtosis is
said to exist when the Pearson kurtosis exceeds 0. No existing guidelines are available
for the use of kurtosis in the detection of centrality. It is expected that latent trait
56
measures, which provide stronger guidelines in the identification of centrality, should
find that at some point in the standard setting, panelist ratings show rater centrality.
When panelist centrality exists, the expected-residual correlation will be positive.
This research makes no prediction about when rater centrality will become
apparent in a modified-Angoff standard setting. It is possible that panelist ratings as
early as the first round show centrality, however this is not of significance to the
expected results of the study. By the final round, indicators of centrality should show
that it is present.
Question 2: How does centrality change across the rounds of the modified-Angoff
procedure?
This research question is a corollary of the question above. What are the patterns
of change for the measures examined in this study between the rounds of the
modified-Angoff standard setting? Do they increase? Or do they decrease? Are their
patterns among panelists? Do difficult items differ from easy items in terms of their
centrality?
In addition to the analysis of the item and judge classical and latent trait
measures described above, I propose a measure similar to that used in Impara and
Plake (1998). In this measure, the absolute difference between estimated and actual
scores is used to construct a nominal measure index based on a combination of the
two measures classical measures of standard deviation and kurtosis. To obtain the
values for this matrix, the first round panelist estimates will be subtracted from the
final value in the third round. The value for kurtosis will be positive or negative
depending on whether it, and the centrality of raters or items, is increasing or
decreasing, across the standard setting. A similar operation will be performed for the
standard deviation. Combining these 2 measures produces the following matrix. The
57
quadrants of the index should produce a hierarchy of centrality.
KURTOSIS across 3 rounds
positive negative
A I B
smaller HIGHEST I
CENTRALITY I
ST. -----------------------------------------------------------------
DEV. C I D
across 3 rounds I LOWEST
I I CENTRALITY
larger I
Figure 2. Direction of Centrality Change
The 4 quadrants of the 2 X 2 matrix show different combinations of kurtosis and
standard deviation. The A Quadrant shows the highest level of centrality, containing
those items and panelists with a decreasing standard deviation and increasingly
positive kurtosis. The B Quadrant contains those items and panelists with decreasing
standard deviation but increasingly negative kurtosis. The C Quadrant contains those
items and panelists with an increasing standard deviation and an increasingly positive
kurtosis. The final D Quadrant contains those items with the least centrality, with an
increasing standard deviation and an increasingly negative kurtosis.
Latent trait measures can also be used in this operation, and similar indexes can
be constructed using combinations of these and classical measures to assess the
change in centrality between rounds.
58
Do the estimates of panelists fall in some quadrants more than others? What is
the movement of panelist estimates between rounds? What is the movement of item
estimates between rounds? Do item estimates and panelist estimates differ? I expect
that most panelist estimates will fall in Quadrant A and Quadrant B, indicating that
centrality has increased across rounds. Few, if any items and panelists will fall in
Quadrant C and Quadrant D indicating a decrease in centrality.
Question 3: Does centrality explain differences in panelist ratings between extreme
(difficult and easy) items and median difficulty item?
This research question is the primary question of the project. Once patterns
consistent with centrality have been established, are these patterns also consistent with
the idea that items of extreme difficulty values - both the most difficult and the easiest
- show the most centrality?
A measure of how extreme an item's difficulty will be constructed from the
difficulty measures of the items. The absolute value of the distance from the midpoint
of logit scores will be used as this measure of item extremism. This measure can in
turn be used in calculations of the correlation with classical measures of centrality,
such as standard deviation or kurtosis, as well as the various latent trait measures. In
addition, it can be compared with the index of centrality change described above to
determine if items most likely to change in the hierarchy of centrality measures are
the most extreme items.
It is expected that correlations between the measure of item extremism and all
measures of centrality will be significant, or at least that these correlations will
increase across the three rounds of the modified-Angoff standard setting, indicating a
strong association between item extreme value and its centrality in a collection of
rated items. It is expected that the most extreme items as measured by my measure of
59
item extremism will be the items most likely to show the most centrality change in the
hierarchy of centrality described above. That is, that items with the most extreme
values of item difficulty will be most likely found in Quadrant A.
60
References
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In: R. L. Thorndike
(Ed.), Educational Measurement (pp. 508-600). Washington, DC: American
Council on Education.
Balanda, K. P.; and H. L. MacGillivray. (1988). Kurtosis: a critical review. American
Statistician, 42, 111-119.
Berger, J., Harbring, C.& Sliwka D. (2010). Performance appraisals and the
impact of forced distribution: An experimental investigation. Discussion Paper
No. 5020, The Institute for the Study of Labor, Bonn, Germany.
Bol, J. C. (2006).The determinants and performance effects of supervisor bias. IESE
Business School. University of Navarra.
Bond, T. G., & Fox, C. M. (2001).Applying the Rasch model: Fundamental
measurement in human sciences. Mahwah, NJ: Erlbaum.
Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff
standard-setting topics. Applied Measurement in Education, 17, 59–88.
Buckendahl, C.W., Smith, R.W., Impara, J.C. & Plake, B.S. (2002).A comparison of
Angoff and bookmark standard setting methods. Journal of Educational
Measurement, 39, 253-26.
61
Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.). (1988). The nature of
expertise. Hillsdale, NJ: Erlbaum.
Cizek, G. J. (1996). An NCME instructional module on setting passing scores.
Educational Measurement: Issues and Practice, 15, 20-31.
Cizek, G. J. (ed.) (2001) Setting performance standards: concepts, methods, and
perspectives. Mahwah. NJ: Erlbaum.
Cizek, G. J., & Bunch, M. B. (2007).Standard setting: A guide to establishing and
evaluating performance standards on tests. Thousand Oaks, CA: Sage.
Cizek, G.J., Bunch, M.B., & Koons, H. (2004).Setting performance standards:
contemporary methods. Educational Measurement: Issues and Practice, 23,
31–50.
Clauser, B. E., Harik, P., Margolis, M. J., McManus, I. C., Mollon, J., Chris, L. &
Williams, S. (2009). An empirical examination of the impact of group
discussion and examinee performance information on judgments made in the
Angoff standard-setting procedure. Applied Measurement in Education, 22,
1-21
Clauser, B. E., Swanson, D. B., & Harik, P. (2002).A multivariate generalizability
analysis of the impact of training and examinee performance information on
judgments made in an Angoff-style standard-setting procedure. Journal of
Educational Measurement, 39, 269–290.
62
Corrado, C., and T. Su. (1996). Skewness and Kurtosis in S&P 500 Index Returns
Implied by Option Prices. Journal of Financial Research, 19,175-192.
Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Wainer & H.
Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Erlbaum.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52, 281–302.
Cross, L. H., Impara, J. C., Frary, R. B., & Jarger, R. M. (1984). A comparison of
three methods on the National Teacher Examination. Journal of Educational
Measurement, 21, 113-129.
Council of Europe (2001).Common European Framework of Reference for
Languages: learning, teaching, assessment. Cambridge: Cambridge University
Press. Retrieved March 25, 2013 from
http://www.coe.int/t/DG4/Portfolio/documents/Framework_EN.pdf
Council of Europe (2009).Relating language examinations to the Common European
Framework of Reference for Languages: Learning, teaching, assessment
(CEFR).Cambridge: Cambridge University Press. Retrieved March 25, 2013
from http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Publications/
Standards_language_EN.pdf
Cross, L. H., Frary, R. B., Kelly, P. P., Small, R. C., & Impara, J. C. (1985).
63
Establishing minimum standards for essays: Blind versus informed reviews.
Journal of Educational Measurement, 22,137-146.
Darlington, R. B. (1970). Is kurtosis really "peakedness"? American Statistician, 24,
19-22.
DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2,
292-306.
DeCoths, Thomas A. (1977). An analysis of the external validity and applied
relevance of three rating formats. Organizational Behavior and Human
Performance, 19, 247-266.
Dittmar, R., (2002). Nonlinear Pricing Kernels, Kurtosis Preference, and Evidence
from the Cross-Section of Equity Returns. Journal of Finance, 57, 369-403.
Embretson, S.E. and Reise, S.P. (2000).Item response theory for psychologists.
Mahwah, NY: Lawrence Erlbaum Associates.
Engelhard, G. (2007). Evaluating bookmark judgments. Rasch measurement
Transactions, 21, 1097-1098.
Engelhard, G. and Anderson, D. W. (1998). A binomial trials model for examining the
ratings of standard setting judges. Applied Measurement in Education,
11209-230.
64
Farrokhi, F., Esfandiari, R., & Dalili, M. V. (2011). Applying the Many-Facet Rasch
Model to Detect Centrality in Self-Assessment, Peer-Assessment and Teacher
assessment. World Applied Science Journal, 15, 70-77.
Fitzpatrick, A. R. (1989). Social influences in standard setting: The effects of social
interaction on group judgments. Review of Educational Research, 59, 315-328.
Geisinger, K., & McCormick, C. (2010). Adopting cut scores: Post-standard-setting
panel considerations for decision makers. Educational Measurement: Issues and
Practice, 10, 17-22.
George, S., Haque, M. S.& Oyebode, F. (2006). Standard setting: Comparison of two
methods, BMC Medical Education, 6, 46.
Giraud, G. T., & Impara, J. C. (2005).Making the cut: The cut score setting process in
a public school district. Applied Measurement in Education, 18, 289-312.
Glaser, R. (1963). Instructional technology and the measurement of learning
outcomes. American Psychologist,18, 519–522.
Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff
minimum passing levels for a group of minimally competent examinees. Applied
Measurement in Education, 12, 13-28.
Green, D. R., Trimble, C. S., & Lewis, D. M. (2003).Interpreting the results of three
standard-setting procedures. Educational Measurement: Issues and Practice, 22,
65
22–32.
Halpin, G., Sigmon, G., & Halpin, G. (1983).Minimum competency standards set by
three divergent groups of raters using a three judgmental procedures:
Educational and Psychological Measurement, 47, 977-983.
Hambleton , R. K. (1980) Test score validity and standard-setting methods. In R. A.
Berk (Ed.),Criterion - referenced measurement: The state of the art. Baltimore,
MD: Johns Hopkins University Press.
Hambleton, R.K., (2001). Setting performance standards on educational assessments
and criteria for evaluating the process. In G.J. Cizek (Ed.), Setting performance
standards: Concepts, methods, and perspectives. (pp.89-116).Mahwah, NJ:
Lawrence Erlbaum Associates
Hambleton, R. K., & Pitoniak. M.(2006).Setting performance standards. In R. L.
Brennan (Ed.), Educational measurement (pp. 433-470). Westport, CT:
Praeger Publishers.
Hein, S. F. & Skaggs, G. E. (2009).A qualitative investigation of panelists’
experiences of standard setting using two variations of the bookmark method.
Applied Measurement in Education, 22, 207-228.
Hertz, N. R., & Chinn, R. N. (2002, April).The role of deliberation style in standard
setting for licensing and certification examinations. Paper presented at the
annual meeting of the National Council on Measurement in Education, New
66
Orleans, LA.
Humphry, S. (2005). Maintaining a common arbitrary unit in social measurement.
Unpublished doctoral dissertation, Murdoch University, Western Australia.
Hurtz, G. M., & Auerbach, M. A. (2003).A meta-analysis of the effects of
modifications to the Angoff method on cutoff scores and judgment consensus.
Educational and Psychological Measurement, 63, 584–601
Huynh, H. & Schneider, C. (2005). Vertically moderated standards: Background, p
assumptions, and practices. Applied Measurement in Education, 18, 99-113.
Impara, J.C., & Plake, B.S. (1998). Teachers’ ability to estimate item difficulty: A test
of the assumptions in the Angoff standard-setting method. Journal of
Educational Measurement, 35,69-81.
Jaeger, R. M. (1982). An iterative structures judgment process for establishing
standards on competency tests: Theory and application. Educational evaluation
and Policy Analysis, 4, 461-475.
Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and
process. In M. Chi, R. Glaser, & M. J. Farr (Eds.), The nature of expertise.
(pp. 209-228). Hillsdale, NJ: Lawrence Erlbaum Associates.
Kaftandjieva, F. (2010). Methods for Setting Cut Scores in Criterion-references
Achievement Tests. A Comparative Analysis of Six Recent Methods with an
67
Application to Tests of Reading in EFL. EALTA publication. Retrieved March
25, 2013 from
http://www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf
Kane, M. T. (2001). So much remains the same: conception and status of validation in
setting standards. In G. J. Cizek (Ed.),Setting performance standards: concepts,
methods and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum
Associates, Inc.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin,
112, 527-535.
Larkin, J. H., McDermott, J., Simon, D. P., & Simon, H. A. (1980). Expert and novice
performance in solving physics problems. Science, 208, 1335-1342.
Lavallee, J. (2012). Validation Issues in an Angoff Standard Setting: A Facets-based
investigation. Unpublished PhD Dissertation, Department of Counseling and
Educational Psychology, National Taiwan Normal University, Taipei, Taiwan.
Linacre, J. Facets: Many-Facet Rasch Analysis, Retrieved June 27, 2012 from
http://www.winsteps.com/facets.htm
Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems:
Implications of requirements of the No Child Left Behind Act of 2001.
Educational Researcher, 31, 3–16.
68
Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations.
Educational Researcher, 32, 3-13.
Lissitz, R. W. & Huynh H. (2003). Vertical equating for state assessments: Issues
and solutions in determination of adequate yearly progress and school
accountability. Practical Assessment, Research & Evaluation, 8 (10). Retrieved
March 25, 2012 from http://pareonline.net/getvn.asp?v=8&n=10
Lissitz, R. W. & Wei, H. (2008).Consistency of standard setting in an augmented
state testing system. Educational Measurement, 27, 46-56.
Loevinger, J. (1957). Objective tests as instruments of psychological theory.
Psychological Reports, 3, 635–694.
Lorge, L, & Kruglov, L. (1953).A suggested technique for the improvement of
difficulty prediction of test items. Educational and Psychological Measurement,
12, 554-561.
McGinty, D. (2005). Illuminating the “Black Box” of standard setting: An exploratory
qualitative study. Applied Measurement in Education, 18, 269–287
Messick, S. (1981). Constructs and their vicissitudes in educational and
psychological measurement. Psychological Bulletin, 89, 575–588.
Messick, S. (1989).Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–
103). Washington, DC: American Council on Education and National Council on
69
Measurement in Education.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators
Research, 45, 35–44
Microsoft (2013a) STDEV in Welcome to Office. Retrieved January 23, 2012 from
http://office.microsoft.com/en-us/excel-help/stdev-HP005209277.aspx?CTT=3.
Microsoft (2013b) KURT in Welcome to Office. Retrieved from January 23, 2013
from http://office.microsoft.com/en-us/excel-help/kurt-HP005209150.aspx
January 23, 2013.
Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark
procedure: Psychological perspectives. In G. J. Cizek (Ed.). Setting performance
standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ:
Erlbaum.
Moers, F.(2005). Discretion and bias in performance evaluation: the impact of
diversity and subjectivity. Accounting, Organizations and Society, 30, 67–80.
Murphy, K. R., & Balzer, W. K. 1989. Rater errors and rating accuracy. Journal of
Applied Psychology, 74, 619-624.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4,
386–422.
70
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using
many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5,
189–227.
Nedelvsky, L. (1954). Absolute Grading Standards for Objective Tests. Educational
and Psychological Measurement, 14, 3-19.
Nelson, D. S., (1994) Job Analysis for Licensure and Certification Exams: Science or
Politics? Educational Measurement: Issues and Practice, 29-35.
Norcini, J. J. & Shea, J. A. (1997). The credibility and comparability of standards.
Applied Measurement in Education, 10, 39–59
Norcini, J., Lipner, R., Langdon, L., & Strecker, C. (1987). A comparison of three
variations on a standard-setting method. Journal of Educational Measurement,
24, 56-64.
Plake, B., & Giraud, G. (1998) Effect of a modified Angoff strategy for Obtaining
item performance estimates in a standard setting study. Paper presented at the
Annual Meeting of the American Educational Research Association. San Diego,
Calf.
Prendergast, C., 1999, The provision of incentives in firms. Journal of Economic
Literature, 37, 7-63.
71
Reckase M. D.(2000). The ACT/NAGB standard setting process: How "modified"
does it have to be before it is no longer a modified-Angoff process? Paper
presented at the annual meeting of the American Educational Research
Association, New Orleans.
Reckase, M. D. (2006) Rejoinder: Evaluating standard setting methods using error
models proposed by Schulz. Educational Measurement, 25, 14-17
Roach, A. T., McGrath, D., Wixon, C., & Talapatra, D. (2010).Aligning an early
childhood assessment to state kindergarten content standards: application of a
nationally recognized alignment framework. Educational Measurement: Issues
and Practice, 29(1), 25-37.
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the
psychometric quality of rating data. Psychological Bulletin, 88, 413-428.
Schafer, W. D. (2005).Criteria for standard setting from the sponsor’s perspective.
Applied Measurement in Education, 18, 61-81.
Saunders, W. L. (1998). Value added assessment. The School Administrator, 55,
24-32.
Saunders, W. L., & Horn, S. P. (1998). Research findings from the Tennessee value-
added assessment system (TVAAS) database: Implications for educational
evaluation and research. Journal of Personnel Evaluation in Education, 12,
247–256.
72
Saunders, W. L. & Horn, S P. (1995). The Tennessee Value-Added Assessment System
(TVAAS): Mixed Model Methodology in Educational Assessment. In
Shrinkfietd, A. J. & Stufflebeam, D. (Eds.), Teacher evaluation: Guide to effective
practice (pp. 337-350). Boston, MA: Kluwer.
Shanteau, J.(1992). Competence in experts: the role of task characteristics.
Organizational Behavior and Human Decision Processes. 53, 252–266.
Shanteau, J.(2001). What does it mean when experts disagree? In E. Salas & G. Klein
(Eds.), Linking expertise and naturalistic decision making (pp. 229–244).
Mahwah, NJ: Erlbaum
Shepard, L.A. (1980). Standard setting issues and methods. Applied Psychological
Measurement, 1980, 4, 447-467.
Shepard, L. A. (1994). Implications for standard setting of the National Academy of
Educational Evaluation of the National Assessment of Educational Progress
achievement levels. In: Proceedings of the joint conference on standard setting
for large-scale assessments of the National Assessment Governing Board and the
National Center for Educational Statistics (pp. 143–159). Washington, DC: U.S.
Government Printing Office.
Smith, R. L. and Smith, J. S. (1988). Differential use of item information by judges
using Angoff and Nedelsky procedures. Journal of Educational Measurement,
25, 259-274.
73
Taube, K.T. (1997). The incorporation of empirical item difficulty data in the Angoffstandard-setting procedure. Evaluation and the Health Professions, 20, 479-498.
van de Watering G., van der Rijt J. (2006). Teachers’ and students’ perceptions of
assessments: a review and a study into the ability and accuracy of estimating the
difficulty levels of assessment items. Educational Research Review,1, 133–147.
Weiss, D. J., & Shanteau, J. (2004). The vice of consensus and the virtue of
consistency. In C. Smith, J. Shanteau, & P. Johnson (Eds.), Psychological
investigations of competent decision making (pp. 226–240). Cambridge, UK:
Cambridge University Press.
Wessen, C. (2010). Analysis of Pre- and Post-Discussion Angoff ratings for evidence
of social influence effects. Unpublished MA Dissertation, Department of
Psychology, University of California, Sacramento.
Wiley, A., & Guille, R. (2002, April).The occasion effect for “at-home” Angoff
ratings. Paper presented at the annual meeting of the National Council on
Measurement in Education, New Orleans, LA.
Weiss, D. J., & Shanteau, J. (2003). Empirical assessment of expertise. Human Factors, 45, 104–116.
Weiss, D. J. & Shanteau, J. (no date, a). CWS: A user's guide. Retrieved January 23,
2013 from https://www.ksu.edu/psych/cws/pdf/using_cws.pdf.
Weiss, D. J. & Shanteau, J. (no date, b). Do judgments alone provide sufficient
74
information to determine expertise of the judge who made them? Retrieved
January 23, 2013 from http://www.k-state.edu/psych/cws/pdf/judgment.pdf.
Wilcox R. R. & Keselman H. J. (2003). Modern robust data analysis methods:
Measures of central tendency. Psychological Methods, 8, 254–274.
Wolfe, E.& McVay, A. (2012) Application of latent trait models to identifying
substantively interesting raters. Educational Measurement: Issues and Practice,
31, pp. 31-37.
Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology
Science, 46, 35–51.
Wolfe, E. W., Moulder, B. C., & Myford, C. M. (2001). Detecting differential rater
functioning over time (DRIFT) using a Rasch multi-faceted rating scale model.
Journal of Applied Measurement, 2, 256–280.
Yin, P., & Schultz, E. M. (2005, April).A comparison of cut scores and cut score
variability from Angoff-based and Bookmark-based procedures in standard
setting. Paper presented at the annual meeting of the National Council on
Measurement in Education, Montreal, Canada.
Yue, Xiaohui (2011). Detecting Rater Centrality Effect: Using Simulation Methods
and Rasch Measurement Analysis. Unpublished PhD Dissertation, Department of
Educational Research and Evaluation, Virginia Polytechnic Institute, Blacksburg,
Virginia.
75
76
Appendix 1. Common European Framework of Reference - Global Scale
Level Performance Level Descriptors
C2
Can understand with ease virtually everything heard or read. Can summarise
information from different spoken and written sources, reconstructing
arguments and accounts in a coherent presentation. Can express him/herself
spontaneously, very fluently and precisely, differentiating finer shades of
meaning even in more complex situations.
C1
Can understand a wide range of demanding, longer texts, and recognise implicit
meaning. Can express him/herself fluently and spontaneously without much
obvious searching for expressions. Can use language flexibly and effectively for
social, academic and professional purposes. Can produce clear, well-structured,
detailed text on complex subjects, showing controlled use of organisational
patterns, connectors and cohesive devices.
B2
Can understand the main ideas of complex text on both concrete and abstract
topics, including technical discussions in his/her field of specialisation. Can
interact with a degree of fluency and spontaneity that makes regular interaction
with native speakers quite possible without strain for either party. Can produce
clear, detailed text on a wide range of subjects and explain a viewpoint on a
topical issue giving the advantages and Independent disadvantages of various
options.
B1
Can understand the main points of clear standard input on familiar matters
regularly encountered in work, school, leisure, etc. Can deal with most
situations likely to arise whilst travelling in an area where the language is
spoken. Can produce simple connected text on topics which are familiar or of
personal interest. Can describe experiences and events, dreams, hopes and
ambitions and briefly give reasons and explanations for opinions and plans.
A2
Can understand sentences and frequently used expressions related to areas of
most immediate relevance (e.g. very basic personal and family information,
shopping, local geography, employment). Can communicate in simple and
routine tasks requiring a simple and direct exchange of information on familiar
and routine matters. Can describe in simple terms aspects of his/her background,
77
immediate environment and matters in areas of immediate need.
A1
Can understand and use familiar everyday expressions and very basic phrases
aimed at the satisfaction of needs of a concrete type. Can introduce him/herself
and others and can ask and answer questions about personal details such as
where he/she lives, people he/she knows and things he/she has. Can interact in a
simple way provided the other person talks slowly and clearly and is prepared to
help.
Source: CoE, 2001, p. 24.
78
Appendix 2. Informed Consent Form
Informed Consent Form for ELC Standard Setting Pilot Studies (July 2010)
The ELC assessment committee is doing research on the standard setting process used to link
tests to the Common European Framework of Reference (CEFR). The results of this process
are important, because they determine what test scores count as ‘proof’ that a student has
reached a certain ability level. However, the process itself is very subjective, and there is no
way to prove that a given score means that a student has “really” reached a given ability level.
The purpose of this study is thus to help us better understand the factors that influence the
decision-making process, as part of the longer-term goal of improving the process.
You are being invited to take part because of your background in the TEFL field. Your
participation is entirely voluntary, and your choice will have no bearing on your job or on any
work-related evaluations or reports. If you accept, you will be asked to complete a short
preparatory assignment and to participate in two one-day workshops. At these meetings, you
will receive more training and then you will be asked to make a series of judgments
concerning the difficulty level of items from MCU’s English exams in relation to the CEFR
descriptors.
Audio and video recordings will be made of the group discussion and the interviews, and the
recordings will be transcribed. The recordings will be treated as confidential and no real
names will be used in the transcripts. No one outside of the Assessment Committee will have
access to the transcripts.
We expect that your participation will help us to better understand how judges make standard
setting decisions. We will share any findings with you. The findings may also be shared with
other researchers in the field through presentations and/or publications.
All participants will receive an honorarium.
If you have any questions about any aspect of the study, please do ask.
Joseph Lavallee, Principal Investigator
ELC Assessment Committee, Ming Chuan University
I have read the foregoing information. I have had the opportunity to ask questions about it and any questions I have been asked have been answered to my satisfaction. I consent voluntarily to participate in this study.
79
Name (please print clearly): ________________________Signature: ________________________Date (day/month/year): ________________________
80
Appendix 3. Security FormStandard Setting Security Agreement Form
I, _______________________ (print panelist name here), understand and accept the following terms and conditions.
1. Panelists will follow all test security procedures set forth in writing or verballyby English Language Center representatives.2. Panelists will turn over to English Language Center representatives all products of the standard setting meeting at the close of the session or as directed by said representatives.3. Panelists relinquish any claim or right to any and all products turned over to the English Language Center.4. All of the materials used at the meeting are considered secure and panelists are expected to turn in all such materials at the close of the session or as directed by English Language Center representatives and to maintain complete confidentiality regarding these materials.
________________________________________ ___________Panelist signature Date
81
Appendix 4.Angoff Panelist Record FormPANELIST NAME: ________________ Listening - ROUND 1
Circle or insert the probability that a just-B1 level student would get the item correct. Then write your probability at the bottom of the table.
ITEM NO.1 2 3 4 5 6 7 8
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
Prob.
82
PANELIST NAME: ________________ Reading - ROUND 1
Circle or insert the probability that a just-B1 level student would get the item correct. Then write your probability at the bottom of the table.
ITEM NO.1 2 3 4 5 6 7 8
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
Prob.
83
84
Appendix 5. Panelist Information Form
Panelist Information Form Please Answer the Following Questions
Name (in native language)
_____________________________________________________________________
Current Job Title/Employer
_____________________________________________________________________
Highest relevant degree attained (please list the name of the degree and the granting institution): ____________________________________________________________________
_____________________________________________________________________
Years of teaching experience with university-level students in Taiwan:
_____________________________________________________________________
Years of experience with test design/development:
_____________________________________________________________________
Have you ever lived in an English-speaking country? Where? How long?
_____________________________________________________________________
Please circle the answer that best shows your opinion
85
I am familiar with the test design and construction process.Disagree 1 2 3 4 Agree
I am familiar with the Common European Framework of Reference(CEFR).Disagree 1 2 3 4 Agree
I am familiar with Rasch or IRT Modeling.Disagree 1 2 3 4 Agree
I am familiar with the standard setting process.Disagree 1 2 3 4 Agree
86
Appendix 6. Evaluation 1PART I. Procedures Panelist Number _____
Please circle the answer that best shows your opinion
I have completed the introduction to the procedures of a standard setting.Disagree 1 2 3 4 Agree
The group leader answered all of my questions.Disagree 1 2 3 4 Agree
I understand the instructions so far.Disagree 1 2 3 4 Agree
I understand why a standard setting is important.Disagree 1 2 3 4 Agree
I have made decisions about student standards before.Disagree 1 2 3 4 Agree
I feel qualified to make this kind of judgment.Disagree 1 2 3 4 Agree
I feel I have the experience to make this kind of judgment.Disagree 1 2 3 4 Agree
Standard setting is common sense.Disagree 1 2 3 4 Agree
87
Appendix 7. Evaluation 2
PART II. Common European FrameworkPanelist Number _____
I have completed the introduction to the CEFR.Disagree 1 2 3 4 Agree
The group leader answered all of my questions.Disagree 1 2 3 4 Agree
I understand the instructions so far.Disagree 1 2 3 4 Agree
I understand the CEFR.Disagree 1 2 3 4 Agree
I understand the difference between the different levels of the CEF.Disagree 1 2 3 4 Agree
I understand the B1 level.Disagree 1 2 3 4 Agree
The CEFR is a useful way to think about teaching English.Disagree 1 2 3 4 Agree
I agree with the order of the CEFR levels.Disagree 1 2 3 4 Agree
The CEFR is common sense.Disagree 1 2 3 4 Agree
88
Appendix 8. Evaluation 3
PART III. Ming Chuan University Practical English TestPanelist Number ______
I have completed the introduction to the Practical English Test.Disagree 1 2 3 4 Agree
The group leader answered all of my questions.Disagree 1 2 3 4 Agree
The practice test helped me understand more about the test.Disagree 1 2 3 4 Agree
I understand the instructions so far.Disagree 1 2 3 4 Agree
I understand item difficulty.Disagree 1 2 3 4 Agree
I understand item proficiency scales.Disagree 1 2 3 4 Agree
I understand scale scores.Disagree 1 2 3 4 Agree
I understand why some items are more difficult than others.Disagree 1 2 3 4 Agree
89
Appendix 9. Evaluation 4
Review of Standard Setting ProceduresPanelist Number ______
Please circle the answer that best shows your opinion
I have completed Angoff Standard Setting training.Disagree 1 2 3 4 Agree
I understand Angoff Standard SettingDisagree 1 2 3 4 Agree
I understand the procedures of an Angoff Standard Setting.Disagree 1 2 3 4 Agree
I understand the Common European Framework.Disagree 1 2 3 4 Agree
I understand item difficulty.Disagree 1 2 3 4 Agree
I feel qualified to perform an Angoff Standard Setting.Disagree 1 2 3 4 Agree
I understand the B1 level of the CEFR.Disagree 1 2 3 4 Agree
I feel ready to start the Angoff Standard Setting.Disagree 1 2 3 4 Agree
I am ready to begin the Angoff Standard Setting.
____________________________________ (signature) _____________ (date)
90
Appendix 10. Evaluation 5
Angoff Standard Setting. Final Evaluationpage 1 of 2Panelist Number ______
Please circle the answer that best shows your opinion
The training and practice exercises helped me understand how to perform the task.Disagree 1 2 3 4 Agree
The time provided for discussions was adequate.Disagree 1 2 3 4 Agree
There was an equal opportunity for everyone in my group to contribute his/her ideas and opinions.Disagree 1 2 3 4 Agree
I was able to follow the instructions and complete the evaluation accurately.Disagree 1 2 3 4 Agree
The discussions after the first round of ratings were helpful to me.Disagree 1 2 3 4 Agree
The discussions after the second round of ratings were helpful to meDisagree 1 2 3 4 Agree
I am confident about the defensibility and appropriateness of the final recommended cut scores.Disagree 1 2 3 4 Agree
The facilities and food service helped create a productive and efficient working environment.Disagree 1 2 3 4 Agree
The information showing the distribution of student scores was helpful to me.
91
Disagree 1 2 3 4 Agree TURN PAPER OVER
Angoff Standard Setting. Final Evaluationpage 2 of 2Panelist Number ______
I found the discussion between rounds to be useful.Disagree 1 2 3 4 Agree
I changed my scores between rounds.Disagree 1 2 3 4 Agree
The discussion between rounds influenced me to change my score.Disagree 1 2 3 4 Agree
The information about student performance influenced me to change my score.Disagree 1 2 3 4 Agree
I am confident with my final decisions.Disagree 1 2 3 4 Agree
I believe my final decisions are the best decision I can make.Disagree 1 2 3 4 Agree
92