Powerpoint Templates Page 1 Powerpoint Templates Methods of Standard Setting Natalia Gaponova

Powerpoint TemplatesPage 1

Powerpoint Templates

Methods of Standard Methods of Standard SettingSetting

Natalia Gaponova


IntroductionIntroduction

• All standard setting methods involve expert judgemental decision making at some level... (Jaegar, 1979)

•There is no such thing as a true standard, but there is a theoretical cut-score that would be set by a judge if he or she totally understood the process, the test, the content, and the policy and had a true score on the test in mind as the standard. The question is whether the standard setting method can recover the theoretical cut-score assuming a judge performed every task consistently and without error (Reckase, 2000)

• Many different terms are used in the measurement literature to refer to performance standards: “passing scores”, “cut scores”, “cutoff score”, “performance levels”, “achievement levels”, “mastery levels”, “proficiency levels”, “tresholds” and “standards” (Hambleton, 2001)


The importance of standard-settingThe importance of standard-setting

• Cut-score – is crucial for all participants of testing

must be reasoned and fair

necessary to use methods that allow with a mathematical precision to make it possible


Participants of testing need

•to compare themselves with other examinees

•to estimate correctly and adequately their level of mastery of the material

Common solutionCommon solution: : Setting of cut-scores and division of

examinees into groups in accordance with their ability level

Policy-makers

Are interested in overall level of educational

achievements, which could reflect the real situation in schools and classes of a

region

Interpretation of the mass-testing resultsInterpretation of the mass-testing results


Professional and ethic responsibility of people, who conduct testing for the provided results

1.

Interpretation of the results should be available to any understanding of the audience and should not cause an obvious disagreement with them

2.

The results interpretation should reflect real situation and be informative for policy-makers

3.

The results interpretation should not have a dual meaning – the examinees of one group should have really different levels of ability from examinees from another group

4.

Why is it important to establish reasonable Why is it important to establish reasonable and fair cut-scores?and fair cut-scores?


Second Page :

"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Cycle Diagram

Test-centered

Criterion-referenced

Norm-referenced

Examinee-centered

Standard-Setting Methods

Classification of


The most commonly used classification scheme nowadays is the one suggested by Jaeger (1989) who

splits the standard setting methods into two large groups

Test-centered• Angoff• Ebel• Nedelsky• Jaeger• Objective Standard

Setting• Bookmark• Etc.

Examinee-centered• Method of Contrasting

Groups• Method of Borderline

group• Etc.


ANGOFFTest-centered method


MethodMethod Angoff – Angoff – one of the most preferred one of the most preferred widely and frequently used methodswidely and frequently used methods

AngoffAngoff

Traditional Modified


Procedure of standard settingProcedure of standard setting ((traditional method Angofftraditional method Angoff))

Experts rate the probability that a barely or minimally satisfactory or qualified person would answer each test item correctly

The average of these probabilities across judges or raters is the cutoff score


Advantages and disadvantagesAdvantages and disadvantages

+• Transparency and clarity • Simplicity• Flexibility

-• ? Objectiveness

decision making about the probability of a correct answer by a minimally competent examinee

• One round in rating

variable values

(fluctuating rated probability)


EBELTest-centered method


Procedure of Standard SettingProcedure of Standard Setting

• 2 Rounds• Experts classify independently test items by:

I level of difficulty

II level of relevance

easy medium hard

essential important acceptable questionable


For each judge then: All items could be classified 12 cells in a 3*4 grid defined by the three

difficulty and four relevance category. As in the example:

categories Expert №3 Expert №4 Expert №5

Number of items

in a category

(А)

% correctly performed

items

(В)

А*В

Number of items

in a category

(А)


items

(В)

А*В

Number of items

in a category

(А)


items

(В)

А*В

Essential

Easy 11 60 660 10 70 700 13 75 975

Medium 1 25 25 3 25 75 1 0 0

Hard 0 10 0 1 0 0 0 0 0

QuestionableEasy 0 0 0 0 0 0 0 0 0

Medium 0 0 0 0 0 0 0 0 0Hard 0 0 0 0 0 0 0 0 0Mean 25.1 26.7 35

Mean for all experts

28

Cut-score 12

……


How to count a cut-score Judges indicated the percentage of items within each of

the 12 cells that a student should answer correctly in order to be judged minimally competent each item assigned to one of the 12 cells based on the expert’s ratings the percent passing judgment for a cell

multiplied times the number of items in a cell these products summed over all 12 cells to get an overall passing score for a judge these passing scores - averaged over judges in order to get the composite passing score


Advantages and disadvantagesAdvantages and disadvantages

++• Can be used with

different types of items (not only multiple-choice)

--• It may be challenging for standard

setting participants to keep the two dimensions of difficulty and relevance distinct because those dimensions may, in some situations, be highly correlated

• Validity concern has to do with judgments about item relevance. Because the inclusion of items judged to be of questionable relevance appears on its face to weaken the validity evidence supporting defensible interpretation of the total test scores


NEDELSKYTest-centered


General conceptGeneral concept

NedelskyNedelsky proposed considering the characteristics and proposed considering the characteristics and performance of a hypothetical borderline examinee that performance of a hypothetical borderline examinee that he referred to as the he referred to as the “F-D student”“F-D student”. Responses . Responses (distractors) which the lowest (distractors) which the lowest D-student D-student should should be able be able to reject as incorrectto reject as incorrect, and which therefore should be , and which therefore should be attractive to [failing students]attractive to [failing students] are called are called F-F-responsesresponses… Students who possess just … Students who possess just enough enough knowledge to eliminate F-responses knowledge to eliminate F-responses and must choose and must choose among the remaining responses at random are called among the remaining responses at random are called F-F-D studentsD students..


Procedure of Standard SettingProcedure of Standard Setting

• The experts independently determine F-responses which minimally competent examinees would be able to be able to eliminate as incorrecteliminate as incorrect

• The number of other options determines the probability with which the candidate will answer correctly the question: a plausible answer = 100%, 2 = 50%, 3 = 33%, 4 = 25%, and 5 = 0% probability of a correct answer


An exampleAn example• Participants judged that, for a certain five-option item, Participants judged that, for a certain five-option item,

borderline examinees would be expected to rule out two borderline examinees would be expected to rule out two of the options as incorrect, leaving them to choose from of the options as incorrect, leaving them to choose from the remaining three options. The Nedelsky rating for this the remaining three options. The Nedelsky rating for this item would be 1/3 = 0.33. Repeating the judgment item would be 1/3 = 0.33. Repeating the judgment process for each item would give a number of Nedelsky process for each item would give a number of Nedelsky values equal to the number of items in the test (n). The values equal to the number of items in the test (n). The sum of the n values can be directly used as a raw score sum of the n values can be directly used as a raw score cut score. For example, a 50-item test consisting entirely cut score. For example, a 50-item test consisting entirely of items with Nedelsky ratings of 0.33 would yield a of items with Nedelsky ratings of 0.33 would yield a recommended passing score of 16.5 (i.e., 50 × 0.33 = recommended passing score of 16.5 (i.e., 50 × 0.33 = 16.5)16.5)


Advantages and disadvantagesAdvantages and disadvantages+

• Nedelsky method is used for many years to establish threshold assessment. Probably it’s been popular for many years, because the procedure is clear for experts, they can make a decision about responses quickly, which is minimally competent examinee would be able to eliminate as incorrect.

• It can be used without preliminary approbation of a test

-• Can be used only with multiple-

choice items• Raters tend not to assign

probabilities of 1.00 (i.e., to judge that a borderline examinee could rule out all incorrect response options), this tends to create a downward bias in item ratings (i.e., a rating of .50 is assigned to an item instead of 1.00) with the overall result being a somewhat lower passing score than the participants may have intended to recommend, and somewhat lower passing scores compared to other methods


BOOKMARKTest-centered (based on Item-Response Theory)


EssentialEssential materialsmaterials


Standard SettingStandard Setting

Presentation of the percentage ofPresentation of the percentage ofstudents falling into each performance level students falling into each performance level and each median cut-score from Round 2. and each median cut-score from Round 2. After discussion individual judgmentsAfter discussion individual judgments

Overview of established cut-scores by every Overview of established cut-scores by every expert, repeating of the same procedure asexpert, repeating of the same procedure as

in the first stepin the first step

Experts are informed about the essential numberExperts are informed about the essential number of cut-scores to establish. Experts work inof cut-scores to establish. Experts work insmall groups, all the essential material issmall groups, all the essential material is

introduced to themintroduced to them

Basic steps of the Basic steps of the procedureprocedure

Round III

Round II

Round I


Round 1• The main goals are to get panelists familiar with the ordered

item booklet, set initial bookmarks, and then discuss the placements.

• Panelists are asked to discuss and determine the content that students should master for placement into a given performance level.

• Their independent judgments of cut-scores are expressed by simply placing a bookmark between the items judged to represent a cut-point. One bookmark is placed for each of the required cut-points.

• Items preceding the participant's bookmark reflect content that all students at the given performance level are expected to know and be able to perform successfully with a probability of at least 0.67 or 0.50.


Round 2• The first activity in Round 2 involves having each member

place bookmarks in his/her ordered item booklet where each of the other panelists in their small group made their bookmark placement. For a group of 6 people, each panelist’s ordered booklet will have 6 bookmarks for each cut point.

• Discussions are then focused on the items between the first and last bookmarks for each performance level. Upon completion of this discussion, the panelists then independently reset their bookmarks. The median of the Round 2 bookmarks for each cut point is taken as that group’s recommendation for that cut-point.


Round 3

• The percentage of students falling into each performance level is presented, given each group’s median cut-score from Round 2.

• With this information of how students actually performed, the panelists discuss the bookmarks in the large group and then make their Round 3 independent judgments of where to place the bookmarks.

• The median for the large group is considered to be the final cut-point for a given performance level.


METHOD OF CONTRASTING GROUPS

Examinee-centered


Method of contrasting groupsMethod of contrasting groups• Procedure includes testing of two groups of

examinees

• Comparison of the distribution of test scores for each examinee, who was classified by category

• In the place of intersection of two distributions cut-score

CompetentCompetent Non-competentNon-competent



Advantages and disadvantagesAdvantages and disadvantages++

• Can be used with any kind of an item type

--• Classifying students

on competent and non-competent is doubted to be objective


THANK YOU FOR ATTENTIONYour questions?

Documents

Powerpoint Templates Page 1 Powerpoint Templates Methods of Standard Setting Natalia Gaponova