Standard Setting In Medical Exams

29/11/2009

Standard Setting and Medical Students’

AssessmentDr. Sanjoy Sanyal

Associate Professor – Neurosciences

Medical University of the Americas

Nevis, St. Kitts-Nevis, WI

[email protected]

mailto:[email protected]

Sanjoy Sanyal

Sanjoy Sanyal Modified 11/29/2009

List of topics• Summative

assessment

• Standard-setting

• Classification of standards

• Standard-setting models– Test-centered

models– Examinee-centered

models

• Modified Angoff

approach

• The Hofstee method

• Evaluation of

standards

• Future perspectives

• Conclusion

• References

What is Summative Assessment?

• In the context of a Caribbean medical

school training students to be future

doctors, summative assessment can be

interpreted as any of the following:

– End-point / end-semester assessment

– Certification examination

– Licensing examination

Reasons for Post-training Summative Assessment

1. Trainee motivation: Assessment drives

learning

2. Recognition of achievement

3. Rite of passage: Initiation to the profession

4. Reputation of the discipline

5. Patient safety

6. Quality marker for patients

Characteristics of Good Summative Assessment

VALIDITY* RELIABILITY**

FEASIBILITY

Content validity

Accurate Practicable

Construct validity

Consistent Cost-effective

Predictive validity

Fair Proportionate

Assessment Methods

Adapted from Roger Neighbour 2006

Factors Playing a Role in Recruitment of Examiners

Qualities

• Credibility

• Can ‘rank order’

• Trainable

• Impartial

• Team players

Incentives

• Status

• Influence

• Stimulation

• ‘Make a difference’

• Financial

Selection of Standard-setting Panelists

1. Experts in related field of examination

2. Familiar with examination methods

3. Good problem solvers

4. Familiar with level of candidates

5. Interested in education (teachers)

Establishing Standards – Policy Decision

Deciding who should pass or fail

should be a matter of policy

decision rather than a statistical

exercise

Why Need Standard Setting?

• To provide an educational tool to decide

cut-off point on the scoring scale which

separates the non-competent from the

competent

• To determine standards of performance,

which separate competent from the non-

competent candidate

Pertinent Questions Regarding Standard Setting

• What is the main purpose of assessment?

• What is at stake?

– For students

– For patients

– For organization

• Who has an interest in the outcome?

• What message do we wish to convey?

• What may be the effect of high / low pass rate?

Pertinent Questions Regarding Standard Setting

• What are the rules of combination in a multi-component examination?

• Who should set the standards?– Examiners?– Clinical practitioners?– Patients?

• Should the standards be absolute or relative?

• What happens to those who fail under the current standards?

• Are there any appeals procedure?

Qualities of Good Standards for Assessments

• Transparent marking and standard-setting process

• High reliability indices (Cronbach’s α >0.8; Cohen’s κ >+4)*

• Corrections for test variance and Error of Measurement

• Low examiner variability (recruitment, training, feedback)

• Fair appeals procedure

Educational Benefits of Standard Setting

• Faculty development

• Quality control of test materials

Standards – Classification

• Norm-referenced standards vs. Criterion-

referenced standards

• Compensatory Standards vs. Conjunctive

Standards

Ort

ho

go

nal

Bip

ola

r S

tan

dar

ds

Norm-referenced Standards

• Standard is based on performance of an

external large representative sample (‘Norm

group’) equivalent to candidates taking the

test

• May result in reasonable standards provided

the group is representative of candidates’

population, heterogeneous and large

Criterion-referenced Standards

• Links the standard to a set criterion of the

competence level under consideration

• Can be:

– Relative criterion standard

– Absolute criterion standard

Relative Criterion Standard

• A relative standard can be set at the mean

performance of candidates;

• Or by defining the units of SD from mean

• These standards may vary from year to year

due to shifts in ability of the group

• May result in a fixed annual percentage of

failing students

Absolute Criterion Standard

• Absolute criterion standard stays same over

multiple administrations of the test, relative

to the content specifications of the test

• Failure rate may vary due to changes in the

group’s ability, from one test administration

to the other

• The standard is set on the total test score

• Candidates can compensate for poor

performance in some parts of exam with

good performance in others

Compensatory Standards

• Standards are set for individual components

of the examination

• Candidates cannot compensate for poor

performance in one part

– Each skill component considered separately

– Allows diagnostic feedback to candidates

– Higher the correlation among test

components, greater the inclination towards

a compensatory standard

Conjunctive Standards

Standard-setting Models

• Test-centered models: Judges review test

items and provide judgments as to ‘just

adequate’ level of performance on these

items

• Examinee-centered models: Judges identify

(and sort) an actual (not hypothetical) group

of examinees

Test-centered Models

• Angoff model

• Ebel’s approach

• Nedelsky approach

• Jaeger’s method

Angoff Model

• A judgemental approach

• Group of expert judges make judgements

about how borderline candidates would

perform on items in the examination

• Details described later…

Ebel’s Approach

• Judges categorise items in a test according

to levels of difficulty and relevance to the

decision to be made

• Then they decide on proportion of items in

each category that a hypothetical group of

examinees could respond to correctly

Nedelsky Approach

• Originally designed for multiple choice

items

• For each item, judges decide on how many

of the distractors (response options) a

minimally competent examinee would

recognise as being incorrect

Jaeger’s method

• Emphasises the need to sample all

populations that have a legitimate interest in

outcomes of competency testing

• Focuses on passing examinees rather than

on borderline or minimally competent

Examinee-centered Models

• Borderline-group method

• Contrasts-by-group approach

• Hofstee method

Borderline-group method

• Judges identify an actual (not hypothetical)

borderline group

• The median score for this group is used as

the passing score

Contrasts-by-Group approach• Panellists sort examinees into 2 groups: competent and not-competent– This judgement is based on prior characteristics

of examinees rather than the current test scores– Test scores are not known to panellist during

sorting process

• After sorting is completed, score distributions for competent / not-competent groups are plotted

• Point of intersection of the two distributions is considered as the passing score

Hofstee Method

• A standard setting approach that

incorporates advantages of both relative

and absolute standard setting procedures

• Details described later…

Two Common Standard-Setting Procedures

• Modified Angoff procedure

– A Test-centered model

– Judgmental approach

– Suitable for MCQ examinations

• The Hofstee method

– An Examinee-centered model

– Compromise relative/absolute method

– Suitable for overall pass/fail decisions

– Approved by USMLE

Modified Angoff Procedure

• Judges discuss characteristics of a borderline candidate ‘only just good enough to pass’

• They make judgements about borderline candidate’s likelihood to respond correctly to each test item

• For each test item, judges estimate % of borderline candidates that is likely to answer the item correctly

• Pass / fail standard is the average of % for all items

The Hofstee Method• This takes advantages of both relative and

absolute standard-setting procedures and

arrives at a compromise between the two

• Reference group of judges agree on ff:

– Lowest acceptable fail rate (A)

– Highest acceptable fail rate (B)

– Lowest permissible passing grade (C)

– The required passing score (D)

The Hofstee Method

Adapted from Roger Neighbour 2006

Evaluation of Standards• Standard setting process should be

evaluated

• Evaluation includes data on 1st and 2nd ratings

of panellists for each test item rated

• This should demonstrate increased

consensus among raters (Cohen’s κ inter-

rater reliability)

• A questionnaire should be administered to

panellists at end of standard setting process

Future Perspectives

• Much work is still needed to establish

effective standard setting procedures

• Length of procedures should be considered

• Ways to shorten the process are needed

• Fully compensatory models should be

considered, in which test items are

averaged to produce a test standard

Future Perspectives

• Obtained standards should be checked

against other information available on the

test-taker to ensure construct validity

• Effective methods of training panellists to

recognise borderline characteristics are

essential if Angoff approach is widely used

Conclusion

• The more standard setting procedures are

applied to a variety of tests,

• More the practice of high quality testing will

be enhanced, and

• Higher will be the confidence in the testing

of professional competencies

References• Neighbour, Roger. Summative assessment

and standard setting.

www.jafm.org/edu/20060128/sem4_060129.pdf

• Friedman Ben-David. Standard setting in student assessment – An extended summary of AMEE Medical Education Guide No 18. Medical Teacher (2000) 22, 2, pp 120-130 www.medev.ac.uk/resources/features/AMEE_summaries/Guide18summaryMar04.pdf

http://www.jafm.org/edu/20060128/sem4_060129.pdf

http://www.medev.ac.uk/resources/features/AMEE_summaries/Guide18summaryMar04.pdf

http://www.medev.ac.uk/resources/features/AMEE_summaries/Guide18summaryMar04.pdf