Objective Test Guide

VVVVersion 1.2ersion 1.2ersion 1.2ersion 1.2

AprilAprilAprilApril 2007 2007 2007 2007

Guide to

Writing Objective Tests

A guide to writing selected response questions and creating

objective tests. Produced as part of the E-only project.

Page Page Page Page 1111

GuideGuideGuideGuide to Writing Objective Tests v1.2 to Writing Objective Tests v1.2 to Writing Objective Tests v1.2 to Writing Objective Tests v1.2

Page Page Page Page 1111 Introduction to Selected Response QuestionsIntroduction to Selected Response QuestionsIntroduction to Selected Response QuestionsIntroduction to Selected Response Questions

INTRODUCTION TO SELEINTRODUCTION TO SELEINTRODUCTION TO SELEINTRODUCTION TO SELECTED RESPONSE QUESTICTED RESPONSE QUESTICTED RESPONSE QUESTICTED RESPONSE QUESTIONSONSONSONS

This guide was written as part of the E-Only project, which sought to develop SQA’s

first (fully) online qualification. A package of support materials was developed as part

of that project – and this document is part of that package.

It was written after an extensive literature review of UK and international (particularly

US) publications relating to objective testing. It is notnotnotnot a procedural guide for SQA

appointees. Specific sectors and subjects will have their own procedures for

producing objective tests and this guide does not seek to replace that guidance.

However, it does aim to describe best practice in the construction of objective tests

from a generic perspective.

This document has gone through a number of revisions since the first draft version

was written in July 2006. Special thanks to everyone who took the time to contribute

through SQA Academy (http://www.sqaacademy.com). The document is being

frequently revised. An online forum is available to discuss its contents at the

following URL, where the latest version of the Guide can be found:

http://groups.google.com/group/objectivetesting

Bobby Elliott ([email protected])

April 2007

PURPOSE OF THIS GUIDE

Although most SQA units employ “conventional” assessment, some subject areas

(mostly Science-related) have a tradition of using objective tests. For example,

Biology uses multiple choice questions at Intermediate, Higher and Advanced Higher

levels; and HNC Computing uses an objective test as part of the Graded Unit.

More recently, there has been greater emphasis on objective testing due to its

suitability for computer-based assessment; as a result, an increasing number of unit

specifications (at both National and Higher National levels) involve an element of

objective testing.

This guide will be of assistance to any SQA Officer (or appointee) involved in creating

objective tests. It has three objectives:

1. to provide advice about the construction of objective questions;

2. to explain how to combine questions into an objective test;

3. to provide guidance on authoring items.

A subsidiary objective is to standardise our vocabulary. Objective testing is a

technical area with lots of jargon – some of which is used inconsistently. This guide

is the result of a wide-ranging literature review and seeks to harmonise our

terminology with that used internationally.




Although some topics (such as item banking) overlap with computer-assisted

assessment (CAA), this guide focuses on the production of paper-based objective

tests – although much of the advice is directly transferable to a CAA environment.

The document has seven sections:

Section 1Section 1Section 1Section 1 Introduction to selected response questionsIntroduction to selected response questionsIntroduction to selected response questionsIntroduction to selected response questions (page 1 (page 1 (page 1 (page 1))))

Section 2Section 2Section 2Section 2 TypeTypeTypeTypessss of selec of selec of selec of selected response quted response quted response quted response questions (page 7estions (page 7estions (page 7estions (page 7))))

Section 3Section 3Section 3Section 3 Choosing selected response questionsChoosing selected response questionsChoosing selected response questionsChoosing selected response questions (page 15 (page 15 (page 15 (page 15))))

Section 4Section 4Section 4Section 4 Writing multiple choice questionsWriting multiple choice questionsWriting multiple choice questionsWriting multiple choice questions (page 22 (page 22 (page 22 (page 22))))

Section 5Section 5Section 5Section 5 Writing questions for higher Writing questions for higher Writing questions for higher Writing questions for higher levellevellevellevel skills skills skills skills (page 33 (page 33 (page 33 (page 33))))

Section 6Section 6Section 6Section 6 Item analysisItem analysisItem analysisItem analysis (page 40 (page 40 (page 40 (page 40))))

Section 7Section 7Section 7Section 7 Constructing testsConstructing testsConstructing testsConstructing tests (pa (pa (pa (page ge ge ge 44)44)44)44)

Section 8Section 8Section 8Section 8 Dealing with guessing (page 52).Dealing with guessing (page 52).Dealing with guessing (page 52).Dealing with guessing (page 52).

While the focus of this guide is objective questions, it does not seek to promote one

type of assessment over another. Traditional forms of assessment remain as valid

today as they have ever been – but, where appropriate, objective approaches have a

role to play too. Neither does it seek to explain what you already know. Most SQA

staff have a good knowledge of objective testing - this guide simply seeks to provide

a single source of advice for busy Officers and appointees.

There are no rules for writing objective tests; there’s only advice. Do whatever you

think is right for your particular test. Assessment is an art – not a science. There is

no substitute for human judgement.

MISCONCEPTIONS ABOUT OBJECTIVE TESTING

Although this document does not seek to promote one type of assessment over

another, it does aim to dispel commonly-held, but inaccurate, views about objective

testing. Some of the most common misconceptions are rehearsed below.

1. Objective tests dumb-down education; objective tests are easy. Objective tests

are as “dumb” or “smart” as you choose to make them. Many high stakes tests

(such as university medical examinations in the UK and the SAT in the United

States) use objective tests.

2. Objective tests can only be used to assess basic knowledge. While this is largely

true in practice, there is nothing inherent in the design of objective tests to make

them unsuitable for assessing high level skills.

3. Objective tests encourage guessing. The problem of guessing can be resolved

through one of a number of recognised techniques.

4. Writing an objective test is easy. While most teachers can create simple

objective tests, the construction of high quality objective questions is highly

skilled and requires significant knowledge and experience.




5. Objective testing is only fashionable because of e-assessment. It’s true that

objective tests are well suited to computer-assisted assessment – but they are

also valid and reliable forms of assessment in their own right.

6. Objective tests aren’t appropriate for my subject. While objective tests have

traditionally been used in the physical and social sciences (such as Physics and

Psychology), they can be used in any subject.

QUESTION TYPES

SQA has traditionally employed a variety of question types within Unit and Course

assessment. These question types can be categorised under two headings:

• constructed response questions

• selected response questions.

NoteNoteNoteNote: Some of the terminology in this guide might not be

familiar to you. It has been used because it is widely

employed in international testing literature and it was

considered best to use “industry standard” nomenclature

rather than “Scottish” terminology.

CONSTRUCTED RESPONSECONSTRUCTED RESPONSECONSTRUCTED RESPONSECONSTRUCTED RESPONSE QUESTIONS QUESTIONS QUESTIONS QUESTIONS

Constructed response questions (also known as “open-ended” questions) are

questions that require the candidate to create (“construct”) an answer. Examples of

constructed response questions (CRQs) include short answer questions and essays.

CRQs can be sub-divided into two sub-categories:

• restricted response questions

• extended response questions.

A restricted response question (RRQ) is a question

whose answer is limited to a few words. Examples of

RRQs include complete-the-sentence, missing word

and short answer questions (see Example 1 above).

Example 1 ~ Constructed response question

Translate “Good morning mother” into Spanish.

Write here:

RRQ ERQ

Figure Figure Figure Figure 1111 ---- RRQs and ERQsRRQs and ERQsRRQs and ERQsRRQs and ERQs




An extended response question (ERQ) is one whose answer requires the candidate

to write longer responses, normally consisting of two or more paragraphs. Examples

of ERQs include reports, essays and dissertations. There is no hard-and-fast rule

about where a restricted response question ends and an extended response

question begins.

Note that many SQA assessments use a combination of

restricted response questions and extended response

questions. Some question papers have two sections, one

employing RRQs and the other using ERQs.

SELECTED RESPONSE QUESTIONS

A selected response question (SRQ) is a question whose answer is pre-determined

and involves the candidate choosing (“selecting”) the response from a list of options.

Because the answer is pre-determined and there is only one correct answer, these

types of questions are often referred to as “objective” questions. Examples of SRQs

include true/false, multiple choice and matching questions.

Example 2 ~ Selected response question

The capital of the United States is New York. True/False

SQA’s question papers typically consist of constructed response questions. Lower

levels (up to SCQF level 4/5) generally use restricted response questions and higher

levels (SCQF level 5 and up) generally employ extended response questions

(although sometimes a mixed approach is used). A limited number of subjects

employ selected response questions.

This guide focuses on SRQs, which are becoming increasingly popular for a variety of

reasons.

ADVANTAGES OF SRQS

1. SRQs take less time to answer – reducing the amount of time that candidates

spend on assessment and increasing learning time.

2. SRQs are quick to mark – reducing the time teachers spend on assessment and

increasing teaching time.

3. SRQs are well suited to formative assessment – since candidates’ responses

can be analysed and used to provide detailed feedback.

4. SRQs are good for assessing breadth of knowledge - they are ideal for assessing

a broad range of topics in a short time.

5. SRQs are more reliable than CRQs - because they get around some of the

marking problems associated with written answers.




6. SRQs are well suited to computer-assisted assessment - and facilitate item

banking.

The low writing load of SRQs means that the focus is on the candidate’s knowledge

rather than the candidate’s writing or language skills – which is a common problem

with constructed response questions. Also, the speed of answering SRQs addresses

another common criticism of assessment – that it takes up too much time for both

students and teachers.

Research into the marking of CRQs and SRQs has shown significant differences in

the reliability of the two approaches – with objective tests proving to be significantly

more reliable than written tests. This has been the major reason for the widespread

adoption of objective tests in the United States, where testing organisations operate

in a more litigious environment.

The compatibility of objective tests with computer-assisted assessment is a major

driver for the renewed popularity of objective testing. SQA, along with other awarding

bodies, is in the process of building banks of questions (“item banks”) which can be

computerised and delivered to candidates over the Internet.

DISADVANTAGES OF SRQS

1. SRQs are not suitable for assessing certain abilities, such as communication

skills or creativity. They are also not appropriate when candidates are required

to construct an argument or provide an original response.

2. SRQs may be less valid than CRQs and suffer from low professional credibility.

3. SRQs that assess higher order skills are difficult (and time consuming) to

produce.

4. SRQs can be wordy and require high order reading skills.

The first and second disadvantages are linked. There is nothing inherent in the

design of SRQs to make them less valid than CRQs – but because they have often

been used inappropriately (to measure skills that cannot be properly measured by

this style of question) they have established a reputation for being invalid among

some practitioners.

Most teachers are comfortable with using SRQs to assess low order skills (such as

factual recall, typified by Example 2). They are less comfortable with their use in

assessing deeper knowledge and understanding. Most currently available examples

of SRQs re-affirm this view by focussing on the assessment of surface knowledge;

even examples of SRQs that are meant to assess deeper knowledge often only

assess surface knowledge – albeit less well known surface knowledge!

Traditionally, the costs of carrying out assessment come at the end of the process –

the setting of the question paper is relatively speedy, the time consuming part

comes when the papers have to be marked. Objective tests reverse this model – the

time consuming bit is the production of the questions, with marking taking very little




time. It is, therefore, something of a culture shock to move from traditional

assessment to objective testing.

Another criticism of SRQs is that they can atomise teaching and learning,

encouraging “teaching to the test” and surface learning. This, combined with their

efficiency in assessing large numbers of students in short periods of time, has

resulted in them acquiring a reputation as “weapons of mass instruction”, with poor

standing among many educationalists.

USES OF SELECTED RESPONSE QUESTIONS

As previously mentioned, objective tests are used in a number of SQA summative

assessments (such as Higher Physics and some HN units). This style of assessment

is well suited to rapid, focussed assessment and is traditionally employed to assess

factual recall and basic understanding. It is less commonly used to assess deeper

knowledge and understanding, and there are few examples (within SQA or

elsewhere) of objective tests being used to assess higher level skills. When used

summatively, objective testing tends to be used for low-stakes assessment rather

than high stakes assessment, which largely remains the preserve of constructed

response questions. However, some subjects (such as Advanced Higher Biology) do

employ objective testing and Higher Education has a long tradition in using objective

testing for high-stakes summative purposes in some fields (such as Medicine).

Objective testing is well suited to formative assessment since it is quick to

administer and assess (lack of time is often cited as the main reason for not using

formative assessment). It is particularly suited to diagnostic assessment since it can

be used to identify specific misunderstandings or weaknesses.

Historically, objective testing has been widely used for psychometric testing (testing

of intellect and attitudes) and, more recently, it has been widely applied to job

competence testing. It is also used in entry examinations for some professional

bodies (such as ACCA).

Objective tests are widely used internationally – including high stakes assessments

such as the SAT in the United States, which is used for university entry. They are also

widely used within vendor examinations (such as Microsoft’s global certification

programme). Awarding bodies in every country are focusing on computer-assisted

assessment, which has resulted in a renewed interest in objective testing. These

organisations share the view that the increasing popularity of e-learning will drive

demand for e-assessment – which will be underpinned by item banks consisting of

large numbers of selected response questions.




TYPES OF SELECTED RETYPES OF SELECTED RETYPES OF SELECTED RETYPES OF SELECTED RESPONSE QUESTIONSSPONSE QUESTIONSSPONSE QUESTIONSSPONSE QUESTIONS

There are several types of selected response questions (SRQs). Although they share

some common characteristics, they each have unique features and applications. But

they all share a fundamental characteristic – they have one unambiguously correct

answer.

TYPES OF SELECTED RESPONSE QUESTIONS

There are seven types of SRQ. These are:

1. true/false questions

2. matching questions

3. multiple choice questions (MCQ)

4. multiple response questions (MRQ)

5. ranking/sequencing questions

6. assertion/reason questions

7. Likert scale questions.

Each type of SRQ is now described and exemplified.

Note: This section simply introduces each type of

question. It does not aim to explain how or when to use

them.

TRUE/FALSE QUESTIONS

A true/false question (T/F) is a statement (not a question!) that is either true or

false. The candidate must select one of two possible responses - “true” or “false”.

Because candidates have a 50/50 chance of answering these questions correctly,

this type of question is considered “easy” and is associated with low order

knowledge. However, true/false questions can assess higher order skills; and setting

an appropriate pass mark can eliminate the effects of guessing.

Example 3 ~ True/false question

(x+1) is a factor of x2+2x-3 True/False




NoteNoteNoteNote: Any question that has one of two possible answers

is considered a true/false question (for example, the

responses might be “yes” or “no” rather than “true” or

“false”). These questions are also known as “alternative

response” items.

MATCHING QUESTIONS

This type of question requires candidates to match an object with one or more

associated characteristics.

The objects on the left are called “stimulators” and the matching statements on the

right are called “responses”. No more than seven stimulators should be included in

any one question.

This type of question is often used to assess candidates’ knowledge of the

characteristics of certain objects. It is particularly well suited to computer-based

assessment since it can be implemented as drag-and-drop (dragging each response

onto an associated stimulator).

Example 4 ~ Matching question

Match the list of storage technologies on the left with the list of memory characteristics on

the right. Match each technology (A, B, C or D) with one characteristic (1, 2, 3 or 4) only.

A. Hard disk 1. Non-volatile

B. Flash memory 2. Volatile

C. RAM 3. High capacity

D. ROM 4. Low cost.

A.

B.

C.

D.




MULTIPLE CHOICE QUESTIONS

A multiple choice question (MCQ) consists of a question (or incomplete statement)

followed by a list of possible responses from which candidates must select one.

There are normally three to five options with four being the most common.

Example 5 ~ Multiple choice question

In psychiatry, holding two contradictory views about the same thing is called:

A cognitive dissonance

B dementia

C dissociative disorder

D factitious disorder

Note that a multiple choice question with two options is

effectively a true/false question. Or, to put it more

accurately, a T/F question is a multiple choice question

with two options.

MCQs are the most common type of selected response question – and the one that

this guide focuses on in later sections.

MULTIPLE RESPONSE QUESTIONS

A multiple response question (MRQ) is similar to a multiple choice question (MCQ)

but has two or more correct responses (as opposed to an MCQ’s single correct

response).

Example 6 ~ Multiple response question

Which of the following statements about earthquakes is/are true?

A An earthquake generates seismic waves.

B The boundary of tectonic plates is called the fault plane.

C The point of origin of seismic waves is called its epicentre.

D The severity of an earthquake is measured by its magnitude and intensity.




There are some misconceptions about MRQs. They are not necessarily more difficult

than MCQs; they are as hard or as easy as you choose to make them. There is no

need to indicate the number of correct options; this only encourages guessing. And

there is nothing wrong with making every option correct; in fact, prohibiting this

possibility reduces the reliability of MRQs.

Note that MCQs normally begin: “Which oneWhich oneWhich oneWhich one of the

following..”, and MRQs usually begin: “WhiWhiWhiWhichchchch of the

following…”.

RANKING QUESTIONS

A ranking question involves ordering the options in some defined sequence. The

sequence can be an ordered list of numbers, chronological sequence or series of

events.

Example 7 ~ Ranking question

Rank the following countries in order of their population densities (lowest density first).

I France

II Germany

III Spain

IV United Kingdom

A

B

C

D

Ranking questions are easily implemented by computers using drag-and-drop.

ASSERTION/REASON QUESTIONS

This type of question consists of a statement (assertion) and a possible explanation

(reason). Candidates must decide if the assertion and reason are true, and whether

the reason is a correct explanation of the assertion.



Page Page Page Page 11111111 Introduction toIntroduction toIntroduction toIntroduction to Selected Response Questions Selected Response Questions Selected Response Questions Selected Response Questions

Example 8 ~ Assertion-reason question

The following assertion and reason relate to World War II. Read the assertion and

associated reason and then choose a corresponding letter (A-E) to indicate whether the

assertion and/or reason is/are true.

Assertion Japan’s lack of raw materials was a cause of World War II in Asia.

Reason Japan lacked natural raw material except for small deposits of coal and

iron.

A Assertion is true and reason is true and the reason is a correct explanation of the

assertion.

B Assertion is true and reason is true but the reason is not a correct explanation of

the assertion.

C The assertion is true but the reason is false.

D The assertion is false but the reason is true.

E The assertion is false and the reason is false.

Assertion-reason questions are similar to multiple true-false questions.

LIKERT SCALE QUESTIONS

This type of SRQ was named after Rensis Likert who invented the scale in 1932. It is

widely used within questionnaires to gauge respondents’ attitudes. The classic Likert

scale consists of five possible responses:

1. Strongly disagree

2. Disagree

3. Neither agree nor disagree

4. Agree

5. Strongly agree

Some psychometricians add or remove options (the neutral option – “neither agree

nor disagree” – is often removed).




Example 9 ~ Likert Scale question

My manager supports me when necessary but otherwise allows me to work without

interference.

A Strongly disagree.

B Disagree.

C Neither agree nor disagree.

D Agree.

E Strongly agree.

This type of SRQ is almost exclusively used for attitudinal assessments and is rarely

employed within formal SQA assessments. It is not discussed further in this guide.

BEST ANSWER AND EXCEPTIONS

Although the existence of a single, unambiguous, correct response is a fundamental

feature of SRQs, the usefulness of SRQs can be extended through “best answer” and

“exception” type questions. These techniques increase the flexibility of SRQs at the

expense of some of their objectivity.

BEST ANSWER QUESTIONS

A “best answer” question is one whose answer is the closest (“best”) answer

selected from a list of possible answers of which more than one may be true. Used

carefully, best answer questions can be almost as objective as standard SRQs.

Example 10 ~ Best answer question

A user wishes to use a search engine to look for information relating to Celtic music that

originated in Scotland. Which one of the following queries is likely to produce the best

results?

A Celtic music Scotland

B “Celtic music” Scotland –football

C Scotland +celtic +music +originate

D “Celtic music that originated in Scotland”




Note that more than one of the responses is correct (in fact, they are all more-or-less

correct). But only one option is the best answer (B).

The use of best answer questions is particularly appropriate to the social sciences

and arts subjects, which tend not to have a definitive body of knowledge like the

physical sciences. Best answer questions can also be used to assess some higher

order skills since they frequently require an element of judgment.

EXCEPTION QUESTIONS

An exception question is one where all of the options are correct exceptexceptexceptexcept one of the

possible responses. This type of question effectively reverses the logic of the

standard SRQ.

Example 11 ~ Exception question

Smoking is a contributory factor in the following conditions EXCEPT:

A diabetes.

B heart disease.

C lung cancer.

D Parkinson’s disease.

A question that includes “not” in the stem is effectively an exception question. For

example, the above question could be re-phrased: “Which one of the following

conditions is NOT caused by smoking?”.

Exception (and negative) questions are not ideal – but should not be completely

avoided since their use can simplify questions and/or increase the number of

questions that can be asked.

VARIANTS & CLONES

A question that assesses the same content as another question is known as a

variantvariantvariantvariant. The stems of variants are worded differently and the options may be

different – but, fundamentally, variants assess the same learning objective.

A question that is (almost) identical to another question is known as a clclclcloneoneoneone. Clones

differ only in their variables. For example, the question below is a clone of

Example 3, the only difference being the expression to be factorised.




Variants and clones have significant implications for e-assessment since they

provide a quick and simple way of rapidly populating an item bank.

Each type of SRQ has its strengths and weaknesses, and each has its best uses. The

next section looks at choosing SRQs for different purposes.

Example 12 ~ Clone

(x+2) is a factor of x2+2x-3. True/False




CHOOSING SELECTED RECHOOSING SELECTED RECHOOSING SELECTED RECHOOSING SELECTED RESPONSE QUESTIONSSPONSE QUESTIONSSPONSE QUESTIONSSPONSE QUESTIONS

The previous section explored the characteristics of different types of selected

response question. This section looks at how each type is best used.

TAXONOMIES OF LEARNING

One of the key determinants in the selection of SRQs is the kind of knowledge or

understanding that you are seeking to assess. For example, factual recall can be

adequately assessed using true/false questions; deeper understanding may require

more complex question types such as multiple response questions.

As a starting point, we need a method of classifying knowledge and understanding.

The most widely used classification system is Bloom’s Taxonomy.

BLOOM’S TAXONOMY

Benjamin Bloom wrote Taxonomy of Educational Objectives Book 1 Cognitive

Domain in 1956 in an attempt to standardise the terminology used by teachers to

describe academic abilities. Until the publication of this book, different people used

different words to describe the same thing; or, worse, used the same words to

describe different things.

His book described a classification system that could be used to categorise cognitive

abilities. The taxonomy (which became known as Bloom’s Taxonomy) is widely used

within the educational community.

NoteNoteNoteNote: Bloom’s Taxonomy is not the only way to classify

academic abilities. There are many alternative methods –

some linked to Bloom’s (but more up-to-date) and some

entirely different from Bloom’s. But Bloom’s Taxonomy

remains the most widely used classification system.

Bloom’s Taxonomy classifies academic abilities into six categories:

1. Knowledge

2. Comprehension

3. Application

4. Analysis

5. Synthesis

6. Evaluation.

A brief description of each cognitive skill follows.




Knowledge Knowledge involves the recall of specific facts and figures, or the recall of specific

methods and processes. Knowledge is the bottom of Bloom’s Taxonomy but

underpins the higher order abilities. There are three types of knowledge: knowledge

of specifics, knowledge of methods, and knowledge of universals. At the higher

levels (knowledge of methods and universals) it can be intellectually demanding.

This category includes: knowledge of terminology, knowledge of specific facts,

knowledge of conventions, knowledge of trends and sequences, knowledge of

classifications, knowledge of criteria, knowledge of methodology, knowledge of

principles and generalisations, and knowledge of theories and structures.

Comprehension Comprehension differs from knowledge in that it relates to the mental processes of

organising and re-organising information for a particular purpose. It includes:

translation, interpretation and extrapolation. Translation relates to the ability to

translate (or decode) a communication from one format (or language) to another.

Interpretation involves the explanation or summarisation of a communication.

Whereas translation involves a mechanistic, part-for-part rendering of a

communication, interpretation involves a more holistic, re-ordering or re-

arrangement of the information. Extrapolation involves extending trends or

sequences beyond the given data to infer consequences or corollaries.

Application This involves the use of knowledge and comprehension in specific situations. For

example, the use of knowledge of computing terminology and procedures combined

with an understanding of the principles of computer hardware and software can be

applied to the assembly of a computer system.

Analysis Analysis involves the breakdown of a communication into its constituent parts so

that the relationship between the elements is made clear. Analysis is intended to

clarify or explain communications or processes. This cognitive skill includes the

ability to: (1) analyse elements (identification of the components of the

communication); (2) analyse relationships (the ability to check the consistency or

accuracy of a hypothesis, and skills in comprehending the inter-relationships among

different ideas or concepts); and (3) analyse organisational principles (the ability to

recognise form and pattern in a communication, and the ability to recognise general

techniques used within a subject area).

Synthesis Synthesis involves combining the parts so as to form a whole. It involves combining

and arranging parts or pieces of a communication to create something new. It may

involve: (1) the production of a unique communication; (2) the production of a plan;

and (3) the derivation of a set of abstract relations to represent physical

phenomena.

Evaluation Evaluation involves making judgements about the value of particular phenomena for

given purposes. Evaluation is carried out using criteria and involves qualitative and

quantitative judgements based on these criteria. The criteria may be given or

created. This includes measuring the internal consistency of the communication

using criteria such as: quality of writing, accuracy of the information contained within

it, and consistency of argument; and measuring the external consistency of the

communication which requires the evaluator to have a detailed knowledge of type of

phenomena under review since it will be evaluated in terms of the general criteria

which are applied to phenomena of this type.

Table Table Table Table 1111 ---- Bloom's Taxonomy Bloom's Taxonomy Bloom's Taxonomy Bloom's Taxonomy




Bloom’s Taxonomy is a

hierarchy in that each category builds on the one

below. For example,

application depends on

comprehension which in turn

depends on knowledge. Or, to

put it more simply: you can’t

apply something until you

understand it; and you can’t

understand something until

you know about it. The figure

opposite illustrates this

hierarchy -- with knowledge at

the bottom and evaluation at

the top.

FiguFiguFiguFigure re re re 2222 ---- Bloom's hierarchy Bloom's hierarchy Bloom's hierarchy Bloom's hierarchy

It is worth noting that, in practice, every level of Bloom’s

Taxonomy can be reduced to knowledge if the candidate

answers the question through rote learning. The most

sophisticated evaluation can be answered correctly if the

candidate has studied that specific scenario and learned

the correct response. Or, to put it another way, one

person’s evaluation is another person’s knowledge.

IDENTIFYING THE LEVEL OF A QUESTION

Bloom’s Taxonomy can be used to categorise the cognitive demands of a question.

For example, a question asking a candidate to “describe” something is normally

associated with the knowledge domain; another question asking the candidate to

“explain” something is normally associated with analysis. In fact, the verb in the

question can provide a clue to the question’s intellectual demands.

Table 2 associates some verbs with the levels within Bloom’s Taxonomy.




LevelLevelLevelLevel VerbsVerbsVerbsVerbs

KnowledgeKnowledgeKnowledgeKnowledge define, describe, define, describe, define, describe, define, describe, label, label, label, label, list, name, recall, show, who, when, whelist, name, recall, show, who, when, whelist, name, recall, show, who, when, whelist, name, recall, show, who, when, wherererere

ComprehensionComprehensionComprehensionComprehension compare, compare, compare, compare, discuss, discuss, discuss, discuss, distinguish, estimate, interpret, predict, summarisedistinguish, estimate, interpret, predict, summarisedistinguish, estimate, interpret, predict, summarisedistinguish, estimate, interpret, predict, summarise

ApplicationApplicationApplicationApplication apply, calculate, demonstrate, illustrate, apply, calculate, demonstrate, illustrate, apply, calculate, demonstrate, illustrate, apply, calculate, demonstrate, illustrate, relate, show, relate, show, relate, show, relate, show, solvesolvesolvesolve

AnalysisAnalysisAnalysisAnalysis analyse, arrange, analyse, arrange, analyse, arrange, analyse, arrange, categorise, categorise, categorise, categorise, compare, compare, compare, compare, connect, connect, connect, connect, explain, explain, explain, explain, infer, infer, infer, infer, order, order, order, order,

separateseparateseparateseparate

SynthesisSynthesisSynthesisSynthesis arrange, arrange, arrange, arrange, combine, combine, combine, combine, compose, compose, compose, compose, create, design, formulate, hypothesize, create, design, formulate, hypothesize, create, design, formulate, hypothesize, create, design, formulate, hypothesize,

integrate, integrate, integrate, integrate, invent, modify, invent, modify, invent, modify, invent, modify, planplanplanplan

EvaluationEvaluationEvaluationEvaluation assess, compare, decide, assess, compare, decide, assess, compare, decide, assess, compare, decide, defend, discriminate, evaluate, defend, discriminate, evaluate, defend, discriminate, evaluate, defend, discriminate, evaluate, judge, justify, judge, justify, judge, justify, judge, justify,

measure, rankmeasure, rankmeasure, rankmeasure, rank, recommend., recommend., recommend., recommend.

Table Table Table Table 2222 ---- V V V Verbs associated with Blooerbs associated with Blooerbs associated with Blooerbs associated with Bloom's Taxonomym's Taxonomym's Taxonomym's Taxonomy

So, for example, a question that commences: “Define…” is likely to assess basic

knowledge; a question that begins “Compare…” is likely to assess analytical or

evaluative skills.

NoteNoteNoteNote: SQA does not formally use a recognised taxonomy

for assessments. However, when one is employed by

Officers or appointees, it is usually Bloom’s. Some SQA

question papers fall foul of Bloom’s Taxonomy, asking

candidates to “explain” something but actually awarding

marks for descriptions (or vice-versa).

DIFFICULTY AND DEMAND

Bloom’s Taxonomy provides an indication of the demanddemanddemanddemand of a question – it does not

define its difficultydifficultydifficultydifficulty. A question’s demand is a measure of its intellectual

requirements; its difficulty is “how hard” it is. Although difficulty and demand are

related (most demanding questions are difficult), a question can have high demand

and low difficulty - or low demand and high difficulty.

Example 13 ~ Low demand, high difficulty question

Describe the main processes that take place during nuclear fusion.

This question has low demand (relating to factual recall) but high difficulty because it

relates to a complex topic (nuclear fusion). Similarly, crossing the road involves

evaluation skills (Is the road clear? Is it safe to cross? How far away is that car?),



Page Page Page Page 19191919 Introduction to SIntroduction to SIntroduction to SIntroduction to Selected Response Questionselected Response Questionselected Response Questionselected Response Questions

which are at the top of Bloom’s hierarchy – but is not a difficult task for most people.

So, merely climbing Bloom’s Taxonomy is no guarantee of difficulty.

The concept of difficulty and demand has important implications for question setting.

Most SQA tests employ low difficulty/low demand questions; but even the “more

demanding” questions may not be – they might simply assess knowledge in a more

difficult way (by, for example, assessing little known knowledge).

QUESTION TYPES AND DEMANDS

Each question type can be related to one or more levels in Bloom’s Taxonomy. While

it’s possible to use any one of the question types for almost any of Bloom’s levels,

some are better than others for specific levels as the following table describes.

True/FalseTrue/FalseTrue/FalseTrue/False While mostly used to assess knowledge, T/F questions can, in fact, be While mostly used to assess knowledge, T/F questions can, in fact, be While mostly used to assess knowledge, T/F questions can, in fact, be While mostly used to assess knowledge, T/F questions can, in fact, be

used to assess knowledge, comprehension and application levels.used to assess knowledge, comprehension and application levels.used to assess knowledge, comprehension and application levels.used to assess knowledge, comprehension and application levels.

Matching Again, mostly used to assess basic knowledge but can be used to

assess knowledge and comprehension.

MCQMCQMCQMCQ MCQs are the most flexible type oMCQs are the most flexible type oMCQs are the most flexible type oMCQs are the most flexible type of SRQ and can assess all levels; they f SRQ and can assess all levels; they f SRQ and can assess all levels; they f SRQ and can assess all levels; they

are are are are particularly particularly particularly particularly suitable for knowledge, comprehension, application suitable for knowledge, comprehension, application suitable for knowledge, comprehension, application suitable for knowledge, comprehension, application

and analysis.and analysis.and analysis.and analysis.

MRQ MRQs can assess the same range of levels as MCQs – but have the

potential to create more difficult questions within each category.

RankingRankingRankingRanking Ranking questions are well suited to assessing application and Ranking questions are well suited to assessing application and Ranking questions are well suited to assessing application and Ranking questions are well suited to assessing application and

analysis.analysis.analysis.analysis.

Assertion Suitable for knowledge, comprehension and analysis.

Table Table Table Table 3333 ---- Question types and demand Question types and demand Question types and demand Question types and demand

So, in theory, SRQs can assess all of Bloom’s levels. However, in practice, it is

uncommon to come across SRQs that assess anything other than knowledge and

comprehension. But this is not an inherent limitation in their design. Assessing

higher order skills can be done – but it is a time consuming and skilled task to do so.




ADVANTAGES & DISADVANTAGES OF QUESTION TYPES

As stated previously, each question type has its unique characteristics and uses. The

applications of each type are determined by its strengths and weaknesses.

TypeTypeTypeType Advantage(s)Advantage(s)Advantage(s)Advantage(s) Disadvantage(s)Disadvantage(s)Disadvantage(s)Disadvantage(s)

True/FalseTrue/FalseTrue/FalseTrue/False Well suited to basic knowledge.

Easy to write.

Rapid to mark.

Suited to dichotomous knowledge.

Good for formative assessment – especially

diagnostic assessment.

Limited applications (best

suited to dichotomous

knowledge).

MatchingMatchingMatchingMatching Relatively easy to write.

Quick to mark.

Good for assessing knowledge of

characteristics/features or relationships

between variables.

Well suited to computerisation (drag-and-drop).

Limited to knowledge and

comprehension.

Best used for homogenous

content i.e. classifying

types.

MCQMCQMCQMCQ/MRQ/MRQ/MRQ/MRQ Can assess a wide range of cognitive abilities

(up to analysis).

Scenario-based questions can assess higher

order skills.

Well suited to diagnostic assessment

(distractors can target learning difficulties).

Item analysis provides detailed feedback (to

assessors and candidates).

Simple MCQs are quick and easy to construct.

High re-usability of items.

Good MCQs (at any level)

are difficult and time

consuming to construct.

MCQs that assess high level

abilities require skilled

authors.

Unsuitable for assessing

synthesis and evaluative

skills.

AssertionAssertionAssertionAssertion Well suited to assessing relationships between

variables.

Well suited to assessing understanding of cause-

and-effect.

Good for constructing demanding items.

Difficult to construct.

Limited applications

(compared to MCQs).

Wordy – difficult to read and

understand.

Table Table Table Table 4444 ---- Advantages and disadvantages of question types Advantages and disadvantages of question types Advantages and disadvantages of question types Advantages and disadvantages of question types




In practice, the main barriers to constructing high quality items are the skills and

experience of the authors. A talent for writing traditional question papers does not

necessarily translate to writing SRQs – so experienced setting teams may struggle to

create high quality item banks. Even after training, some writers don’t “get” SRQs –

while others are veritable question factories.

NoteNoteNoteNote: If a unit writer wishes to use objective testing, s/he

should not prescribe the particular type of SRQ in the unit

specification itself. It’s better to simply state that selected

response questions may be used – and leave the choice

of SRQ to the assessment writers (although the Support

Notes may suggest specific forms of SRQ).




WRITING WRITING WRITING WRITING MULTIPLE CHOICEMULTIPLE CHOICEMULTIPLE CHOICEMULTIPLE CHOICE QUESTIONS QUESTIONS QUESTIONS QUESTIONS

This section focuses on the construction of a specific type of selected response

question (SRQ) – the multiple choice question (MCQ). However, much of the advice

is transferable to other forms of SRQs.

Multiple choice questions are the most common type of SRQ; they’re also the most

flexible and most difficult to construct. MCQs are used in all types of objective testing

(including high stakes assessment) and are the most common form of SRQ

employed by SQA.

ANATOMY OF AN MCQ

A single, complete multiple-choice question is called an itemitemitemitem. It poses a question and

allows a candidate to select the correct answer from a list of possible options. An

MCQ has the following structure:

Figure Figure Figure Figure 3333 ---- Anatomy of an MCQ Anatomy of an MCQ Anatomy of an MCQ Anatomy of an MCQ

StemStemStemStem ( ( ( (or or or or stimulus)stimulus)stimulus)stimulus): the question or problem.

OptionsOptionsOptionsOptions ((((or or or or responsesresponsesresponsesresponses or alternatives or alternatives or alternatives or alternatives)))): the list of possible answers.

KeyKeyKeyKey: the correct (or best) answer.

DistractorsDistractorsDistractorsDistractors: the incorrect alternatives to the key.

Note the spelling of “distractor” – which is the US-English

spelling rather than the International English spelling

(“distracter”).

WRITING MULTIPLE CHOICE QUESTIONS

There is no formula for constructing high quality items. However, there is some

guidance that aids their construction.




THE ITEM

The key to writing good items (“authoring”) is to ensure that the question directly

relates to the underlying Arrangements, it is clearly presented, and free from

unnecessary details. A question should not be a test of reading ability; the focus

must be on the knowledge or skill that it is seeking to assess.

� Ensure that each item is relevant to the course/unit outcomes.

� Ensure that the level of language is appropriate to the target cohort.

� Assess one thing at a time (unless you intend to ask an integrative question).

� One correct answer only.

� Don’t write questions in isolation.

� Don’t include unnecessary words.

� Pre-test items whenever possible.

The most difficult part of writing an item is to ensure that there is only onThe most difficult part of writing an item is to ensure that there is only onThe most difficult part of writing an item is to ensure that there is only onThe most difficult part of writing an item is to ensure that there is only one correct e correct e correct e correct

answer.answer.answer.answer. Having more than one potentially correct answer is the most common

complaint from teachers and candidates. It’s a challenge to write items with one

clearly correct answer – at least non-trivial items. It’s easy to be subjective or context

dependent (i.e. the key is correct is some circumstances but not others). One

solution is to spell out the context – but this may make the item clumsy or wordy or

gives clues to the correct answer. Another option is to use words and phrases like

“best” or “most likely” in the stem (it’s easier to argue that the key is the most likely

answer rather than the only answer).

Although the initial construction of questions has to be the work of an individual, it’s

vital that items are reviewed prior to being used operationally. It’s impossible for a

single author to both write and review items independently.

SRQs are well suited to prepreprepre----testingtestingtestingtesting – which means trying them out on students

before using them operationally. Pre-testing will confirm the item’s suitability (or not).

It also generates valuable data about the question that can be used in item analysis

(see Section 6).

STYLE GUIDE

Each item should follow an agreed house-style to provide guidance on language use.

A style guide for item writing would normally include advice about:

• spelling



Page Page Page Page 24242424 Introduction to SeIntroduction to SeIntroduction to SeIntroduction to Selected Response Questionslected Response Questionslected Response Questionslected Response Questions

• punctuation

• use of emphasis

• prose style

• language.

For example, spelling advice would include the treatment of numbers (spelled in

words or written as digits?); punctuation advice would include information on the

punctuation to use within options (should they end with a period or without any form

of punctuation?); emphasis rules would include the use of bold and italics; prose

style and language would provide general advice about the type and level of

language to be used.

THE STEM

It’s best to phrase the stem as a self-contained question rather than a partial

statement – although the latter approach is neither uncommon nor invalid.

� Try to phrase the stem as a complete question (unless this is too contrived –

when an incomplete statement may be used).

� Use clear, straight-forward language – suitable for the target cohort in terms of

level of language.

� Place necessary wording in the stem – not in each of the options.

� Avoid irrelevant or unnecessary information.

� Avoid negative wording if possible – or use negatives sparingly.

� Specify any standards implied.

� Avoid the use of personal pronouns (“I”, “You” etc.).

� Avoid subjectivity e.g. “Which one of the following do you think is…” (what the

candidate “thinks” is subjective – and her response cannot be wrong).

Any words that would be repeated in each of the options should be included in the

stem. Options should not begin or end with identical words and phrases.




Example 14 ~ Repeated text

If the pressure of a certain amount of gas is held constant, what will happen if its volume is

increased?

A The temperature of the gas will decrease.

B The temperature of the gas will increase.

C The temperature of the gas will remain the same.

Example 15 ~ Repeated text removed

If the pressure of a certain amount of gas is held constant, what will happen to the

temperature if its volume is increased?

A Decrease.

B Increase.

C Remain the same.

Avoid words like “could” and “would”. For example, asking a candidate “What would

you do…” cannot be answered incorrectly (since only the candidate can know what

she would do in any given circumstance) – instead write: “What should you do…”.

The following example illustrates a poor question.

Example 16 ~ Using subjective wording

A computer is running slowly. What could be responsible?

A Insufficient memory

B Over-heating

C Small hard drive

D Virus




The author intends D to be the correct answer – but any of the options could be

correct. Here is an improved version.

Example 17 ~ Subjective wording removed

A computer suddenly runs slowly without any changes to its configuration. What is most

likely to be responsible?

A Insufficient memory

B Over-heating

C Small hard drive

D Virus

Notice the added contextual information in the stem to improve the clarity of the

question – and the replacement of “could” with “most likely”.

Specify any standards implied.... If an item calls for a judgment, specify the authority

or standard upon which the correct answer is based.

Example 18 ~ Standards specified

According to the American Medical Association, the diet of the average American provides

vitamins in amounts that are what?

A Adequate for normal consumption.

B Inadequate for normal consumption.

C In excess of normal requirements.

D Variable in relation to individual requirements.

The key to good stem construction is to keep the question (or statement) as short as

possible – consistent with providing sufficient information to clearly pose the

question. But don’t be tempted to reduce the length of the stem by moving

information into each of the options; this complicates the question and increases the

candidate’s reading time.

Negative wording is not prohibited but it’s better to word a question positively when

this is possible. Double negatives should be completely avoided i.e. two negatives in

the stem or a negative in the stem and a negative in the options. However, some




questions can be made unnecessarily complex by avoiding a single negative – in

which case, use negatives. When negatives are used, emphasise “NOT” (or whatever

construct is used) in the stem (or the options).

THE OPTIONS

� Provide between three and five options – four options is most common.

� Options should be internally consistent (e.g. all consisting of people’s names, not

three names and the measurement).

� All of the options should be plausible.

� All of the items should be quality equivalent.

� Ordering of the options should follow a consistent and logical sequence.

� The length of options should be comparable.

� Options should be mutually exclusive.

� Only one correct (or best) answer.

� The one correct answer (key) should be actually correct.

� The key should not be worded in a way that would make it likely to change over

time.

� Ensure that none of the distractors is conditionally correct (depending on

circumstances or context – unless these are defined in the stem).

� Do not create distractors that are too close to the key.

� Don’t use words such as “not”, “never” or “always” to make an option incorrect.

� Avoid the use of “All of the above”.

� “None of the above” should be used sparingly (and when used should be the

correct answer some of the time).

� Avoid pejorative language (such as “bad”, “low”, “ignore” etc.).

� Avoid syllogistic reasoning e.g. “Both A and B are correct”.

Some of the advice is conflicting – such as “Stems should

be short and simple” and “Move information to the stem

rather than repeat it in each option”. Dealing with these

tensions is the art of item construction!




The advice about pejorative language is quite subtle. Any option that uses words

such as “bad”, “low” and “ignore” is usually a distractor – authors rarely use such

words in the key.

At higher levels of understanding, it can be difficult to construct questions with one

objectively correct answer and it is a common error in such questions to offer

options that include more than one potentially correct answer. Careful wording

(“Which one of the following is likely to be the best answer…”) can get round this

potential problem.

SEQUENCING OPTIONS

Ordering of the options within an item should follow a logical order. If using numbers

or dates then they should be displayed numerically or chronologically in ascending or

descending order (normally ascending). Text answers should normally be sorted

alphabetically unless there is a “natural” sequence to the options, in which case the

natural sequence should be used in preference to alphabetical order. Do not order Do not order Do not order Do not order

the options to try to evenly dthe options to try to evenly dthe options to try to evenly dthe options to try to evenly distribute the answers istribute the answers istribute the answers istribute the answers (i.e. to ensure each option – A, B,

C and D – is used approximately the same number of times) nor attempt to avoid nor attempt to avoid nor attempt to avoid nor attempt to avoid

clustering keysclustering keysclustering keysclustering keys (e.g. A-B-B-B-C) since both of these strategies reduce the

randomness of the test.

USE OF “NONE OF THE ABOVE”

The option “None of the above” should be used sparingly. It is preferable to avoid the

use of “None of the above” as well as “All of the above” Studies have shown that

they decrease item discrimination and test score reliability (see Section 6). However,

“None of the above” can be used if authors ensure that:

• it is used in several items in a test

• it is sometimes the correct option (but not always)

• it is not used after a negative stem

• it is not used as “padding” (because you are short of other options).

“None of the above” may be particularly useful in questions that require candidates

to carry out calculations, since this option effectively mops-up a large range of

potential errors. But, if it’s used, it must sometimes be the key.



Page Page Page Page 29292929 Introduction to SelIntroduction to SelIntroduction to SelIntroduction to Selected Response Questionsected Response Questionsected Response Questionsected Response Questions

Example 19 ~ Example 19 ~ Example 19 ~ Example 19 ~ GooGooGooGood use of “None of the above”d use of “None of the above”d use of “None of the above”d use of “None of the above”

Which one of the following is the solution for x in the equation 5(x-1)=10.

A 0

B 2

C 4

D None of the above

ADVICE ON WRITING DISTRACTORS

The quality of distractors has a huge impact on the quality of the question.

Distractors have a particularly important role to play in formative assessment since

their careful selection can provide a wealth of diagnostic information about the

candidate’s present understanding. In summative assessment, carefully selected

distractors can catch out unprepared (or under prepared) candidates. Writing

distractors, therefore, requires as much thought as writing the key.

Distractors should be as plausible as the key; do not use unrealistic or humorous

distractors as this effectively reduces the number of (real) options.

� Distractors should be as plausible as the key – no silly distractors – although

some can be relatively weak.

� Common misunderstandings make good distractors.

� Incorrect paraphrasing of the question makes for good distractors.

� Correct sounding distractors are good for the poorly prepared candidate.

� True statements that do not answer the question are good distractors.

There is a balance to be struck between writing good distractors and trying to dupe

candidates. Distractors should not “entrap” candidates – that is, catch out

candidates through clever wording, very fine distinctions or tricks-of-the-trade. If you

want to write a difficult question then do so through the knowledge and skills

required to answer it – not by tricking the candidate into giving the wrong answer.




ADVICE ON AVOIDING CUEING

“Cueing” is the tendency for the stem (or the options) to infer the key. It is a common

problem with SRQs. The following question has only one option (A) which is

grammatically correct (the stem ends with “an” and only option A begins with a

vowel).

Example 20 ~ Cueing

A word used to describe a noun is called an:

A adjective

B conjunction

C pronoun

D verb

� The wording in the stem should not provide obvious clues to the correct answer.

� Don’t give clues to the correct answer by ensuring the options flow from the

stem, are in the same format and tense, and are grammatically correct.

� Don’t allow the wording of the options to provide obvious clues to the correct

answer.

� Avoid the use of “always” and “never” in the options since these responses are

rarely correct.

� Avoid the use of “sometimes” and “often” in the options since these responses

are often correct.

� Avoid using stereotypical language that could give away the answer.

� Avoid using phrases from textbooks.

� Avoid pejorative wording (“bad”, “low” etc.) since these words are rarely used in

the key.

� Avoid absolute language such as “always”, “never”, etc. since these are rarely

correct.




� Avoid complex language in one option compared with other options (this option

tends to be the correct answer).

� Avoid similar language in the stem and the options since the option with the

most similar language is most likely to be the key.

� Avoid visual cueing i.e. one option being much longer or standing out in some

other way from the other options – this one is likely to be the key.

The length of options should be similar. An option that stands out from the others

can indicate to a student that it is the right answer. If different lengths are

unavoidable then use two long options adjacent to each other and two short options

adjacent to each other.

The following example illustrates some of this guidance.

Example 21 ~ Advice in context

“Shakespeare wrote plays and they reflect both the depth of human emotion and the

complexity of human society.”

Which one of the following phrases improves the wording of the underlined fragment?

A “Shakespeare wrote plays who reflect…”

B “Shakespeare wrote plays that reflect…”

C “Shakespeare wrote plays which reflect…”

D “Shakespeare wrote plays being that they reflect…”

The question appears to be a valid assessment of candidates’ knowledge of English

grammar (presuming that this is what the author intended to assess) – although a

more familiar context could have assessed the same knowledge (the mere mention

of Shakespeare can disorientate candidates).

The question is clearly worded – although some of the language in the stem is

unnecessarily complex (words such as “fragment” could confuse candidates).

The options look homogenous, with none standing out (no visual cueing). They have

been ordered in a logical sequence (sentence length). They are all plausible to the

under-studied candidate. There is some repeated text in the options that a rewording

of the stem may avoid (but maybe not without making the question less clear). The

distractors have been chosen to reflect common misunderstandings among

candidates with respect to the use of “that”, “which” and “who”. And there is one

unambiguously correct option (B).




All in all, a reasonable (albeit, imperfect) question.

DISCLOSERS

A concept associated with cueing is disclosingdisclosingdisclosingdisclosing. A discloser is a question that

contains the answer to another question. Unless otherwise intended, every question

should be independent of every other question and should contain the minimum

information required to answer the question. However, it can happen that the stem

or options in one question inadvertently help candidates to answer another question.

Disclosure is a particular problem in item banking when it is impossible to predict

which items will be included in a particular instance of a test (such tests are usually

dynamically generated by a computer – and a computer is unlikely to spot the

subtleties of disclosure).

A checklist, summarising the advice for item construction, is provided in the

appendices.




WRITING QUESTIONS FOWRITING QUESTIONS FOWRITING QUESTIONS FOWRITING QUESTIONS FOR HIGHER R HIGHER R HIGHER R HIGHER LEVELLEVELLEVELLEVEL SKILLS SKILLS SKILLS SKILLS

Multiple choice questions (MCQs) have gained a reputation for being a quick-and-

dirty way of assessing low level knowledge. However, they can also be used to

assess higher level skills – but this requires a great deal more effort on the part of

the writer. This section explores the potential of MCQs to assess higher level skills.

As has been previously stated, MCQs can be used to assess all of the levels within

Bloom’s Taxonomy – although they are more suited to the lower levels.. This section

explores a couple of techniques for writing higher order questions and exemplifies

this against each level in Bloom’s Taxonomy.

Writing MCQs to assess higher order skills frequently contradicts some of the

previous advice about writing good items. For example, such questions often involve

long stems; complex language is frequently used; standards are often omitted (or

the question becomes one of knowledge of the standard); and they often require an

element of judgement on the part of the candidate (and, as a consequence, are less

objective).

Note: There is a fundamental distinction between writing

questions that assess higher level skills and writing items

that assess lower order skills in a “difficult” way. Writing a

question that assesses some esoteric piece of knowledge

is not a higher order item – although few candidates will

answer it correctly, it is still only assessing a low level

ability albeit in a difficult way (see previous discussion on

difficulty and demand).

TECHNIQUES FOR WRITING HIGHER ORDER QUESTIONS

Writing higher level questions is easier in some subjects than others. Some fields,

such as mathematics, are problem solving based and in such subjects it is relatively

straight-forward to produce questions that assess more than knowledge and

comprehension (see example 3 for a straight-forward application level question in

Maths). In other subjects it’s not so easy.

However, there are a few techniques that can be used to help authors produce more

demanding MCQs. We will look at two:

1. scenario questions

2. passage-based reading.

Before we do, there is a very simple technique that can be used to transform a

simple knowledge question into one that is more demanding. Instead of asking

“What…?”, ask “Why…?”. For example, in a Geography test, instead of asking:

“Which one of the following cities is the capital of the United States” (which assesses

basic knowledge), ask why Washington is the capital of the US (which requires an




explanation).

Example 2Example 2Example 2Example 22222 ~ ~ ~ ~ Upgrading questionsUpgrading questionsUpgrading questionsUpgrading questions

Why is Washington DC the capital of the United States?

A It is a planned city, capital by design.

B It is the largest city in the United States.

C It is located beside a large river and manufacturing base.

D It is located in a position safe from British troops during the American

Revolution.

This is a quick-and-dirty technique to generate more demanding questions,

upgrading basic knowledge questions to comprehension or analysis levels.

SCENARIO QUESTIONS

The main method of writing demanding items is to present a scenario to candidates

and then pose one or more related questions. The scenario can be anything from a

paragraph to a page (although a very long scenario really requires a number of

follow-on questions to justify its length). The associated question(s) may involve a

range of cognitive abilities including interpretation (comprehension), prediction

(comprehension), calculation (application), problem solving (application), explanation

(analysis), inference (analysis), categorisation (analysis) and decision making

(analysis and evaluation).

Scenarios can be used in all subjects but are particularly suitable in the social

sciences. Science subjects are inherently suited to problem solving and it is easier in

these areas to pose demanding questions without the need for lengthy scenarios.

The examples provided in this section are given without detailed comment. You are

encouraged to critically appraise each question yourself. When you do, you will

appreciate that no )non-trivial) question is without its weaknesses.

A scenario question has a straight-forward construction. It consists of some text,

which may be illustrated with a diagram or photograph, and one or more associated

questions. The scenario can take one of a number of forms including:

• a description of a specific environment

• a description of a specific situation

• a description of a principle or theorem




• a description of a problem

• an explanation of an event

• the results of an experiment (or the results of research).

Most scenario questions involve an element of interpretation on the part of the

candidate.

The candidate will take more time to process a scenario question as it often requires

a high level of reading skills. This should be taken into account when determining

the duration of a test (see Section 7).

Example 23 ~ Application skills

Julie is 14 years old and frequently uses an online community called MyParty, which is a

social network used by many of her friends. However, the service is open to any member

of the public. She has become very friendly with Jamie, who is another user of the service,

whom she has never met. Jamie’s profile reports that he is 16 years old and attends a

nearby school. Julie and Jamie share many common interests and Jamie has asked to

meet Julie, who wants to meet him.

Which one of the following is Julie’s best course of action?

A Refuse to meet with him.

B Agree to meet with him but accompanied with a responsible adult.

C Agree to meet with him but accompanied with a friend.

D Agree to meet with him.

This question uses a specific situation to ask a question that involves application

skills. Any question that uses a scenario that the candidate is unfamiliar with is, in

effect, assessing application skills.




Example 24 ~ Application and analysis skills

A user is having problems reading files from a flashdrive. While most files work correctly,

any attempt to access a few specific files results in an operating system error message:

“Cannot read file. Storage device may be corrupt.”. Which one of the following actions is

normally the best course of action in such circumstances?

A Copy the readable files from the device and do not re-use the device.

B Copy the readable files from the device, reformat it and recopy the files to the

device.

C Ignore the error and continue to use the part of the device that is usable.

D Reformat the device and re-use it.

Note that this question is an example of problem solving. Note also that there are at

least two weaknesses. The key (B) “looks” correct (it is the longest and most detailed

option); at least one of the distractors is weak (C) and uses pejorative language

(“ignore”). But it has its strengths too. The key is clearly the best answer (not always

an easy task when writing demanding questions) and it’s a challenging question

(admittedly made easier by the options). And the author didn’t resort to “None of the

above” as a final option! It is a moot point whether this item can be “fixed” or

whether it has to be discarded.

The following example uses a single scenario and a number of linked questions of

increasing demand.




Example 25 ~ Application and analysis skills

Raj and Sophie, who have never been married, have two children – Ben aged 8 and

Shazia aged 2. Raj and Sophie’s relationship has ended, and Sophie has married Carlton.

Raj has agreed that the children can live with Sophie and Carlton for the time being.

For questions 1-4, the options are:

A Raj and Sophie.

B Raj, Sophie and Carlton.

C Sophie and Carlton.

D Sophie only.

E Raj only.

1 Who has parental responsibility for the children at present?

2 If Section 8 orders are required in respect of the children, who could apply as of right

(without leave) for any Section 6 order?

3 Who would be able to apply as of right (without leave) for a residence or contact

order?

4 If Raj obtained a contact order to see the children every week, who would have

parental responsibility for the children?

PASSAGE-BASED READING

A second technique to aid the writing of demanding questions is to use passage-

based items. This involves presenting a passage of around 100 to 800 words and

asking one or more linked questions about the passage.

The passage can be narrative, argumentative or expository in nature. The questions

can ask candidates about the meaning of words in the passage (vocabulary in

context); ask questions about significant information the passage is seeking to

impart (literal comprehension); or measure candidates’ ability to analyse information

as well as to evaluate the assumptions made and the techniques used by the author

(extended reasoning).




Example 26 ~ Passage-based question

1

2

3

4

5

6

7

8

9

10

11

12

13

14

“Psychoanalysis has been criticised on a variety of grounds by Karl Popper,

Adolf Grünbaum, Mario Bunge, Hans Eysenck, L. Ron Hubbard and others.

Popper argues that it is not scientific because it is not falsifiable. Grünbaum

argues that it is falsifiable, and in fact turns out to be false. The other schools of

psychology have produced alternative methods of psychotherapy, including

behaviour therapy, cognitive therapy, primal therapy and person-centred

psychotherapy.

An important consequence of the wide variety of psychoanalytic theories is that

psychoanalysis is difficult to criticise as a whole. Many critics have attempted to

offer criticisms of psychoanalysis that were in fact only criticisms of specific

ideas present in one or more theories, rather than in all of psychoanalysis.

For example, it is common for critics of psychoanalysis to focus on Freud's

ideas, even though only a fraction of contemporary analysts still hold to Freud's

major theses.” (Wikipedia)

A number of linked questions could be asked about this passage. For example, a

vocabulary-in-context question could ask about the meaning of a word (or term) such

as “falsifiable” (line 3) or “cognitive therapy” (line 6); a literal comprehension

question could ask about the candidate’s understanding of this passage (such as

asking her to choose the best (one line) summary of the passage); and a number of

extended reasoning questions could be posed (such as one asking about criticisms

of Freudian psychoanalysis).

Passage-based reading can also be used to measure evaluation skills by asking

candidates to judge the logical consistency of written material, the validity of

experimental results, the interpretation of data, or the quality of writing.



Page Page Page Page 39393939 IIIIntroduction to Selected Response Questionsntroduction to Selected Response Questionsntroduction to Selected Response Questionsntroduction to Selected Response Questions

Example 2Example 2Example 2Example 27777 ~ ~ ~ ~ EvaluationEvaluationEvaluationEvaluation skills skills skills skills

The Fibonacci sequence of numbers can be defined by the following

mathematical recurrence relation.

The following Java method (i.e. function) specifies an implementation of this

recurrence relation.

public static int fibonacci (int n) { public static int fibonacci (int n) { public static int fibonacci (int n) { public static int fibonacci (int n) {

if (n == 0 || n == 1) { if (n == 0 || n == 1) { if (n == 0 || n == 1) { if (n == 0 || n == 1) {

return 1; return 1; return 1; return 1;

} else { } else { } else { } else {

return fibonacci (nreturn fibonacci (nreturn fibonacci (nreturn fibonacci (n----1) + fibonacci (n1) + fibonacci (n1) + fibonacci (n1) + fibonacci (n----2); 2); 2); 2);

} } } }

}}}}

Which one of the following statements best evaluates this function?

A The algorithm will produce the correct result and is efficient.

B The algorithm will produce the correct result but is inefficient.

C The algorithm will not produce the correct result.

D The algorithm will fail.




ITEM ANALYSISITEM ANALYSISITEM ANALYSISITEM ANALYSIS

One of the major advantages of selected response questions (SRQs) is that they can

be easily analysed.

Item analysis permits a more scientific approach to assessment. If you know the

properties of each question (for example, how difficult it is or how well it separates

candidates of differing abilities) then you can construct a better test.

This section explores two classical ways of analysing items: (1) measuring their

difficulty; and (2) measuring how well they separate candidates. The next section

explains how these measures can be used to construct tests.

FACILITY VALUE

The facility value (FV) of an item is a measure of its difficulty – or, more accurately,

its “easiness”. It represents the proportion of candidates who answer the item

correctly and is expressed as a decimal fraction between zero and one.

A FV of zero means that no-one answered the question correctly; a FV of one means

that everyone answered the question correctly; and a FV of 0.6 means that 60% of

the test takers answered it correctly. The lower the FV, the more difficult the item;

the higher the FV, the easier the item (hence, it is better thought of as an “easy

index”). A very easy item might have a FV of 0.9 (meaning that 90% of candidates

are expected to answer it correctly) and a very difficult item might have a FV of 0.1

(meaning that 10% of candidates are expected to answer it correctly).

NoteNoteNoteNote: In a competency-based system (such as SQA’s), the

FV measures the probability of a minimally competent minimally competent minimally competent minimally competent

candidatecandidatecandidatecandidate answering the question correctly – nonononot a typical t a typical t a typical t a typical

candidatecandidatecandidatecandidate.

Facility values are best assigned during pre-testing. Once a sample group of students

has attempted the item (assuming that this sample is representative of the target

cohort), an initial FV can be assigned. If pre-testing is not possible (or, more likely,

not feasible) a predicted facility value (PFV) can be assigned by the test authors.

Predicted FVs are assigned by subject matter experts (SMEs) and represent the

“best guess” of two or moretwo or moretwo or moretwo or more SMEs. This initial estimate can be re-calibrated once the

item is used operationally.

Note that a FV is a relative measure of an item’s difficulty

– relative to the target cohort’s age and stage. For

example, a simple addition question might have a low FV

for Primary 2 pupils but a high FV for Primary 4 pupils.

Note that, in theory, any SRQ will have a minimum FV greater than zero. For example,

any true/false question will have a minimum FV of 0.5 (which represents the 50-50




chance of guessing the answer correctly) and any MCQ (with four options) will have a

minimum FV of 0.25 (no matter how difficult it is). However, in practice, some FVs

will be lower than this due to the way the item has been constructed – with a key

attracting more than its fair share of candidates and badly constructed distractors

attracting very few candidates.

It is recommended that items with FVs greater than 0.9 are discarded (too easy);

similarly FVs lower than 0.1 should be avoided (too difficult).

DISCRIMINATION INDEX

The discrimination index (DI) of an item is a measure of how well that item separates

candidates. It relates each candidate’s test scoretest scoretest scoretest score with his/her performance on a

specific item, and then compares the top candidates with the bottom candidates.

For example, if 30 candidates attempt an item, the DI measures the performance of

the top third (top 10) of the candidates with the bottom third (bottom 10) of

candidates (based on final test scores). If eight of the top ten answered the item

correctly and two of the bottom third answered it correctly then the item’s DI is:

DI = (8-2)/10 = 6/10 = 0.6.

DI values range from +1 (all of the top candidates answered it correctly and none of

the bottom candidates) to -1 (all of the bottom candidates answered it correctly and

none of the top candidates!); a DI of zero means that the same number of top and

bottom students answered it correctly. A positive DI is essential (which shows some

discrimination). If an item yields a zero or negative DI, discard it. The above example

illustrates good discrimination. It recommended that It recommended that It recommended that It recommended that an item has a DI of at least 0.2; an item has a DI of at least 0.2; an item has a DI of at least 0.2; an item has a DI of at least 0.2;

items with DI values of 0.4 and above are considered to have good discrimination.items with DI values of 0.4 and above are considered to have good discrimination.items with DI values of 0.4 and above are considered to have good discrimination.items with DI values of 0.4 and above are considered to have good discrimination.

Discrimination indices cannot be predicted. They must be derived through pre-

testing or operational use.

There is a link between a question’s facility value and its discrimination index. A

“good” question that is designed to be difficult will have a low facility value and high

discrimination. But not all questions with low FVs will have high DIs. A poorly

designed question that is difficult to answer due to lack of clarity or inappropriate

language may have a low FV and low discrimination (since few candidates can

answer it – and poor candidates are as likely to get it right as good candidates).

The following example (see over) illustrates the facility value and discrimination

index for a specific question. The item was designed to assess the mathematical

knowledge of S2 candidates. It was pre-tested on 60 candidates of whom 18

answered it correctly; 15 in the top third and three in the bottom third. This gave the

following item analysis:

FV = 0.30

DI = 0.60




Example 25 ~ Item analysis

If the radius of a circle is increased by 20%, which one of the following represents the

corresponding increase in the circle’s area?

A 40%

B 44%

C 120%

D 144%

This item is difficult. Given that blind guessing would produce a one-in-four chance of

answering it correctly (FV=0.25), the recorded FV of 0.30 (representing 30% of the

sample) is very low. It also discriminates well, meaning that it is likely to separate

candidates and aid grading.

It is worth noting that this item is slightly cued. “44%” appears twice in the options

(in B and D) – which might encourage some candidates to assume one of these

options is correct (which would be a correct assumption – the key is B). This could

have been avoided by selecting a different value for D (such as 160%).

OTHER METRICS

There is a range of other metrics that can be calculated for SRQs. Most are complex

and, unlike facility values and discrimination indices, have no “real” interpretation.

However, the distractor patterndistractor patterndistractor patterndistractor pattern provides useful information about which of the

options candidates choose. For example, the following distractor pattern illustrates

the choices made by 100 candidates for Example 25 (above).

OptioOptioOptioOptionnnn Frequency Frequency Frequency Frequency

of selectionof selectionof selectionof selection

A 15

B 40

C 10

D 35

This distribution would infer that distractors A and C are under-performing and need

to be strengthened or replaced. It might also indicate that distractor D is too strong




and may require weakening. It would appear that this question comes down to a

straight choice between options B and D for most candidates.

There isn’t a perfect distribution for the options – but options that are rarely selected

or a distractor that is more popular than the key warrant attention.

Item analysis provides a means of evolving item banks by identifying under-

performing (“weak”) items – and eliminating them (“survival of the fittest”). The

initial calibration of items can be done formally (through field testing items prior to

their use) or informally (using predicted facility values for example) and these initial

values can be re-calibrated once the items are used in earnest. However, to be

effective, item bank evolution (like biological evolution) needs a mechanism to

identify weak items and replace these with stronger ones.



Page Page Page Page 44444444 InInInIntroduction to Selected Response Questionstroduction to Selected Response Questionstroduction to Selected Response Questionstroduction to Selected Response Questions

CONSTRUCTING TESTSCONSTRUCTING TESTSCONSTRUCTING TESTSCONSTRUCTING TESTS

AUTHORING TESTS

This section looks at the process of combining questions into a test. The following

diagram illustrates the test generation procedure.

Figure Figure Figure Figure 4444 ---- Test generati Test generati Test generati Test generation procedureon procedureon procedureon procedure

TEST SPECIFICATION

The test specification is the document (or “blueprint”) that defines the precise

nature of the test. It is normally created by the Principal Assessor (or equivalent)

under advice from the SQA Officer. The test specification will include the following

information:

• description (including links with source unit(s) and outcome(s))

• question format(s)

• number of questions

• duration

• rubric (including the marking scheme)

• pass mark (including grade boundaries where applicable)

• conditions of assessment.

A sample test specification is provided in the appendices.

The descriptiondescriptiondescriptiondescription of the test must (at a minimum) define the learning objectives that

the test is seeking to measure. In the context of SQA, this would mean the unit(s)

and outcome(s) that the assessment is testing (its “domain”).

The question formatquestion formatquestion formatquestion format defines the type of question that the test will employ. This might

be true/false, matching, multiple choice or multiple response – or a mix of these

types. For example, a test might use 15 MCQs and 5 MRQs – the test spec’ should

spell this out.

The number of questionsnumber of questionsnumber of questionsnumber of questions is self-evident but note that where more than one question

type is employed, the spec’ should specify the number of each type.

The dudududurationrationrationration of the test will depend on the number of questions and the complexity

of the questions. Simplistic formulas for the duration of a test (“two minutes per Simplistic formulas for the duration of a test (“two minutes per Simplistic formulas for the duration of a test (“two minutes per Simplistic formulas for the duration of a test (“two minutes per




question”) should be avoided.question”) should be avoided.question”) should be avoided.question”) should be avoided. Scenario questions, in particular, take time to read,

assimilate and answer. The duration should be based on a typical test undertaken by

a typical candidate. If in doubt, err on the side of generosity – unless speed of

response is a critical aspect of the assessment.

The rubricrubricrubricrubric defines the marking scheme and provides instructions to candidates.

Setters may adopt a simple marking scheme (one mark per question) or more

complex schemes (involving one, two or more marks for each item depending on its

importance or complexity). Simple marking schemes are recommended. This section

should also provide any special instructions for candidates.

The pass markpass markpass markpass mark (or cutting score) is the minimum mark that candidates must gain in

order to achieve a pass in the test. There are a number of techniques for setting

pass marks, some of which are discussed later in this section. But pulling a figure

out of thin air is not one of them. And 50% is rarely a suitable cut score for an

objective test (due to the effects of guessing – see below).

If a test is graded (beyond the basic pass/fail threshold), the grade boundariesgrade boundariesgrade boundariesgrade boundaries must

be defined. The grade boundaries define the marks required to gain an A or B or C

pass. For example, a C pass might require a total score between 60% and 74%, B

between 75% and 89%, and an A pass 90% or more.

Finally, the test spec’ should describe any special conditionsspecial conditionsspecial conditionsspecial conditions that have not already

been described elsewhere in the specification. Examples include: access to

reference material (Is the assessment open book? Or open web?) and permitted

materials (such as calculators or special instruments).

ASSEMBLING THE TEST TEAM

The test teamtest teamtest teamtest team is responsible for constructing the test, using the test specification as

a blueprint. This team will normally consist of an SQA Officer and a number of setters

– or, in testing terminology, a test expert (the SQA Officer) and a number of subject

matter experts (the setters). The SMEs should have prior knowledge and experience

of writing SRQs. The size of the team will depend on a number of factors such as the

number of items required and the time available to write them. The more items

required and less time available, the greater the number of SMEs needed.

Subject matter experts may need training in the construction of selected response

questions. This can be done at the authoring event (see below) or prior to this event,

at a specific training event.




AUTHORING EVENT

Due to the collaborative nature of item writing, it is recommended that questions are

produced over a short period of intensive activity rather than the more traditional

SQA approach to question setting. For example, a team of four SMEs might be asked

to produce 200 items over an intensive working weekend. A suggested workflow

during the authoring event is provided below.

Allocate learning

outcomes to

SMEs

Agree targets

with SMEs

Write

item

Completed

batch?

Add item

to batch

Pass batch

To reviewer

Reviewer checks

batch

Accept

Item?

Revise or discard

item

Add item to

item bank

No Yes

Yes

No

Figure Figure Figure Figure 5555 ---- Authoring event Authoring event Authoring event Authoring event workflow workflow workflow workflow

Authors need to be crystal clear about the learning objectives (outcomes) that they

are to assess. Where more than one outcome is to be covered by an individual SME,

the number of questions for each outcome should be agreed. Each author’s targets

should also include the types of question and number of each type of question (for

example: “Twenty multiple choice questions and 10 multiple response questions”),

the average facility value for their set of questions (see below), and the expected

productivity rate (for example, five items per hour).

Writing items is a solitary activity. Although authors may seek advice when they write

questions, the act of putting pen to paper (or, more likely, finger to keyboard) is an

individual task. Authors should be provided with a question template before

commencing. This template (which is normally a Word document) defines the precise

format of the question and will include metadata about the item (such as the




associated keywords and its predicted facility value). A sample template is provided

in the appendices.

If the items are being written for a test with a known pass mark, authors will require

to know the target facility value (FV) to aim for. For example, if the writers are

producing items for a test with a pass mark of 15/20 then the target FV will be 0.75

and each author should ensure that each batch of questions has an average FV of

0.75 (so that the overall item bank has a “correct” FV).

Authors should batch items before passing a group of questions to a designated

reviewer for checking. The reviewer will then review each item and do one of three

things: (1) accept it without change; (2) accept it with revisions; or (3) reject it. While

it is unlikely that the author and reviewer cannot reach a compromise about a

disputed item, in such cases the Principal Assessor should make a final decision.

Reviewing is best done blind (i.e. without knowing the identity of the author) to

prevent personality conflicts from interfering with the process. While group reviewing

is a good means of training writers and reviewers, it is an inefficient way to create

large numbers of items.

The output from the authoring event will be an item bank of approved and calibrated

items. The SQA officer will play a crucial rolThe SQA officer will play a crucial rolThe SQA officer will play a crucial rolThe SQA officer will play a crucial role in maintaining workflow and ensuringe in maintaining workflow and ensuringe in maintaining workflow and ensuringe in maintaining workflow and ensuring a a a a

productive event.productive event.productive event.productive event. Target setting and regular milestones will play an important part in

ensuring a successful outcome. At various points during the event, the officer should

convene review meetings when progress can be measured, and problems or

bottlenecks can be collectively identified and addressed.

DETERMINING TEST LENGTH

Determining the number of questions to include in a test is an important decision.

The length of a test has a direct relationship with the test’sThe length of a test has a direct relationship with the test’sThe length of a test has a direct relationship with the test’sThe length of a test has a direct relationship with the test’s reliability reliability reliability reliability –––– the longer the the longer the the longer the the longer the

test (and, by implication, the more questions in the test), the more reliable that test test (and, by implication, the more questions in the test), the more reliable that test test (and, by implication, the more questions in the test), the more reliable that test test (and, by implication, the more questions in the test), the more reliable that test

will be as a measure of the candidate’s ability.will be as a measure of the candidate’s ability.will be as a measure of the candidate’s ability.will be as a measure of the candidate’s ability.

There are a number of factors that affect test length including:

• the importance of the test

• the size of the domain being assessed

• the range of knowledge and skills contained within the domain

• the time available.

A high stakes test needs to be more reliable than a low stakes test – and therefore

needs to be longer. However, the improvement in reliability levels off over a certain

number of questions.

The number of learning objectives being assessed also has a bearing on the size of

the test. A test that assesses several outcomes (or one large outcome) will obviously

require more items than one that assesses fewer outcome (or smaller outcomes).

However, even a test that assesses a single outcome may require lots of questions if

that outcome covers a broad range of knowledge and skills.




And, finally, the time available needs to be considered. There is no point is designing

a test with 60 questions, requiring two hours to complete, if this is disruptive to

centres. For example, most Scottish schools operate a 50 minute period and tests

that last longer than this can be difficult to administer.

There is no formula for test length. Criticality, domain size and practical

considerations need to be balanced. However, in most instances of unit assessment

it is best to keep tests as short as possible to reduce the assessment burden on

centres (and candidates).

TECHNIQUES FOR SETTING PASS MARKS IN OBJECTIVE TESTS

There are a number of ways to set a pass mark. We will look at three methods:

1. Informed judgement

2. Angoff method

3. contrasting groups

Some are more “scientific” than others but, no matter which method is used, none of

them replace the need for human judgement.

INFORMED JUDGEMENT

This technique involves the most human judgement and, as a consequence, is the

most subjective way of setting pass marks (it also is the method most similar to the

way that SQA sets cut-scores).

At its most basic level, informed judgement involves the opinion of the members of

the setting team. These subject matter experts (SMEs) agree a sensible pass mark

based on their expert judgement and the following considerations:

• the minimum mark achievable through guessing

• the criticality of the judgement being made about candidates

• the complexity of the subject domain

• the difficulty of the test items

• the age and stage of the candidates.

No matter how little a candidate knows, s/he is unlikely to score zero marks in an

objective test due to the effects of guessing. For example, in an objective test

consisting of 100 multiple choice items, each with four options, blind guessing

should produce a minimum mark of 25% (representing the one in four chance of

guessing the correct answer to each question). For this reason, the pass mark in an

objective test is usually higher than 50%.

The importance of the assessment also has a bearing on the pass mark. For

example, an assessment that grants a license to practice as a surgeon is more



Page Page Page Page 49494949 IntIntIntIntroduction to Selected Response Questionsroduction to Selected Response Questionsroduction to Selected Response Questionsroduction to Selected Response Questions

important that an assessment that confers a pass in a unit. Where it is critical that

candidates possess particular competences both the test duration (see above) and

the pass mark (see below) should be increased.

If there is an existing item bank, the difficulty of the items in the bank can be used to

determine the pass mark. For example, if we know that an item bank contains

difficult questions then that would result in a lower pass mark; conversely, a simple

item bank would lead to a higher pass mark. Associated with this is the complexity of

the subject domain. For example, a test on nuclear physics might have a lower pass

mark than one on multiplication tables – although this is dependent on the age and

stage of the candidates.

In practice, the informed judgement would be based on all of these considerations –

some of which may drive the pass mark up and some may push it down. For

example, an undergraduate true/false test for medical students would have a

significantly higher pass mark than a multiple response test for a low level unit.

The initial judgement may be refined after further consultation or pre-testing. For

example, practicing teachers may be asked about their views about the proposed

pass mark; and/or the assessment may be field-tested and the pass mark adjusted

in the light of the resulting scores.

ANGOFF METHOD

This method of determining the pass mark is less subjective than the informed

judgement approach. It involves aggregating the facility values (FVs) for each item

and estimating the pass mark based on this figure. The following example illustrates

this method.

QuestionQuestionQuestionQuestion FVFVFVFV

1 0.8

2 0.6

3 0.6

4 0.3

5 0.4

TotalTotalTotalTotal 2.72.72.72.7

Pass markPass markPass markPass mark 3/53/53/53/5

Table Table Table Table 5555 ---- Setting pass marks using Angoff Setting pass marks using Angoff Setting pass marks using Angoff Setting pass marks using Angoff

Recall that the facility value is a measure of the probability (between 0 and 1) of

minimally competent candidates answering the question correctly. For example,




based on the above table, there is an 80% probability that candidates will answer

question one correctly (FV=0.8). Adding the FVs for each question, therefore,

provides an indication of the total score that a minimally competent candidate

should achieve (in this case 2.7). Subject matter experts would then either round

this value down or up using their professional judgements (in this case the aggregate

FV was rounded up). The resulting pass mark for this test is three out of five.

In practice, pass marks are defined in the test specification, and the task, therefore,

becomes one of selecting questions with FVs that aggregate to this pass mark. We

effectively reverse engineer the Angoff method. For example, if the test specification

defines a pass mark of 7/10 then the test should consist of questions whose FVs

add to seven (give or take a decimal place). This is a very simple task for a computer.

CONTRASTING GROUPS

This method, unlike the previous ones, requires pre-testing. The test is issued to two

groups of students – one group who are expected to pass and one group who are

expected to fail. The actual scores are then plotted on a chart and the intersection of

the graphs provides an initial pass mark. This initial pass mark is then refined using

the SMEs’ expert judgement.

The graph below illustrates the results for two groups of students – one group (the

blue line) expected to fail and one group (the red line) expected to pass.

0

5

10

15

20

25

30

0 5

10

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Marks

No

of

ca

nd

ida

tes

Figure Figure Figure Figure 6666 ---- Setting pass marks using contrasting groups Setting pass marks using contrasting groups Setting pass marks using contrasting groups Setting pass marks using contrasting groups

The initial cut score would be around 55% (the approximate intersection of the two

lines). Raising this to 60% would reduce the number of “incompetent” students who




would pass the test – but increase the number of “competent” students who would

fail. Conversely, decreasing the pass mark to 50% would reduce the number of

“false fails” but increase the number of “false passes”. The final decision is based

on the professional judgement of the SMEs.

These methods can be used alone or in combination. They all provide some scientific

basis to the process of setting the pass mark. The alternative – pulling a pass mark

from thin air – is not an option.




DEALING WITH GUESSINDEALING WITH GUESSINDEALING WITH GUESSINDEALING WITH GUESSINGGGG

Guessing is often cited as a major problem with selected response questions and it

is true that blind guessing can produce relatively high marks for candidates in an

objective test. For example, blind guessing in a true/false test should produce a

result of approximately 50%. However, there are well established ways of dealing

with guessing. These are: pass mark setting, negative marking and correction-for-

guessing.

SETTING AN APPROPRIATE PASS MARK

The simplest way of dealing with guessing is to adjust the pass mark accordingly.The simplest way of dealing with guessing is to adjust the pass mark accordingly.The simplest way of dealing with guessing is to adjust the pass mark accordingly.The simplest way of dealing with guessing is to adjust the pass mark accordingly.

Instead of the “traditional” 50% pass mark, the cut score can be made higher to

compensate for the effects of guessing. For example, a multiple choice test that has

a pass mark of 75% is unlikely to be passed by blind guessing. We have already seen

three ways of determining the pass mark for an objective test (informed judgement,

Angoff method and contrasting groups). Any of these methods will eliminate (or

greatly reduce) the effects of guessing.

NEGATIVE MARKING

Negative marking involves deducting marks for incorrect answers. For example, the

following table illustrates a candidate’s scoring pattern in a five item test where one

mark is awarded for the correct answer, zero marks where a question is not

answered and one mark deducted for the incorrect answer.

QuestionQuestionQuestionQuestion MarkMarkMarkMark

1 1

2 1

3 0

4 -1

5 1

TotalTotalTotalTotal 2222

The main problem with negative marking is that it penalises partial knowledge.

Selecting a “good” distractor is a better than choosing a “bad” distractor – but both

choices will result in the loss of a mark.




CORRECTION-FOR-GUESSING

This technique involves deducting a certain number of marks from every candidate

to compensate for the effects of guessing. The number of marks deducted can be

worked out in a number of ways, ranging from the crude (a fixed number of marks

deducted from every candidate) to the more sophisticated (when the number of

marks deducted is not fixed and is based on an estimate of how many guesses each

candidate has made). An example of the second approach follows.

In a 50 item test, where each item is a multiple choice question consisting of four

options (a key and three distractors), a candidate scores 38/50. The proportion of

marks deducted is based on the number of incorrect answers (which are assumed to

be guesses) and is worked out as follows:

No. of marks to deducted = No. of wrong answers x (1/No. of distractors)

In this case:

No. of marks deducted = 12 x 1/3 = 4 marks.

So, four marks would be deducted from this candidate giving her an adjusted score

of 34.

While less crude than negative marking, this method suffers from similar problems –

it penalises partial knowledge as much as no knowledge, and disproportionately

affects low risk takers who will choose not to attempt a question rather than answer

it for fear of losing marks, resulting in many unanswered questions – and deflated

marks.

SEQUENCING QUESTIONS

When deciding the order of items in a test, it should be borne in mind that tests

should begin with relatively simple questions and progress to more complex

questions. It is also advisable to group item types together – for example, all

true/false items and all MCQs. So, in most cases, it is advisable to begin with

straight-forward, low difficulty true/false questions and progress to more complex,

higher difficulty MRQ or assertion/reason items.



Page Page Page Page 54545454 IntrIntrIntrIntroduction to Selected Response Questionsoduction to Selected Response Questionsoduction to Selected Response Questionsoduction to Selected Response Questions

APPENDIX 1 APPENDIX 1 APPENDIX 1 APPENDIX 1 –––– SAMPLE TEST S SAMPLE TEST S SAMPLE TEST S SAMPLE TEST SPECIFICATIONPECIFICATIONPECIFICATIONPECIFICATION

Source unit

Provide details about unit, outcomes and performance criteria that the test is assessing.

Title Internet Safety Ref. No. 10 1234 SCQF level 4

Outcome Performance criteria No. of

questions

1 All 9

2 d, e 9

3 a, b, c 7

Outcome(s) and Performance Criteria

Test details

Provide details about the test.

No. of questions 25 Type Number Additional info.

Duration 50 min. MCQ 20 4 options for

each question

MRQ 5 4 options for

each question

Question format(s)

Selection of questions

Explain selection criteria for questions.

There must be a fixed number of questions for each outcome. See

distribution above. The question types (MCQ and MRQ) can be

distributed between outcomes as desired.

Pass mark(s)

Including grading thresholds where applicable.

16/25

Rubric

Marking instructions, instructions to candidates, assessment conditions etc.

Marking

instructions

One mark per question.

Assessment

conditions

Such as reference

materials, location,

authentication.

No access to reference material (paper or web).

Candidate authentication is required.

Instruction to

candidates

No special instructions.

Author Bobby Elliott Date 12 May 2006




APPENDIX 2 APPENDIX 2 APPENDIX 2 APPENDIX 2 –––– SAMPLE TEMPLATE FOR SAMPLE TEMPLATE FOR SAMPLE TEMPLATE FOR SAMPLE TEMPLATE FOR MCQS MCQS MCQS MCQS

Item

Stem

Options

Key

1

2 Distractors

3

Metadata

Outcome

PC(s)

PFV

Tags

Workflow

Writing Writer Date Time

Reviewing Reviewer Date Time

Banking Banker Date Time




APPENDIX APPENDIX APPENDIX APPENDIX 3333 –––– CHECKLIST FOR MULTIPCHECKLIST FOR MULTIPCHECKLIST FOR MULTIPCHECKLIST FOR MULTIPLE CHOICE QLE CHOICE QLE CHOICE QLE CHOICE QUESTIONSUESTIONSUESTIONSUESTIONS

Test ID Item ID Reviewer

ITEM

The question relates to learning outcome(s) and performance criteria.

The level of language is appropriate to the target candidates.

The question is set at an appropriate level of difficulty.

There is one unambiguously correct answer.

Cueing is avoided.

STEM

The stem is phrased as a question.

Unnecessary information is not included.

Necessary standards are specified.

Negative wording is avoided.

Personal pronouns (“you”, “we”, etc.) are avoided.

Subjective wording is not used e.g. “What do you think…”.

OPTIONS

Options are sequenced in a definite order.

Length of options are similar.

Options are mutually exclusive.

The key is not distinctive in terms of length, wording etc.

Distractors are correct in every context (unless a specific context is given).

Definitive wording (“never”, “always”, etc.) is not used.

Pejorative wording is avoided e.g. “bad”, “little”, etc.

“None of the above” is used sparingly.

“All of the above” is not used.

COMMENTS

Documents

Objective Test Guide