a useful study tool - but why in exams? it reflects the proper meaning of knowledge

Confidence-based assessment in the 1st year medical end-of-year exam

Tony Gardner-MedwinPhysiology, UCL

a useful study tool - but why in exams?

it reflects the proper meaning of knowledge

conventional marking disadvantages able students

how did the students do in the exam?

conf-asst was a more reliable measure of student ability

it saves on the number of questions required

a useful study tool - but why in exams?

it reflects the proper meaning of knowledge


how did the students do in the exam?

conf-asst was a more reliable measure of student ability

it saves on the number of questions required

Knowledge depends on degree of belief, or confidence:

knowledge

uncertainty

ignorance

misconception

delusion

What is Knowledge?Knowledge depends on degree of belief, or confidence: knowledge uncertainty ignorance misconception delusion

increasing

nescience

=0 -log2(confidence*)

for truth of a

=1 true proposition

>>1

Measurement of knowledge requires the eliciting of confidence (or *subjective probability) for the truth of correct statements.

This requires a proper scheme of incentives

LAPT confidence-based scoring scheme

Confidence Level 1 2 3

Score if Correct 1 2 3Score if incorrect 0 -2 -6P(correct) < 67% >67% >80%Odds < 2:1 >2:1 >4:1

-8

-6

-4

-2

0

2

4

0 1 2 3 4

-log2(Subjective Probability)

Sco

re0%

20%

40%

60%

80%

100%

0.5 0.75 1

Subjective Probability

Su

bje

cti

ve

Ex

pec

tati

on

of

Sc

ore

C=2

C=1

C=3


Suppose 4 students go for the same answer options in an exam: 75 , 25 Ai is confident of all his answers Bo is very hesitant about all her

answers

Cy is realistic (expects 75%), but can’t distinguish reliable & uncertain answers

Di is confident of 50 answers (90% ) and uncertain of the others (60% )

Clearly: Di > Cy > Bo, AiDi has extra insight - about her knowledge, or maybe about subtleties in questions

How can she use this insight?

Conventional scoring: Her only option is to omit uncertain answers: % correct: Ai = Bo = Cy = 75%, Di = 45%

negative marking score (±1): Ai = Bo = Cy = 50%, Di = 40%

Confidence-based scoring: She can moderate her confidence:Ai enters all at C=3, Bo at C=1: Ai = Bo = 25%Cy enters all at C=2: Cy = 33%Di splits answers C=3, C=1: Di = 48%

Summary aims

• reward the ability to distinguish reliable and uncertain answers (whatever the reason for uncertainty)

• penalise confident errors more than errors from uncertainty

What people sometimes think is the aim!

• to penalise a general over-confidence or under-confidence - probably helped by practice & feedback, but not an exam issue

NB over-confidence can actually get you places !!

How well did students discriminate? exam: 500 T/F Qs, in 2 sessions, each 2hrs

331 students: 190 F, 141 M

50%

60%

70%

80%

90%

100%

i-c exF exM @ C=1

i-c exF exM @ C=2

i-c exF exM @ C=3

% c

orre

ct

5%, 95% percentiles

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

conventional scaled score (“simple” score)

con

fid

ence

-bas

ed s

core

A.

(50% correct)

d

a

c

b

y = x1.67

equality (only expected for a pure mix of certain knowledge and total

guesses) scores if uncertainty is homogeneous and correctly reported

theoretical scores for homogeneous uncertainty, based on an information theoretic measure

0%

20%

40%

60%

80%

100%

answ ers correctansw ers

credit variance credit variance

@C=1

@C=2

@C=3

simple scores confidence scores

Simple scores (scaled conventional scores)- 65% of the variance came from answers at C=1, but only 18% of the credit.

Breakdown of credit and variance due to uncertainty

Confidence-based scores: these give less weight to uncertain answers; uncertainty variance is then more in proportion to credit, and was reduced by 46% (relative to the variation of student marks)

Exam marks are determined by:

1. the student’s knowledge and skills in the subject area

2. the level of difficulty of the questions

3. chance factors in the way questions relate to details of the student’s knowledge

4. chance factors in the way uncertainties are resolved (luck)

The most convincing test of this is to compare marks on one set of questions with marks for the same student on a different set . A good correlation means we are measuring something about the student, not just “noise”

(1) = “signal” (its measurement is the object of the exam)

(3,4) = “noise” (random factors obscuring the “signal”)

Confidence-based marks improve the “signal-to-noise ratio”

The correlation, across students, between scores on one set of questions and another is higher for confidence than for simple scores.

But perhaps they are just measuring ability to handle confidence ?

R2 = 0.735

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

set 1 (simple)

set

2 (s

imp

le)

B.

7

R2 = 0.814

0%

20%

40%

60%

80%

0% 20% 40% 60% 80%

set 1 (confidence)

set

2 (c

on

fid

ence

)

C.

C

R2 = 0.776

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

set 1 (conf 0.6 )s

et

2 (

sim

ple

)

D.

No. Confidence scores are better than simple scores at predicting even the conventional scores on a different set of questions. This can only be because they are a statistically more efficient measure of knowledge.

How should one handle students with poor calibration?

Significantly overconfident: 2 students (1%)

e.g. 50% correct @C=1, 59%@C=2, 73%@C=3

Significantly underconfident: 41 students (14%)

e.g. 83% correct @C=1, 89%@C=2, 99%@C=3

Maybe one shouldn’t penalise such students

Adjusted confidence-based score:

Mark the set of answers at each C level as if they were entered at the C level that gives the highest score.

mean benefit = 1.5% ± 2.1% (median 0.6%)

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%simple scaled score

con

fid

en

ce-

ba

sed

sc

ore

A.

(50% correct) (100% correct)

d

a

c

b

y = x1.67

equality (only expected for a pure mix of certain knowledge and total

guesses) scores if uncertainty is homogeneous and correctly reported

theoretical scores for homogeneous uncertainty, based on an information theoretic measure

0.700

0.750

0.800

0.850

0.900

simple :simple

conf : conf

conf(adj): conf(adj)

simple :conf

simple :conf(adj)

Mean values of r2 for 16 random partitionings of the 500 questions : score on one set vs score on the other

simple conf conf (adj)

Signal / noise variance ratio: 2.8 5.3 4.3

Savings in no. of Qs required: - 48% 35%

SUMMARY CONCLUSIONS

• Adjusted confidence scores seem the best scores to use (they don’t discriminate on the basis of the calibration of a person’s confidence judgements, and are also the best predictors of performance on a separate set of questions).

• Reliable discrimination of student knowledge can be achieved with one third fewer questions, compared with conventional scoring.

• Confidence scoring is not only fundamentally more fair (rewarding students who can correctly identify which answers are uncertain) but it is more efficient at measuring performance.

• www.ucl.ac.uk/~cusplap

• confident errors are far worse than acknowledged ignorance and are a wake-up call (-6!) to pay attention to explanations

• expressing uncertainty when you are uncertain is a good thing

• thinking about the basis and reliability of answers can help tie bits of knowledge together (to form “understanding”)

• checking an answer and rereading the question are worthwhile

• sound confidence judgement is a valued intellectual skill in every context, and one they can improve

Principles that students seem readily to understand :-

• both under- and over- confidence are impediments to learning

Documents

a useful study tool - but why in exams? it reflects the proper meaning of knowledge