Upload
john-pitts
View
214
Download
1
Embed Size (px)
Citation preview
Educational portfolios in the assessment of general practicetrainers: reliability of assessors
John Pitts,1 Colin Coles 2 & Peter Thomas 3
Objectives This paper reports a project that assessed a
series of portfolios assembled by a cohort of partici-
pants attending a course for prospective general
practice trainers.
Design The reliability of judgements about individual
`components', together with an overall global judge-
ment about performance were studied.
Setting NHSE South & West, King Alfred's College,
Winchester and Institute of Community Studies,
Bournemouth University.
Subjects Eight experienced general practice trainers re-
cruited from around Wessex, which incorporates
Hampshire, Dorset, Wiltshire and the Isle of Wight.
Results The reliability of individual assessor's
judgements (i.e. their consistency) was moderate, but
inter-rater reliability did not reach a level which could
support making a safe summative judgement. The
levels of reliability reached were similar to other sub-
jective assessments and perhaps re¯ected individuality
of personal agendas of both the assessed and the asses-
sors, and variations in portfolio structure and content.
Conclusions Suggestions for approaches in future are
made.
Keywords Education, medical, continuing, *standards;
family practice, *education; teaching, *standards.
Medical Education 1999;33;515±520
Background
In postgraduate medical education in the United
Kingdom the quality of training posts, and the devel-
opment and skills of educationalists, are issues rapidly
ascending the agendas of colleges, postgraduate deans
and national policy makers. More speci®cally in general
practice, certi®cation and re-certi®cation of GP trainers
is an important professional and political issue. Cur-
rently, educational competence is inferred from a range
of disparate assessment measures which include the
teaching environment and resources, as well as a sub-
jective judgement of how good a trainer this person is
likely to be. Performance as a teacher falls within this
overall view and there is an assumption that a trainer
who can reach some (usually unexplained) `standard' is
competent as a teacher. However, while the Joint
Committee on Postgraduate Training for General
Practice recognizes the dif®culties in determining the
teaching ability of trainers and stresses the importance
of assessing this task within the regions, no strategies
have so far been offered.1
Introduction
Earlier work by us exploring this ®eld used an assess-
ment tool based on valid criteria of `good' teaching.2 A
panel of experienced trainers was asked to judge a series
of video-recorded tutorials between general practice
registrars and their trainers. The reliability of panel
members' judgements about individual `components',
together with an overall global judgement about per-
formance, were studied and while reliability of indi-
vidual assessor's judgements (i.e. their consistency) was
moderate, inter-rater reliability did not reach a level
which could support making a safe summative judge-
ment. While evidence suggested that increasing the
number of teaching episodes would increase reliability
towards acceptable levels, feasibility and cost pur-
suaded us that this was impractical ± too many videos
would need to be assessed by too many assessors.
Following this work, and in recognition of other
considerations concerning the issue of professional
competence, a `new direction' based on portfolios was
suggested.3
The use of portfolios as a record for the purposes of
both learning and assessment has grown, from initial
1Associate Director in Postgraduate General Practice Education,
NHSE S&W and Honorary Research Fellow, King Alfred's University
College, Winchester, UK2 Professor in Medical Education and 3 Reader in Statistics, Institute
of Health and Community Studies, Bournemouth University, UK
Correspondence: Dr John Pitts, Chilgrove, Noads Way, Dibden Purlieu,
Southampton, Hampshire SO45 4PD, UK
Primary care
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520 515
application by the Armed Forces and later within the
arts. Embraced particularly in professions supplemen-
tary to medicine and nursing at both undergraduate
and postgraduate levels, various interpretations have
evolved ranging from a logbook resembling a
curriculum vitae4 through case logs5 to deeply personal
re¯ective accounts. In general practice in the UK,
portfolios have been used in both vocational training
with general practice registrars6 and continuing edu-
cation with GP principals.7 The Royal College of
Physicians and Surgeons of Canada is using an
electronic learning portfolio in its re-certi®cation pro-
cesses.8 There is considerable general agreement about
the value of keeping a portfolio (and receiving feedback
on it) as a formative process, both in the literature and
in practice, particularly at the higher levels of a taxon-
omy of portfolios that ranges from a record of actions to
re¯ections on those actions in order to explain, under-
stand and expose reasoning processes, and to making
observations of learning outcomes, elaboration and
application of learning to new tasks. In contrast, video
recordings merely show actions at that time.
However, before this approach is seized upon as the
`latest' panacea to a problem with no ideal solution,
studies need to be carried out to establish important
background data. While a substantial amount of liter-
ature in many ®elds exists, primarily teaching and
nursing, psychometric data to support the use of port-
folios as a summative assessment tool are sparse, and
lacking in the majority of published papers.
The aims of the study were, therefore: to de®ne,
discuss and develop with a group of assessors the pro-
posed criteria for assessment, to assess the portfolios of
a group of prospective general practice trainers and to
study the inter-rater and intra-rater reliability of the
assessors' judgements.
Method
Study group
The assessors were eight experienced general practice
trainers recruited from around Wessex, which incorpo-
rates Hampshire, Dorset, Wiltshire and the Isle of
Wight. The portfolios used in the study were those
written by participants of the Wessex New Trainers
Course held in the autumn of 1997. The course com-
prised 5 separate days, each about 3±4 weeks apart.
`Homework' and reading were carried out in the inter-
vening periods. `Loose' guidelines about portfolio
content were made, with further discussions held within
the learning groups as they progressed during the course.
Development of assessment criteria
Previous work9,10 has provided a baseline for assessing
teaching behaviours likely to be effective. A training
session prior to the study introduced, re®ned and re-
inforced six assessment criteria based on observable
and recordable thought processes, taking into account
our views of the complexity of professional prac-
tice.11±13 These included:
1 Evidence of `re¯ective learning'.
2 Awareness of `where they were', consideration of
past learning experiences, identi®cation of per-
sonal learning needs.
3 Recognition of effective teaching behaviours.
4 Ability to identify with being a learner.
5 Awareness of educational resources.
6 Drawing conclusions, overall re¯ections on the
course and their future career development
These criteria were incorporated into a marking
schedule (Appendix). A judgement was sought from the
assessors on whether and to what degree these could be
seen, together with an overall global judgement by them
as to whether the portfolio was deemed to be satisfac-
tory. All assessors examined all portfolios on two oc-
casions, 1 month apart. Assessors attended a debrie®ng
workshop after the project to discuss and record their
experiences, and identify points of dif®culty. Portfolios
varied greatly in their size, depth and complexity. Many
included copies of published material, and many in-
cluded details such as course records and timetables.
No attempt to `edit' these was made; the assessors
viewed the portfolios in their entirety.
Thirteen prospective general practice trainers at-
tended the course; 12 agreed the use of their portfolios
in this project. Each portfolio was judged by the as-
sessors against the six criteria plus a `global' pass/refer
rating. The criteria were originally coded as 0±5. To
concentrate the data, scores of 0±2 were grouped as
`refer' and 3±5 as `pass'.
The overall level of agreement (above that expected
as a result of chance) between the eight assessors in
rating the subjects as `pass' or `refer' was estimated
using the kappa (j) statistic.14 Values of j over 0á8indicate excellent agreement, between 0á61 and 0á8substantial agreement, 0á41±0á6 moderate agreement,
0á21±0á4 fair agreement, 0±0á2 slight agreement and
less than 0 poor agreement.15 To assess whether the
values of j were signi®cantly different from levels of
agreement that would be expected by chance, they
were compared with their asymptotic standard
errors.14
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520
Reliability of portfolio assessments · J Pitts et al.516
Results
Inter-rater reliability
This was explored using the ®rst assessment made by
each assessor. The overall scores of the eight assessors
are shown in Table 1. Ten portfolios were judged to
have passed on the global assessment by more than half
(®ve or more) of the assessors. Only three were judged
to have passed by all assessors. Of the two portfolios that
were not passed by ®ve or more assessors, one was
passed by one and the other by two. The number of
overall passes given by the assessors ranged from 6 to 11,
suggesting some variability in the tendency of these as-
sessors to pass portfolios. Agreement between assessors
ranged from `slight' to `fair' (Table 2), but signi®cantly
above the level expected by chance (P < 0á05).
Intra-rater reliability (rater consistency)
On re-assessment, agreement within assessors was cal-
culated using an overall j that is an average of the eight
j calculated from individual assessors. Also shown is
the number of portfolios for which the pass/refer as-
sessment changed, averaged over the eight assessors
(Table 3). Rater consistency is therefore de®ned as
`moderate' agreement. Greatest consistency was seen in
the `global' and `re¯ectiveness' judgements and least in
the `identi®cation with being a learner' criterion.
Use as an assessment tool
A possible assessment system might use two assessors
to judge a portfolio. On the basis of this work, however,
we can show how the results would be affected if two
assessors were to be chosen at random from our panel
of eight, which allows 28 combinations of pairs.
Table 4 demonstrates that, with a trainer designated as
`pass' if one assessor of the pair passed him or her, a
trainer passed in this study by only one of the panel of
eight would have a 25% chance of being passed by the
paired system (row 2, column 2). A trainer who was
passed by all but one assessor would be guaranteed a
pass in the paired system (row 8, column 2). In sum-
mary, this system would be geared towards passing
trainers (there would be many false passes). If both as-
sessors had to award a pass, the system would be much
harsher. For example, a trainer who was passed by
seven of the eight assessors would have a 25% chance of
being referred under a paired system.
Within our small sample, the percentage of global
passes given by an assessor ranged from 50 to 92%.
This variation was not statistically signi®cant (P�0á17,
using the Cochran Q-test), suggesting that the data
were not incompatible with the pass rates being equal
for different assessors.
Assessors' comments
Areas of dif®culty identi®ed by the assessors are listed
in Table 5. The main areas of dif®culty stemmed from
the individuality of the portfolios, and variation in the
`starting points' of each prospective trainer. Material
included within portfolios was not always referred to in
the discussion ± some participants appeared to be
`squirrels' who merely ®led documentation without
apparent further thought or connection, while others
more clearly used and cross-referred to papers and
other literature within their written re¯ections.
Discussion
This work has shown that using valid criteria, a group
of experienced trainers who had been trained as asses-
sors through devising and agreeing the criteria to be
used, can achieve only a `fair' degree of agreement
(inter-rater reliability) regarding a trainer's learning
portfolio. These data are similar to those from our
earlier work on video-recorded teaching, and do not
reach a level where a summative judgement could be
made safely. Similar problems were identi®ed when GP
trainers marked audit projects submitted by general
practice registrars as part of the summative assessment
process; the trainers failed to recognize basic audit
methodology, using a marking schedule to which they
themselves had contributed. Furthermore, although not
quali®ed by statistical data, a group of `expert' asses-
sors, used as comparators in this work, also failed to
agree on the levels of competence of some of their cri-
teria.16
One explanation for these inconsistencies has been
highlighted by Phillips,17 quoting Thomas Kuhn, who
popularized the notion that inquirers always work
within a paradigm ± a framework that determines the
concepts that are used. He states that `in a sense, all
inquirers are trapped within their own paradigms; they
will judge certain things as being true (for them) that
other inquirers will judge as being false (for them)'.
Overall, these data are similar to those from studies
of clinical competence. For example, 32 assessors of a
videotaped simulated patient encounter achieved jscores of 0á3±0á4 on component scores, with 25% of the
assessors failing the candidate, 50% rating as `marginal'
and 25% rating as `satisfactory' on a global judge-
ment.18 Similarly, a study of senior medical students
produced reliability on components of competence of
Reliability of portfolio assessments · J Pitts et al. 517
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520
0á2±0á4, with the authors concluding that serious
questions can be raised about using scores as indicators
of student performance.19 These authors calculated
that to achieve the recommended `summative' reli-
ability of 0á8, a range of 45±170 cases per candidate
would need to be examined! It is possible that higher
levels of rater reliability could be achieved through
more training of the assessors. However, in a study of
undergraduate examinations, training of examiners
produced no signi®cant improvement in reliability. In
fact, the greatest improvement came from identifying
and removing the most inconsistent examiners.20 Wil-
son et al.21 suggested that pass/fail decisions should not
be made on the basis of clinical examinations because
of the magnitude of examiner variation. Training of
doctor examiners yielded no bene®t compared with
medical students or lay staff used as examiners.22
Does this mean that professional people are not ca-
pable of assessing the professional competence of others
or that, more particularly here, that portfolios should
remain as effective formative educational tools, but ex-
cluded for the purposes of summative assessments? The
evidence of this study suggests so, but as an assessment
instrument portfolios have particular advantages. These
stem primarily from the differences from a `typical'
examination situation ± a pressurized and stressful time-
limited event occurring at the end of a course or
programme. Immediately identi®able is the fact that this
method is not an examination; completing the portfolio
over time allows multiple attempts and opportunities,
allows for revision and re¯ection, can address multiple
tasks and use many forms of data entry.
Issues for further exploration include: can reliability
be improved, perhaps by offering a structure to the
portfolio that could assist the assessment process, as
those used in this work were relatively `free-range'?
Would this diminish the strengths of this approach?
Portfolios are claimed to have high face and content
validity because they represent a comprehensive and
personal record,23 but what is the relationship of what
can be observed/assessed in a portfolio to the qualities
of a `good' teacher?
Finally, and most importantly, the question we feel
needs to be addressed is: should assessment method-
ology focus on an entirely different approach? It may be
that traditional psychometric views of reliability and
validity are actually limiting more meaningful educa-
tional approaches. A reductionist view of assessment
would hold that criteria should be tightly de®ned, and
which can be operationalized in behavioural terms; but
portfolios are `narrative accounts'24 which require `in-
terpretation'. Therefore, let us start from the position
that applying measures such as reliability and validityTa
ble
1P
ort
folio
ass
essm
ent
score
s,b
yn
um
ber
sof
ass
esso
rs(n
=8)
Com
pon
ent
Over
all
Can
did
ate
Re¯
ecti
ve
learn
ing
pro
cess
es
Aw
are
nes
sof
`pre
sen
tst
ate
',
willin
gn
ess
tole
arn
Rec
ogn
itio
nof
effe
ctiv
ete
ach
ing
beh
avio
urs
Iden
ti®
cati
on
wit
h
learn
er
Aw
are
nes
sof
edu
cati
on
al
reso
urc
es
Dra
win
g
con
clu
sion
s,th
inkin
g
ab
ou
tth
efu
ture
`Glo
bal
jud
gem
ent
Fin
al
nu
mb
erP
ass
Ref
erP
ass
Ref
erP
ass
Ref
erP
ass
Ref
erP
ass
Ref
erP
ass
Ref
erP
ass
Ref
erP
ass
/ref
er
14
46
26
26
26
21
75
3P
ass
28
08
08
05
38
07
17
1P
ass
37
18
06
26
22
68
08
0P
ass
41
73
51
71
72
62
61
7R
efer
57
18
07
16
27
16
26
2P
ass
67
17
17
15
38
08
07
1P
ass
77
17
17
18
08
07
17
1P
ass
88
08
06
25
37
17
18
0P
ass
97
17
17
16
28
07
16
2P
ass
10
53
44
53
44
44
17
26
Ref
er
11
80
80
71
71
80
71
80
Pass
12
62
62
53
62
71
62
71
Pass
Reliability of portfolio assessments · J Pitts et al.518
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520
are not appropriate for portfolio-based learning. What
can we learn from qualitative approaches? Qualitative
research takes an interpretive, naturalistic approach to
its subject matter; qualitative researchers study things
in their natural settings, attempting to make sense, or
interpret, phenomena in terms of the meanings that
people bring to them.25 This sounds congruent with a
portfolio approach. Lincoln and Guba26 and Guba and
Lincoln27 use terminology such as `trustworthiness',
comprising credibility, dependability, transferability
and con®rmability as re¯ecting rigour in research
®ndings. While accepting that ultimately a ®nal
`quantitative' outcome, i.e. pass/fail, will be necessary,
basing portfolio assessments around such concepts will
form the basis of our future work in this area.
References
1 Joint Committee on Postgraduate Training for General
Practice (JCPTGP). Recommendations to Regions for the Es-
tablishment of Criteria for the Approval and Reapproval of
Trainers in General Practice. London: JCPTGP; 1992.
2 Pitts J, Coles C, Thomas P. Exploring the introduction of a
performance-based component into the certi®cation and re-
certi®cation of general practice trainers. Educ General Pract
1998;9:316±24.
3 Pitts J, Coles C, Percy D. Performance-based certi®cation and
recerti®cation of general practice trainers: a new direction.
Educ General Pract 1998;9:291±8.
4 United Kingdom Central Council (UKCC) for Nursing. The
Future of Professional Practice. The council's standard for educa-
tion and practice following registration. London: UKCC; 1994.
5 Finlay IG, Maughan TS, Webster DJT. Portfolio learning: a
proposal for undergardaute cancer teaching. Med Educ
1994;28:79±82.
6 Snadden D, Thomas ML, Grif®n EM, Hudson H. Portfolio-
based learning and general practice vocational training. Med
Educ 1996;30:148±52.
7 Challis M, Mathers NJ, Howe AC, Field NJ. Portfolio-based
learning: continuing medical education for general practitio-
ners ± a mid-point evaluation. Med Educ 1997;31:22±6.
8 Parboosingh J. Learning portfolios: potential to assist health
professionals with self-directed learning. J Contin Educ Health
Professions 1996;16:75±81.
9 Coles C. A review of learner-centred education and its ap-
plications in primary care. Educ General Pract 1994;5:19±25.
10 Pitts J. Pathologies of one-to-one teaching. Educ General Pract
1996;7:118±22.
11 Schon D. The Re¯ective Practitioner. San Francisco: Jossey-
Bass; 1983.
12 Eraut M. Developing Professional Knowledge and Competence.
London: Falmer Press; 1994.
13 Fish D, Coles C. Developing Professional Judgment in Health
Care. Learning through the critical appreciation of practice.
London: Butterworth Heinemann; 1998.
Table 2 Inter-rater reliability
Assessment criterion Kappa
Asymptotic
standard error
Global 0á32 0á05
`Re¯ectiveness' 0á27 0á05
`Willingness' 0á19 0á05
`Recognition' 0á16 0á04
`Learner' 0á1 0á05
Resources¢ 0á37 0á05
`Conclusions' 0á41 0á05
Table 3 Intra-rater reliability
Assessment criterion Kappa
Average number
of disagreements
Global 0á54 2á3`Re¯ectiveness' 0á54 1á9`Willingness' 0á53 1á9`Recognition' 0á41 2á9`Learner' 0á38 3á3Resources¢ 0á53 2á5`Conclusions' 0á49 2á9
Table 4 Results of assessment by a pair of assessors
Number of
assessors passing
candidate
% pairs both
passing�% pairs at least
one pass�% pairs both
referring�
0 0 0 100
1 0 25 75
2 4 46 54
3 11 64 36
4 21 79 21
5 36 89 11
6 54 96 4
7 75 100 0
8 100 100 0
�Out of the 28 possible ways of pairing the eight assessors.
Table 5 Areas of dif®culty reported by assessors
Variability in the individual `starting points' of the trainers
Unclear about the relevance of some inclusions
Unclear which were group and which were individual re¯ections
Applying standards ± peer or criterion referencing?
Relating the global impression to individual components
The bulk of some portfolios was inhibiting
Reliability of portfolio assessments · J Pitts et al. 519
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520
14 Davies M, Fleiss JL. Measuring agreement for multinomial
data. Biometrics 1982;38:1047±51.
15 Kianifard F. Evaluation of clinimetric scales: basic principles
and methods. Statistician 1994;43:475±82.
16 Lough JRM, Murray TS. Training for audit: lessons still to be
learned. Br J General Pract 1997;47:290±2.
17 Phillips DC. Subjectivity and objectivity: an objective inquiry.
In: Eisner EW, Peshkin A, editors. Qualitative Inquiry in Ed-
ucation ± the continuing debate. New York: Teachers College
Press; 1990.
18 Herbers JE, Noel GL, Cooper GS, Harvey J, Pangaro LN,
Weaver MJ. How accurate are faculty evaluations of clinical
competence? J General Intern Med 1989;4:202±8.
19 Colliver JA, Vu NV, Markwell SJ, Verhulst SJ. Reliability and
ef®ciency of components of clinical competence assessed with
®ve performance-based examinations using standardised pa-
tients. Med Educ 1991;25:303±10.
20 Newble DI, Hoare J, Sheldrake PF. The selection and training
of examiners for clinical examinations. Med Educ
1980;14:345±9.
21 Wilson GM, Lever R, Harden RM, Robertson JIS, Mac-
Ritchie J. Examination of clinical examiners. Lancet
1969;i:37±40.
22 Van der Vleuten CPM, Van Luyk SJ, Van Ballegooijen AMJ,
Swanson DB. Training and experience of examiners. Med
Educ 1989;23:290±6.
23 Jolly B, Grant J. The Good Assessment Guide; a practical guide to
assessment and appraisal for higher specialist training. London:
Joint Centre for Education in Medicine; 1997.
24 Greenhalgh T, Hurwitz B. Why study narrative? BMJ
1999;318:48±50.
25 Denzin NK, Lincoln YS. Handbook of Qualitative Research.
London: Sage Publications; 1994.
26 Lincoln YS, Guba EG. Naturalistic Enquiry. London: Sage
Publications; 1985.
27 Guba EG, Lincoln YS. Fourth Generation Evaluation. London:
Sage Publications; 1989.
Received 2 December 1998; editorial comments to authors 11 January
1999; accepted for publication 3 February 1999
Appendix
PORTFOLIO ASSESSMENT GUIDE ± Candidate no:¼
Unsatisfactory Satisfactory
Criterion 0 1 2 3 4 5
1. Re¯ective learning process
Demonstration of learning, proof
of use, change
2. Awareness of `present state',
willingness to learn
Learning experiences, hopes and
expectations, identi®cation and
de®nition of educational needs shown
3. Recognition of effective teaching beha-
viours
Listening, questioning, identifying
`wants' and `needs', de®ning and agreeing
agenda, re¯ective teaching, summarizing,
evaluating
4. Identifying with the
learner
Recognizing uncertainty, acknowledging
ignorance, learning together
5. Awareness of educational
resources
Literature, peers, mentor, courses,
own learning
6. Drawing conclusions,
the future¼Overall gain, career
development, etc.
Global decision: Pass/refer
Reliability of portfolio assessments · J Pitts et al.520
Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520