Educational portfolios in the assessment of general practice trainers: reliability of assessors

Educational portfolios in the assessment of general practicetrainers: reliability of assessors

John Pitts,1 Colin Coles 2 & Peter Thomas 3

Objectives This paper reports a project that assessed a

series of portfolios assembled by a cohort of partici-

pants attending a course for prospective general

practice trainers.

Design The reliability of judgements about individual

`components', together with an overall global judge-

ment about performance were studied.

Setting NHSE South & West, King Alfred's College,

Winchester and Institute of Community Studies,

Bournemouth University.

Subjects Eight experienced general practice trainers re-

cruited from around Wessex, which incorporates

Hampshire, Dorset, Wiltshire and the Isle of Wight.

Results The reliability of individual assessor's

judgements (i.e. their consistency) was moderate, but

inter-rater reliability did not reach a level which could

support making a safe summative judgement. The

levels of reliability reached were similar to other sub-

jective assessments and perhaps re¯ected individuality

of personal agendas of both the assessed and the asses-

sors, and variations in portfolio structure and content.

Conclusions Suggestions for approaches in future are

made.

Keywords Education, medical, continuing, *standards;

family practice, *education; teaching, *standards.

Medical Education 1999;33;515±520

Background

In postgraduate medical education in the United

Kingdom the quality of training posts, and the devel-

opment and skills of educationalists, are issues rapidly

ascending the agendas of colleges, postgraduate deans

and national policy makers. More speci®cally in general

practice, certi®cation and re-certi®cation of GP trainers

is an important professional and political issue. Cur-

rently, educational competence is inferred from a range

of disparate assessment measures which include the

teaching environment and resources, as well as a sub-

jective judgement of how good a trainer this person is

likely to be. Performance as a teacher falls within this

overall view and there is an assumption that a trainer

who can reach some (usually unexplained) `standard' is

competent as a teacher. However, while the Joint

Committee on Postgraduate Training for General

Practice recognizes the dif®culties in determining the

teaching ability of trainers and stresses the importance

of assessing this task within the regions, no strategies

have so far been offered.1

Introduction

Earlier work by us exploring this ®eld used an assess-

ment tool based on valid criteria of `good' teaching.2 A

panel of experienced trainers was asked to judge a series

of video-recorded tutorials between general practice

registrars and their trainers. The reliability of panel

members' judgements about individual `components',

together with an overall global judgement about per-

formance, were studied and while reliability of indi-

vidual assessor's judgements (i.e. their consistency) was

moderate, inter-rater reliability did not reach a level

which could support making a safe summative judge-

ment. While evidence suggested that increasing the

number of teaching episodes would increase reliability

towards acceptable levels, feasibility and cost pur-

suaded us that this was impractical ± too many videos

would need to be assessed by too many assessors.

Following this work, and in recognition of other

considerations concerning the issue of professional

competence, a `new direction' based on portfolios was

suggested.3

The use of portfolios as a record for the purposes of

both learning and assessment has grown, from initial

1Associate Director in Postgraduate General Practice Education,

NHSE S&W and Honorary Research Fellow, King Alfred's University

College, Winchester, UK2 Professor in Medical Education and 3 Reader in Statistics, Institute

of Health and Community Studies, Bournemouth University, UK

Correspondence: Dr John Pitts, Chilgrove, Noads Way, Dibden Purlieu,

Southampton, Hampshire SO45 4PD, UK

Primary care

Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520 515

application by the Armed Forces and later within the

arts. Embraced particularly in professions supplemen-

tary to medicine and nursing at both undergraduate

and postgraduate levels, various interpretations have

evolved ranging from a logbook resembling a

curriculum vitae4 through case logs5 to deeply personal

re¯ective accounts. In general practice in the UK,

portfolios have been used in both vocational training

with general practice registrars6 and continuing edu-

cation with GP principals.7 The Royal College of

Physicians and Surgeons of Canada is using an

electronic learning portfolio in its re-certi®cation pro-

cesses.8 There is considerable general agreement about

the value of keeping a portfolio (and receiving feedback

on it) as a formative process, both in the literature and

in practice, particularly at the higher levels of a taxon-

omy of portfolios that ranges from a record of actions to

re¯ections on those actions in order to explain, under-

stand and expose reasoning processes, and to making

observations of learning outcomes, elaboration and

application of learning to new tasks. In contrast, video

recordings merely show actions at that time.

However, before this approach is seized upon as the

`latest' panacea to a problem with no ideal solution,

studies need to be carried out to establish important

background data. While a substantial amount of liter-

ature in many ®elds exists, primarily teaching and

nursing, psychometric data to support the use of port-

folios as a summative assessment tool are sparse, and

lacking in the majority of published papers.

The aims of the study were, therefore: to de®ne,

discuss and develop with a group of assessors the pro-

posed criteria for assessment, to assess the portfolios of

a group of prospective general practice trainers and to

study the inter-rater and intra-rater reliability of the

assessors' judgements.

Method

Study group

The assessors were eight experienced general practice

trainers recruited from around Wessex, which incorpo-

rates Hampshire, Dorset, Wiltshire and the Isle of

Wight. The portfolios used in the study were those

written by participants of the Wessex New Trainers

Course held in the autumn of 1997. The course com-

prised 5 separate days, each about 3±4 weeks apart.

`Homework' and reading were carried out in the inter-

vening periods. `Loose' guidelines about portfolio

content were made, with further discussions held within

the learning groups as they progressed during the course.

Development of assessment criteria

Previous work9,10 has provided a baseline for assessing

teaching behaviours likely to be effective. A training

session prior to the study introduced, re®ned and re-

inforced six assessment criteria based on observable

and recordable thought processes, taking into account

our views of the complexity of professional prac-

tice.11±13 These included:

1 Evidence of `re¯ective learning'.

2 Awareness of `where they were', consideration of

past learning experiences, identi®cation of per-

sonal learning needs.

3 Recognition of effective teaching behaviours.

4 Ability to identify with being a learner.

5 Awareness of educational resources.

6 Drawing conclusions, overall re¯ections on the

course and their future career development

These criteria were incorporated into a marking

schedule (Appendix). A judgement was sought from the

assessors on whether and to what degree these could be

seen, together with an overall global judgement by them

as to whether the portfolio was deemed to be satisfac-

tory. All assessors examined all portfolios on two oc-

casions, 1 month apart. Assessors attended a debrie®ng

workshop after the project to discuss and record their

experiences, and identify points of dif®culty. Portfolios

varied greatly in their size, depth and complexity. Many

included copies of published material, and many in-

cluded details such as course records and timetables.

No attempt to èdit' these was made; the assessors

viewed the portfolios in their entirety.

Thirteen prospective general practice trainers at-

tended the course; 12 agreed the use of their portfolios

in this project. Each portfolio was judged by the as-

sessors against the six criteria plus a `global' pass/refer

rating. The criteria were originally coded as 0±5. To

concentrate the data, scores of 0±2 were grouped as

`refer' and 3±5 as `pass'.

The overall level of agreement (above that expected

as a result of chance) between the eight assessors in

rating the subjects as `pass' or `refer' was estimated

using the kappa (j) statistic.14 Values of j over 0á8indicate excellent agreement, between 0á61 and 0á8substantial agreement, 0á41±0á6 moderate agreement,

0á21±0á4 fair agreement, 0±0á2 slight agreement and

less than 0 poor agreement.15 To assess whether the

values of j were signi®cantly different from levels of

agreement that would be expected by chance, they

were compared with their asymptotic standard

errors.14

Ó Blackwell Science Ltd MEDICAL EDUCATION 1999;33:515±520

Reliability of portfolio assessments · J Pitts et al.516

Results

Inter-rater reliability

This was explored using the ®rst assessment made by

each assessor. The overall scores of the eight assessors

are shown in Table 1. Ten portfolios were judged to

have passed on the global assessment by more than half

(®ve or more) of the assessors. Only three were judged

to have passed by all assessors. Of the two portfolios that

were not passed by ®ve or more assessors, one was

passed by one and the other by two. The number of

overall passes given by the assessors ranged from 6 to 11,

suggesting some variability in the tendency of these as-

sessors to pass portfolios. Agreement between assessors

ranged from `slight' to `fair' (Table 2), but signi®cantly

above the level expected by chance (P < 0á05).

Intra-rater reliability (rater consistency)

On re-assessment, agreement within assessors was cal-

culated using an overall j that is an average of the eight

j calculated from individual assessors. Also shown is

the number of portfolios for which the pass/refer as-

sessment changed, averaged over the eight assessors

(Table 3). Rater consistency is therefore de®ned as

`moderate' agreement. Greatest consistency was seen in

the `global' and `re¯ectiveness' judgements and least in

the ìdenti®cation with being a learner' criterion.

Use as an assessment tool

A possible assessment system might use two assessors

to judge a portfolio. On the basis of this work, however,

we can show how the results would be affected if two

assessors were to be chosen at random from our panel

of eight, which allows 28 combinations of pairs.

Table 4 demonstrates that, with a trainer designated as

`pass' if one assessor of the pair passed him or her, a

trainer passed in this study by only one of the panel of

eight would have a 25% chance of being passed by the

paired system (row 2, column 2). A trainer who was

passed by all but one assessor would be guaranteed a

pass in the paired system (row 8, column 2). In sum-

mary, this system would be geared towards passing

trainers (there would be many false passes). If both as-

sessors had to award a pass, the system would be much

harsher. For example, a trainer who was passed by

seven of the eight assessors would have a 25% chance of

being referred under a paired system.

Within our small sample, the percentage of global

passes given by an assessor ranged from 50 to 92%.

This variation was not statistically signi®cant (P�0á17,

using the Cochran Q-test), suggesting that the data

were not incompatible with the pass rates being equal

for different assessors.

Assessors' comments

Areas of dif®culty identi®ed by the assessors are listed

in Table 5. The main areas of dif®culty stemmed from

the individuality of the portfolios, and variation in the

`starting points' of each prospective trainer. Material

included within portfolios was not always referred to in

the discussion ± some participants appeared to be

`squirrels' who merely ®led documentation without

apparent further thought or connection, while others

more clearly used and cross-referred to papers and

other literature within their written re¯ections.

Discussion

This work has shown that using valid criteria, a group

of experienced trainers who had been trained as asses-

sors through devising and agreeing the criteria to be

used, can achieve only a `fair' degree of agreement

(inter-rater reliability) regarding a trainer's learning

portfolio. These data are similar to those from our

earlier work on video-recorded teaching, and do not

reach a level where a summative judgement could be

made safely. Similar problems were identi®ed when GP

trainers marked audit projects submitted by general

practice registrars as part of the summative assessment

process; the trainers failed to recognize basic audit

methodology, using a marking schedule to which they

themselves had contributed. Furthermore, although not

quali®ed by statistical data, a group of èxpert' asses-

sors, used as comparators in this work, also failed to

agree on the levels of competence of some of their cri-

teria.16

One explanation for these inconsistencies has been

highlighted by Phillips,17 quoting Thomas Kuhn, who

popularized the notion that inquirers always work

within a paradigm ± a framework that determines the

concepts that are used. He states that ìn a sense, all

inquirers are trapped within their own paradigms; they

will judge certain things as being true (for them) that

other inquirers will judge as being false (for them)'.

Overall, these data are similar to those from studies

of clinical competence. For example, 32 assessors of a

videotaped simulated patient encounter achieved jscores of 0á3±0á4 on component scores, with 25% of the

assessors failing the candidate, 50% rating as `marginal'

and 25% rating as `satisfactory' on a global judge-

ment.18 Similarly, a study of senior medical students

produced reliability on components of competence of

Reliability of portfolio assessments · J Pitts et al. 517


0á2±0á4, with the authors concluding that serious

questions can be raised about using scores as indicators

of student performance.19 These authors calculated

that to achieve the recommended `summative' reli-

ability of 0á8, a range of 45±170 cases per candidate

would need to be examined! It is possible that higher

levels of rater reliability could be achieved through

more training of the assessors. However, in a study of

undergraduate examinations, training of examiners

produced no signi®cant improvement in reliability. In

fact, the greatest improvement came from identifying

and removing the most inconsistent examiners.20 Wil-

son et al.21 suggested that pass/fail decisions should not

be made on the basis of clinical examinations because

of the magnitude of examiner variation. Training of

doctor examiners yielded no bene®t compared with

medical students or lay staff used as examiners.22

Does this mean that professional people are not ca-

pable of assessing the professional competence of others

or that, more particularly here, that portfolios should

remain as effective formative educational tools, but ex-

cluded for the purposes of summative assessments? The

evidence of this study suggests so, but as an assessment

instrument portfolios have particular advantages. These

stem primarily from the differences from a `typical'

examination situation ± a pressurized and stressful time-

limited event occurring at the end of a course or

programme. Immediately identi®able is the fact that this

method is not an examination; completing the portfolio

over time allows multiple attempts and opportunities,

allows for revision and re¯ection, can address multiple

tasks and use many forms of data entry.

Issues for further exploration include: can reliability

be improved, perhaps by offering a structure to the

portfolio that could assist the assessment process, as

those used in this work were relatively `free-range'?

Would this diminish the strengths of this approach?

Portfolios are claimed to have high face and content

validity because they represent a comprehensive and

personal record,23 but what is the relationship of what

can be observed/assessed in a portfolio to the qualities

of a `good' teacher?

Finally, and most importantly, the question we feel

needs to be addressed is: should assessment method-

ology focus on an entirely different approach? It may be

that traditional psychometric views of reliability and

validity are actually limiting more meaningful educa-

tional approaches. A reductionist view of assessment

would hold that criteria should be tightly de®ned, and

which can be operationalized in behavioural terms; but

portfolios are `narrative accounts'24 which require ìn-

terpretation'. Therefore, let us start from the position

that applying measures such as reliability and validityTa

ble

1P

ort

folio

ass

essm

ent

score

s,b

yn

um

ber

sof

ass

esso

rs(n

=8)

Com

pon

ent

Over

all

Can

did

ate

Re¯

ecti

ve

learn

ing

pro

cess

es

Aw

are

nes

sof

`pre

sen

tst

ate

',

willin

gn

ess

tole

arn

Rec

ogn

itio

nof

effe

ctiv

ete

ach

ing

beh

avio

urs

Iden

ti®

cati

on

wit

h

learn

er

Aw

are

nes

sof

edu

cati

on

al

reso

urc

es

Dra

win

g

con

clu

sion

s,th

inkin

g

ab

ou

tth

efu

ture

`Glo

bal

jud

gem

ent

Fin

al

nu

mb

erP

ass

Ref

erP

ass

Ref

erP

ass

Ref

erP

ass

Ref

erP

ass

Ref

erP

ass

Ref

erP

ass

Ref

erP

ass

/ref

er

14

46

26

26

26

21

75

3P

ass

28

08

08

05

38

07

17

1P

ass

37

18

06

26

22

68

08

0P

ass

41

73

51

71

72

62

61

7R

efer

57

18

07

16

27

16

26

2P

ass

67

17

17

15

38

08

07

1P

ass

77

17

17

18

08

07

17

1P

ass

88

08

06

25

37

17

18

0P

ass

97

17

17

16

28

07

16

2P

ass

10

53

44

53

44

44

17

26

Ref

er

11

80

80

71

71

80

71

80

Pass

12

62

62

53

62

71

62

71

Pass



are not appropriate for portfolio-based learning. What

can we learn from qualitative approaches? Qualitative

research takes an interpretive, naturalistic approach to

its subject matter; qualitative researchers study things

in their natural settings, attempting to make sense, or

interpret, phenomena in terms of the meanings that

people bring to them.25 This sounds congruent with a

portfolio approach. Lincoln and Guba26 and Guba and

Lincoln27 use terminology such as `trustworthiness',

comprising credibility, dependability, transferability

and con®rmability as re¯ecting rigour in research

®ndings. While accepting that ultimately a ®nal

`quantitative' outcome, i.e. pass/fail, will be necessary,

basing portfolio assessments around such concepts will

form the basis of our future work in this area.

References

1 Joint Committee on Postgraduate Training for General

Practice (JCPTGP). Recommendations to Regions for the Es-

tablishment of Criteria for the Approval and Reapproval of

Trainers in General Practice. London: JCPTGP; 1992.

2 Pitts J, Coles C, Thomas P. Exploring the introduction of a

performance-based component into the certi®cation and re-

certi®cation of general practice trainers. Educ General Pract

1998;9:316±24.

3 Pitts J, Coles C, Percy D. Performance-based certi®cation and

recerti®cation of general practice trainers: a new direction.

Educ General Pract 1998;9:291±8.

4 United Kingdom Central Council (UKCC) for Nursing. The

Future of Professional Practice. The council's standard for educa-

tion and practice following registration. London: UKCC; 1994.

5 Finlay IG, Maughan TS, Webster DJT. Portfolio learning: a

proposal for undergardaute cancer teaching. Med Educ

1994;28:79±82.

6 Snadden D, Thomas ML, Grif®n EM, Hudson H. Portfolio-

based learning and general practice vocational training. Med

Educ 1996;30:148±52.

7 Challis M, Mathers NJ, Howe AC, Field NJ. Portfolio-based

learning: continuing medical education for general practitio-

ners ± a mid-point evaluation. Med Educ 1997;31:22±6.

8 Parboosingh J. Learning portfolios: potential to assist health

professionals with self-directed learning. J Contin Educ Health

Professions 1996;16:75±81.

9 Coles C. A review of learner-centred education and its ap-

plications in primary care. Educ General Pract 1994;5:19±25.

10 Pitts J. Pathologies of one-to-one teaching. Educ General Pract

1996;7:118±22.

11 Schon D. The Re¯ective Practitioner. San Francisco: Jossey-

Bass; 1983.

12 Eraut M. Developing Professional Knowledge and Competence.

London: Falmer Press; 1994.

13 Fish D, Coles C. Developing Professional Judgment in Health

Care. Learning through the critical appreciation of practice.

London: Butterworth Heinemann; 1998.

Table 2 Inter-rater reliability

Assessment criterion Kappa

Asymptotic

standard error

Global 0á32 0á05

`Re¯ectiveness' 0á27 0á05

`Willingness' 0á19 0á05

`Recognition' 0á16 0á04

`Learner' 0á1 0á05

Resources¢ 0á37 0á05

`Conclusions' 0á41 0á05

Table 3 Intra-rater reliability

Assessment criterion Kappa

Average number

of disagreements

Global 0á54 2á3`Re¯ectiveness' 0á54 1á9`Willingness' 0á53 1á9`Recognition' 0á41 2á9`Learner' 0á38 3á3Resources¢ 0á53 2á5`Conclusions' 0á49 2á9

Table 4 Results of assessment by a pair of assessors

Number of

assessors passing

candidate

% pairs both

passing�% pairs at least

one pass�% pairs both

referring�

0 0 0 100

1 0 25 75

2 4 46 54

3 11 64 36

4 21 79 21

5 36 89 11

6 54 96 4

7 75 100 0

8 100 100 0

�Out of the 28 possible ways of pairing the eight assessors.

Table 5 Areas of dif®culty reported by assessors

Variability in the individual `starting points' of the trainers

Unclear about the relevance of some inclusions

Unclear which were group and which were individual re¯ections

Applying standards ± peer or criterion referencing?

Relating the global impression to individual components

The bulk of some portfolios was inhibiting

Reliability of portfolio assessments · J Pitts et al. 519


14 Davies M, Fleiss JL. Measuring agreement for multinomial

data. Biometrics 1982;38:1047±51.

15 Kianifard F. Evaluation of clinimetric scales: basic principles

and methods. Statistician 1994;43:475±82.

16 Lough JRM, Murray TS. Training for audit: lessons still to be

learned. Br J General Pract 1997;47:290±2.

17 Phillips DC. Subjectivity and objectivity: an objective inquiry.

In: Eisner EW, Peshkin A, editors. Qualitative Inquiry in Ed-

ucation ± the continuing debate. New York: Teachers College

Press; 1990.

18 Herbers JE, Noel GL, Cooper GS, Harvey J, Pangaro LN,

Weaver MJ. How accurate are faculty evaluations of clinical

competence? J General Intern Med 1989;4:202±8.

19 Colliver JA, Vu NV, Markwell SJ, Verhulst SJ. Reliability and

ef®ciency of components of clinical competence assessed with

®ve performance-based examinations using standardised pa-

tients. Med Educ 1991;25:303±10.

20 Newble DI, Hoare J, Sheldrake PF. The selection and training

of examiners for clinical examinations. Med Educ

1980;14:345±9.

21 Wilson GM, Lever R, Harden RM, Robertson JIS, Mac-

Ritchie J. Examination of clinical examiners. Lancet

1969;i:37±40.

22 Van der Vleuten CPM, Van Luyk SJ, Van Ballegooijen AMJ,

Swanson DB. Training and experience of examiners. Med

Educ 1989;23:290±6.

23 Jolly B, Grant J. The Good Assessment Guide; a practical guide to

assessment and appraisal for higher specialist training. London:

Joint Centre for Education in Medicine; 1997.

24 Greenhalgh T, Hurwitz B. Why study narrative? BMJ

1999;318:48±50.

25 Denzin NK, Lincoln YS. Handbook of Qualitative Research.

London: Sage Publications; 1994.

26 Lincoln YS, Guba EG. Naturalistic Enquiry. London: Sage

Publications; 1985.

27 Guba EG, Lincoln YS. Fourth Generation Evaluation. London:

Sage Publications; 1989.

Received 2 December 1998; editorial comments to authors 11 January

1999; accepted for publication 3 February 1999

Appendix

PORTFOLIO ASSESSMENT GUIDE ± Candidate no:¼

Unsatisfactory Satisfactory

Criterion 0 1 2 3 4 5

1. Re¯ective learning process

Demonstration of learning, proof

of use, change

2. Awareness of `present state',

willingness to learn

Learning experiences, hopes and

expectations, identi®cation and

de®nition of educational needs shown

3. Recognition of effective teaching beha-

viours

Listening, questioning, identifying

`wants' and `needs', de®ning and agreeing

agenda, re¯ective teaching, summarizing,

evaluating

4. Identifying with the

learner

Recognizing uncertainty, acknowledging

ignorance, learning together

5. Awareness of educational

resources

Literature, peers, mentor, courses,

own learning

6. Drawing conclusions,

the future¼Overall gain, career

development, etc.

Global decision: Pass/refer



Documents

Educational portfolios in the assessment of general practice trainers: reliability of assessors