24
SPECIAL ARTICLE Cognitive, Social and Environmental Sources of Bias in Clinical Performance Ratings Reed G. Williams Department of Surgery Southern Illinois University School of Medicine Springfield, Illinois USA Debra A. Klamen Department of Psychiatry University of Illinois at Chicago Chicago, Illinois USA William C. McGaghie Office of Medical Education and Faculty Development Northwestern University Feinberg School of Medicine Chicago, Illinois USA Background: Global ratings based on observing convenience samples of clinical performance form the primary basis for appraising the clinical competence of medi- cal students, residents, and practicing physicians. This review explores cognitive, so- cial, and environmental factors that contribute unwanted sources of score variation (bias) to clinical performance evaluations. Summary: Raters have a 1 or 2-dimensional conception of clinical performance and do not recall details. Good news is reported more quickly and fully than bad news, leading to overly generous performance evaluations. Training has little impact on ac- curacy and reproducibility of clinical performance ratings. Conclusions: Clinical performance evaluation systems should assure broad, system- atic sampling of clinical situations; keep rating instruments short; encourage imme- diate feedback for teaching and learning purposes; encourage maintenance of writ- ten performance notes to support delayed clinical performance ratings; give raters feedback about their ratings; supplement formal with unobtrusive observation; make promotion decisions via group review; supplement traditional observation with other clinical skills measures (e.g., Objective Structured Clinical Examination); encourage rating of specific performances rather than global ratings; and establish the meaning of ratings in the manner used to set normal limits for clinical diagnostic investigations. Teaching and Learning in Medicine, 15(4), 270–292 Copyright © 2003 by Lawrence Erlbaum Associates, Inc. Global ratings based on observing convenience samples of clinical performance in vivo form the pri- mary basis for appraising the clinical skills of medical students, residents, and practicing physicians (hereaf- ter called practitioners). Global ratings are subject to several sources of bias that have not been the focus of earlier research and reviews because they involve cog- nitive, social, and environmental factors attendant to the ratings rather than measurement problems such as instrument design. Research shows that features of measurement instruments account for no more than 8% of the variance in performance ratings. 1 This review article aims to explore the cognitive, so- cial, and environmental factors that contribute un- wanted sources of score variation (bias) to ratings of practitioners. Such bias compromises the value of per- formance ratings and constrains inferences about the clinical competence of practitioners. To start this we 270 Correspondence may be sent to Reed G. Williams, Ph.D., Department of Surgery, Southern Illinois University School of Medicine, P.O. Box 19638, Springfield, IL 62794–9638 USA. E-mail: [email protected]

SPECIAL ARTICLE Cognitive, Social and Environmental ... · PDF fileSPECIAL ARTICLE Cognitive, Social and Environmental Sources of Bias in Clinical Performance Ratings Reed G. Williams

  • Upload
    trandat

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

SPECIAL ARTICLE

Cognitive, Social and Environmental Sources of Biasin Clinical Performance Ratings

Reed G. WilliamsDepartment of Surgery

Southern Illinois University School of MedicineSpringfield, Illinois USA

Debra A. KlamenDepartment of Psychiatry

University of Illinois at ChicagoChicago, Illinois USA

William C. McGaghieOffice of Medical Education and Faculty Development

Northwestern UniversityFeinberg School of Medicine

Chicago, Illinois USA

Background: Global ratings based on observing convenience samples of clinicalperformance form the primary basis for appraising the clinical competence of medi-cal students, residents, and practicing physicians. This review explores cognitive, so-cial, and environmental factors that contribute unwanted sources of score variation(bias) to clinical performance evaluations.Summary: Raters have a 1 or 2-dimensional conception of clinical performance anddo not recall details. Good news is reported more quickly and fully than bad news,leading to overly generous performance evaluations. Training has little impact on ac-curacy and reproducibility of clinical performance ratings.Conclusions: Clinical performance evaluation systems should assure broad, system-atic sampling of clinical situations; keep rating instruments short; encourage imme-diate feedback for teaching and learning purposes; encourage maintenance of writ-ten performance notes to support delayed clinical performance ratings; give ratersfeedback about their ratings; supplement formal with unobtrusive observation; makepromotion decisions via group review; supplement traditional observation with otherclinical skills measures (e.g., Objective Structured Clinical Examination); encouragerating of specific performances rather than global ratings; and establish the meaningof ratings in the manner used to set normal limits for clinical diagnostic investigations.

Teaching and Learning in Medicine, 15(4), 270–292 Copyright © 2003 by Lawrence Erlbaum Associates, Inc.

Global ratings based on observing conveniencesamples of clinical performance in vivo form the pri-mary basis for appraising the clinical skills of medicalstudents, residents, and practicing physicians (hereaf-ter called practitioners). Global ratings are subject toseveral sources of bias that have not been the focus ofearlier research and reviews because they involve cog-nitive, social, and environmental factors attendant tothe ratings rather than measurement problems such as

instrument design. Research shows that features ofmeasurement instruments account for no more than8% of the variance in performance ratings.1

This review article aims to explore the cognitive, so-cial, and environmental factors that contribute un-wanted sources of score variation (bias) to ratings ofpractitioners. Such bias compromises the value of per-formance ratings and constrains inferences about theclinical competence of practitioners. To start this we

270

Correspondence may be sent to Reed G. Williams, Ph.D., Department of Surgery, Southern Illinois University School of Medicine, P.O. Box19638, Springfield, IL 62794–9638 USA. E-mail: [email protected]

set the stage by defining clinical competence and de-scribing characteristics that are pertinent to this review.Later, we describe characteristics of the natural clinicalenvironment that lead to unwanted variation in clinicalperformance ratings.

Next, we establish four quality criteria that clinicalratings must meet to allow strong inferences aboutmedical clinical competence. We also summarizepsychometric data from the performance assessmentliterature in medicine, business, and psychology thatdocument how performance ratings achieve these fourcriteria and, in the process, better define these criteria.

We then summarize the evidence about cognitive,social, and environmental factors that can bias evalua-tor ability to make valid and useful inferences aboutcompetence from performance ratings. The goal is toidentify and describe factors worthy of attention be-cause they are based on well-designed and conductedresearch studies and, in most cases, have stood the testof replication. It is not possible to describe all studiesand findings in detail. Interested readers will have thereferences to allow further exploration as desired.

Finally, we present a set of 16 recommendations forclinical performance assessment practice and for re-search based on our analysis of performance ratingproblems and our review of the literature.

The article is not a review using a single, standard,and highly specific set of methodological inclusion cri-teria,2–4 nor is it an exhaustive review of the perfor-mance assessment research literature. Instead, it fo-cuses selectively on cognitive, social, andenvironmental sources of error that compromise thevalue of clinical performance ratings as typically col-lected and the inferences about clinical competencethat are based on this information. We try to be com-prehensive in reviewing research studies that addressthese sources of error and to summarize the resultswhen the research design of the studies and the prepon-derance of evidence earned our confidence. We cite acombination of field studies that document the extentof the problem, and experimental studies (randomizedcontrolled trials) that allow us to disentangle the ele-ments of the problem and document their relativestrength. Using the taxonomy of literature reviews pro-posed by Cooper,5 the focus of this review is researchfindings; the goals are to formulate general statementsfrom multiple studies, to bridge the gap in research lit-erature from disciplines involved in performance as-sessment research, and to identify and refine key issuescentral to this field of practice and research; the per-spective is to provide a neutral representation of thefindings; the review describes a purposeful sample ofthe articles that represent the extensive literature we re-viewed; the organization of the review is around a con-ceptual framework that we developed to guide futurepractice and research; and the intended audiences arethose individuals responsible for the practice of clini-

cal performance assessment and those who do researchin this area. Hopefully, the review will also help thosepractitioners and researchers who focus on perfor-mance assessment outside the field of medicine. Allfactors discussed and recommendations made arebased on findings from sound research studies thathave also stood the test of replication, taking advantageof meta-analyses when available.

Clinical Competence Defined

Michael Kane6 defined clinical competence as thedegree to which an individual can use the knowledge,skills, and judgment associated with the profession toperform effectively in the domain of possible encoun-ters defining the scope of professional practice (p.167). This definition highlights a key point about clini-cal competence appraisal. Evaluators are interested notonly in how the clinician performs in observed situa-tions (i.e., judging clinical performance) but also asgrounds for generalizing about the practitioner’s abil-ity to perform a variety of other tasks in a range of simi-lar clinical situations (i.e., estimating clinical compe-tence).

Estimating a practitioner’s professional compe-tence requires at least three inferences: (a) judging thequality of the observed performance, (b) drawing gen-eral conclusions about the person’s ability to performin a universe of similar situations based on the ob-server’s judgment of the observed performance(Kane’s6 example: drawing general conclusions aboutskill in delivering babies from the observation of one ortwo deliveries), and (c) extrapolating from the ap-praisal context to expected performance in practice sit-uations where the practitioner is not being observed.This third distinction, between what a practitioner cando and what the practitioner typically will do in prac-tice, introduces the contribution of practitioner per-sonal qualities (e.g., dependability, responsibility, ethi-cal beliefs) to the mix of factors (knowledge, skills, andprofessional judgment) that shape practice behavior.

Clinical Competence Characteristics

Multidimensionality

The term clinical competence leads one to think interms of a unitary performance capability. This over-simplifies medical practice. Most patient encountersrequire the practitioner to integrate and perform at least10 separate component capabilities.7 These include (a)discerning what information is needed to clarify a pa-tient’s problem; (b) collecting data about the patient(history taking, physical examination, diagnostic tests,procedures); (c) organizing, integrating, and interpret-

271

CLINICAL PERFORMANCE RATINGS

ing patient data to derive working diagnoses and refin-ing them as more information becomes available; (d)establishing management goals and evaluating the ef-fectiveness of treatment, adjusting management asneeded; (e) performing procedures required for patientmanagement; (f) communicating and relating effec-tively with patients and their families; (g) working ef-fectively with other health professionals under therange of conditions that make up clinical practice; (h)critically appraising new information for adequacy,quality, and veracity; (i) monitoring personal knowl-edge and skills and updating these in an efficient andeffective manner as needed; and (j) performing consis-tently in a dependable and responsible manner.

Case Specificity

Research on clinical problem solving,8 clinicalperformance assessment (primarily using standard-ized patient technology),9–12 and clinical practice hasshown consistently that clinical performance is casespecific. Practitioners who excel at diagnosing andmanaging one clinical problem are not necessarilyskillful at diagnosing even closely related clinical sit-uations.13 Practitioners differ in effectiveness in dif-ferent clinical situations due to differences in experi-ence, training, professional interest, and personaldisposition.

Van der Vleuten and Swanson9 summarized thefindings from 29 reports of large-scale standardizedpatient-based examinations of clinical performanceconducted on residents and medical students in the1980s. They concluded that the major source of scorevariation is due to variation in examinee performancefrom situation to situation (case to case), and that 16 hrof performance observation across a variety of clinicalsituations is necessary to achieve reproducible esti-mates of clinical competence. The studies that formedthe basis for van der Vleuten and Swanson’s9 conclu-sions involved (a) carefully selected situations andtasks where observation was tightly controlled; (b) in-struments and rater training that assured commonalityin rating focus; and (c) ratings that occurred immedi-ately after the performance, limiting forgetting andmemory distortion. Because this estimate was madeunder nearly ideal conditions, it is likely that morehours of performance observation would be required toachieve reproducible estimates of clinical competencein less controlled clinical settings. The van der Vleutenand Swanson9 review suggests that appraisals designedto determine whether clinical performance is accept-able rather than just determine how practitioners rankrelative to each other will require observation of abroader clinical sample because the difficulty of theobserved encounters must be taken into account.

Field research14,15 involving ratings of physicianperformance also suggests case specificity. Carline et

al.15 recruited 269 internists from six Western statesto participate in a study designed to validate theAmerican Board of Internal Medicine (ABIM) certifi-cation process. Six professional colleagues (two in-ternists, two surgeons, one head nurse, and one medi-cal director or chief of staff) nominated by eachphysician were asked to evaluate the performance ofthat physician in eight areas of clinical competenceand to provide a rating of overall clinical skills. Thereturn rate averaged approximately 78% for the rat-ers. Results of this study suggested that 10 to 17 rat-ings by peers are necessary to yield reproducibleglobal ratings of competence in each competencyarea. Kreiter et al.16 did a similar study to investigatethe measurement characteristics of ratings of medicalstudent clerkship performance using standard clinicalevaluation forms. Again, their results documentedthat ratings depended heavily on the individual raterto whom the student was assigned and on the natureof the patient–student encounter observed. They useddecision study results to estimate the reliability of rat-ings for panels of up to eight raters per student.Panels of eight raters resulted in reliability estimatesof at best 0.65. More than eight raters would be re-quired to achieve reliabilities of 0.80, but the articledoes not allow us to be more precise. Clinical ratingresearch is clear that many observations are needed,and attention needs to be directed toward assuringsystematic sampling across the range of normal clini-cal encounters and tasks.

Context

The typical performance assessment context is anuncertain, uncontrolled, and “noisy” information envi-ronment. The most common performance assessmentapproach is to observe, rate, and judge a practitioner’sclinical performance in a natural clinical setting. Thepractitioner is observed performing routine clinicaltasks under real clinical conditions. A global ratingscale containing relatively nonspecific items designedto be used in a range of clinical situations is employedto direct the observer’s attention to common, importantaspects of clinical performance and to calibrate the rat-ings of performance quality. The result of this generalapproach is that practitioners are observed working upcases that vary in difficulty and in required skill. Like-wise, performance standards used for the case may varybecause only one expert observes the doctor–patient en-counter, and the expert rarely studies the case in depth.Consequently, this method provides a weak basis forcomparing clinicians because the practitioners performdifferent tasks, and the standards used by raters are id-iosyncratic.

Contextual noise contributes error (unreliability) toperformance measurements. The rater usually has

272

WILLIAMS, KLAMEN, MCGAGHIE

many responsibilities beyond supervision, training,and evaluation of practitioners. As a result, the rater’sdatabase about a medical practitioner’s clinical perfor-mance is often fragmentary and restricted to a small,unsystematic sample of clinical situations and tasksdue to limited direct personal contact. Given what hasbeen written about the multidimensionality and casespecificity of clinical competence, problems producedby using small, unsystematic, clinical performancesamples are magnified. In addition, performance ex-pectations are rarely articulated for the rater or thepractitioner. A lack of shared understanding of respon-sibilities may influence clinical performance ratings.Also, clinical care in hospital and outpatient settings isusually a team responsibility. Therefore, it is often dif-ficult to isolate and appraise the contributions and clin-ical competence of individual practitioners.

Quality of Clinical Ratings

Observational data must fulfill four quality criteria.First, the observational instrument and process must beaccurate. That is, observations must yield scores thatare defined by the behavior being observed and re-corded, nothing else. Second, the observers mustagree. A rating instrument may be capable of accurateuse, but rater agreement depends on how it is actuallyused by observers in a field setting. If only one rater isusing the instrument, intrarater reliability is the only is-sue; but this is rarely the case. When multiple raters usethe instrument, interest also resides in agreementamong different raters when observing the same per-formance. The third criterion concerns the ability togeneralize from a set of observations to a range of per-tinent professional performance situations. As de-scribed in the discussion of Kane’s6 definition of pro-fessional competence, the goal of measurement andevaluation of clinical performance is to formulate ac-curate inferences about clinical competence generallyfrom a series of observations of particular instances ofperformance. The third criterion addresses this issue. Itasks, “To what extent do observations of this set of per-formances predict how that individual will perform inother situations?”

In addition to accuracy, rater agreement, and theability to generalize across situations, the focus mustinclude the fourth criterion: utility. Utility addressesthe question about whether the generalizations are use-ful. Does knowing the rating score have any value inpredicting future or concurrent performance of inter-est? Does knowing the score aid in making progress,teaching and learning, licensure, or certification deci-sions? If not, the ratings have no real value. Utility dataalso provide the basis for setting meaningful perfor-mance standards.

Accuracy

In a randomized, controlled trial designed to deter-mine the accuracy of faculty evaluations of residents’clinical skills, Noel and colleagues17 had 203 facultyinternists watch a videotape of one of two residentsperforming a new patient work-up on one of two pa-tients. The faculty internists then immediately ratedthe clinical skills of the resident. One half of the rat-ers used an open-ended evaluation form, and theother half used a structured form that prompted de-tailed observations of selected clinical skills. Theunique contribution of this study was that it investi-gated and compared attending ratings of the sameclinical performance. When observations were notspecifically directed through use of a structured form,a condition typical of traditional clinical performancerating practice, participants recorded only 30% ofknown resident strengths and weaknesses. Strengthswere omitted more frequently than weaknesses, anoutcome consistent with findings from the generalperformance assessment literature.18,19 Rater accuracyincreased to 60% or greater when participants usedstructured forms designed to direct attention to spe-cific dimensions of performance. Comparable resultswere obtained in an earlier, smaller, randomized con-trolled trial.20

The only other study that addressed accuracy in adirect experimental manner was a study by Pangaroand associates21 done to document the accuracy ofstandardized patient-produced checklist scores. In thisstudy, internal medicine residents were trained to act asordinary candidates and to achieve target scores byperforming to a predetermined “strong” or “weak”level on checklist items used by raters to record inter-viewing, physical examination, and communicationskills. The strong standardized examinees (SEs) weretrained to score 80% correct, and the weak SEs weretrained to score 40% correct. Seven SEs took the Stan-dardized Patient (SP) examination along with regularcandidates. The mean score recorded by the raters was77% for the strong SEs and 44% for the weak SEs. Avideotape review revealed that most scoring discrepan-cies were in the area of communication skills.

Achieving the increases in accuracy reported byNoel and colleagues17 and by Herbers and colleagues20

would require standardized patient encounters or, atleast, use of rating scales specific to the case being ob-served and appraised (i.e., rating scales specific to thechief complaint of patients). Obviously, this wouldcome at added cost, extra effort, and added logisticalcomplexity.

There is usually a time lag between observation ofclinical behavior and rating of that behavior in the realworld of performance evaluation. Delay introduces er-ror into ratings. We know of no controlled study of theeffects of time lag on accuracy of clinical ratings, but

273

CLINICAL PERFORMANCE RATINGS

laboratory studies reported in the business literatureand a survey of psychologists outlined the nature andmagnitude of the relation.

In an experimental study with 180 participants,Heneman22 demonstrated that ratings recorded 1 to 3weeks after observation were less accurate than ratingsrecorded immediately. However, the decrease in accu-racy was not large. The delay accounted for only 2% ofthe variance in rating accuracy.

A related but “soft” source of evidence confirmsthe importance of immediate recording to maximizerating accuracy. Kassin et al.23 conducted a survey of64 psychologists who are frequently asked to testifyabout the reliability of eyewitness testimony. The sur-vey asked the psychologists to list findings that areadequately documented to present in court. The ex-pert psychologists agreed there is sufficient evidenceto support the conclusion that event memory loss isgreatest right after the event and then levels off withtime.

Rater Agreement (Intra- andInterrater Agreement)

Intrarater agreement is based on having a singlerater evaluate the same (e.g., videotaped) perfor-mance on at least two different occasions. Differ-ences in ratings reflect changes in rater attention, per-spective, standards, or mood. These studies provide agood estimate of how much variability is introducedinto a performance measurement as a result of thesefactors.

Kalet et al.24 had faculty raters rerate videotapes of21 medical student interviews after a period of 3 to 6months. They found substantial agreement betweenthe earlier and later ratings only when raters judged theinformation obtained (i.e., data collection) during theinterview. Even here, the level of agreement was farfrom perfect. There was little to no agreement in theearlier and later ratings of the interviewing process.

Interrater agreement is based on having differentraters evaluate the same performance. Under thesecircumstances, differences in ratings reflect differ-ences in rater focus and rigor as well as the sources ofvariability addressed in intrarater reliability studies.Weekley and Gier25 examined ratings gathered fromworld-class figure skating judges during the 1984Olympics. These results provide some indication ofthe theoretical upper limit on interrater agreement us-ing human judges. Olympic judges have at least 15years of judging experience. Criteria for judging areclearly spelled out, and judges are periodicallytrained and retrained to use them. Nine judges areemployed to observe and appraise the same perfor-mance. Olympic judges have no conflicting responsi-bilities during judging, and they receive continuous

feedback about the consistency of their ratings withthose of other judges because all ratings are postedimmediately. This serves as a form of feedback thatcan refine agreement among future ratings and also islikely to increase the motivation to provide accurateratings.

Interrater reliability coefficients were consistentlyhigh, ranging from 0.93 to 0.97 for the full comple-ment of judges. More detailed analyses demonstratedthat the judges were able to agree in their evaluationof the skaters, but primarily in a global sense. Judges’ratings of skaters did not differ much for the two di-mensions rated, technical merit and artistic impres-sion. The rating differences between technical meritand artistic impression were larger in the short pro-gram where skaters were required to perform pre-scribed maneuvers, reinforcing the view that carefuldefinition of tasks performed and rated helps raters todifferentiate among performance dimensions. Theseresults suggest that the ceiling on rating reliability us-ing human judges is high. Virtually all of the variabil-ity in ratings was attributable to differences in skatingperformance.

Elliot and Hickam26 had groups of three physicianfaculty members observe the same student while thestudent performed a physical examination on a patient.As in the Weekley and Gier25 study, observers were fa-miliar with the task, had limited environmental inter-ruptions to disrupt the observation, and used a check-list with careful description of the tasks and finiteclassification categories for rating performance. Elliotand Hickam26 found that observers did not agree whenjudging 17 of the 53 physical examination maneuversperformed (32%). There was good interrater agree-ment for 18 of the remaining maneuvers and moderateagreement for the other 18. Agreement was lower onitems requiring palpation and percussion, characteris-tics that are not subject to complete evaluation by ob-servation alone. Elliot and Hickam26 also reported thattrained nonphysician observers were able to reliablyevaluate 83% of the skills that were reliably assessedby faculty.

Kalet et al.24 also had four judges each rate medicalinterviews. They concluded that interrater agreementwas poor when judging the overall interview as well aswhen judging the information obtained and the inter-viewing process.

In the Noel et al.17 study described earlier, medicalfaculty raters were also asked to appraise the overallclinical performance of the resident. Raters disagreedabout the residents’ overall clinical performance. Forthe first resident, 5% of the judges rated the overallclinical performance as unsatisfactory, whereas 5%rated the performance as superior. Twenty-six percentrated the performance as marginal, and 64% rated theperformance as satisfactory. For the second resident,

274

WILLIAMS, KLAMEN, MCGAGHIE

the ratings were 6% unsatisfactory, 42% marginal,49% satisfactory, and 3% superior. Agreement inoverall ratings was not improved among raters whofilled out the structured form. These results stronglysuggest that judges use different criteria and stan-dards to judge clinical performances. Therefore, mul-tiple ratings need to be collected and combined to as-sure a fair and representative overall appraisal ofclinical performance. These results are a conservativeestimate of actual clinical performance assessmentproblems because the raters in this study all observedthe entire performance and were able to completetheir appraisals immediately under circumstanceswhere no competing responsibilities interfered.Again, parallel findings were observed in the Herberset al.20 study.

Martin and colleagues27 compared the score estab-lished by individual physician raters with a gold stan-dard based on the average score of three physicianraters who rated the same encounters from a video-taped record. This allowed the investigators to deter-mine the absolute score deviation from the gold stan-dard as well as conduct correlational analyses thatcompare relative rankings of individuals. The averagecorrelation of the individual physician raters with thegold standard was approximately 0.54 with a fairamount of variation among clinical situations. Theunique contribution of this study was to documentthat individual physician scores were almost alwaysoverestimates of true clinical competence (estimatedby the gold standard consensus ratings). This ten-dency to give medical practitioners the benefit of thedoubt has also been established in a study by Vu andcolleagues.28

MacRae and colleagues29 compared physician rat-ings of 120 videotaped medical student clinical en-counters across four cases using an eight-item ratingscale covering information acquired and interviewingprocess. They found that the interrater reliability co-efficients ranged from 0.65 to 0.93 with an averagereliability of 0.85. These results demonstrate thatgood interrater agreement is possible when judgingclinical performance. One factor that probably con-tributed to the high level of agreement is that the rat-ers collaborated in developing the rating scales usedin this study, resulting in a shared definition of goodclinical performance.

The differences in overall appraisal of the same per-formance documented in this section are most likelydue to the combination of two factors. First, differentraters focus on (differentially weight) different aspectsof clinical performance. Second, different raters havedifferent expectations about levels of acceptable clini-cal performance. Anything that reduces these differ-ences, such as in the MacRae et al.29 study, tends to im-prove interrater agreement.

Accuracy in Estimating Competence(Estimating From a Discrete Set ofObservations to Performance Across aRange of Tasks and Situations)

Interrater agreement studies using ratings by judgeswho have observed different practitioner performancescombine variance due to differences in judges’ expec-tations with variance due to differences in demands(task and case characteristics) of the observed sampleof behavior. Viswesvaran et al.30 used meta-analyticmethods to investigate interrater and intraraterreliabilities of job performance ratings reported in 15primary organizational psychology and business jour-nals from their inception through January 1994. Thisresulted in studying 215 research reports. Viswesvaranet al.30 established that the mean interrater reliabilityestimate for ratings by supervisors equaled 0.52, withthe majority of interrater reliability coefficients fallingbetween 0.50 and 0.54. They reported that the meanintrarater reliability coefficient was 0.81, with most co-efficients ranging from 0.78 to 0.84. Based on the dif-ferences between inter- and intrarater reliability coeffi-cients, Viswesvaran and associates30 concluded thatapproximately 29% of the variance in ratings is attrib-utable to differences in the perspective of different rat-ers, including the sample of behavior they observed;whereas 5% is due to variations in a single rater attrib-utable to changes in moods, frames of reference, orother variables.

Recently, authors of a number of studies on clinicalperformance evaluation have been influenced bygeneralizability theory and analytic methods. Thiswork has established the number of ratings needed toachieve a generalizability coefficient of 0.80 that is ac-cepted as the minimum desired level for use in makingprogress decisions. Carline et al.14 collected 3,557 rat-ings of 328 medical students who had taken an internalmedicine clerkship. They concluded that 7 independ-ent ratings would be needed to judge overall clinicalperformance. The amount of rater experience did notchange the results. Similar results have been reportedby Kreiter and colleagues16 (more than 8 ratingsneeded), Kwolek et al.31 (7–8 ratings), Ramsey et al.32

(11 ratings), Violato et al.33 (10 ratings), and Krobothet al.34 (6–10 ratings). All reports agree that some-where between 7 and 11 ratings are necessary toachieve a generalizable global estimate of competencewhen raters are basing ratings on a nonsystematic sam-ple of observations.

Although there has been only limited research onthe component skills included under the broader cate-gory of clinical competence, it is reasonable to expectthat these abilities will develop at different rates andmay differ in their stability across situations. The workthat has been done suggests that different numbers of

275

CLINICAL PERFORMANCE RATINGS

observations will be required to establish stable esti-mates of clinical competence in various clinical com-petence areas. Carline et al.15 established that the num-ber of ratings required to achieve stable estimates ofcompetence in each of the eight areas reflected in theiritems ranged from 10 to 32 ratings, with an average of18. The areas where the least agreement occurred in-cluded ratings of (a) verbal presentations; (b) writtencommunications; and (c) management of patients withsimple, uncomplicated problems (probably becausethese are managed in the office where care is less sub-ject to observation than in the hospital). Because thisinvestigation was a field study, it is not possible to de-termine whether these findings are due to less stablepractitioner performance or to rater variability associ-ated with lack of rater opportunity to observe. Carlineet al.14 also reported that the number of independentratings needed to reliably judge various individual as-pects of clinical performance varied dramatically. Toillustrate the range of requirements, these investigatorsreported that 7 ratings would be needed to judge datagathering reliably, whereas 27 would be needed to rateinterpersonal relationships with patients. Wenrich etal.35 observed similar results when studying nurses’ratings of internists. They established that somewherebetween 11 and 24 nurse ratings would be necessary toestablish a generalizable estimate (generalizability co-efficient = 0.80) of clinical competence depending onthe competence dimension being investigated. Inter-personal and professional behavior characteristics re-quired more ratings than did cognitive and technical at-tributes. Meta-analytic reviews of the generalperformance assessment literature30,36 produced simi-lar findings. More ratings are needed to provide stableestimates of communication and interpersonal compe-tence than for specific skills (e.g., clinical skills).

We believe that the discrepancies in number of rat-ings needed for interpersonal and professional behav-ior and cognitive and technical attributes are best ex-plained by the conceptual framework offered byGinsburg and associates.37 These investigators notedthat we tend to talk about people as being honest or dis-honest, professional or unprofessional rather than talkabout their behaviors. In other words, professionalismis thought of as a set of stable traits. They argued in-stead that professional behavior is much more contextdependent than has usually been acknowledged. Theimplication of this is that generalizations about profes-sional behavior require sampling much larger numbersof events (contexts and behaviors) than is true for someother aspects of competence. More research regardingthe nature of aspects of clinical competence will helpto increase our understanding of these individual com-petencies and will allow further refinement of perfor-mance assessment practice.

It is clear that multiple observations of students, resi-dents, and physicians are needed to reliably estimate

clinical competence. The results in the studies reviewedhere suggest that a minimum of 7 to 11 observer ratingswill be required to estimate overall clinical competence.More ratingswill benecessary toobtain stableestimatesof competence in more specific competence domains(e.g., data collection, communication, responsible pro-fessional behavior), with the most ratings (up to 30)needed in the noncognitive areas.

A few studies have been reported that investigateusing nurses and patients to appraise aspects of clinicalcompetence. We have included these studies under theheading of Accuracy in Estimating Competence be-cause we believe that these ratings broaden the dimen-sions of competence observed more than they changethe characteristics of raters. Wenrich and associates35

compared ratings of physicians by nurses and physi-cian peers. They found that the correlation amongthese ratings for individual clinical performance com-ponents ranged from 0.46 to 0.72, with the average cor-relation being 0.59. Nurses’ ratings were generallylower than those of physician peer raters for humanis-tic qualities, and their ratings were higher for medicalknowledge and verbal communication. These resultssuggest that nurses’ ratings are not a replacement forphysician ratings. However, nurses ratings do expandthe number of perspectives incorporated into the esti-mates of clinical competence and broaden the range ofclinical situations and competencies observed andjudged. Nurses also are more likely to have the oppor-tunity to observe how practitioners typically perform.

Tamblyn and associates38 investigated the feasibil-ity and value of using satisfaction ratings obtainedfrom patients to evaluate the doctor–patient relation-ship skills of medical residents. They found that patientsatisfaction ratings provide valuable information abouta resident’s ability to establish an effective doctor–pa-tient relationship. However, these investigators notedthat approximately 30 patient ratings would be neces-sary to obtain a generalizable estimate of resident skillin these areas (0.80 generalizability coefficient), al-though a smaller number of ratings would be enough todetect outliers. The larger number of required ratingsspeaks to the increased heterogeneity of patient ratersand raises feasibility issues. However, the use of pa-tient satisfaction ratings provides the perspective of theclient. Church’s39 findings from a business environ-ment suggested that service providers and their clientshave different perceptual frames of reference than in-ternal observers. In this regard, the addition of infor-mation from patients should provide an important per-spective that has not often been capturedsystematically.

Utility

The bottom line for any measurement instrumentrevolves around its usefulness. A useful clinical perfor-

276

WILLIAMS, KLAMEN, MCGAGHIE

mance measure is one that successfully estimates (pre-dicts) performance in situations that we care about.There are very few studies that provide insight intohow useful clinical ratings are for accomplishing thisgoal.

Hojat et al.40 investigated how effectively medicalstudent clerkship ratings predicted residency supervi-sors’ ratings of clinical performance. The investigatorssought to determine if clerkship ratings provided infor-mation that was helpful in predicting which studentsperformed poorly or well during their first year of resi-dency. They found that 42% of students who receivedclerkship ratings that put them in the bottom 25% onglobal ratings of clinical competence were also in thebottom fourth based on residency program ratings ofdata gathering and processing skills. Only 17% ofthose in the bottom fourth on medical school clerkshipratings were found to achieve residency ratings that putthem in the top fourth in their residency program.Thirty-seven percent of those individuals who receivedclerkship ratings that put them in the top performancequartile were found to achieve residency ratings thatplaced them in the top quartile. Conversely, only 14%of those individuals ended with residency ratings thatplaced them in the bottom fourth. Put another way, the1,704 individuals were three times more likely to beclassified in the same quartile during residency train-ing and medical school as to be classified at the otherextreme (i.e., classified in the bottom fourth during res-idency when they had been classified in the top fourthduring medical school).

Data from rating instruments acquire meaning bydocumenting consistent relations between scores onthe instrument and scores from other measures of inter-est. Given this information, it is possible to use scoresto make progress decisions with confidence that thescores have implications for concurrent or future per-formance states. In the case of the Hojat et al.40 re-search, it becomes possible to anticipate subsequentclinical performance during residency training. Thisinformation is critical for making informed decisionsabout medical student progress. There are surprisinglyfew examples of this type of investigation in the litera-ture that we studied.

One of the primary obstacles to research on clini-cal ratings is the lack of a readily available, consen-sual gold standard for judging clinical competence.One approach used in standardized patient research isto develop a patient-based problem with known find-ings and have a group of experts decide in advancewhat information should be elicited to diagnose thecase effectively, what diagnosis is most likely, andwhat management is indicated. The case can then beused to test the clinical performance of practitioners.Recent research using patient simulators has demon-strated the feasibility, utility, and validity of suchclinical evaluations.41–47 A second approach to the

gold standard problem is evaluation of a single per-formance by a group of expert judges who all observethe same performance and then rate the performanceindependently (perhaps using videotape). A third,more common approach in traditional clinical perfor-mance assessment involves a comparison of the rat-ings of a group of expert judges who observed thesame practitioner across an uncontrolled and undocu-mented range of situations and tasks in natural envi-ronments. Although this approach has most of theweaknesses described earlier, it is the most prevalent.Cronbach48 contributed to refining this approach con-ceptually by proposing that rater performance be ana-lyzed to determine whether (a) an individual ratertends to rate too high or too low averaged across allratees and items (leniency–stringency), (b) the ratercan accurately distinguish among practitioners basedon the candidate’s total job performance (i.e., therater can accurately rank practitioners relative to eachother), (c) the rater can accurately determine whethera group of practitioners are better at one aspect oftheir jobs or another, and (d) the rater can diagnoseaccurately the strengths and weaknesses of individualpractitioners. This framework for thinking about andevaluating the quality of ratings is appealing andpractical. Where called for, it is used in subsequentsections of this article.

Cognitive, Social, and EnvironmentalSources of Bias in Clinical Ratings

Research in experimental psychology has investi-gated the nature of human memory since the late 19thcentury. Since the 1980s, research in performance as-sessment has taken conceptual cues from this research.A number of experimental studies have investigatedhow humans store information and use that informa-tion in performance assessment settings. The researchresults provide partial answers to the following twoquestions. How are rating data organized? What are themechanisms of data recall?

Data Organization

Research has been conducted to determine the strat-egy raters use in organizing and storing the informationthey acquire and to determine which organizationalstrategies result in the most accurate recall and ratings.

These research results remind us of the limits of hu-man memory and the implications for performance as-sessment. Nathan and Lord49 concluded, “Even underoptimal conditions many subjects will not or cannot in-dependently use five dimensions to rate performance”(p.112). Evidence supporting this conclusion was pres-ent in the Weekley and Gier25 study of world-class fig-ure skating judges during the 1984 Olympics. Their

277

CLINICAL PERFORMANCE RATINGS

analyses demonstrated that the judges agreed abouttheir evaluations of the skaters, but chiefly in a globalsense.

Cafferty et al.50 reported that raters organized in-formation either by persons (performances of oneperson on all pertinent tasks, followed by perfor-mances for a second person) or by tasks (perfor-mances of all persons for one task, followed by per-formances of all persons for a second task, etc.). Notsurprising, acquisition by person led to informationbeing organized in memory by persons, whereas in-formation acquired by task was organized in memoryby tasks. Most important, raters who did not organizeinformation in memory or were inconsistent in theirorganizational strategy had the least accurate recalland ratings. DeNisi and Peters51 investigated the ef-fects of structured diary keeping on recall of perfor-mance information and on performance ratings. Theyfound that raters who kept diaries structured by per-sons or by performance dimensions were better ableto recall performance information than were thosewho kept no diary even when they were not allowedto refer to the diary. Raters who kept diaries provided(a) lower overall ratings, (b) ratings of individualpractitioners that varied more from dimension to di-mension (provided a more varied profile for each per-son), and (c) ratings that discriminated more amongratees. Raters who organized their diaries by personsbelieved that their ratings were more accurate and de-fensible than did raters in the other conditions, andthey believed that the use of diaries helped them inthe appraisal process. Raters who were required tomaintain diaries, but where no structure was imposedconsistently, organized their diaries by persons.

Research by Williams and colleagues,52,53 in a labo-ratory setting, suggested that raters who acquired andstored performance information in an unorganizedfashion can be prompted to structure the informationstored in memory before a rating task by having themrecall pertinent performance incidents and record themaccording to a given format. This structuring resultedin recall and rating accuracy results similar to those ob-tained when the information was initially obtained in astructured manner. DeNisi and Peters51 found that rat-ers in a free recall situation chose to organize the per-formance information by persons. Most important,they found that raters asked to organize the informationbefore the rating task provided less elevated and morediscriminating ratings and reported more positive reac-tions to the rating task. This suggests that raters can beprompted to structure performance information inmemory just before the appraisal task and realize a rat-ing benefit without adding the logistic demands of a di-ary-keeping requirement. Although we assume a pos-ture of healthy skepticism about these findings untilthey stand the test of multiple replications, they are in-teresting and promising.

Therefore, a series of laboratory and field studiessuggest that (a) organization of information in memoryis important for recall and rating accuracy; (b) recalland rating accuracy can be improved by prompting therater to organize information before the rating even ifthe information was acquired in an unorganized fash-ion; and (c) raters prefer to store information organizedby persons, and this strategy produces the most accu-rate recall and ratings.

Data Recall

Stored information about practitioner clinical per-formance can be recalled and used by raters to formu-late ratings in two ways. First, recall may involve spe-cific details about practitioner performance in specificclinical situations. Second, recall may be based on ob-servations of specific events, forming a general im-pression of the effectiveness of performance in that sit-uation, and storing only the general impression. In thegeneral case, much of the specific information used informing an impression is lost. When asked to make anappraisal, the rater recalls the pertinent general impres-sion data and uses it for making a judgment. Evidenceregarding how raters perceive and recall clinical per-formance data is instructive. Results of factor analyticand correlational studies suggest that raters provideglobal ratings of clinical competence rather than differ-entiate individual aspects of competence. Verhulst etal.54 found that supervisor’s ratings (13-item ratingform) of first-year resident performance revealed thatraters differentiated and rated two aspects of clinicalperformance: clinical skills and professional behavior.These two factors together accounted for 83.6% of thecommon variance in ratings, with the clinical skillsfactor accounting for 49.4% and the professional be-havior factor accounting for 34.2% of the commonvariance. The results were virtually identical each ofthe 7 years of the study.

In their study of peer ratings, Ramsey andcolleagues32 also found that raters have a two-dimen-sional conception of clinical skills. One factor repre-sented cognitive and clinical management skills andthe other represented humanistic qualities and man-agement of psychosocial aspects of illness.

Other studies reported similar results suggestingthat clinical performance ratings reflect a general im-pression of clinical performance rather than a differen-tiated view of individual competencies.33,35,54–58 Theseresults suggest that either the individual competenciesthat make up clinical competence are highly correlated(i.e., individuals who perform one competency well arelikely to perform well in the other competency areas);or that individual competencies are not highly corre-lated, but raters perceive them as such and rate themaccordingly. Nathan and Lord49 summarized their re-search on performance assessment by stating:

278

WILLIAMS, KLAMEN, MCGAGHIE

The true correlation among dimensions related to anynaturally occurring category will usually be nonzero,because categories are built from natural clusters ofcharacteristics. On the other hand, the correlationamong rated dimensions will almost always be higherthan true correlations, due to the nonrandom errors orsimplifications of raters. (p.112)

This is an area where a number of large-scale stud-ies have been done regarding clinical performance as-sessment. Results from these studies show consistentlythat raters have either a one- or two-dimensional con-ception of clinical skills and do not differentiate furtheramong the specific dimensions of clinical performanceusing current instruments and procedures. Efforts toelicit more differentiated ratings may result in ratertime and effort being wasted, and this may result in in-accurate characterizations of true performance.

Frequency of Rater Observation

In a survey of 336 internal medicine residents from14 New England residency programs, Stillman andcolleagues11 found that 19% of the residents reportedthat no faculty member ever observed them performinga complete history and physical examination duringmedical school, and 30% reported they had never beenobserved performing a complete history and physicalin residency. Over one half reported that they had beenobserved less than three times during medical school;and 45% reported that they had been observed onceduring residency, presumably during the requiredABIM Clinical Evaluation Exercise (CEX). Burdickand Schoffstall59 reported similar results in a survey of514 residents in 21 emergency medicine residency pro-grams throughout the United States.

These results suggest that the frequency of directclinical performance observation at the medical schooland the residency level is very low. To the best of ourknowledge, no one has reported on the total number ofhours of observation per student or resident underlyingclinical performance evaluation. Based on surveys of3,600 business organizations throughout the UnitedStates, Bretz et al.60 reported that business supervisorscommonly spend about 7 hr per year assessing the per-formance of each employee. This includes any specialobservation, other data collection, filling out assess-ment forms, and communicating with each employee.Because middle- and upper-level human resource man-agers filled out the survey, one might expect that theseresults paint a favorable picture of the process. Weknow of no comparable survey of clerkship directors,residency program directors, or health maintenance or-ganization directors, but we have no reason to expectthat more time is spent in clinical performance assess-ment. Therefore, this section is directed toward investi-gating available evidence about the impact of fre-

quency of observation on the accuracy andreproducibility of ratings.

In a study of almost 10,000 pairs of raters from 79business organizations, Rothstein61 established thatinterrater agreement increases as the length of expo-sure to the ratee increases. Interrater agreement im-proved slowly and reached only about 0.60 with 10 to12 years of opportunity to observe long-time employ-ees. Heneman,22 in a randomized controlled trial, alsofound that longer observation periods yield more accu-rate ratings. Finally, eyewitness testimony experts23

agreed that there was sufficient evidence to testify incourt that the less time an eyewitness has to observe anevent the less accurate is the memory of the event.Therefore, there is evidence from a range of situationsto reinforce the idea that the frequency of observationinfluences the reproducibility and accuracy of ratings.The quality and utility of rating data is likely reduced atthe low levels of observation that are common in clinicalperformance assessment.

Focus of Attention

One constructive role that a clinical performance as-sessment instrument can play is to direct the attentionof raters toward particular dimensions of competence.This may increase the level of rater agreement and theaccuracy of ratings.

In the randomized controlled trials described ear-lier, Noel and colleagues17 and Herbers andcolleagues20 found that participants recorded only 30%of resident strengths and weaknesses using global rat-ing forms. Strengths were omitted more frequentlythan weaknesses, a finding consistent with results fromthe general performance assessment literature.18,19

Rater accuracy increased to 60% or greater when par-ticipants used structured forms designed to focus atten-tion on specific performance dimensions. Noel andcolleagues17 reported the percentage of raters who de-tected each strength and weakness but the number ofitems is too small to allow generalizations about the at-tributes of performance that raters observed. Morework of this type is needed to determine what expertraters attend to when rating clinical performance. If, aswe suspect, the scope is limited to a few competencies(e.g., clinical skills, professional behavior) then effortsneed to be directed to expanding rater focus.

Woehr62 and Day63 found that when raters are toldabout task rating dimensions in advance they performbetter in rating that performance. Day63 found that rat-ers were better able to distinguish between effectiveand ineffective behaviors when they had advance ac-cess to the rating instrument. Specifically, raters wereable to more accurately identify behaviors that did notoccur. Raters without prior knowledge of task ratingdimensions were more dependent on a general impres-sion of effectiveness.

279

CLINICAL PERFORMANCE RATINGS

Rater Reporting

Judgments are private evaluations of a practitioner’sperformance, whereas ratings are a public record. For avariety of reasons the two may not always coincide.

In a review chapter, Tesser and Rosen64 describeda bias among communicators to transmit messagesthat are pleasant for recipients to receive and to avoidtransmitting unpleasant messages. They labeled thetendency to avoid transmitting bad news the “MumEffect.” The review recounts the results of studiesconducted by themselves and other investigators toprobe this bias directly. Tesser and Rosen64 con-cluded that good news is communicated more fre-quently, more quickly, and more fully than bad news.They also reported this is true across a wide range ofsituations whether the communication is in person orin a less direct manner (e.g., written communication).This Mum Effect would seem to have clear implica-tions for supervisors’ communications of poor perfor-mance to practitioners.

A number of studies have been conducted to deter-mine the impact of the Mum Effect on supervisors’communications about performance of employees andtrainees. In a laboratory study involving performanceassessment, Benedict and Levine65 found that partici-pants scheduled low performers for feedback sessionslater and evaluated these performers with more posi-tive distortion. Two field studies suggested similar re-sults. Waldman and Thornton66 found that ratings usedfor promotion and salary increases and those that wereto be shared with employees were more lenient thanconfidential ratings of the same employees. Harris etal.67 had supervisors rate employees to aid in validatinga performance examination. The supervisors had ear-lier rated the same employees for purposes of promo-tion and salary increases using the same instrument.The investigators reported that slightly more than 1%of the ratees were rated as in need of improvementwhen ratings were to be used for promotion and salaryincreases, whereas almost 7% were rated as in need ofimprovement in the ratings for research purposes thathad no economic consequences for employees. How-ever, the ranking of employees under the two condi-tions was similar.

Longenecker et al.68 reported survey research re-sults showing that many managers knowingly provideemployee ratings that are inflated due to actual andperceived negative consequences of accurate appraisal.Bernardin and Cardy69 found that 65% of police de-partment supervisors indicated that the “typical” su-pervisor would willfully distort ratings. Further, theyfound that supervisors with lower levels of trust in theappraisal process rated their subordinates higher thandid those with higher levels of trust.

In summary, there is research reported in the socialpsychology and the business management literature

supporting the finding that supervisors are likely to dis-tort ratings of their employees when they believeemployees will see the ratings or when the ratings willbe used for promotion or salary increase purposes. TheMum Effect research suggests that distortion in com-munications may occur even when there are no conse-quences for the employee. In all cases, the tendency isfor ratings to be increased. We are not aware of anystudies of the Mum Effect or of the degree of distortionin ratings as they apply to clinical performance ratings.

Leniency and Stringency

Littlefield et al.70 studied the ratings given by sur-gery faculty in five medical schools to determine thepercentage of lenient and stringent raters. The study in-volved 107 raters who each evaluated four or more stu-dents and a total of 1,482 ratings. The investigators es-tablished that 13% of the raters were significantlymore lenient in their ratings. Students with true clinicalability at the 50th percentile would receive ratingsfrom these lenient raters that would rank them at the76th percentile or higher. Conversely, 14% of the raterswere classified as significantly more stringent in theirrating patterns. Ratings by these stringent raters wouldrank average students (true clinical ability at the 50thpercentile) at the 23rd percentile or lower. However,the total group of raters was normally distributed re-garding stringency of ratings. Therefore, where multi-ple ratings are available for each student, the averagerating should approximate the student’s true clinicalcompetence. Further, the investigators noted that theformat of rating forms was not related to the tendencyto provide extreme ratings. Ryan et al.71 studied emer-gency medicine supervisors and found that they alsoshowed large variations in both leniency and range re-striction, and that differences in individual evaluatorscoring leniency were large enough to have a moderateeffect on the overall rating received by the resident.

Inflation of Ratings

Based on surveys of 3,587 business and industrialorganizations, Bretz et al.60 found that 60% to 70% ofan organization’s workforce received ratings in the toptwo performance levels on the organization’s perfor-mance rating scale. We know of no directly compara-ble large-scale survey regarding medical clinical per-formance assessment. However, there are a number ofstudies that report similar outcomes, and the results areconsistent across the studies found. Magarian andMazur72 surveyed all internal medicine clerkship di-rectors in the United States and received completedsurveys from 81% of them. Survey results indicatedthat 75% of students in internal medicine clerkshipsusing an A through F grading system received ratingsin the top two performance categories. In clerkships

280

WILLIAMS, KLAMEN, MCGAGHIE

using descriptor grades (e.g., honors, high pass, pass,fail), 46% of the students received ratings in the toptwo performance categories. Because 86% of the inter-nal medicine clerkships incorporated results of an ob-jective knowledge test such as the National Board ofMedical Examiners (NBME) internal medicine subjecttest into student grades,73 it is reasonable to expect thatpercentages in the top two grading categories would behigher if clinical ratings alone were used for grading.Objective test results generally lower grades.74

Shea et al.75 described program directors’ ratings ofall 17,916 internal medicine residents who sought cer-tification in internal medicine during the years 1982through 1985. They reported that the mean rating was6.5 on a 9-point scale. Six is a “satisfactory” rating and7 is a “superior” rating according to the scale used.Residents with “unsatisfactory” program directors’ rat-ings were excluded from this study. Therefore, the av-erage rating is raised artificially by a small degree.However, it is clear that the ratings assigned are con-centrated in the higher scores on the scale.

Ramsey and associates76 reported that 91% of allpracticing physicians in their study received a rating of7 or above on a 9-point scale from peers. The meanclinical performance rating for all physicians was 7.71with no physician receiving a rating below 6. Otherstudies16,55,71 reported similar findings across a rangeof practitioner rating situations (medical students, resi-dents). Carline et al.15 found that out of 6,932 ratings intheir study, only 136 or 2.0% were poor or fair ratings.Twenty-two percent of the physicians (58 of 269 physi-cians rated) received all of the 136 poor or fair ratings,with 15 physicians receiving three or more low ratings.

Some other evidence exists indicating that individualratings of clinical performance are inflated. We reportedearlier that Martin and colleagues27 found that individualphysician ratings were almost always overestimates oftrue clinical competence as established by a consensualgold standard panel. Vu and colleagues28 established thistendency for individual raters to give ratees the benefit ofthe doubt as well. Schwartz et al.77 reported that facultyward evaluations of fourth-year surgery residents wereinflated and did not correlate with other, more objectivemeasures of clinical competence (i.e., Objective Struc-tured Clinical Examination [OSCE] scores).

Speer et al.78 reported on their efforts to modify theclinical rating process in an internal medicine clerk-ship toward the goal of reducing grade inflation. In theoriginal grading system, the average grade assigned tostudents by preceptors was an A. The revised systeminvolved using a more behaviorally anchored defini-tion for each grade and by assigning final gradesthrough an “objective group review.” This revised sys-tem resulted in students receiving an average grade ofB+. The authors documented that preceptors believedthe average student should receive a grade of B. There-fore, the revised grading approach was more consistent

with preceptor expectations. Although this study doesnot establish the relative contributions that behavioralanchors and the group review process contributed todecreasing the grades assigned, a study by Hemmer etal.79 supports our belief that the objective group reviewprocess was the more important factor. They found thata group review process significantly improved the de-tection of unprofessional behavior. In almost 25% ofthe cases, the group review process was the only evalu-ation method that detected unprofessional behavior.

Metheny56 reported that first-year residents in ob-stetrics–gynecology were more lenient than other rat-ers in their grading of medical students. These resultswere based on findings in one clerkship at one medicalschool. It would be informative to have a larger scalestudy to determine whether such variables as amountof clinical supervising experience have systematic ef-fects on ratings assigned.

Failure Rates

Another way to study the issues of leniency andstringency and grade inflation is to look at failure rates.The Liaison Committee on Medical Education con-ducts a survey of all medical schools in the UnitedStates each year that normally yields a 100% responserate. The results of the survey80 indicated that attritionfrom U.S. medical schools continues to be about 1% oftotal enrollment. Forty-one percent of those studentswho left medical school during the 1996 to 1997 yearwere either dismissed for reasons of academic failureor withdrew in poor academic standing. Of these stu-dents, only 31% were in the clinical years of training.Therefore, there were 95 students in the clinical yearswho were dismissed for academic failure or withdrewin poor academic standing from the 125 U.S. medicalschools during the 1996 to 1997 academic year. This isless than one medical student per U.S. medical school.Therefore, the likelihood of a student leaving medicalschool for academic performance reasons during theclinical years of training was very low.

In an Association of American Medical Collegessurvey of representatives from 10 medical schools,Tonesk and Buchanan81 reported that one of the mostcommonly reported problems was unwillingness to re-cord negative performance evaluations. A second wasunwillingness to act on negative evaluations. Resultsfrom the Magarian and Mazur72,73 surveys of internalmedicine clerkship directors support the Tonesk andBuchanan81 findings. Thirty-six percent of the clerk-ship directors responding could not recall having evergiven a failing grade in their clerkship. The highestfailure rate reported for a medical school class was10% (two schools). Failure rates in other studies werecomparable. Yao and Wright82 reported that 6.9% ofinternal medicine residents were reported to have sig-nificant enough problems to require intervention by

281

CLINICAL PERFORMANCE RATINGS

someone of authority. They also reported that 94% ofresidency programs had problem residents. Kwolek etal.83 reported that 1.3% of ratings in a surgery clerkshipwere unsatisfactory or marginal, and 5% of interns inthe same program were rated as deficient.77

Schueneman et al.57 reported that 9 out of 310 (3%)surgery residents were asked to leave their residencytraining program over a period of 15 years. Vu et al.74

studied failure rates in six required clerkships (familymedicine, internal medicine, obstetrics–gynecology,pediatrics, psychiatry, and surgery) in one medicalschool over a period of 5 years. They reported that theaverage failure rate was less than 1% in family medi-cine, 1% in obstetrics–gynecology and in psychiatry,5% in surgery, 7% in internal medicine, and 8% in pe-diatrics (average failure rate: 3.67%). However, 70% ofthe failing students received passing marks based onclinical performance ratings alone. In these cases, per-formance on an objectively scored knowledge exami-nation (the NBME subject test) was the factor that de-termined their failure. Jain et al.84 reported the resultsof a survey of physical medicine and rehabilitation res-idency programs. Eighty-three percent of program rep-resentatives responded. Only 50% of programs re-ported rating at least one resident unsatisfactory duringa clinical rotation in the 3 years prior to the survey.Only 11% had reported one resident’s overall perfor-mance as unsatisfactory to the parent organization.

As mentioned earlier, results from a study in a largebusiness organization67 indicated that 1.3% of 223ratees were rated as in need of improvement on regularratings of employees for promotion and salary in-creases. When these same employees were rated to val-idate a performance test, a situation where presumablyraters perceived no negative consequences of their rat-ings for the employees, 6.7% were rated as needing im-provement. Therefore, the true rate of cases where per-formance is inadequate may be closer to 7% than to the1% rate, which normal ratings yield.

Rater Calibration and Training

A common response to the perception that clinicalperformance ratings are less accurate and reproduciblethan they should be is to suggest implementing a ratertraining program. We found very few studies in medi-cal clinical performance evaluation that addressed theeffectiveness or cost effectiveness of rater training.Both studies that we found suggested that training maynot lead to gains commensurate with the cost of theprogram. Noel and colleagues17 included a trainingcondition in their randomized controlled trial that stud-ied the accuracy of raters in appraising clinical perfor-mance. They found that physicians who received train-ing were no more accurate than other raters whoparticipated in the study but without training. Newbleet al.85 compared the interrater agreement for groups of

physicians with no training, moderate training, andmore extensive training. They reported that trainingdid not improve the low rate of interrater agreement.However, interrater agreement was improved substan-tially by identifying the most inconsistent raters and re-moving them from the analysis. This approach of cull-ing inconsistent raters may be a more cost-effectiveway of dealing with rater reliability problems.

The limited number of studies that have addressedthe issue of rater training for clinical performance eval-uation should be judged cautiously. The findings maybe the result of the type or quantity of training pro-vided. To illustrate, the training in the Noel et al.17

study only involved viewing a videotape. There was nopractice and feedback component. The Newble et al.85

study involved a ½ hr training session with no feedbackfor the moderate training group and a 2½ hr trainingsession including practice and feedback for the exten-sively trained group. The amount and type of trainingin the Newble et al.85 study is more in line with whatwould be expected to result in a positive training out-come, but it may take even more training to yield thedesired improvement in reliability.

Many more studies investigating the effect of train-ing on rater performance have been done in businesssettings. Woehr and Huffcutt86 reported a meta-analy-sis of 29 rater training studies. These studies yielded 71meaningful comparisons of training programs withcontrol groups. Most important, Woehr and Huffcutt86

sorted out four different types of training programs andinvestigated their effects separately. The most commonprograms (28 comparisons) were rater error trainingprograms designed to minimize errors of leniency, cen-tral tendency, and halo by making raters aware of thesetendencies. These programs proved to be moderatelyeffective in reducing halo error and somewhat less ef-fective with respect to rater leniency. Rater error train-ing programs also yielded a modest increase in rateraccuracy with some being substantially more effectivethan others.

The second type of rater training was performancedimension training, which was designed to train ratersto recognize and use appropriate dimensions of perfor-mance in their ratings. The idea behind this training isthat raters will make dimension-relevant judgments asopposed to more global judgments. Woehr andHuffcutt86 reported that performance dimension train-ing was moderately effective at reducing halo error butless effective with respect to increasing accuracy. Thistraining procedure actually led to a slight increase inrater leniency.

Frame of reference training programs, the third ap-proach, trains raters with regard to performance stan-dards as well as performance dimensionality. Theseprograms typically provide raters with samples of be-havior representing each dimension and each level ofperformance on the rating scale. The programs also in-

282

WILLIAMS, KLAMEN, MCGAGHIE

clude practice and feedback. These programs proved tobe the single most effective training strategy for in-creasing accuracy of rating and observation. They alsoyielded small decreases in halo and leniency.

Finally, behavioral observation training programsfocus on improving observation of behavior rather thanevaluation of behavior. There have been few (four)such studies reported. Like frame of reference training,behavioral observation training programs were foundto produce a medium to large positive effect on bothrating and observational accuracy. These studies didnot investigate effects on either leniency or halo.

A summary of results from Woehr and Huffcutt86

suggests that rater training programs have promise as ameans of improving performance of raters observingand appraising clinical performance. However, therater training studies in medical clinical evaluationhave not been effective most likely due to their shortduration, no practice, and no feedback. This suggestseither the training program characteristics have notbeen optimized or that physician raters are imperviousto training.

Practitioner (Ratee) Characteristics

This section addresses practitioner characteristicsthat can affect performance ratings but are unrelated topractitioner performance ability. Practitioner charac-teristics other than performance ability can be subdi-vided into two categories: those that are relatively fixedand those that are more transitory and can be manipu-lated by the practitioner. Authors have used such termsas impression management and social control to de-scribe the overt manipulation of transitory variables forpersonal gain.

Fixed Personal Characteristics ofPractitioners

We found no studies investigating the impact offixed personal characteristics (e.g., ethnicity, age, gen-der, physical attractiveness) on clinical performanceratings in medicine. However, several studies involv-ing these variables have been done on performanceevaluation ratings in the military and in business. Mostof the reported research investigates the impact of raterand practitioner ethnic background. A few studies haveinvestigated practitioner age and its impact on assignedratings. Finally, one study investigated ratee attractive-ness and affect.

The most comprehensive and informative studies ofethnicity have been those by Pulakos et al.87 and bySackett and DuBois.88 The Pulakos et al.87 study in-volved 39,537 ratings of 8,642 army enlisted personnelin a variety of jobs. They performed separate analysesfor the data set where both an African American and aCaucasian rater evaluated ratees. This data set included

4,951 ratees (1,663 African American and 3,288 Cau-casian). Sackett and Dubois88 conducted a similarstudy using a data set including ratings for 36,000 indi-viduals holding 174 civilian jobs in 2,876 firms. As inthe Pulakos et al.87 study, Sackett and Dubois88 per-formed separate analyses for the 617 ratees (286 Cau-casian, 331 African American) where both an AfricanAmerican and a Caucasian rater had submitted ratingsfor a ratee. Sackett and Dubois88 did a combined analy-sis that incorporated the data from the Pulakos et al.87

study and their study. They reported that AfricanAmerican and Caucasian raters provided virtuallyidentical ratings of Caucasian ratees. African Ameri-can ratees, on the other hand, received ratings fromCaucasian raters that were slightly lower (0.02–0.10SDs lower) than the ratings they received from AfricanAmerican raters. For comparison purposes, Sackettand Dubois88 noted that ratings for Caucasian ratees byAfrican American raters ranged from 0.003 SDs lowerto 0.003 SDs higher than ratings from Caucasian raters.They concluded that race of the rater accounted for anextremely small amount of variance in the rating data.

Like ethnicity, the individual studies of age pro-duced mixed results. The best available basis for draw-ing conclusions lies in available meta-analyses de-signed to look at the combined results from multiplestudies. Two meta-analyses were found that investi-gated the effects of ratee age on performance ratings.McEvoy and Cascio89 looked at the combined resultsfrom 96 independent samples of the age–performancerelation from 65 studies published between 1965 and1986. They found that the mean correlation betweenage and objective measures of productivity (e.g., unitsproduced, patents received, articles published) was0.07. The relation between age and supervisor’s ratingwas similar (mean correlation = 0.03). Waldman andAvolio90 looked at the combined results from 37 sam-ples reported in 13 research reports. Direct measures ofproductivity indicated that older workers were moreproductive. Older workers received lower ratings fromtheir supervisors (average correlation = –0.14), but thisrelation was limited to nonprofessional workers. Basedon the pooled results reported in these meta-analyses,there seems to be little relation between age and super-visors’ ratings for professional workers.

The relation between gender and performance rat-ings has received surprisingly little attention. Thestudy by Pulakos et al.87 described earlier also investi-gated gender. Due to the very large sample, statisticallysignificant differences were obtained, but Pulakos etal.87 noted the proportion of variance in ratings ac-counted for by gender was minimal. Male raters tendedto rate women slightly more positively than men. Fe-male raters on the other hand tended to rate malesslightly more positively.

Only one study was identified that studied attrac-tiveness and its relation to performance ratings.

283

CLINICAL PERFORMANCE RATINGS

Heilman and Stopeck91 conducted a laboratory studyin which the attractiveness of the employee was var-ied experimentally by changing the picture on theemployee’s written performance report. Participantsin the study were MBA students who were asked toevaluate the performance and estimate advancementpotential. Heilman and Stopeck91 concluded that at-tractiveness of the ratee inflated ratings ofnonmanagerial women, deflated ratings of managerialwomen, and had no impact on the ratings of men. Be-cause this was a single laboratory study with a verysmall number of participants rating hypothetical em-ployees from written performance reports, the resultsshould at most be considered suggestive of an areafor further investigation.

Impression Management

Certain practitioner characteristics are under per-sonal control and can be manipulated to influenceone’s ratings. Only two studies were found that investi-gated the effect of practitioner characteristics otherthan clinical competence on clinical performance rat-ings. Wigton92 studied case presentation style and itseffect on the ratings assigned to medical students. Hecoached five first-year medical students to present eachof five prepared (standardized) case presentations thatvaried systematically in content and organization. Fif-teen experienced faculty raters evaluated videotapedrecords of the presentations. Results showed that theranking given each presentation depended as much onwhich student was giving the presentation as the con-tent of the presentation. Wigton concluded that theevaluation of clinical competence based on studentcase presentations could be significantly influenced bythe personal characteristics of the students. This is theonly experimental study of this type in medical educa-tion that we found. More experimental studies of thistype would help us understand factors that influenceclinical performance ratings. Kalet and colleagues24

conducted a study comparing faculty ratings on anOSCE to expert ratings. They concluded that facultyrated students on the basis of likeability rather thanspecific clinical skills

There is a very active research community in socialpsychology investigating the processes by which peo-ple control the impressions others form of them datingback at least to the theoretical work of Goffman.93

Haas and Shaffir94,95 conducted investigations of someof these variables in their study of the professional so-cialization of one class of medical students over the pe-riod of their entire medical school training. Haas andShaffir’s94,95 primary research methods included par-ticipant observation and interview research. They ob-served the students over the full range of their educa-tional experiences. Referring to the sociologicalresearch on the process by which neophytes are social-

ized into professions, Haas and Shaffir94 noted that onecharacteristic is that neophytes must develop a “pre-tense of competence even though one may be privatelyuncertain” (p. 142). They proceeded to document thestudents’ development of this cloak of competence.Haas and Shaffir94 concluded by stating the following:

A significant part of professionalization is an in-creased ability to perceive and adapt behavior tolegitimators’ (faculty, staff, and peer) expectations, nomatter how variable or ambiguous … In this context ofambiguity, students in both settings accommodatethemselves, individually and collectively, to convinc-ing others of their developing competence by selectivelearning and by striving to control the impressionsothers receive of them.” (p.148)

There have been many studies conducted in busi-ness organizations to determine the effects of subordi-nate’s use of impression management techniques ontheir supervisors’ ratings. There have also been a num-ber of studies in the laboratory designed to control andseparate the variables involved. All of these studies ad-dress concerns that supervisors’ evaluations of em-ployees can be influenced substantially by factorsother than job performance itself. The validity of per-formance ratings is compromised to the extent that thisoccurs. The results of these impression managementstudies are mixed, partly due to the variety of impres-sion management techniques that have been studiedand partly to the environmental conditions underwhich they have been studied. As a result, the best indi-cation of the impact of impression management on per-formance ratings is found in the cumulative findingsreflected in the meta-analysis conducted by Gordon.96

Gordon’s96 meta-analysis covered 55 journal articles,including 69 studies, yielding 106 effect sizes.Gordon96 found that impression management tech-niques studied so far appear to have little effect on per-formance ratings in field settings. There was a slightlylarger effect in laboratory experiments, but the averageeffect was still characterized as small based on conven-tional standards for interpreting effect sizes. In sum-mary, there is little hard evidence that impression man-agement tactics are an important influence onsupervisors’ ratings of performance.

Influence by Other Evaluators

There are two traditional approaches to evaluatingmedical students and residents. In one, all supervisorsare asked to fill out the rating form independently andsubmit it to the clerkship director or residency programdirector. The clerkship director or residency programdirector then reviews the individual ratings and assignsa final grade. The second approach requires that raterscomplete individual rating forms in preparation for a

284

WILLIAMS, KLAMEN, MCGAGHIE

group discussion of all practitioners by all supervisors.Each practitioner’s performance is discussed in com-mittee and then a final rating is assigned through someconsensus building process.

The two approaches have benefits and disadvan-tages. The consensus building model has the potentialbenefit of providing each rater with multiple perspec-tives on the practitioner’s performance and making allraters aware of data regarding practitioner performancein a variety of settings. However, many people expressfear that such an approach may lead to a dominant indi-vidual unduly influencing the decision process. Al-though little data exist that address the benefits andweaknesses of the group model, the results of onestudy in the business literature is informative.

Martell and Borg97 conducted a laboratory studyin which single individuals and groups of four ratedthe same performance either immediately or after adelay of 5 days. In the delayed rating condition,groups remembered performance details more accu-rately than individuals. However, groups also weremore liberal in their rating criteria. On the other hand,Speer and colleagues78 found that grades assigned tomedical students on an internal medicine clerkshipwere lower and closer to the faculty-stated appropri-ate average grade for clerks when the grades were as-signed after a group discussion by faculty. These in-vestigators attributed this difference to committeeimpartiality compared to individual faculty observerswho worked with the students regularly on their inter-nal medicine clerkship. Likewise, Hemmer et al.79

found that group performance assessment identifiedmany professional behavior problems that were nototherwise disclosed. We found no studies to confirmor refute anecdotal reports that individuals withstrong opinions or a persuasive style can sway groupopinion inordinately in evaluative discussions. This iswidely perceived as a disadvantage to the group pro-cess model and should be studied systematically.

Environmental Bias

As mentioned earlier, raters and practitioners inter-act in an environment with many distractions and con-founding factors. Some of these factors are likely to in-fluence judgments and ratings and are unrelated to trueclinical competence. In this section, we discuss some ofthe confounding factors and provide evidence, whereavailable, of their influence on performance ratings.

Time Pressures and Distractions

Anecdotal evidence and nonsystematic observationsuggest that time pressures, distractions, and the pres-sure of competing responsibilities may compromisethe ability of clinical supervisors to accurately observeand evaluate the clinical performance of physi-

cians-in-training and practitioners. Time pressures anddistractions likely influence the clinical performanceassessment process most by decreasing the amount ofobservation that occurs and thus by decreasing thesampling of clinical performance across the domain ofclinical situations and competencies of interest. Theymay also influence the rater’s time for reflection andthe quality of information integration that underlies theassigned ratings. We found no research that addressesthis question directly in either the clinical or generalperformance assessment research literature. One studyin the decision-making literature is suggestive.Wright98 found that consumers, facing time pressuresand distractions when making purchasing decisions,reduced search time by collecting fewer pieces of in-formation and generally searching for negative infor-mation. It is possible that a similar process occurs inperformance assessment. Clinical performance assess-ment practice would profit from more research in thisarea.

Clinical Performance = Group Effort

Patient care is a group effort in hospital settings andin most office settings. Many individuals includingspecialists, attendings, residents, and students partici-pate in the process of working up, diagnosing, andmanaging patients. Therefore, it is difficult to assessthe clinical competence of any one practitioner basedon patient presentations and write-ups or on individualconversations about patients. Weaknesses in practitio-ner knowledge and skill may be masked by the contri-butions of others to the diagnosis and care of the pa-tient as reported. One of the primary advantages ofstandardized patient encounters for clinical perfor-mance evaluation is the opportunity they give to assessan individual’s clinical proficiency knowing that theresults reflect that individual’s clinical competencealone. We found no research that investigated the im-pact of this source of bias on traditional clinical perfor-mance ratings, nor did we find any reports of attemptsto control this source of bias in field settings.

A great deal of research and evaluation on appraisingteam effectiveness99 in addition to individual effective-ness has occurred in the military, in business, and in thenuclearpower industry in recentyears.Themedicalpro-fession has much to learn from this research legacy.

Effects of Being Observed

One factor that makes clinical performance assess-ment artificial is the fact that practitioners usuallyknow when they are being observed. It is customary tothink that we are seeing a person’s best performanceunder conditions where the person knows that he or sheis being watched. However, it is also possible that theawareness of being observed creates anxiety and leads

285

CLINICAL PERFORMANCE RATINGS

to poor performance. In either case, appraisals may notprovide an accurate reflection of how the individualwill perform in practice when not under observation.

We found one study using clinical performance rat-ings by Kopelow et al.100 that addressed the effects ofbeing observed. Kopelow and colleagues100 had practi-tioners work up the same standardized patient cases(portrayed by different individuals) both in a settingwhere they knew they were being evaluated and also inpractice without evaluation awareness. One group ofthe practitioners saw the patients in the office first andthen saw the patients under examination conditions. Asecond group took the examination first and saw thepatients in their office later. The practitioners agreed tohave the standardized patients rotate through theirpractice but were not made aware when the patientswere introduced. Physicians recognized the standard-ized patients in only 26 of the 263 encounters (10%).Most pertinent to this article, average overall perfor-mance scores were 65% to 66% of possible points inthe examination setting when practitioners knew theywere being assessed and 52% to 56% in the practicesetting when practitioners were not aware they werebeing assessed (p < .01).

Sackett et al.101 performed a comparable study insupermarkets. They instructed 1,370 supermarketcashiers from 12 supermarket chains to process (checkout) 25 standardized items in a grocery cart while be-ing observed. Cashiers were told that they would betimed and errors would be recorded. They were askedto do their best and to place equal emphasis on speedand accuracy. Each cashier processed six differentstandard cartloads of groceries over a period of 4weeks. Sackett et al.101 considered these performancetests measures of maximum performance. All the cash-iers worked in supermarkets where the cash registersystem kept records of items checked per minute, ameasure of speed; and voids per shift, a measure of ac-curacy. These records provided a measure of typicalperformance. Sackett and colleagues101 establishedthat the correlations between maximum and typicalperformance (speed and accuracy) were very low, lead-ing them to conclude that the two measures do notyield comparable information about the relative per-formance of cashiers. They also found that supervi-sors’ ratings of the cashiers correlated more highlywith the maximum than the typical performance mea-sures confirming the sampling weaknesses noted forevaluation based on nonsystematic observation of criti-cal incidents (positive or negative).

The results of the Kopelow et al.100 and Sackett etal.101 studies document the likelihood that perfor-mance under formal test conditions is not a very goodindicator of “normal” performance. Furthermore, theseresults suggest that differences under these two sets ofconditions are likely to be large. Their findings demon-strate the importance of creating ways to sample clini-

cal performance in unobtrusive as well as more formalways. Patient satisfaction measures102 and nurseobservations103 may prove to be good ways of captur-ing information about normal clinical performance.

Recommendations

The following recommendations for improvementin clinical performance assessment practice are basedon the research covered in this article.

Broad, Systematic Sampling

Practitioners need to be observed and evaluatedacross a broad range of clinical situations and proce-dures to draw reasonable conclusions about overall clin-ical competence. Further, it is unreasonable to expectthat practitioners are equally effective across a broadrange of situations. Efforts need to be directed towardassuring that practitioners can handle important, spe-cific situations competently rather than a sole focus onidentifying good and bad practitioners. AdvancedTrauma Life Support (ATLS) and Advanced CardiacLife Support training are good examples of training andassessment protocols designed to assure competence.

Medical educators need to be much more selectiveand deliberate in planning performance observations.Observations should cross a representative sample ofchief complaints, diagnoses, tasks, and patient types.This will require planning and record keeping to fulfillthe desired systematic sampling. Standardized patientencounters9–11 and high-fidelity simulations44–46 canbe used to compensate for deficits in the naturally oc-curring patient mix.104 Educators also need to increasethe absolute amount of observation that is done in sup-port of clinical performance evaluation. One way to ac-complish this may be to limit the amount of observa-tion time during any one encounter. Results reportedby Shatzer and his colleagues105,106 suggest that shortperiods of observation may be enough for clinical eval-uation. A strategy that distributes available observationtime across a larger number and range of encounters ispreferable.

Observation by Multiple Raters

Although the primary goal should be to increase therange of situations and procedures observed, it is alsothe case that individual raters have idiosyncrasies re-garding clinical focus (competencies they observe) andstandards (stringency and leniency). Each practitionershouldberatedbyanumberofdifferent raters tobalancethe effects of these rater idiosyncrasies. The researchthat we have reviewed suggests that a minimum of 7 to11ratingsshouldbecollected toallowforareproducibleholistic estimate of general clinical competence. Ac-

286

WILLIAMS, KLAMEN, MCGAGHIE

quiring 7 to 11 ratings per practitioner with multiple rat-ers, although not a small logistical task, appears to befeasible. Ramsey and colleagues76 sought to acquire 10or more ratings for each of 228 practicing physicians.They reported that this was achieved for 72% of the phy-sicians with no follow-up mailings.

Keep Rating Instruments Short

Our review indicates that raters view clinical com-petence as one or two dimensional. The work byRamsey and colleagues76 suggests that a rating scalecomposed of 10 specific items plus a global item issufficient when the goal is holistic decision making(i.e., about progress or promotion). Adding moreitems only boosts rater time and makes the taskharder. Results of the Kreiter et al.16 study suggestthat relatively little is gained in generalizability byadding more than 5 items to the rating scale whenprogress decision making is the goal. Careful instru-ment construction using a process similar to that usedby Ramsey et al.76 should simplify the rating taskwith no loss of information. When the goal of evalua-tion is to provide feedback to support learning, theobservation should be specific to the performance ob-served, and the feedback should be immediate and in-teractive. A general performance rating form is notappropriate for this purpose.

Separate Appraisal for Teaching,Learning, and Feedback FromAppraisal for Promotion

One of the clearest implications from our review isthat the feedback and progress decision goals of clini-cal performance evaluation should be separate. Physi-cians recall information regarding two clinical perfor-mance factors: (a) clinical skills, and (b) professionalbehavior. They do not recall performance details at theend of rotations when asked to fill out rating forms.Consequently, physicians should be encouraged tocomment immediately about specific performance at-tributes that should be reinforced or corrected. Thisshould include an interactive conversation about thesituation, the practitioner’s behavior, desirable or un-desirable consequences of the behavior, and some sug-gested alternatives if the behavior should be changed inthe future. The implied message in performance as-sessment forms with room for written comments is thatperformance feedback should be saved and conveyedonly through the form. We believe this is the wrongmessage to send.

Encourage Prompt Recording

Clinical performance evaluation usually involvescompleting an evaluation instrument at the end of a ro-

tation based on observations that occurred during therotation. As a result, there is some decrease in ratingaccuracy due to memory loss and distortion. Anymethod that encourages recording of observations andjudgments at the time of the observation will minimizethis problem. Encounter records can be referred towhen completing end-of-rotation rating forms. The en-counter records also provide a basis for providing de-tailed feedback to practitioners-in-training to helpthem improve their clinical performance. A number ofapproaches have been used to capture observations andjudgments as they are made, including performance re-port cards (Rhoton and associates,107–109 Soglin,110

Brennan & Norman111), diaries,112 and clinical evalua-tions filled out immediately.113

Supplement Formal Observation WithUnobtrusive Observation

Unobtrusive observation is valuable because itprovides a better estimate of normal clinical perfor-mance than formal observation. To the extent that itincreases the total number of clinical performanceobservations, it also helps to achieve the goal ofbroad sampling. Newble’s103 reported use of nurseevaluations and Tamblyn and associates102 use of pa-tient evaluations are two effective ways of increasingunobtrusive evaluation.

Consider Making Promotion andGrading Decisions Via a Group Review

We believe that a group review process for makingpromotion and grading decisions has more benefitsthan problems associated with it. The group processprovides each decision maker with a broader base ofinformation on which to base the decision. It alsoprovides added perspective because different raterstend to focus on different aspects of performance.Finally, available evidence suggests that groups aremore likely to make unpopular promotion decisionsthan are individual raters making decisions in isola-tion.78,79 However, some physicians are reluctant touse a group process for fear that a person with astrong view will unduly sway the opinions of others.There is no evidence on this issue one way or theother.

Supplement Traditional ClinicalPerformance Ratings WithStandardized Clinical Encounters andSkills Training and AssessmentProtocols

Standardized clinical encounters have the advan-tage of allowing direct comparisons of practitionerswhen they perform exactly the same clinical tasks.

287

CLINICAL PERFORMANCE RATINGS

They also allow for evaluation of individual physi-cians’ skills in comparison to an established gold stan-dard. Performance in these situations can also isolatethe individual’s clinical performance capabilities anddeficits. There is no contamination of the clinical per-formance measure resulting from the contributions ofother members of the health care team who have coop-erated in working up the patient. Standardized clinicalencounters usually provide a more in-depth look atclinical performance. Finally, they provide a means oflooking at the performance of an entire class (e.g., allthird-year residents) and assuring that critical skills areacquired and maintained.

Educate Raters

At a minimum, make sure that all raters are familiarwith the rating form at the beginning of the rotation.This will help raters focus their observations and orga-nize performance information in a way that will facili-tate performance assessment. Consider employingframe of reference training to assure that raters are cali-brated regarding both definition of performance di-mensions and performance quality rating category di-mensions. More research is needed to determinewhether adding other types of more extensive trainingwill yield higher quality clinical performance ratingsby physicians. The research by Newble andassociates85 suggests that eliminating raters who areextreme outliers (positive and negative) is a morecost-effective means of increasing the validity andreproducibility of clinical performance ratings than istraining. It would be nice to have other studies investi-gate this practice to increase confidence in thereproducibility of the finding. We mention it here asmuch to encourage research as to suggest practice. Werealize that the practice of culling raters may have un-desirable consequences on faculty morale that need tobe considered by the program or clerkship director, butthe practice does merit consideration of the positiveand negative consequences.

Provide Time for Rating

Do whatever you can to encourage intermittent ob-servation and recording of clinical performance. Rat-ings and other types of clinical evaluation should bethoughtful, candid, and not rushed. Clinical evaluationof learners is an important professional activity thatwarrants protected time. We have heard anecdotally ofdepartments that gather faculty members at a giventime and place for the sole purpose of filling out perfor-mance ratings. The group rating process described ear-lier has the added benefit of protecting time for perfor-mance assessment.

Encourage Raters to Observe and RateSpecific Performances

The ABIM has explored use of the mini CEX. Thisprocess involves having an evaluator observe and ratespecific performance on a single case immediately us-ing an instrument tailored for that purpose. Given theproblems of memory distortion associated with thenormal rating process, we believe a supplemental pro-cess such as that proposed by the ABIM will provideadditional, more accurate data about clinical perfor-mance that can be added to the mix when final promo-tion decisions are made. An even more optimal processwould use rating scales specifically designed for se-lected chief complaints or procedures.

Use No More Than Seven QualityRating Categories (i.e., 1 [Poor] to7 [Excellent])

Five to seven quality rating categories are optimal.Even at this, the extreme rating categories will rarelybe used. We also discourage two-level rating systems(e.g., 1–3 [unsatisfactory], 4–6 [satisfactory], to 7–9[superior]). Additional quality rating categories add noperformance information and just complicate the taskfor raters.

Establish the Meaning of Ratings

Given the tendency for raters to use primarily thetop two categories and the reluctance to assign failingratings, clinical performance assessment ratings onlyhave meaning relative to each other. As with any mea-surement instrument, scores only acquire meaningthrough experience and systematic data analysis andinterpretation. We encourage maintaining the same rat-ing instrument and procedure. Clinical evaluators havemore to gain by accumulating norms over a number ofyears than will be gained by “refining” the rating in-strument. Familiarity with a form due to extensive useenables directors to easily and quickly identify outliers(i.e., those who perform unusually well or poorly basedon their ratings). One can also develop a sense for thesignificance of particular ratings by keeping and usinghistorical records regarding the ultimate status (e.g.,asked to leave program, chose to leave program, ulti-mately judged to be outstanding) of practitioners withparticular scores. This will require some investment tocollect and make sense of the data, but it is the onlyway scores acquire meaning. Keeping the same formalso makes it more feasible to conduct validity studiessuch as those conducted by Hojat and colleagues.40

Changing the form repeatedly does nothing but requirerecalibration of the new form, while losing data cap-tured with an older, calibrated form.

288

WILLIAMS, KLAMEN, MCGAGHIE

Give Raters Feedback AboutStringency and Leniency

Much like feedback to Olympic skating judges,feedback to raters about their clinical performance as-sessments may help calibrate future evaluations and in-crease the motivation to provide accurate ratings. Ad-mittedly, because clinical raters observe differentsamples of clinical performance, they will have moredifficulty interpreting information about stringencyand leniency. However, providing feedback to ratersregarding their average ratings along with average rat-ings for all raters provides some idea of rater strin-gency and leniency especially when the averages arebased on many ratings. We are not convinced of thewisdom of handicapping raters arithmetically as raterknowledge that handicapping is being used may leadthem to even more extreme ratings in the future.

Learn From Other Professions

Clinical medicine has much to learn about perfor-mance evaluation from other professions including theastronaut corps,114 aviation,115 business and indus-try,116 the clergy,117 the military,118,119 and nuclearpower plant operation.120 Each of these professionsemploys personnel evaluation theories and technolo-gies that are applicable and useful in clinical medicine.They also have made more progress in evaluating teamperformance,115 an important clinical competency.

Acknowledge the Limits of Ratings

The consistent message throughout this target arti-cle is that clinical ratings, as an evaluation tool, haveserious limits. Users of clinical ratings need to be fullyaware of the limitations and make necessary adjust-ments in their use for making inferences about clinicalcompetence and in making progress decisions. Despiterecent arguments to the contrary,121 our review showsthat faculty observations and the ratings they produceare insufficient measures of clinical competence bythemselves although they provide an important sourceof information about clinical performance. The chal-lenge ahead is to understand when and how to use tra-ditional clinical ratings and to complement their use byadding other measures such as standardized clinicalencounters and skills examinations (e.g., ATLS) to fillin the gaps regarding clinical performance.

References

1. Landy FJ, Farr JL. Performance rating. Psychological Bulletin1980;87:72–107.

2. Wolf FM. Lessons to be learned from evidence-based medicine:Practice and promise of evidence-based medicine and evi-dence-based education. Medical Teacher 2000;22:251–9.

3. Wolf FM, Shea JA, Albanese MA. Toward setting a researchagenda for systematic reviews of evidence of the effects of med-ical education. Teaching and Learning in Medicine2001;13:54–60.

4. Albanese M, Norcini J. Systematic reviews: What are they andwhy should we care? Advances in Health Sciences Education2002;7:147–51.

5. Cooper H. Editorial. Psychological Bulletin 2003;129:3–9.6. Kane MT. The assessment of professional competence. Evalua-

tion and the Health Professions 1992;15:163–82.7. American Institutes for Research. The definition of clinical

competence in medicine: Performance dimensions and ratio-nale for clinical skill areas. Palo Alto, CA: American Institutesfor Research in the Behavioral Sciences, 1976.

8. Elstein AS, Shulman LS, Sprafka SA. Medical problem solv-ing: An analysis of clinical reasoning. Cambridge, MA: Har-vard University Press, 1978.

9. Van der Vleuten CPM, Swanson DB. Assessment of clinicalskills with standardized patients: State of the art. Teaching andLearning in Medicine 1990;2:58–76.

10. Stillman PL, Swanson DB, Smee S, et al. Assessing clinicalskills of residents with standardized patients. Annals of InternalMedicine 1986;105:762–71.

11. Stillman PL, Swanson DB, Regan MB, et al. Assessment ofclinical skills of residents utilizing standardized patients: A fol-low-up study and recommendations for application. Annals ofInternal Medicine 1991;114:393–401.

12. Harper AC, Roy WB, Norman GR, Rand CA, Feightner JW.Difficulties in clinical skills evaluation. Medical Education1983;17:24–27.

13. Norman GR, Tugwell P, Feightner JW, Muzzin LJ, Jacoby LL.Knowledge and problem-solving. Medical Education1985;19:344–56.

14. Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affect-ing the reliability of ratings of students’clinical skills in a medi-cine clerkship. Journal of General Internal Medicine1992;7:506–10.

15. Carline JD, Wenrich MD, Ramsey PG. Characteristics of rat-ings of physician competence by professional associates. Eval-uation and the Health Professions 1989;12:409–23.

16. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. Ageneralizability study of a new standardized rating form used toevaluate students’ clinical clerkship performances. AcademicMedicine 1998;73:1294–8.

17. Noel GL, Herbers JEJ, Caplow MP, Cooper GS, Pangaro LN,Harvey J. How well do internal medicine faculty members eval-uate the clinical skills of residents? Annals of Internal Medicine1992;117:757–65.

18. Ganzach Y. Negativity (and positivity) in performance evalua-tion: Three field studies. Journal of Applied Psychology1995;80:491–9.

19. McIntyre M, James L. The inconsistency with which ratersweight and combine information across targets. Human Perfor-mance 1995;8:95–111.

20. Herbers JEJ, Noel GL, Cooper GS, Harvey J, Pangaro LN,Weaver MJ. How accurate are faculty evaluations of clinicalcompetence? Journal of General Internal Medicine1989;4:202–8.

21. Pangaro LN, Worth-Dickstein H, Macmillan K, Shatzer J,Klass D. Performance of “standardized examinees” in a stan-dardized patient examination of clinical skills. Academic Medi-cine 1997;72:1008–11.

22. Heneman RL. The effects of time delay in rating and amount ofinformation observed on performance rating accuracy. Acad-emy of Management Journal 1983;26:677–86.

23. Kassin S, Tubb V, Hosch H, Memon A. On the “general accep-tance” of eyewitness testimony research: A new survey of theexperts. American Psychologist 2001;56:405–16.

289

CLINICAL PERFORMANCE RATINGS

24. Kalet A, Earp JA, Kowlowitz V. How well do faculty evaluatethe interviewing skills of medical students? Journal of GeneralInternal Medicine 1992;7:499–505.

25. Weekly JA, Gier JA. Ceilings in the reliability and validity ofperformance ratings: The case of expert raters. Academy ofManagement Journal 1989;32:213–22.

26. Elliot DL, Hickam DH. Evaluation of physical examinationskills: Reliability of faculty observers and patient instructors.Journal of the American Medical Association1987;258:3405–8.

27. Martin JA, Reznick RK, Rothman A, Tamblyn RM, Regehr G.Who should rate candidates in an objective structured clinicalexamination? Academic Medicine 1996;71:170–5.

28. Vu NV, Marcy ML, Colliver JA, Verhulst SJ, Travis TA, Bar-rows HS. Standardized (simulated) patients’ accuracy in re-cording clinical performance checklist items. Medical Educa-tion 1992;26:99–104.

29. MacRae HM, Vu NV, Graham B, Word-Sims M, Colliver JA,Robbs RS. Comparing checklists and databases with physi-cians’ ratings as measures of students’ history and physical-ex-amination skills. Academic Medicine 1995;70:313–7.

30. Viswesvaran C, Ones D, Schmidt F. Comparative analysis ofthe reliability of job performance ratings. Journal of AppliedPsychology 1996;81:557–74.

31. Kwolek CJ, Donnelly MB, Sloan DA, Birrell SN, Strodel WE,Schwartz RW. Ward evaluations: Should they be abandoned?Journal of Surgical Research 1997;69:1–6.

32. Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB,LoGerfo JP. Use of peer ratings to evaluate physician perfor-mance. Journal of the American Medical Association1993;269:1655–60.

33. Violato C, Marini A, Toews J, Lockyer J, Fidler H. Feasibilityand psychometric properties of using peers, consulting physi-cians, co-workers, and patients to assess physicians. AcademicMedicine 1997;72:S82–4.

34. Kroboth FJ, Hanusa BH, Parker S, et al. The inter-rater reliabil-ity and internal consistency of a clinical evaluation exercise.Journal of General Internal Medicine 1992;7:174–9.

35. Wenrich MD, Carline JD, Giles LM, Ramsey PG. Ratings of theperformances of practicing internists by hospital-based regis-tered nurses. Academic Medicine 1993;68:680–7.

36. Conway JM, Huffcutt AI. Psychometric properties ofmultisource performance ratings: A meta-analysis of subordi-nate, supervisor, peer, and self-ratings. Human Performance1997;10:331–60.

37. Ginsburg S, Regehr G, Hatala R, et al. Context, conflict, andresolution: A new conceptual framework for evaluating profes-sionalism. Academic Medicine 2000;75(October,Suppl.):S6–11.

38. Tamblyn R, Benaroya S, Snell L, McLeod P, Schnarch B,Abrahamowicz M. The feasibility and value of using patientsatisfaction ratings to evaluate internal medicine residents.Journal of General Internal Medicine 1994;9:146–52.

39. Church AH. Do you see what I see? An exploration of congru-ence in ratings from multiple perspectives. Journal of AppliedSocial Psychology 1997;27:983–1020.

40. Hojat M, Gonnella JS, Veloski JJ, Erdmann JB. Is the glass halffull or half empty? A reexamination of the associations betweenassessment measures during medical school and clinical com-petence after graduation. Academic Medicine 1993;68:S69–76.

41. Williams RG, Barrows HS, Vu NV. Direct, standardized assess-ment of clinical competence. Medical Education1987;21:482–9.

42. Stillman PL, Regan MB, Philbin MM, Haley HL. Results of asurvey on the use of standardized patients to teach and evaluateclinical skills. Academic Medicine 1990;65:288–92.

43. Colliver JA, Williams RG. Technical issues: Test application.Academic Medicine 1993;68:454–60.

44. Issenberg SB, McGaghie WC, Hart IR, et al. Simulation tech-nology for health care professional skills training and assess-ment. Journal of the American Medical Association1999;282:861–6.

45. Issenberg SB, Petrusa ER, McGaghie WC, et al. Effectivenessof a computer-based system to teach bedside cardiology. Aca-demic Medicine 1999;74(October, Suppl.):S93–5.

46. Issenberg SB, McGaghie WC, Gordon DL, et al. Effectivenessof a cardiology review course for internal medicine residentsusing simulation technology and deliberate practice. Teachingand Learning in Medicine 2002;14:223–8.

47. Mangione S, Nieman LZ, Gracely E, Kaye D. The teaching andpractice of cardiac auscultation during internal medicine andcardiology training. Annals of Internal Medicine1993;119:47–54.

48. Cronbach LJ. Processes affecting scores on “understanding ofothers” and “assumed similarity.” Psychological Bulletin1955;52:177–93.

49. Nathan BR, Lord RG. Cognitive categorization and dimen-sional schemata: A process approach to the study of halo in per-formance ratings. Journal of Applied Psychology1983;68:102–14.

50. Cafferty T, DeNisi AS, Williams KJ. Search and retrieval pat-terns for performance information. Journal of Personality andSocial Psychology 1986;50:676–83.

51. DeNisi AS, Peters LH. Organization of information in memoryand the performance appraisal process: Evidence from the field.Journal of Applied Psychology 1996;81:717–37.

52. Williams KJ, DeNisi AS, Meglino BM, Cafferty T. Initial judg-ments and subsequent appraisal decisions. Journal of AppliedPsychology 1986;71:189–95.

53. Williams KJ. The effect of performance appraisal salience onrecall and ratings. Organizational Behavior and Human Deci-sion Processes 1990;46:217–39.

54. Verhulst S, Colliver J, Paiva R, Williams RG. A factor analysisof performance of first-year residents. Journal of Medical Edu-cation 1986;61:132–4.

55. Thompson WG, Lipkin MJ, Gilbert DA, Guzzo RA, RobersonL. Evaluating evaluation: Assessment of the American Board ofInternal Medicine Resident Evaluation Form (see comments).Journal of General Internal Medicine 1990;5:214–7.

56. Metheny WP. Limitations of physician ratings in the assessmentof student clinical performance in an obstetrics and gynecologyclerkship. Obstetrics and Gynecology 1991;78:136–41.

57. Schueneman AL, Carley JP, Baker WH. Residency evaluations:Are they worth the effort? Archives of Surgery1994;129:1067–73.

58. Haber RJ, Avins AL. Do ratings on the American Board of In-ternal Medicine Resident Evaluation Form detect differences inclinical competency? Journal of General Internal Medicine1994;9:140–5.

59. Burdick WP, Schoffstall J. Observation of emergency medicineresidents at the bedside: How often does it happen? AcademicEmergency Medicine 1995;2:909–13.

60. Bretz RD, Milkovich GT, Read W. The current state of perfor-mance appraisal research and practice: Concerns, directions,and implications. Journal of Management 1992;18:321–52.

61. Rothstein HR. Interrater reliability of job performance ratings:Growth to asymptote level with increasing opportunity to ob-serve. Journal of Applied Psychology 1990;75:322–7.

62. Woehr DJ. Performance dimension accessibility: Implicationsfor rating accuracy. Journal of Organizational Behavior1992;13:357–67.

63. Day NE. Can performance raters be more accurate? Investi-gating the benefits of prior knowledge of performance dimen-sions. Journal of Managerial Issues 1995;7:323–42.

290

WILLIAMS, KLAMEN, MCGAGHIE

64. Tesser A, Rosen S. The reluctance to transmit bad news. In LBerkowitz (Ed.), Advances in experimental social psychology(Vol. 8, pp. 193–232). New York: Academic, 1975.

65. Benedict ME, Levine EL. Delay and distortion: Tacit influenceson performance appraisal effectiveness. Journal of AppliedPsychology 1988;73:507–14.

66. Waldman DA, Thornton GC, III. A field study of rating condi-tions and leniency in performance appraisal. Psychological Re-ports 1988;63:835–40.

67. Harris MM, Smith DE, Champagne D. A field study of perfor-mance appraisal purpose: Research- versus administra-tive-based ratings. Personnel Psychology 1995;48:151–60.

68. Longenecker CO, Sims HP, Gioia DA. Behind the mask: Thepolitics of employee appraisal. The Academy of ManagementExecutive 1987;1:183–93.

69. Bernardin HJ, Cardy RL. Appraisal accuracy: The ability andmotivation to remember the past. Public Personnel Management1982;11:352–7.

70. Littlefield JH, DaRosa DA, Anderson KD, Bell RM, NicholasGG, Wolfson PJ. Accuracy of surgery clerkship performanceraters. Academic Medicine 1991;66:S16–8.

71. Ryan JG, Mandel FS, Sama A, Ward MF. Reliability of facultyclinical evaluations of non-emergency medicine residents dur-ing emergency department rotations. Academic EmergencyMedicine 1996;3:1124–30.

72. Magarian GJ, Mazur DJ. A national survey of grading systemsused in medicine clerkships. Academic Medicine1990;65:636–9.

73. Magarian GJ, Mazur DJ. Evaluation of students in medicineclerkships. Academic Medicine 1990;65:341–5.

74. Vu NV, Henkle JQ, Colliver JA. Consistency of pass–fail de-cisions made with clinical clerkship ratings and standard-ized-patient examination scores. Academic Medicine1994;69:S40–1.

75. Shea J, Norcini J, Kimball H. Relationships of ratings of clini-cal competence and ABIM scores to certification status. Aca-demic Medicine 1993;68:S22–4.

76. Ramsey PG, Carline JD, Blank LL, Wenrich MD. Feasibilityof hospital-based use of peer ratings to evaluate the perfor-mances of practicing physicians. Academic Medicine1996;71:364–70.

77. Schwartz RW, Donnelly MB, Sloan DA, Johnson SB, StrodelWE. The relationship between faculty ward evaluations, OSCE,and ABSITE as measures of surgical intern performance.American Journal of Surgery 1995;169:414–7.

78. Speer AJ, Solomon DJ, Ainsworth MA. An innovative evalua-tion method in an internal medicine clerkship. Academic Medi-cine 1996;71:S76–8.

79. Hemmer PA, Hawkins R, Jackson JL, Pangaro LN. Assessinghow well three evaluation methods detect deficiencies in medi-cal students’professionalism in two settings of an internal med-icine clerkship. Academic Medicine 2000;75:167–73.

80. Barzansky B, Jonas HS, Etzel SI. Educational programs in USmedical schools, 1997–1998. Journal of the American MedicalAssociation 1998;280:803–8.

81. Tonesk X, Buchanan RG. An AAMC pilot study by 10 medicalschools of clinical evaluation of students. Journal of MedicalEducation 1987;62:707–18.

82. Yao DC, Wright SM. National survey of internal medicine resi-dency program directors regarding problem residents. Journalof the American Medical Association 2000;284:1099–104.

83. Kwolek CJ, Donnelly MB, Sloan DA, Birrell SN, Strodel WE,Schwartz RW. Ward evaluations: Should they be abandoned?Journal of Surgical Research 1997;69:1–6.

84. Jain SS, DeLisa JA, Campagnolo DI. Methods used in the eval-uation of clinical competency of physical medicine and rehabil-itation residents. American Journal of Physical Medicine andRehabilitation 1994;73:234–9.

85. Newble DI, Hoare J, Sheldrake PF. The selection and training ofexaminers for clinical examinations. Medical Education1980;14:345–9.

86. Woehr DJ, Huffcutt AI. Rater training for performance ap-praisal: A quantitative review. Journal of Occupational andOrganizational Psychology 1994;67:189–205.

87. Pulakos ED, White LA, Oppler SH, Borman WC. Examinationof race and sex effects on performance ratings. Journal of Ap-plied Psychology 1989;74:770–80.

88. Sackett PR, DuBois CL. Rater–ratee race effects on perfor-mance evaluation: Challenging meta-analytic conclusions.Journal of Applied Psychology 1991;76:873–7.

89. McEvoy GM, Cascio WF. Cumulative evidence of the relation-ship between employee age and job performance. Journal ofApplied Psychology 1989;74:11–7.

90. Waldman DA, Avolio BJ. A meta-analysis of age differences injob performance. Journal of Applied Psychology1986;71:33–8.

91. Heilman ME, Stopeck MH. Being attractive, advantage or dis-advantage? Performance-based evaluations and recommendedpersonnel actions as a function of appearance, sex, and job type.Organizational Behavior and Human Decision Processes1985;35:202–15.

92. Wigton RS. The effects of student personal characteristics onthe evaluation of clinical performance. Journal of Medical Edu-cation 1980;55:423–7.

93. Goffman E. The presentation of self in everyday life. GardenCity, NY: Doubleday, 1959.

94. Haas J, Shaffir W. Ritual evaluation of competence. Work andOccupations 1982;9:131–54.

95. Haas J, Shaffir W. Becoming doctors: The adoption of a cloakof competence. Greenwich, CT: JAI, 1987.

96. Gordon R. Impact of ingratiation on judgments and evaluations:A meta-analytic investigation. Journal of Personality and So-cial Psychology 1996;71:54–70.

97. Martell RF, Borg MR. A comparison of the behavioral ratingaccuracy of groups and individuals. Journal of Applied Psy-chology 1993;78:43–50.

98. Wright P. The harassed decision maker: Time pressures, dis-tractions, and the use of evidence. Journal of Applied Psychol-ogy 1974;59:555–61.

99. Brannick MT, Salas E, Prince C. (Eds.). Team performance as-sessment and measurement: Theory, methods, and applica-tions. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 1997.

100. Kopelow ML, Schnabl GK, Hassard TH, et al. Assessing prac-ticing physicians in two settings using standardized patients.Academic Medicine 1992;67(October, Suppl.):S19–21.

101. Sackett P, Zedeck S, Fogli L. Relations between measures oftypical and maximum job performance. Journal of AppliedPsychology 1988;73:482–6.

102. Tamblyn R, Benaroya S, Snell L, McLeod P, Schnarch B,Abrahamowicz M. The feasibility and value of using patientsatisfaction ratings to evaluate internal medicine residents.Journal of General Internal Medicine 1994;9:146–52.

103. Newble D. The critical incident technique: A new approach tothe assessment of clinical performance. Medical Education1983;17:401–3.

104. McGlynn TJ, Munzenrider RF, Zizzo J. A resident’s internalmedicine practice. Evaluation and the Health Professions1979;2:463–76.

105. Shatzer, J., Wardrop, J. Williams, R., Hatch, T. Thegeneralizability of performance on different station length stan-dardized patient cases. Teaching and Learning in Medicine1994;6:54–8.

106. Shatzer JH, DaRosa D, Colliver JA, Barkmeier L. Sta-tion-length requirements for reliable performance-based exam-ination scores. Academic Medicine 1993;68:224–9.

291

CLINICAL PERFORMANCE RATINGS

107. Rhoton MF, Furgerson CL, Cascorbi HF. Evaluating clinicalcompetence in anesthesia: Using faculty comments to developcriteria. Annual Conference on Research in Medical Education1986;25:57–62.

108. Rhoton MF. A new method to evaluate clinical performanceand critical incidents in anesthesia: Quantification of dailycomments by teachers. Medical Education 1989;23:280–9.

109. Rhoton MF, Barnes A, Flashburg M, Ronai A, Springman S. In-fluence of anesthesiology residents’ noncognitive skills on theoccurrence of critical incidents and the residents’ overall clini-cal performances. Academic Medicine 1991;66:359–61.

110. Soglin, D. J. The use of event report cards in the evaluation ofpediatric residents: A pilot study. Chicago: University of Illi-nois, Department of Medical Education, 1996.

111. Brennan BG, Norman GR. Use of encounter cards for evalua-tion of residents in obstetrics. Academic Medicine1997;72:S43–4.

112. Flanagan JC, Burns RK. The employee performance record: Anew appraisal and development tool. Harvard Business Review1955;33:95–102.

113. Norcini JJ, Blank LL, Arnold GK, Kimball HR. The mini-CEX(clinical evaluation exercise): A preliminary investigation. An-nals of Internal Medicine 1995;123:795–9.

114. Santy PA. Choosing the right stuff: The psychological selectionof astronauts and cosmonauts. Westport, CT: Praeger, 1994.

115. Wiener EL, Kanki BG, Helmreich RL. (Eds.). Cockpit resourcemanagement. San Diego, CA: Academic, 1993.

116. Berk RA. (Ed.). Performance assessment: Methods & applica-tions. Baltimore: Johns Hopkins University Press, 1986.

117. Hunt RA, Hinkle JE, Malony HN. (Eds.). Clergy assessmentand career development. Nashville, TN: Abingdon, 1990.

118. Wigdor AK, Green BF. (Eds.). Performance assessment for theworkplace (Vol. 1). Washington, DC: National Academy Press,1991.

119. Wigdor AK, Green BF. (Eds.). Performance assessment for theworkplace: Technical issues (Vol. 2). Washington, DC: Na-tional Academy Press, 1991.

120. Toquam JL, Macaulay JL, Westra CD, Fujita Y, Murphy SE.Assessment of nuclear power plant crew performance variabil-ity. In MT Brannick, E Salas, C Prince (Eds.), Team perfor-mance assessment and measurement: Theory, methods, and ap-plications (pp. 253–287). Mahwah, NJ: Lawrence ErlbaumAssociates, Inc., 1997.

121. Whitcomb ME. Competency-based graduate medical educa-tion? Of course! But how should competency be assessed? Aca-demic Medicine 2002;77:359–60.

Received 23 September 2002Final revision received 14 April 2003

292

WILLIAMS, KLAMEN, MCGAGHIE