Validity Studies on the Assessment of Clinical Reasoning ... Studies on the Assessment of Clinical Reasoning Skills ... 2.USMLE Step 2 Clinical Skills Examination ... USMLE Step 2

Yoon Soo Park, PhDSeptember 21, 2017

Validity Studies on the Assessment of Clinical Reasoning Skills

Using Patient Notes

University of TokyoMedical Education Lecture

Overview

1. Overview of Clinical Reasoning: Diagnostic Justification

2. USMLE Step 2 Clinical Skills Examination

3. Validity Studies of Patient Note Scoring Rubric

– Five Projects (Years 2012 to 2017)

– Gather Validity Evidence

4. Implications for Validity Research

Clinical Reasoning (1)

Data GatheringHistory

Physical ExaminationLab Results

Differential DiagnosisWorking Diagnosis

Plan

Diagnoses listed –Not always relate to actual findings!

Finding 1

Finding 2

Finding 3

Finding 4

Finding 5

?????Diagnosis A

Diagnosis B

Diagnosis C

Clinical Reasoning (2)

• Organize and use knowledge• Collect pertinent data about patients• Generate appropriate DDx given chief complaint• Make good diagnostic decisions using data

Diagnostic Justification

1. How to assess?2. Validity?

Clinical Reasoning –Diagnostic Justification

Finding 1

Finding 2

Finding 3

Finding 4

Finding 5

Diagnosis A

Diagnosis B

Diagnostic Justification Study –4th year Medical Students

Williams and Klamen, Academic Medicine, 2012

Williams and Klamen Study

• Case specificity – Justification scores variable across cases

• Task specificity

• >20% of students had borderline or poor Diagnostic justification on > 50% of cases

USMLE Step 2 Clinical SkillsPatient Note Format

Old Format:Before June 2012

1. Document H & PE

2. DDX

3. Plan for immediate workup

New Format:After June 2012

1. Document H & PE

2. Ranked and Justified DDX

3. Plan for immediate workup

USMLE Step 2 Clinical Skills (1)

Standardized Patient Encounter

(15 minutes)

Patient NoteWriting

(10 minutes)

Patient

MSStudent Patient

Note

Repeat for 12 Stations

• Medical schools → align curriculum and assessment • Develop scoring rubrics • Train faculty raters to score PNs

SP Encounter(about 15 min.)

Patient Note(about 10 min.)

Scoring: Standardized Patients• Physical Examination

Scoring: Physician Raters1. Documentation2. Differential Diagnosis3. Diagnostic Justification4. Workup Plan

USMLE Step 2 Clinical Skills (2)

USMLE Patient Note Form (1)

USMLE Patient Note Form (2)

• 75‐yo woman, sudden onset back pain while lifting turkey from oven. • Pain in midline, mid‐thoracic• PMH: fracture of foot bone last year• Meds: Acetaminophen for occasional headache …

• Well‐nourished woman looks her age, alert and lucid; • Appears in pain, especially with movements; • Point tenderness over T8; • Straight leg test negative; sensation intact …

Pulled Back Muscle

Sudden onset back pain with liftingPain worse with movement

Range of motion limited by pain

Imaging of back/spine

Five Sources of Validity Evidence(AERA, APA, & NCME, 2014)

• Content– Content sampled, blueprint

• Response process– Response, Rater Issues

• Internal structure– Reliability, Psychometrics

• Relations to other variables– Correlations between assessment score to other measures

• Consequences– Impact on pass‐fail rates, curriculum

Education and Research Cycle:University of Illinois – College of Medicine

Assessments:Graduation CompetencyExamination

PsychometricAnalysisResearch Paper

Generate New Questions

Education Committee:Discussion

Curricular ChangeNew Questions

Quality ImprovementAccreditation Agency

• 1.1 Continuous Quality Improvement– A medical school engages in ongoing planning and continuous

improvement processes … achievement of measureable outcomes …

• 8.4 Program Evaluation– A medical school collects and uses a variety of outcome data, including

national norms of accomplishment, to demonstrate the extent to which medical students are achieving medical education program objectives and to enhance medical education program quality.

Standards for Accreditation of Medical Education Programs Leading to the MD Degree,Liaison Committee on Medical Education (LCME), 2016

Validity Research: Patient Note Rubric

2012 2013 2014 2015 2016 2017

Paper 1:Development

Paper 3:Rater Issues

Paper 2: Reliability and Psychometrics

Paper 4: Composite Measure

Paper 5: Multisite Study

Validity Research Agenda(Sireci 2013; Kane 2013; Messick 1995)

ProjectSources of Validity Evidence

Content Internal structure

Relations with other variables

Response Process Consequences

Paper 1(2012‐2013)

Paper 2(2013‐2015)

Paper 3(2014‐2016)

Paper 4(2015‐2016)

Paper 5(2015‐2017)

xx

x

xx

xx x

x x

x x

Park et al, Academic Medicine, 2013

Paper 1: Initial Development(Project Dates: 2012‐2013)

Task with Description Score and Anchor1. Documentation:

Documentation of findings in history (Hx) and physical examination (PE)

(30 points)

1. Key Hx and PE findings are missing or incorrect 2. Some key positive findings present but poorly documented or disorganized or missing

pertinent negatives 3. Most key positive findings well documented and organized, may miss a few pertinent

negatives 4. All key information present, concise and well organized with little irrelevant

information 2. DDX:

Differential Diagnosis

(30 points)

1. [0-1 of 3] or [0 of 2] of the correct diagnoses listed2. [2 of 3] or [1 of 2] of the correct diagnoses listed, in any order3. All diagnoses listed, incorrect rank order 4. All diagnoses listed and correctly rank ordered

3. Justification:

Justification of Differential Diagnosis

(30 points)

1. No justification provided OR many missing or incorrect links between findings and Dx2. Some missing or incorrect links between findings and Dx 3. Only a few missing or incorrect attributions, which would not impact Dx 4. Links to diagnoses are correct and complete

4. Workup:

Plan for Immediate Diagnostic Workup

(10 points)

1. Diagnostic workup places patient in unnecessary risk or danger 2. Ineffective plan for diagnostic workup – essential tests missed, irrelevant tests included 3. Reasonable plan for diagnostic workup, may have some unnecessary tests or missing

few essential tests4. Plan for diagnostic workup is effective and efficient, includes all essential tests, and

few or no unnecessary tests

Literature Review and Interview/Focus Groups

• As described in the literature? • What language do learners use?

Conceptualization of Construct

Prospective Learners

match?

Literature Review• Determine if construct already exist• Alignment between construct and theory

Interview / Focus Group

Conduct Cognitive Interviews

Read and paraphrase it –

“tell me what you think it is asking, in your own words”

Ask more focused questions to evaluate if respondents agree

• Observable?• Relevant?• Wording clear?• Responses clear?• Other?

1. Documentation of Key Hx and PE (30 points)1. Key Hx and PE findings are missing or incorrect

2. Some key positive findings present but poorly documented or disorganized or missing pertinent negatives

3. Most key positive findings well documented and organized, may miss a few pertinent negatives

4. All key information present, concise and well organizedwith little irrelevant info

7%

27%

57%

9%

Pilot Study –Task Score: Documentation

2. Differential Diagnosis (30 points)1. [0‐1 of 3] or [0 of 2] of the correct diagnoses

listed2. [2 of 3] or [1 of 2] of the correct diagnoses

listed, in any order3. All diagnoses listed, incorrect rank order 4. All diagnoses listed and correctly rank

ordered

26

11%

34%

29%

25%

Pilot Study –Task Score: Differential Diagnosis

3. Justification (30 points)1. No justification provided OR many missing or

incorrect links between findings and Dx2. Some missing or incorrect links between

findings and Dx3. Only a few missing or incorrect attributions,

which would not impact Dx4. Links to diagnoses are correct and complete

27

6%

34%

55%

5%

Pilot Study –Task Score: Justification

4. Plan for Immediate Dx Workup (10 points)1. Diagnostic workup places patient in unnecessary

risk or danger 2. Ineffective plan for diagnostic workup – essential

tests missed, irrelevant tests included 3. Reasonable plan for diagnostic workup, may

have some unnecessary tests or missing few essential tests

4. Plan for diagnostic workup is effective and efficient, includes all essential tests, and few or no unnecessary tests.

28

7%

54%

36%

3%

Pilot Study –Task Score: Workup

Students who received scores of “Poor” or “Borderline” on Justification

# Cases %0 91 312 303 214 85 1

Consistent with findings from Williams Study –

>20% of students had borderline or poor Diagnostic justificationon >50% of cases.

Students need more practice linking findings and diagnoses!

Finding 1

Finding 2

Finding 3

Finding 4

Finding 5

Diagnosis A

Diagnosis B

Yudkowsky et al, Academic Medicine, 2015

Paper 2: Reliability and Psychometrics(Project Years: 2013‐2015)

Reliability as Consistency

Test 1

Pass

Test 2

Pass

Test 1

Pass

Test 2

Fail??

True Variance

Score (X)

Frequency

True Variance =Total Variance – Error

Total Variance =True Variance + Error

error variancetrue"" variancetrue"" y Reliabilit

Reliability: Example

ErrorTrueTrue

nceTotalVariaceTrueVarianyreliabilit

Example (Note: Hypothetical!)

Test Mean = 65 Standard Deviation = 5 Total Variance = 25

Total (True + Error) = 25→ True = 15, Error = 10

Reliability =

__15__ = __15__ = .6025 15 + 10

Internal Structure: Reliability

• Reliability with 5 cases– G‐coefficient: .47– Φ‐coefficient: .43

• Variance components– Students: 6%– Student‐case interaction (case specificity): 20%

• Need 15 cases to reach Φ‐coefficient of .70

Relations to other variables

Data Gathering: r = .47Total Score: r = .38

Park et al, Advances in Health Sciences Education, 2016

Paper 3: Rater Issues(Project Years: 2014‐2016)

Rater Agreement

Rater 1 Rater 23 24 41 32 41 43 41 12 24 21 1

• How well do raters agree?• Exact: % exact match• Adjacent: % match within ±1 adjacent point• Discrepant : % discrepant by ±2 or more points

Agreement% Exact = 40%% Adjacent = 20%% Discrepant = 40%

“EXACT”“Discrepant”“Discrepant”

“Adjacent”

“Adjacent”

“Discrepant”

“EXACT”“EXACT”

“Discrepant”“EXACT”

• Kappa• Intraclass Correlations

• Intraclass Correlation (ICC) = .79

% Agreement and ICC

3.5

44.

55

5.5

66.

57

Abso

luat

e SE

M

5 6 7 8 9 10 11 12Number of Cases

1 Rater 2 Raters3 Raters 4 Raters

Projections in Absolute SEM

Additional rater or creating a new case:

What’s more optimal?

Adding a rater (double scoring) ≈ 2 new cases!

Rater Training

Training Materials1. Detailed case information2. “Gold Standard” exemplar note3. Case‐specific scoring guidelines4. 3 sample notes (actual student notes)5. 10 extra notes (optional cases to review)

Training (2 sessions: 3 hours total)1. Moderator presented summary of case2. Discussion of 3 sample notes 3. Discrepancies in scoring discussed4. Update scoring guidelines and “Gold Standard” exemplar note

Park et al, Academic Medicine, 2016

Paper 4: Composite Measures(Project Years: 2015‐2016)

How to combine different educational experiences?

• Example: How can weights be specified to optimize signal?

Patient Note

Written Exam

SP Encounter (OSCE)

Composite Measure

?

?

?

Composite Scores: Weighted Mean vs Kane Method

Weighted Mean Score

Composite Score: Kane Method

SP Physical Exam(SP‐PE)

Patient Note(PN)+

Weighted Mean Score69.5%

(= 80% x .30 + 65% x .70)

Weight = .30 Weight = .70

Score = 80% Score = 65%

SP Physical Exam(SP‐PE)

Patient Note(PN)+

Standardize Standardize

Weighted Mean

Kane Composite Score

Adjust Composite Score Variance

.25

.3.3

5.4

.45

.5.5

5.6

Com

posi

te S

core

Rel

iabi

lity

0 10 20 30 40 50 60 70 80 90 100Patient Note Weight (%)

SP-PE and PN SP-Hx, SP-PE, and PN

Weights and Composite Score ReliabilityPsychometrically – what “weight” maximizes composite score reliability? .51

.42.46

Pass‐Fail Classification0

1020

3040

5060

7080

9010

0C

ompo

site

Sco

re

40 45 50 55 60 65 70 75 80Weighted GCE % Score: Integrated Clinical Encounter

Misclassificationn = 16, 4.5%

Weighted GCE % cutscore = 52%

Composite cutscore = 9

Misclassification: n = 1, 0.3%

Classification accuracy (κ) = .77

Park et al, Academic Medicine, in press

Paper 5: Multisite Study(Project Years: 2015‐2016)

Validity Evidence and Scoring Guidelines for Standardized Patient Encounters and Patient Notes from a Multisite Study of Clinical Performance Examinations in Seven Medical Schools

Yoon Soo Park, PhD, Abbas Hyderi, MD, MPH, Nancy Heine MEd, RN, ANP, Win May MD, PhD, Andrew Nevins, MD, Ming Lee, PhD, Georges Bordage, MD, PhD, and Rachel Yudkowsky, MD, MHPE

Why multisite validity study?

• Enhance psychometric rigor → Validity

Psychometric• Greater heterogeneity of learner samples• Reliability and validity inferences

Educational• Broaden mix of clinical cases• Shared case development• Develop common rater training protocols• Evaluate program • Many more!

Validity Evidence:

UI‐College of Medicine(n = 160)

California Consortium for the Assessment of Clinical Competence (CCACC)

(n = 830)

1. Loma Linda University 2. UC Davis3. UCLA4. UC San Diego5. University of Southern

California 6. Stanford University

GeneralizeInferences?

Cross‐Institutional Consequential Evidence

Multisite Participants

Multisite Validity Evidence• Consistent with previous findings

• Notable school‐effect differences on patient note scores– Evidence from variance components– Differences in effective teaching of clinical reasoning

• Ability to “justify” diagnoses → most discriminating

• Develop case‐specific scoring guidelines – More accessible

6570

7580

Patie

nt N

ote

(%)

1 2 3 4 5 6 7School ID

95% confidence intervals

Patient Note

Program Evaluation

Why is performance on Medical School Clinical Skills Exam important?

Standardized Patient Encounter

Patient Note

Communication and Interpersonal Skills

Medical School

Step 2 CS Exam

Medical School

Step 2 CS Exam

Medical School

Step 2 CS Exam

r = .32

r = .75

r = .56

Berg et al, Academic Medicine, 2008

Cuddy et al, Academic Medicine, 2016

Residency Performance:

Significant Prediction!

Current Studies on the Patient Note (1) –Revisit Project (8 medical schools)

Reflect and Revisit Patient

Blatt et al, Journal of General Internal Medicine, 2007

Reflection Study –Standardized Patient Encounter

Current Studies on the Patient Note (2) –Non‐Physician Rater (15 medical schools)

• Scoring Patient Note – Expensive

• Cost of training and scoring notes – UIC estimates:

– Physician: $40,869 (excluding case development costs)

– Non‐Physicians: $9,990

• California pilot study (4 medical schools)– Physician and Non‐Physician correlation: r = .64

Translational Science in Education

Teacher training, development

Teacher knowledge and skills

Classroom teaching

Medical Education

Trainee Clinical Skills and knowledge

Clinics and offices for better

health care

Patient or public health outcome

K‐12 Education (example)

Medical Education

T1(education setting)

T2(clinic)

T3(clinic and community)

McGaghie WC, Science Translational Medicine, 2010

Student achievement

Allen et al, Science, 2011

Yoon Soo Park, PhDemail: [email protected]

Documents

Validity Studies on the Assessment of Clinical Reasoning ... Studies on the Assessment of Clinical Reasoning Skills ... 2.USMLE Step 2 Clinical Skills Examination ... USMLE Step 2