Designing Trustworthy & Reliable GME Evaluations Conference Session: SES85 2011 ACGME Annual Education Conference Nancy Piro, PhD, Program Manager/Ed Specialist

Designing Trustworthy & Reliable Designing Trustworthy & Reliable GME EvaluationsGME Evaluations

Conference Session: SES852011 ACGME Annual Education Conference

Nancy Piro, PhD, Program Manager/Ed SpecialistAlice Edler, MD, MPH, MA (Educ)Ann Dohn, MA, DIOBardia Behravesh, EdD, Manager/Ed Specialist

Stanford Hospital & ClinicsDepartment of Graduate Medical Education

Overall QuestionsOverall Questions

What is assessment…an What is assessment…an evaluation? evaluation?

How are they different? How are they different?

What are they used for?What are they used for?

WhyWhy do we evaluate? do we evaluate?

How do we construct a useful How do we construct a useful evaluation?evaluation?

What is cognitive bias?What is cognitive bias?

How do we eliminate bias How do we eliminate bias from our evaluations? from our evaluations?

What is validity?What is validity?

What is reliability?What is reliability?

Defining the Rules of Defining the Rules of the “Game”the “Game”

Assessment - Evaluation: Assessment - Evaluation: What’s the difference and What’s the difference and what are they used for?what are they used for?

AssessmentAssessment … …is the analysis and is the analysis and use of data by residents or sub-use of data by residents or sub-specialty residents (trainees), specialty residents (trainees), faculty, program directors and/or faculty, program directors and/or departments to make decisions departments to make decisions about about improvements in teaching improvements in teaching and learning. and learning.

Assessment - Evaluation: Assessment - Evaluation: What’s the difference and What’s the difference and what are they used for?what are they used for?

EvaluationEvaluation is the analysis and use of is the analysis and use of data by faculty to make data by faculty to make judgments judgments about trainee performance. Evaluation about trainee performance. Evaluation includes obtaining accurate includes obtaining accurate performance based, empirical performance based, empirical information which is used to make information which is used to make competency decisionscompetency decisions on trainees on trainees across the six domains.across the six domains.

Evaluation ExamplesEvaluation Examples

Example 1: A trainee delivers an oral A trainee delivers an oral presentation at a Journal Club. The presentation at a Journal Club. The faculty member provides a critique of the faculty member provides a critique of the delivery and content accompanied by a delivery and content accompanied by a rating for the assignment.rating for the assignment.

Example 2: A program director provides A program director provides a final evaluation to a resident a final evaluation to a resident accompanied by an attestation that the accompanied by an attestation that the resident has demonstrated sufficient resident has demonstrated sufficient ability and acquired the appropriate ability and acquired the appropriate clinical and procedural skills to practice clinical and procedural skills to practice competently and independently. competently and independently.

Why do we assess and evaluate?Why do we assess and evaluate?(Besides the fact it is required…)(Besides the fact it is required…)

Demonstrate and improve trainee Demonstrate and improve trainee competence in core and related competence in core and related competency areas - Knowledge and competency areas - Knowledge and applicationapplicationEnsure our programs produce graduates, Ensure our programs produce graduates, each of whom: “has demonstrated each of whom: “has demonstrated sufficient ability and acquired the sufficient ability and acquired the appropriate clinical and procedural skills to appropriate clinical and procedural skills to practice competently and independently.”practice competently and independently.”Track the impact of Track the impact of curriculum/organizational change curriculum/organizational change Gain feedback on program, curriculum Gain feedback on program, curriculum and faculty effectivenessand faculty effectivenessProvide residents/fellows a means to Provide residents/fellows a means to communicate confidentiallycommunicate confidentiallyProvide an early warning systemProvide an early warning systemIdentify gaps between competency based Identify gaps between competency based goals and individual performancegoals and individual performance

So What’s the Game Plan for So What’s the Game Plan for Constructing Effective Evaluations ?Constructing Effective Evaluations ?

Without a plan… evaluations can take on a life of their own!!

How do we construct a How do we construct a useful evaluation? useful evaluation?

How do we construct a How do we construct a useful evaluation? useful evaluation?

STEP 1. Create the Evaluation (Plan)STEP 1. Create the Evaluation (Plan)

Curriculum (Competency) Goals, Curriculum (Competency) Goals, Objectives and OutcomesObjectives and Outcomes

Question and Scale DevelopmentQuestion and Scale Development

STEP 2STEP 2. . DeployDeploy (Do)(Do)

Online /In-Person (Paper)Online /In-Person (Paper)

STEP 3STEP 3. . Analyze (Study /Check )Analyze (Study /Check )

Reporting, Benchmarking and Reporting, Benchmarking and Statistical AnalysisStatistical Analysis

Rank Order / Norms (Within the Rank Order / Norms (Within the Institution or National)Institution or National)

STEP 4STEP 4. . Take Action (Act)Take Action (Act)

Develop & Implement Develop & Implement Learning/Action Plans Learning/Action Plans

Measure Progress Against Measure Progress Against Learning Learning Goals Goals

Adjust Learning/Action PlansAdjust Learning/Action Plans

Question and Response Question and Response Scale ConstructionScale Construction

Two Basic Goals:Two Basic Goals:1.1. Construct unbiased, Construct unbiased,

unconfounded, and non-unconfounded, and non-leading questions that produce leading questions that produce valid datavalid data

2.2. Design and use unbiased and Design and use unbiased and valid response scalesvalid response scales

What is cognitive bias…What is cognitive bias…

Cognitive bias Cognitive bias is distortion is distortion in the way we perceive reality in the way we perceive reality or information.or information.

Response bias Response bias is a particular is a particular type of cognitive bias which type of cognitive bias which can affect the results of an can affect the results of an evaluation if respondents evaluation if respondents answer questions in the way answer questions in the way they think they are designed they think they are designed to be answered, or with a to be answered, or with a positive or negative bias positive or negative bias toward the examinee.toward the examinee.

Where does response Where does response bias occur?bias occur?

Response bias most often Response bias most often occurs in the occurs in the wordingwording of the of the question. question. – Response bias is present Response bias is present

when a question contains a when a question contains a leading phrase or words. leading phrase or words.

– Response bias can also occur Response bias can also occur in rating scales.in rating scales.

Response bias can also be Response bias can also be in the raters themselvesin the raters themselves

– Halo EffectHalo Effect– Devil EffectDevil Effect– Similarity EffectSimilarity Effect– First ImpressionsFirst Impressions

Step 1: Create the Step 1: Create the EvaluationEvaluation

Question ConstructionQuestion ConstructionExample (1): Example (1): – "I can always talk to my Program "I can always talk to my Program

Director about residency related Director about residency related problems.” problems.”

Example (2): Example (2): – ““Sufficient career planning Sufficient career planning

resources are available to me resources are available to me and my program director and my program director supports my professional supports my professional aspirations .”aspirations .”

Question ConstructionQuestion Construction

Example (3): Example (3): – ““Incomplete, inaccurate medical Incomplete, inaccurate medical

interviews, physical interviews, physical examinations; incomplete review examinations; incomplete review and summary of other data and summary of other data sources. Fails to analyze data to sources. Fails to analyze data to make decisions; poor clinical make decisions; poor clinical judgment.”judgment.”

Example (4): Example (4): – "Communication in my sub-"Communication in my sub-

specialty program is good." specialty program is good."

Create the EvaluationCreate the EvaluationQuestion ConstructionQuestion Construction

Example (5): Example (5): – "The pace on our service is "The pace on our service is

chaotic."chaotic."

Exercise OneExercise One

Review each question and Review each question and share your thinking of what share your thinking of what makes it a good or bad makes it a good or bad question.question.

Question Construction - Question Construction - Test Test

Your KnowledgeYour KnowledgeExample 1: Example 1: "I can always talk "I can always talk to my Program Director about to my Program Director about residency related problems." residency related problems."

Problem: Terms such as Problem: Terms such as "always" and "never" will bias "always" and "never" will bias the response in the opposite the response in the opposite direction. direction.

Result: Data will be skewed. Result: Data will be skewed.


Your KnowledgeYour KnowledgeExample 2: Example 2: “Career planning “Career planning resources are available to me resources are available to me and my program director and my program director supports my professional supports my professional aspirations." aspirations." Problem: Double-barreled ---Problem: Double-barreled ---resources and aspirations… resources and aspirations… Respondents may agree with Respondents may agree with one and not the other. one and not the other. Researcher cannot make valid Researcher cannot make valid assumptions about which part of assumptions about which part of the question respondents were the question respondents were rating. rating. Result: Data is useless. Result: Data is useless.


Your KnowledgeYour Knowledge

Example 3: Example 3: "Communication in my "Communication in my sub-specialty program is good." sub-specialty program is good."

Problem: Question is too broad. If Problem: Question is too broad. If score is less than 100% positive, score is less than 100% positive, researcher/evaluator still does not researcher/evaluator still does not know what aspect of know what aspect of communication needs communication needs improvement.improvement.

Result: Data is of little or no Result: Data is of little or no usefulness. usefulness.

Question Construction - Question Construction - Test Your KnowledgeTest Your Knowledge

Example 4: Example 4: “Evidences incomplete, “Evidences incomplete, inaccurate medical interviews, inaccurate medical interviews, physical examinations; incomplete physical examinations; incomplete review and summary of other data review and summary of other data sources. Fails to analyze data to sources. Fails to analyze data to make decisions; poor clinical make decisions; poor clinical judgment.”judgment.”

Problem: Septuple-barreled ---Problem: Septuple-barreled ---Respondents may need to agree Respondents may need to agree with some and not the others. with some and not the others. Evaluator cannot make Evaluator cannot make assumptions about which part of the assumptions about which part of the question respondents were rating. question respondents were rating.

Result: Data is useless. Result: Data is useless.


Your KnowledgeYour KnowledgeExample (5): Example (5): – "The pace on our service is "The pace on our service is

chaotic.“chaotic.“

Problem: The question is Problem: The question is negative, and broadcasts a negative, and broadcasts a bad message about the bad message about the rotation/program. rotation/program.

Result: Data will be skewed, Result: Data will be skewed, and the climate may be and the climate may be negatively impacted.negatively impacted.

Evaluation Question Evaluation Question Design Principles Design Principles

Avoid ‘double-barreled’ Avoid ‘double-barreled’ questionsquestions

A double-barreled question A double-barreled question combines two or more issues combines two or more issues or “attitudinal objects” in a or “attitudinal objects” in a single question.single question.

Avoiding Double-Barreled Avoiding Double-Barreled QuestionsQuestions

Example: Patient Care Core Example: Patient Care Core CompetencyCompetency

““Resident provides sensitive Resident provides sensitive support to patients with serious support to patients with serious illness and to their families, illness and to their families, and arranges for on-going and arranges for on-going support or preventive services support or preventive services if needed.” if needed.”

Minimal Progress Progressing Competent


Combining the two or more questions into one question makes it unclear which object attribute is being measured, as each question may elicit a different perception of the resident’s performance.

RESULT:Respondents are confused and results are confounded leading to unreliable or misleading results.

Tip: If the word “and” or the word “or” appears in a question, check to verify whether it is a double-barreled question.


Avoid questions with double Avoid questions with double negatives…negatives…

When respondents are asked When respondents are asked for their agreement with a for their agreement with a negatively phrased statement, negatively phrased statement, double negatives can occur. double negatives can occur. – Example:Example:

Do you agree or disagree with Do you agree or disagree with the following statement? the following statement?


““Attendings should not be required Attendings should not be required to supervise their residents during to supervise their residents during night call.”night call.”

If you respond that you disagree, If you respond that you disagree, you are saying you do not think you are saying you do not think attendings should not supervise attendings should not supervise residents. In other words, you residents. In other words, you believe that attendings should believe that attendings should supervise residents. supervise residents.

If you do use a negative word like If you do use a negative word like “not”, consider highlighting the “not”, consider highlighting the word by underlining or bolding it to word by underlining or bolding it to catch the respondent’s attention.catch the respondent’s attention.


Because every question is Because every question is measuring something, it’s measuring something, it’s important for each to be clear important for each to be clear and precise. and precise.

Remember…Your goal is for Remember…Your goal is for each respondent to interpret each respondent to interpret the meaning of each question the meaning of each question in exactly the same way. in exactly the same way.


If your respondents are not clear If your respondents are not clear on what is being asked in a on what is being asked in a question, their responses may question, their responses may result in data that cannot or result in data that cannot or should not be applied to your should not be applied to your evaluation results…evaluation results…"For me, further development of "For me, further development of my medical competence, it is my medical competence, it is important enough to take risks" important enough to take risks" – Does this mean to take risks – Does this mean to take risks with patient safety, risks to one's with patient safety, risks to one's pride, or something else? pride, or something else?


Keep questions short. Long Keep questions short. Long questions can be confusing.questions can be confusing.

Bottom line: Focus on short, Bottom line: Focus on short, concise, clearly written concise, clearly written statements that get right to the statements that get right to the point, producing actionable point, producing actionable data that can inform individual data that can inform individual learning plans (ILPs). learning plans (ILPs). – Take only seconds to respond Take only seconds to respond

to/rateto/rate– Easily interpreted.Easily interpreted.


Do not use “loaded” or Do not use “loaded” or “leading” questions“leading” questions

A loaded or leading question A loaded or leading question biases the response given by biases the response given by the respondent. A loaded the respondent. A loaded question is one that contains question is one that contains loaded words. loaded words. – For example: “I’m concerned For example: “I’m concerned

about doing a procedure if my about doing a procedure if my performance would reveal that I performance would reveal that I had low ability”had low ability”

Disagree Disagree Agree Agree


"I’m concerned about doing a "I’m concerned about doing a procedure on my unit if my procedure on my unit if my performance would reveal that performance would reveal that I had low ability" I had low ability"

How can this be answered with How can this be answered with “agree or disagree” if you think “agree or disagree” if you think you have good abilities in you have good abilities in appropriate tasks for your appropriate tasks for your area?area?


A leading question is phrased A leading question is phrased in such a way that suggests to in such a way that suggests to the respondent that a certain the respondent that a certain answer is expected: answer is expected: – Example: Example: Don’t you agree that Don’t you agree that

nurses should show more nurses should show more respect to residents and respect to residents and attendings? attendings?

Yes, they should show more Yes, they should show more respectrespect

No, they should not show more No, they should not show more respectrespect


Use of Open-Ended QuestionsUse of Open-Ended Questions

Comment boxes after negative Comment boxes after negative ratingsratings– To explain the reasoning and To explain the reasoning and

target areas for focus and target areas for focus and improvementimprovement

General, open-ended questions General, open-ended questions at the end of the evaluation. at the end of the evaluation. – Can prove beneficialCan prove beneficial

– Often it is found that entire topics Often it is found that entire topics have been omitted from the have been omitted from the evaluation that should have been evaluation that should have been included.included.

Evaluation Question Evaluation Question Design Principles – Design Principles –

Exercise 2 “Post Test”Exercise 2 “Post Test”

1. Please rate the general surgery 1. Please rate the general surgery resident’s communication and resident’s communication and technical skillstechnical skills

2. Rate the resident’s ability to 2. Rate the resident’s ability to communicate with patients and communicate with patients and their familiestheir families

3. Rate the resident’s abilities with 3. Rate the resident’s abilities with respect to case familiarization; respect to case familiarization; effort in reading about patient’s effort in reading about patient’s disease process and familiarizing disease process and familiarizing with operative care and post op with operative care and post op carecare

4. Residents deserve higher pay for 4. Residents deserve higher pay for all the hours they put in, don’t all the hours they put in, don’t they?they?

Evaluation Question Design Evaluation Question Design Principles – Exercise 2 “Post Principles – Exercise 2 “Post TTest”est”

5. Explains and performs steps in 5. Explains and performs steps in resuscitation and stabilizationresuscitation and stabilization

6. Do you agree or disagree that 6. Do you agree or disagree that residents shouldn’t have to pay residents shouldn’t have to pay for their meals when on-call?for their meals when on-call?

7. Demonstrates an awareness of 7. Demonstrates an awareness of and responsiveness to the larger and responsiveness to the larger context of health carecontext of health care

8.Demonstrates ability to 8.Demonstrates ability to communicate with faculty and communicate with faculty and staffstaff

Bias in the Bias in the Rating Rating Scales Scales for Questionsfor Questions

The The scalescale you you construct construct can also can also skew your skew your data, much data, much like we like we discussed discussed about about question question construction.construction.

Evaluation Design Evaluation Design Principles:Principles: Rating ScalesRating Scales

By far the most popular scale By far the most popular scale asks respondents to rate their asks respondents to rate their agreement with the evaluation agreement with the evaluation questions or statements – questions or statements – “stems”.“stems”.

After you decide what you want After you decide what you want respondents to rate respondents to rate (competence, agreement, etc.), (competence, agreement, etc.), you need to decide how many you need to decide how many levels of rating you want them levels of rating you want them to be able to make. to be able to make.


Using too few can give less Using too few can give less precise, cultivated information, precise, cultivated information, while using too many could while using too many could make the question hard to read make the question hard to read and answer (do you really and answer (do you really need a 9 or 10 point scale?)need a 9 or 10 point scale?)

Determine how fine a Determine how fine a distinction you want to be able distinction you want to be able to make between agreement to make between agreement and disagreement.and disagreement.

Evaluation Design Evaluation Design Principles:Principles:Rating ScalesRating Scales

Psychological research has Psychological research has shown that a 6-point scale with shown that a 6-point scale with three levels of agreement and three levels of agreement and three levels of disagreement three levels of disagreement works best. An example would works best. An example would be:be:

Disagree StronglyDisagree Strongly

Disagree ModeratelyDisagree Moderately

Disagree SlightlyDisagree Slightly

Agree SlightlyAgree Slightly

Agree ModeratelyAgree Moderately

Agree Strongly Agree Strongly


This scale affords you ample This scale affords you ample flexibility for data analysis. flexibility for data analysis. Depending on the questions, Depending on the questions, other scales may be other scales may be appropriate, but the important appropriate, but the important thing to remember is that it thing to remember is that it must be balanced, or you will must be balanced, or you will build in a biasing factor. build in a biasing factor. Avoid neutral and neither Avoid neutral and neither agree nor disagree…you’re agree nor disagree…you’re just giving up 20% of your just giving up 20% of your evaluation ‘real estate’evaluation ‘real estate’

Evaluation Design Evaluation Design Principles: Rating ScalesPrinciples: Rating Scales

1. Please rate the volume and 1. Please rate the volume and variety of patients available to variety of patients available to the program for educational the program for educational purposes.purposes.

Poor Fair Good Very Good Excellent Poor Fair Good Very Good Excellent

2. Please rate the performance of 2. Please rate the performance of your faculty members. your faculty members.


3. Please rate the competence 3. Please rate the competence and knowledge in general and knowledge in general medicine. medicine.



The data will be artificially The data will be artificially skewed in the positive skewed in the positive direction using this scale direction using this scale because there are far more because there are far more (4:1) positive than negative (4:1) positive than negative rating options….Yet we see rating options….Yet we see this scale being used all the this scale being used all the time!time!

Gentle Words of Gentle Words of Wisdom….Wisdom….

Avoid large numbers of questions….Avoid large numbers of questions….

Respondent fatigue – the respondent Respondent fatigue – the respondent tends to give similar ratings to all items tends to give similar ratings to all items without giving much thought to individual without giving much thought to individual items, just wanting to finishitems, just wanting to finish

In situations where many items are In situations where many items are considered important, a large number can considered important, a large number can receive very similar ratings at the top end of receive very similar ratings at the top end of the scale the scale

Items are not traded-off against each other Items are not traded-off against each other and therefore many items that are not at the and therefore many items that are not at the extreme ends of the scale or that are extreme ends of the scale or that are considered similarly important are given a considered similarly important are given a similar rating similar rating

Gentle Words of Gentle Words of Wisdom….Wisdom….

Avoid large numbers of Avoid large numbers of questions….but ensure your questions….but ensure your evaluation is both valid and has evaluation is both valid and has enough questions to be enough questions to be reliable….reliable….

How many questions (raters) are enough?

Not intuitive

Little bit of math is necessary (sorry)

True Score =Observed Score +/- Error score

Why are we talking about reliability in a question

writing session ? To create your own evaluation questions and insure their reliability

To share/use other evaluations that are assuredly reliable

To read the evaluation literature

ReliabilityReliability

Reliability is the "consistency" Reliability is the "consistency" or "repeatability" of your or "repeatability" of your measures.measures.

If you could create If you could create 11 perfect perfect test question test question (unbiased and (unbiased and perfectly representative of the perfectly representative of the task) you would need only that task) you would need only that one question one question

OR if you could find OR if you could find 11 perfect perfect rater rater (unbiased and fully (unbiased and fully understanding the task) you understanding the task) you would need only one rater would need only one rater

Reliability Estimates Reliability Estimates

Test Designers use four Test Designers use four correlational methods to check correlational methods to check the reliability of an evaluation: the reliability of an evaluation: 1.1. the test-retest method,(Pre test the test-retest method,(Pre test

–Post test) –Post test)

2.2. alternate forms, alternate forms,

3.3. internal consistency, internal consistency,

4.4. and inter-rater reliability. and inter-rater reliability.

Generalizability

One measure based on Score One measure based on Score Variances Variances – Generalizablity TheoryGeneralizablity Theory

Problems with Problems with Correlation Methods Correlation Methods

Based on comparing portions of a Based on comparing portions of a test to one another ( Split-Half, test to one another ( Split-Half, Coefficient Coefficient αα, ICC.), ICC.)– Assumes that all portions are strictly Assumes that all portions are strictly

parallel (measuring the same skill, parallel (measuring the same skill, knowledge, attitude) knowledge, attitude)

Test-Retest assumes no learning Test-Retest assumes no learning has occurred in the interim.has occurred in the interim.Inter-rater reliability only provides Inter-rater reliability only provides consistency of raters across an consistency of raters across an instrument of evaluation instrument of evaluation

UNLIKE A MATH TEST, ALL UNLIKE A MATH TEST, ALL CLINICAL SITUATIONS ARE NOT CLINICAL SITUATIONS ARE NOT PARALLEL…PARALLEL…

Methods based onMethods based on Score Variance Score Variance

Generalizablity Theory Generalizablity Theory – Based in Analysis of Variance- Based in Analysis of Variance-

ANOVA ANOVA – Can parse out the differences in Can parse out the differences in

the sources of error the sources of error For example, capture the essence For example, capture the essence of differing clinical situations of differing clinical situations

GeneralizablityGeneralizablity Studies Studies

Two types:Two types:– G study G study

ANOVA is derived from the actual # ANOVA is derived from the actual # of facets(factors) that you put into of facets(factors) that you put into the equationthe equation

Produces a G coefficient (similar to Produces a G coefficient (similar to r or r or άά ) )

– D study D study Allows you to extrapolate to other Allows you to extrapolate to other testing formats testing formats

Produces a D coefficient Produces a D coefficient

G Study Example

FACET ( FACTOR) LABEL #

Professors scoring activity

P 3

Students in class S 10

# items tested I 2

%

error

Professors 11.5 52%

Test items 0.09 0.4%

COEF G = 0.46

What can we do about this problem?

Train the raters

Increase the # of raters

Would increasing the # of test items help?

Changing the Number of Raters

P 3 6 12 18 24 30

S 10 10 10 10 10 10

I 2 2 2 2 2 2

Coef G

0.45 0.61 0.75 0.82 0.85 0.89

D Study ExampleChanging the Number of Items

P 3 3 3 3 3 3

S 10 10 10 10 10 10

I 2 4 8 16 32 40

Coef G

0.45 0.46 0.46 0.47 0.47 0.47

Reliability Goals

All reliability coefficients display the following qualities: – <50 poor – 50-70 moderate– 70-90 good – >90 excellent

Interrater Reliability (Kappa)

IRR is not really a measure of the test reliability, rather a property of the raters – It does not tell us anything about

the inherent variability within the questions themselves

– Rather Quality of the raters

Or misalignment of one rater/examinee dyad

Reliabililty

Evaluation Reliability (consistency) is an essential but not sufficient requirement for validity

ValidityValidity

Validity is a property of Validity is a property of evaluation scores. Valid evaluation scores. Valid evaluation scores are ones with evaluation scores are ones with which accurate inferences can which accurate inferences can be made about the examinee’ s be made about the examinee’ s performance. performance.

The inferences can be in the The inferences can be in the areas of : areas of : – Content knowledge Content knowledge – Performance ability Performance ability – Attitudes, behaviors and attributes Attitudes, behaviors and attributes

Three types of testThree types of test score validity score validity

1. Content1. Content– Inferences from the scores can be Inferences from the scores can be

generalized to a larger domain of generalized to a larger domain of items similar to those on the test items similar to those on the test itself itself

Example (Example (content validitycontent validity): board ): board scores scores

2. Criteria 2. Criteria – Score inferences can be Score inferences can be

generalized to performance on generalized to performance on some real some real behaviorbehavior (present or (present or anticipated) of practical importance anticipated) of practical importance

ExampleExample– Present behavioral generalization Present behavioral generalization

((concurrent validityconcurrent validity ): ): OSCEOSCE– Future behavioral generalization Future behavioral generalization

((predictive validitypredictive validity): ): MCATMCAT

ValidityValidity

3. Construct3. Construct– Score inferences have “no Score inferences have “no

criteria or universe of content to criteria or universe of content to entirely adequate to define the entirely adequate to define the quality to be measured” quality to be measured” (Cronbach and Meehl, 1955) but (Cronbach and Meehl, 1955) but the inferences can be drawn the inferences can be drawn under the label of a particular under the label of a particular psychological construct psychological construct

Example : professionalism Example : professionalism

Example Question:

Does not demonstrate extremes of behavior

Communicates well

Uses lay terms when discussing issues

Is seen as a role model

Introduces oneself and role in the care team

Skillfully manages difficult patient situations

Sits down to talk with patients

Process of Validation Process of Validation

Define the intended purposes/ Define the intended purposes/ use of inferences to be made use of inferences to be made from the evaluation from the evaluation

Five Arguments for Validity Five Arguments for Validity (Mesick, 1995) (Mesick, 1995) – Content Content – Substance Substance – Structure Structure – GENERALIZABLITY GENERALIZABLITY – Consequence Consequence

Generalizablity Generalizablity

Inferences from this Inferences from this performance task can be performance task can be extended to like tasksextended to like tasks– Task must be representative Task must be representative

(not just simple to measure )(not just simple to measure )– Should be as fully represent the Should be as fully represent the

domain as practically possible domain as practically possible Example: Example: MultipleMultiple Mini Interview Mini Interview (MMI)(MMI)

Why are validity Why are validity statements critical now? statements critical now?

Performance evaluation is on Performance evaluation is on the crux of use for credentialing the crux of use for credentialing and certification.and certification.

We are asked to measure We are asked to measure constructs ….not just content constructs ….not just content and performance abilities.and performance abilities.

Gentle Words of Gentle Words of Wisdom: Begin with the Wisdom: Begin with the

End in MindEnd in MindWhat do you want as What do you want as your outcomes? What your outcomes? What is the purpose of your is the purpose of your evaluationevaluationBe prepared to put in Be prepared to put in the time with pretesting the time with pretesting for reliability and for reliability and understandabilityunderstandabilityThe faculty member, The faculty member, nurse, patient, resident nurse, patient, resident has to be able to has to be able to understand the intent of understand the intent of the question - and each the question - and each must find it credible and must find it credible and interpret it in the same interpret it in the same waywayAdding more items to Adding more items to the test may not always the test may not always be the answer to be the answer to increased reliability increased reliability

Gentle Words of Gentle Words of Wisdom Continued…Wisdom Continued…

Relevancy and Accuracy –Relevancy and Accuracy –If the questions aren’t framed If the questions aren’t framed

properly, if they are too vague or properly, if they are too vague or too specific, it’s impossible to get too specific, it’s impossible to get any meaningful data. any meaningful data. – Question miswording can lead to Question miswording can lead to

skewed data with little or no skewed data with little or no usefulness.usefulness.

– Ensure your response scales are Ensure your response scales are balanced and appropriate.balanced and appropriate.

– If you don't plan or know how you If you don't plan or know how you are going to use the data, don't ask are going to use the data, don't ask the question!the question!

Gentle Words of Gentle Words of Wisdom Continued…Wisdom Continued…

Use an appropriate number of Use an appropriate number of questions based on your questions based on your evaluation's purpose .evaluation's purpose .

If you are using aggregated If you are using aggregated data, the statistical analyses data, the statistical analyses must be appropriate for your must be appropriate for your evaluation or, however evaluation or, however sophisticated and impressive, sophisticated and impressive, the numbers generated that look the numbers generated that look real will actually be false and real will actually be false and misleading.misleading.

Are differences really significant Are differences really significant given your sample size?given your sample size?

Summary: Evaluation Summary: Evaluation Do’s and Don’tsDo’s and Don’ts

DO’sDO’s

Keep Questions Keep Questions Clear, Precise Clear, Precise and Relatively and Relatively Short.Short.Use a balanced Use a balanced response scaleresponse scale– (4-6 scale points (4-6 scale points

recommended)recommended)

Use open ended Use open ended questionsquestionsUse an Use an appropriate appropriate number of number of questionsquestions

DON’TsDON’Ts

Do not use Do not use Double+ Barreled Double+ Barreled QuestionsQuestions

Do not use Do not use Double Negative Double Negative QuestionsQuestions

Do not use Do not use Loaded or Loaded or Leading Leading Questions.Questions.

Don’t assume Don’t assume there is no need there is no need for rater training for rater training

Ready to Play the Ready to Play the GameGame

QuestionsQuestions

Documents

Designing Trustworthy & Reliable GME Evaluations Conference Session: SES85 2011 ACGME Annual Education Conference Nancy Piro, PhD, Program Manager/Ed Specialist