UNIVERSITY OF MARYLAND SCHOOL OF PHARMACY
You MUST fill out all the bubbles for Campus Location and the UMB One Card Student ID Number/PIN
PROGRAM/ CAMPUS LOCATION
UMBONECARD ID NUMBER/ PIN
PharmD UMB 2200111 I I I I I I I I PharmD USG @
Other UMB @ 00000000 00000000 00000000 00000000 00000000 00000000
Deconstructing Paper-based Test Scoring and Test Analysis using the University of Maryland Baltimore Test Scoring Service
- Shannon Tucker, Director of Instructional Technology
The University of Maryland Center for Information Technology Services provides test scoring services to the campus for multiple choice exams using Remark Office OMR a form-processing software for surveys and tests. This software recognizes optical marks (bubbles and checkboxes) and barcodes1. To ensure assessments scored using this software are processed efficiently, the School of Pharmacy has standardize test sheets to include the UMB One Card ID number and Program/Campus location as identifying information (figure 1).
Figure 1: School of Pharmacy Test Sheet Detail
The recent addition of the UMB One Card ID number as the Student ID number in the Blackboard gradebook has provided faculty and teaching assistants with an automatic reference for this information when reviewing results received from the campus test scoring service for any student registered in a course. Additionally, all faculty are provided with a Microsoft Excel formatted master report for PharmD students and graduate students (if applicable) in the Pharmacy Portal as an additional reference for this information.
1 Remark Office OMR Software. 2007. Principa Products. 19 September 2007
Submitting Test Sheets for Scoring All individuals submitting test sheets for scoring are asked to provide the following information:
School of Pharmacy Test Submittal Coversheet Examination Key Test Sheets
Results Received from the Test Scoring Service Instructors or teaching assistants submitting assessments to be scored can expect to receive the following information as a result of scoring:
Item and Test Statistics (printout and electronic files) A comma separated spreadsheet containing all the data scanned into the Remark System that that can
be imported into Excelo This contains ID numbers, campus locations, and the options selected by the students
A tab delimited text file containing ID information and the Raw score that can be imported into Exceland used in conjunction with Blackboard
A tab delimited text file containing ID information and individual item scores and a total percentagescore that can be imported into Excel
Results returned will provide faculty/instructors with an opportunity to review test statistics, item statistics, and import grades into Excel for manipulation or use with Blackboard.
Working with Remark Generated Statistics Individuals using other test scoring products or statistical packages may notice some familiar and unfamiliar test and item statistics on returned results.
Statistic Description Number of tests Total number of tests that were graded. Graded Number of Graded The number of items on the test that were graded. Items Total Points Possible The total number of points on the test. Maximum Score The highest score from the graded tests. Minimum Score The lowest score from the graded tests. Median Score The median of the scores from the graded tests. Range of Scores The range is the distance between the highest and lowest score. Percentile 25 and 75 Percentiles are values that divide a sample of data into one hundred groups
containing (as far as possible) equal numbers of observations. For example, 25% of the data values lie below the 25th percentile.
2 Remark Office OMR Users Guide. Pennsylvania: Malvern, 2004.
S. Tucker (9/21/2007) University of Maryland School of Pharmacy 2007 Page | 2
Inter Quartile Range The difference between the 75th percentile and the 25th percentile. Mean Score The average score of all the graded tests. Variance The amount that each score deviates from the mean squared (by multiplying it
by itself). Standard Deviation A statistic used to characterize the dispersion among the measures in a given
population. It is calculated by taking the square root of the variance. Confidence Interval A confidence interval gives and estimated range of values that is likely to (1, 5, 95, and 99%) include an unknown population parameter, the estimated range being
calculated from a given set of sample data. If independent samples are taken repeatedly from the percentage (confidence level) of the intervals will include the unknown population parameter. Remark Office OMR calculates Confidence Intervals of 1%, 5%, 95%, and 99%.
Kuder-Richardson An overall measure of internal consistency. Formula 20 (KR-20) Coefficient Alpha A coefficient that describes how well a group of items focuses on a single idea
Selected Item Statistics 3
Statistic Description Label The output label designated in the template. Value The corresponding numeric value for each output label. Weight The points assigned to the correct, incorrect, and missing responses. The weight
statistic applies to grading only. Frequency The number of times a particular label was chosen (appears in the dataset). Percent The corresponding percentage of the frequency. Cumulative Percent The sum of the percents from the first response up to and including the current
response. Valid Percent The percent not including missing items. Cumulative Valid The sum of the valid percents from the first response up to and including the Percent current response. P-Value A measurement of the difficulty of an item. Point Biserial A measurement of the discrimination of an item. It indicates the relationship
between a response for a given item and the overall test score of the respondent. A high value indicates that students scoring well on the test chose this response. The point biserial statistic applies to grading only.
3 Remark Office OMR Users Guide. Pennsylvania: Malvern, 2004.
S. Tucker (9/21/2007) University of Maryland School of Pharmacy 2007 Page | 3
Test Reliability There are numerous indexes that may be used to assess the internal consistency of an assessment. Currently, the most widely used measure of reliability is Cronbachs Alpha (also known as the Coefficient Alpha)4. However, you will notice that test statistics included with every scored assessment will include both the Kuder-Richardson Formula (KR-20) and the Coefficient Alpha (Cronbachs Alpha). The Coefficient Alpha is most often used on instruments where items are not scored as right or wrong5. KR-20 is a special case of Cronbachs alpha specifically for ordinal dichotomies to evaluate how consistent student responses are among questions on an assessment6. In laymens terms KR-20 best measures how well your exam measures a subject (a single cognitive factor) and the Coefficient Alpha best measures surveys or attitude data.
Interpreting KR-207 KR-20 formula includes:
1. Number of test items on the exam2. Student performance on every test item3. Variance
Index Range: 0.00-1.00
Values near 0.00: Measuring many unknown factors, but not what you intended to measure
Values near 1.00: Close to measuring a single factor
Summary: An exam with a high KR-20 yields reliable student scores (consistent/true score)
How others use KR-20: Tulane University Office of Medical Education recommends a KR-20 score of 0.60 or larger to be acceptable.
Item Analysis Conducting an item analysis following an administration of your assessment is important to identify any questions that are not performing well due to inappropriate difficulty, scoring error, or other factors. When conducting an item analysis the item difficulty, item discrimination, and distractor quality should all be considered.
4 Streiner, David L. Starting at the Beginning: An Introduction to Coefficient Alpha and Internal Consistency. Journal of Personality Assessment 80(1) (2003): 99-103. 5 Reliability. Del Siegle Faculty Web Site University of Connecticut. < http://www.gifted.uconn.edu/siegle/research/Instrument%20Reliability%20and%20Validity/Reliability.htm>
19 September 2007. 6 Scales and Standard Measures. North Carolina State University. 19 September 2007.
S. Tucker (9/21/2007) University of Maryland School of Pharmacy 2007 Page | 4
Item Difficulty (p-value) Item Difficulty is a measure of the proportion of students/subjects who have answered an item correctly and is most commonly referred to as the p-value.
Index Range: 0.00-1.00
Values near 0.00: A greater proportion of students/subjects responded to the item correctly (more difficult)
Values near 1.00: A greater proportion of students/subjects responded to the item correctly (easier)
Summary: The p-value will report item difficulty related to your assessed population.
How others use the p-value: Consulting company Professional Testing suggests item difficulty for criterion-referenced tests (CRTs), with their emphasis on mastery-testing, many items on an exam form will have p-values of .9 or above. Norm-referenced tests (NRTs), are designed to be harder overall and to spread out the examinees' scores. Thus, many of the items on an NRT will have difficulty indexes between 0.4 and 0.6.8
Item Discrimination There are several indexes that successfully compute item discrimination. While the discrimination index is a popular and valid measure of item quality9, this index is not included in as a part of the Remark reported item statistics. Instead Remark provides the Point Biserial Correlation.
Point Biserial Correlation The Point Biserial Correlation quantifies the relationship between a student/subjects score (correct or incorrect) and the overall assessment score.
Index Range: -1.00 - +1.00
Values Near -1.00: High scorers answered the item incorrectly more frequently than low scorers.
Values Near +1.00: High scorers answered the item correctly more frequently than low scorers.
Summary: A negative value indicates an item may have been misleading, keyed incorrectly, or the content was inadequately covered.
How others use the point biserial correlation: Tulane University Office of Medical Education suggest to fauclty that a score of +0.20 is desirable10.
They also suggest that there is an interaction between the item discrimination and item difficulty that should be considered by faculty:
8 Step 9. Conduct the Item Analysis. Building High Quality Examination Programs Professional Testing. 2005. 19 September 2007. 9 Pyrczak, Fred. Validity of the Discrimination Index as a Measure of Item Quality. Journal of Educational Measurement. 10(3) (1973):227-231. 10 Test and Item Analysis. Tulane University Office of Medical Education. < http://www.som.tulane.edu/ome/helpful_hints/test_analysis.pdf> 19 September 2007.
S. Tucker (9/21/2007) University of Maryland School of Pharmacy 2007 Page | 5
Hem Analys.is:: #11 Label V &IU8 W eight Fr-equen cy Percent Point Bi11erial
A 1 1 19 95.(1[) 0-57 B 2 [) 0 OJ[l[l -C a ,0 1 5(1[) . .o_57 D [) 0 J[l[l -E 5 0 0 0 J[l[l -
T otal w 10.0.00
Very easy or very difficult test items have little discrimination Items of moderate difficulty (60% - 80% answering correctly) generally are more discriminating.
Sample results from Tulane University Office of Medical Education using the p-value with the point biserial correlation can be found at:
Distractor Analysis Unfortunately, neither item difficulty or item discrimination account for incorrect response options (distracters). Distractor analysis will assist individuals with addressing performance issues associated with incorrect options. On a well-designed multiple choice item, high scoring students/subjects should select the correct option even from highly plausible distractors11. Those who are ill-prepared should select randomly from available distractors. In this scenario, the item would be a good discriminator of knowledge and should be considered for future assessments. In other scenarios, a distractor analysis may reveal an item that was mis-keyed, contained a proofreading error, or contains a distractor that appears plausible even by those that scored well on an assessment.
To be effective incorrect options should be plausible and incorrect without ambiguity. Therefore distractor analysis examines the proportion of students/subjects who selected each of the response options. For the correct response, this proportion is equivalent to the item p-value, or item difficulty12. If all response option proportions are summarized they will add up to 1.0 or 100% of student/subject selections. Reviewing the percentage of students/subjects who have responded to each response option will help you assess if there are issues present in an items distractors.
Locating Distractor Statistics To make distractor analysis easier, Remark returns a separate item analysis report specifically for distractor analysis (figure 2). Along with the label, value, weight, and frequency the item was selected, each question item analysis will also include the percent respondents selected an option and its corresponding point biserial correlation for all distractors in addition to the correct answer.
Figure 2: Distractor Analysis
11 Zurawski, Raymond M. Making the Most of Exams: Procedures for Item Analysis. National Teaching & Learning Forum. 7(6). 1998 . . 20 September 2007. 12 Step 9. Conduct the Item Analysis. Building High Quality Examination Programs Professional Testing. 2005.
S. Tucker (9/21/2007) University of Maryland School of Pharmacy 2007 Page | 6
Sample Item Analysis13
Good Item P-Value: 0.72Point Biserial: +0.22
Items Frequency Percent Point Biserial A (correct) 241 72.15 +0.22B 9 2.70 -0.02C 3 0.89 -0.10D 11 3.30 -0.06E 70 20.96 -0.19Total 334 100
This item is good because the point biserial correlation for the correct answer is above 0.2 and is higher than the same value for the other distractors.
Fair Item P-Value: 0.39Point Biserial: +0.12
Items Frequency Percent Point Biserial A 13 3.89 -0.18B 87 26.05 -0.03C 40 11.98 -0.10D (correct) 130 38.92 +0.12E 64 19.16 +0.05Total 334 100
While the point biserial correlation for this question...