10
DATA COLLECTION AND SCORE INTERPRETATION Dr. Jeffrey Oescher 27 January 2014

Data Collection and Score Interpretation

  • Upload
    kobe

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Dr. Jeffrey Oescher 27 January 2014. Data Collection and Score Interpretation. Technical Issues. Two technical issues Validity Reliability. Technical Issues. Validity – the extent to which inferences made on the basis of scores from an instrument are appropriate, meaningful, and useful - PowerPoint PPT Presentation

Citation preview

Data Collection and Score Interpretation

Data CollectionandScore InterpretationDr. Jeffrey Oescher27 January 2014Technical Issues

Two technical issuesValidityReliability

2There are two issues to discuss: validity and reliability. In each case Im discussing these in the context of data collection in general and the characteristics of instruments specifically.Technical IssuesValidity the extent to which inferences made on the basis of scores from an instrument are appropriate, meaningful, and usefulCharacteristicsRefers to the interpretation of the resultsIs a matter of degreeIs situation specificIs a unitary conceptInvolves an overall judgment

3The formal definition of validity is written on this slide. If you think for a moment, the definition makes a lot of sense. When you give a test to the students in your class, you use the scores to make some decisions about each students work. If one student had a very high score, you usually infer this is a good student. If another student had a very low score you could infer this student was having serious difficulties mastering this material. The question ultimately comes down to whether or not such inferences or decisions you make are appropriate, meaningful, or useful. The answer depends on two characteristics of the test. Data Collection Technical Issues

Validity evidenceContentFaceContentConstructCriterion-relatedPredictiveConcurrentSituationally specific

4If your test covered appropriate content for the instruction provided to students, then the extent to which your inferences are appropriate, meaningful, or useful is high. If, like the quiz I gave you, the content is not relevant to the instruction your inferences are not appropriate, meaningful, or useful to anyone. This is known as content validity and is a fundamental characteristic of any test. Please note that whether a test has evidence of content validity or not, nothing stops someone from using the scores to make decisions. Anyone ever taken an exam where the professor wrote items that had nothing to do with what was being taught? Did he or she still use your scores in your grades? Was that fair?Sometimes the purpose of a test is not to measure specific concrete content like that we are studying. Often what is being measured is very nebulous or abstract in nature. How would you measure my intelligence? Probably with an intelligence test, but would the test be developed around Binets conception of intelligence as verbal and mathematical reasoning or Gardners 8 or 9 I forget the latest number - multiple intelligences? Obviously the tests would look very, very different based on the manner by which the researcher interprets the construct of intelligence. While closely related to content validity in that we worry about whether the test measures what it is supposed to measure, construct validity is difficult to estimate. If a test has sufficient evidence to suggest it measures intelligence, my score on that test and your use of it is reasonable. If not, any decision you make on the basis of that score is not appropriate, meaningful, or useful.Many times we find ourselves using test scores to predict a students performance on some later task. The ACT, SAT, GRE, or MCAT are good examples of such tests. Scores on the ACT or SAT are supposed to predict a students performance in their freshman year in college. Do they do so well? If so, we can make some decisions about whether or not to admit a student to a university; if not such a decision is not appropriate, meaningful, or useful. I need to caution you about the situationally specific nature of validity evidence. The quiz you took earlier was not content valid for this course, but it was taken off of a History of Education exam where every question was appropriate to the instruction. In our case the test was not content valid; in the case of another course it is 100% content valid.Data Collection Technical IssuesReliabilityThe extent to which scores are free from errorError is measured by consistencyTwo perspectivesTest the reliability of a testAgreement the reliability of an observation

5Reliability is the second technical characteristic important to measurement. Reliability is basically the consistency with which we measure. If you took Exam 1 a first time and made a 40, a second time and made a 45, and a third time and made a 43, what score should I use to provide a reliable estimate of your knowledge of the material? There are three perspectives from which reliability is viewed: test reliability, score reliability, and agreement.Data Collection Technical IssuesTest reliability evidenceStabilityAlso known as test-retestMeasured on a scale of 0 to1EquivalenceAlso known as parallel formsMeasured on a scale of 0 to 1Internal consistency Split halfKR 20KR 21Cronbach alphaAll measured on a scale from 0 to 1

6When speaking of test reliability, we estimate the extent to which the results of a test are likely to be the same. An estimate could be calculated using two administrations of the same test. This is known as stability or test-retest reliability. Coefficients close to 1 suggest a test that produces very consistent scores; those close to 0 suggest a lack of consistency for the test. Sometimes we dont want to give one test twice what a pain for the students! Besides, there is often a high chance that youll correct something from the first to the second administration of the test.When we develop two tests that examine the same material with different items, we are creating an opportunity to estimate reliability through equivalence or parallel forms. Comparing the scores from Form 1 of a test to those of Form 2 results in a coefficient that ranges from 0 to 1. Again the closer to 1 the more consistent the test. If one test is hard to develop, think about two! Think also about giving a second form of the test to your students! Im sure theyd be delighted to help you out! Because of this limitation, researchers have developed an estimate of test reliability called internal consistency. In essence, we think of one test of say 100 items as two tests of 50 items each. We split the test into halves. The two most common estimates of internal consistency are the KR 20 and Cronbach alpha. The former is used when the items for a test are scored as right or wrong; the latter when the answers can fall on a continuous scale. An example of this is a Likert scale where a student responds to a five point scale ranging from strongly disagreeing to strongly agreeing. Regardless of which estimate is used, the coefficients always range from 0 to 1 with 1 representing greater reliability.Data Collection Technical IssuesScore reliability evidenceStandard error of measurement or SEMA statistic that allows one to ascertain the probability that a students score falls within a given range of scoresUsually reported as the students score and SEM = +/- 2.25 You can add and subtract one (1) SEM to a students score and be confident that their score fall within that range of scores 68% of the timeYou can add and subtract two (2) SEM to a students score and be confident that their score falls with that range of scores 99% of the time Agreement reliability evidencePercentage of agreement between observersMore commonly known as inter-rater reliabilityRanges on a scale from 0 to 1

7While researchers worry about the reliability of a test, they are often more concerned with the reliability of a students score on a test. This is known as score reliability and is estimated with the calculation of a standard error of estimate (SEM). Once calculated a SEM allows you to create a range of scores for the student rather than a single point estimate. Lets suppose you made a 40 on Exam 1, and the SEM was calculated as 1.50. If I added and subtracted one SEM from your score, Id have a range of scores from 38.5 to 41.5. Due to the statistical properties of the SEM, I can surmise you would score in this range 68% of the times you took this test. If I added and subtracted two SEMs from your scores, Id have a range of scores from 37 to 43. Again, I can surmise that you would score in this range 99% of the times you took this test. The SEM makes it much easier to make decisions with a sense of the reliability of a students score. With a score of 40 on Exam 1 you earned a B based on the grading scale. If you needed to score at least 45 points to earn an A, Im 99% confident you would not be able to do so. On the other hand, if a B is between 40 and 44 and a C is any score between a 35 and 39 Im 99% confident you could earn a B or C. The final type of reliability estimate is agreement, and it is used exclusively when making observations. If two researchers watch the same classroom and come away with very differing percentages of on-task behavior of the students, there is a reliability problem. Which percentage is correct? If both observers came out with very similar percentages, we feel much more comfortable about making a decision related to on-task behavior. By knowing exactly what is being observed, what it looks like, how to count it, and sufficient training of observers we can achieve very high levels of agreement between two researchers observing the same situation. The estimate of agreement is a percentage of agreement between each observer. This estimate, known as inter-rater reliability, ranges from 0 to 1 with scores closer to 1 representing a higher level of agreement.Score InterpretationTwo types of interpretations: criterion-referenced and norm-referencedCriterion-referencedYou need to know the underlying scale (e.g., 0-100, 1-5, etc.) upon which the scores are basedThe interpretation of the test score is made relative to this underlying scaleThe scores indicted the students mastered about three-fourths of the objectivesThe scores are interpreted relative to what the students knowThe scores easily communicate some level of performance (e.g., good, bad, moderate, etc.)

8There is a final issue related to data collection that does not get mentioned often enough. It is score interpretation. Basically there are two types of score interpretations criterion-referenced and norm-referenced. Youve probably heard of a criterion-referenced test or a norm-referenced test. While there are major differences in how these two types of tests are developed, the most important distinction is the interpretation of the scores. To interpret a score using a criterion-referenced perspective, you need to know what is the underlying scale of measurement. For example, typical classroom tests measure students on a scale of 0 to 100. A score of 95 is interpreted relative to this scale as an excellent performance. More importantly the score likely indicates what the student knows as well as doesnt know with respect to the tested material. When we use a criterion-referenced interpretation, we communicate what the student knows. Score InterpretationNorm-referencedYou need to know the reference group (i.e., norming sample) against which the scores are being comparedThe interpretation of test scores is made in relation to the scores of students in the norming groupJohns score put him in the 85th percentileJohns score indicates he performed better than 85% of the students in the norming groupJohns score doesnt tell us anything about what John knows in terms of content

9The second type of interpretation is norm-referenced. Here a students score is interpreted relative to the scores of others in a reference group who took the same test. This group is called the norming sample. Typical norm referenced scores are percentiles. When Johnnys score is reported as the 85 percentile, it means Johnny scored better than 85% of the students in the norming sample. Fifteen (15) percent scored better than he did. What is confusing about norm-referenced scores is that they tell us very little about what a student knows. They tell us how they stand relative to everyone else, but not what they know or dont know.I remember in high school taking a literature test and getting 91% of the items right. I received a C on the exam. When I complained to the teacher she indicated my score of 91 was the average score for the class, and because it was average I received a C. It didnt matter that I was in an accelerated class with a number of really bright kids. I was just average relative to everyone in that group. The issue was really one where she interpreted my score from a norm-referenced perspective when I wanted it interpreted from a criterion-referenced perspective. Who was right? For most issues related to school performance, Id suggest using a criterion-referenced interpretation. When you have to sort or select kids on the basis of the brightest or most challenged note each of these labels represents a relative standing a norm-referenced interpretation is most appropriate.Score InterpretationA note of cautionWhich of the following represents a criterion-referenced and norm-referenced interpretation?The scores for the experimental group were significantly higher than those for the control group.The scores for the experimental group indicated mastery of about 95% of the objectives, while those scores for the control group indicated only 65% mastery.These are common examples from the literature you will be readingBe careful about the first interpretation, as it only tells us which group is better. It does not tell us how well either group performed.

10When you read articles, youre going to find the criterion-referenced norm-referenced debate hidden. When you read a conclusion that states the group using a training method called long, slow distance ran significantly faster times than the group using an interval method to train, youre reading a norm-referenced interpretation. One groups performance was faster than the others. Were either fast? Slow? Wouldnt that be very important information if you were making a decision about the effectiveness of the training methods?