Norm- Referenced measurements BY Shadia Abd Elkader

Norm- Referenced measurements

BY

Shadia Abd Elkader

Norm- Referenced measurements

Norm-referenced: metric based on comparison of individual performance in relation to specific group.

A norm-referenced test is a type of test, assessment, or evaluation in which the tested individual is compared to a sample of his or her peers (referred to as a "normative sample"). The term "normative assessment" refers to the process of comparing one test-taker to his or her peers.

Norm-referenced measures are designed to compare students (i.e., disperse average student scores along a bell curve, with some students performing very well, most performing average, and a few performing poorly).

Tests that set goals for students based on the average student's performance are norm-referenced tests.

The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test-takers cannot "fail" a norm-referenced test, as each test-taker receives a score that compares the individual to others that have taken the test, usually given by a percentile.

Norm referenced test - Aim to produce an overall distribution of student scores that has a 'normal' (bell shaped) distribution curve.

An obvious disadvantage of norm-referenced tests is that it cannot measure progress of the population of a whole, only where individuals fall within the whole. Thus, only measuring against a fixed goal can be used to measure the success of an educational reform program which seeks to raise the achievement of all students against new standards which seek to assess skills beyond choosing among multiple choices.

However, while this is attractive in theory, in practice the bar has often been moved in the face of excessive failure rates, and improvement sometimes occurs simply because of familiarity with and teaching to the same test.

Designing Norm-referenced

Selecting conceptual ModelDeveloping objectives of the measureDeveloping a blueprintConstruction of the measure

Definitions

Scientific method: Set of procedures for creating and answering questions

Assume in psychology behavior is lawful, determined, and understandable

The research design should provide empirical, objective, systematic, and controlled observations that can be replicated.

Goals: Describe, Explain, Predict, Control

Definitions

Applied Research-solve existing problemBasic Research-obtain knowledgeTheory-set of statements organize body

of knowledge (Kuhn vs. Popper)Model – explains underlying process

Research Process

Research question Hypothesis-formally stated expectation of

behavior. Testable, falsifiable, rational, parsimonious Null – no relationship Alternative – relationship

Operational Definition-translating construct into measurable observations Variable –entity takes on different values Attribute – value on variable

Definitions

Construct – hypothetical, non-observable entity

Validity – degree to which legitimate inferences can be made from the operationalization of the theoretical construct. Conclusion – establishing a relationship Internal – causal (how it works) Construct – legitimacy of inferences to construct External-generalization to different subjects,

settings

Converging Operations-different procedures producing same results (e.g. Schizophrenia, refrigerator mother or dopamine).

Big Picture

Setting Objectives

Stating the purpose of the study is the first step in designing tool. If we already had a conceptual model for the study stating and determining objectives becomes much easier ▪These objectives should be derived from and be consistent with conceptual model appropriate to the topic understudy

Conceptual definition of framework defines the relevant domain of content to be assessed by the measure as specific as well as specifies the type of behavior the subject will exhibit to demonstrate the purpose of the measure has been met.

Objectives should be stated correctly according to the following approach:

(1) a description of the respondent; (2) delineation of the kind of behavior the

respondent will exhibit to demonstrate accomplishment of the objective; and

(3) a statement of the kind of content to which behavior relates.

This approach toobjective explication is quite useful

within a norm-referenced measurement context because it results in an outline of content and a list of behaviors that can then be readily used in blueprinting.

The use of taxonomies in explicating and measuring objectives provides several advantages:

A critical aspect of any behavioral objective is the word selected to indicate expected behavior.

A behavioral term by definition is one that is observable and measurable (i.e., behavior refers to any action on the part of an individual that can be seen, felt, or heard by another person).

Cognitive and affective objectives, although they are concerned with thinking and feeling, which themselves are not directly observable, are inferred from psychomotor or behavioral acts. In reality, the same behavioral term can be seen, felt, or heard differently by different people

it is impossible to measure every action inherent in a given behavior, different people frequently define the critical behavior to be observed, using a given objective, quite differently.

When taxonomies are employed, action verbs and critical behaviors to be observed are specified, hence decreasing the possibility that the behaviors will be interpreted differently and increasing the probability that the resulting measure will be reliable and valid.

A measurement must match the level of respondent performance stated in the behavioral objective; that is, a performance verb at the application level of the cognitive taxonomy must be assessed by a cognitive item requiring the same level of performance.

▪ Any discrepancy between the stated objective and the performance required by the instrument or measurement device will result in decreased reliability and validity of the measurement process

For example, if the objective for the measurement is to:

ascertain the ability of practicing nurses to apply gerontological content in their work with aging clients (application level of Bloom's) and if the measure constructed to assess the objective simply requires a statement in their own words of some principles important to the care of the gerontological patient (comprehension level of the taxonomy), the outcomes of the measurement are not valid, in that this tool does not measure what is intended.

Blueprinting

The next step is to develop a blueprint to establish the specific scope and emphasis of the measure.

For example blueprint for measure to assess a patient's compliance with a discharge plan. The four major content areas to be assessed appear as column headings across the top of the table; critical behaviors to be measured are listed on the left-hand side of the table as row headings

Blueprint for a measure to assess a

patient's compliance a discharge plan

Objectives

knowledge

medications

nutrition

Daily activity

Total

Ascertain

Patien

Knowledge of The

contents

Of the

Discharge plan

355518

Determine

Patient

Attitudes

Toward

Contents of

the discharge

Plan

22228

EvaluatePatient

Compliance

With the

Contents of

The discharge

Plan

Total

4

9

10

17

10

17

10

17

34

60

Each intersection or cell thus represents a particular content-objective pairing, and values in each cell reflect the actual number of each type of item to be included on the measure.

Hence, from the table it can be seen that three items will be constructed to assess

the content-objective pairing patient knowledge of the contents of the discharge plan/general health knowledge.

The scope of the measure is defined by the cells, which are reflective of the domain of items to be measured, and the emphasis of the measure and/ or relative importance of each content-behavior pairing is ascertained by examining the numbers in the cells.

Blueprint, one can readily tell the topics about which questions will be asked, the types of critical behaviors subjects will be required to demonstrate, and what is relatively important and unimportant to the constructor.

Given the blueprint, the number (or percentage) of items prescribed in each cell would be constructed. Content validity could then be assessed by presenting content experts with the

blueprint and the test and having them judge: 1- The adequacy of the measure as reflected in the

blueprint—that is, whether or not the domain is adequately represented to ascertain that the most appropriate elements are being assessed;

(2) The fairness of the measure—whether it gives unfair advantage to some subjects over others;

(3) The fit of the method to the blueprint from which it was derived.

Constructing the Measure

The type of measure to be employed is a function of the conceptual model and subsequent operational definition of key variables to be measured.

Every measure Is composed of three components:

(1) directions for administration; (2} a set of items: (3) directions for obtaining and

interpreting scores.

Administration

Considerations to be made in preparing instructions for the administration of a measure:

1. A description of who should administer the measure

• A statement of eligibility • A list of essential characteristics • A list of duties 2. Directions for those who administer the

measure • A statement of the purposes for the measure

• Amount of time needed for administration • A statement reflecting the importance of

adhering to directions • Specifications for the physical

environment • A description of how material will be received

and stored • Specifications for maintaining security

• Provisions for supplementary materials needed • Recommendations for response to subjects'

questions • Instructions for handling defective materials • Procedures to follow when distributing the

measure • A schedule for administration • Directions for collection of completed measures • Specifications for the preparation of special

reports (e.g., irregularity reports)

• Instructions for the delivery and/or preparation of completed

measures for scoring• Directions for the return or disposal of

materials

3. Directions for respondents • A statement regarding information to be

given to subjects prior to the data collection session (e.g., materials to be brought along and procedures for how, when, and where data will be collected)

• Instructions regarding completion of the measure, including a request for cooperation, directions to be followed in completing each item type, and directions for when and how to record answers

4. Directions for users of results • Suggestions for use of results

Instructions for dissemination of results The importance of providing this information as an

essential component of any measure cannot be overemphasized.

Errors in administration are an important source of measurement error, and their probability of occurrence is greatly increased when directions for administration are not communicated clearly and explicitly in writing

Standard error of measurementStandard error of measurement……an estimate of how often a researcher

can expect errors of a given size on an instrument

Characteristics of a Good Test

- The test is valid- Does the test measure what it is being used

to measure?

- The test is reliable- Are scores consistent?

- The test is constructed to facilitate ease of taking and scoring

- The test is of an appropriate length

Reliability

Next to validity, reliability is the most important characteristic of assessment results.

Why?1. It provides the consistency to make validity

possible.

2. It indicates the degree to which various kinds of generalizations are justifiable.

Reliability

Reliability: the consistency of measurement, i.e. how consistent test scores or other assessment results are from one measurement to another.

Standard Error of Measurement (SEM)= the estimated amount of variation expected in a score.

Definitions of Reliability

Consistency of a measurement made with a particular test

Consistency of scores upon repeated measurement of the same individuals

Extent to which test produces the same results when used repeatedly under the same conditions

Reliability

Which is more reliable?

Reliability

e = error

Error variance is the variability that exists in a set of scores and is due to factors other than the one being assessed.

Systematic: errors that are consistent.Random: errors that have no pattern.

Reliability

e = error

Positive error (i.e. raises score):Lucky guesses. Items that give clues to the answer.Cheating (students, aides, teachers).

Reliability

e = error score

Negative error (i.e. lowers score): Not following directions. Miss-marking items. Room climate/atmosphere. Hunger, fatigue, illness, “need to go potty”. Assemblies, ball games, fire drills, etc. Break-up of a relationship.

Reliability

Summation of Reliability:

1. Reliability refers to the results and not to the instrument itself.

2. Reliability is a necessary but not sufficient condition for validity.

3. The more reliable the assessment, the better.

Evaluation of a Measurement: Reliability and Validity

Reliability refers to the consistency of the measurement.

Validity refers to the accuracy of the measurement.

Which is the most consistent and accurate?

Can one be very consistent but inaccurate?

Would you describe someone as very accurate when he/she is very inconsistent?

Validity: Accuracy of the Measurement

Does the instrument measure the property that it intends to measure?

A measurement device is valid if it measures what it is supposed to measure.

Validity

Validity: appropriateness of test usage in measurement and inference.

Content: item representativenessCriterion: relationship of test score and

other outcome or measure.Concurrent: other similar measurePredictive: outcome in future

Construct: extent of validity in test measuring theoretical construct.

The extent to which the research design is sufficiently precise or powerful enough for the detection of effects on the operationalized variable should they exist

Conclusion Validity

Threats to Conclusion ValidityThreats to Conclusion Validity

Low statistical powerViolated assumptions of statistical testsFishing and the Error Rate problemLow reliability of measuresPoor reliability of treatment

implementationRandom irrelevancies in the settingRandom heterogeneity of respondents

Descriptive: Central Tendency

Mean Median ModeVariabilityNumbers that summarize the extent to

which scores in a distribution differVarianceStandard deviation

Z-score

Z = score – mean _________________

Standard deviation

Converts scores to a standard metric. Mean = 0, SD = 1

Test: score 80, mean is 85 E.G. (80-85)/10 = -.5 Standard Scores

T score = 10(z) + 50 SAT = 100(z) + 100

Correlation and Relationship Patterns

PositiveNegativeNoneCurvilinear

Correlation and Regression-- The General Linear ModelCorrelation and Regression-- The General Linear Model

Formula for a straight lineFormula for a straight line

y = by = b00 + b + b11xx

xx

yy

b0 = interceptb0 = intercept

b1 = slope b1 = slope

yy

xx

yyxx

==

OutcomeOutcomeOutcomeOutcome ProgramProgramProgramProgram

Correlation Range: -1.0 to +1.0

Standard Error of Estimate

SEE demonstrates the accuracy of prediction

12)80(.15

36.115

60.115

1

2

2

SEE

SEE

SEE

rSDySEE xy

Confidence Intervals

Provides an indicator of how certain we are in reporting scores

CI=obtained score + (SEM)

68% level = 1.00 z score

85% level = 1.44

90% = 1.65

95% = 1.65

99% =2.58

yreliabilitSDSEM 1

Reliability

Reliability: Consistency of measurementReliability coefficient: quantitative

estimate of the degree of stabilityTest-retest Alternate Form Internal Consistency

Chronbach Alpha, Spearman-Brown, Kuder-Rich Inter-Rater

Factors that influence reliablity

Test Test length Homogeneity of items Test-rest Score variability Guessing Sample Size Examinee Factors/Situation Factors

Examinee / Situational Factors

1. Examinee characteristics (health, fatigue, motivation, etc.)

2. Understanding directions, interpretation of directions, language, fictitious/malingering

3. Examiner Factors1. Bias, rapport, complex directions, error in

scoring or administration, failure to provide suitable environment

Discrepancy Procedure

Raw score differenceStandard Score DifferenceDifference considering reliabilityDifference from regression based

prediction

Standard Error of Difference

SED provides a determination of a significant difference between any two standard scores

Requires same SD for tests Best when it takes into account reliability

and correlation of two tests When only reliability is known:

yYreliabilityXreliabilitSDSED 2

General Procedure for Determining Score Discrepancy

1. Subtract ability from achievement score (don’t do it!)

2. Take reliability of tests into consideration (SED)

3. Take reliability and correlation of tests into consideration (SER)

yYreliabilityXreliabilitSDSED 2

Ability SS = 105(rxx=.89); Read=85(.72)

9.37 =critical value for difference at 68% Multiply by 2.58 (p<.01=24.16); need

25pts or 1.96(p<.05=18.37); need 19

37.9

)624(.15

72.89.215

SED

SED

SED

SE Residual: adds correlation

Suppose Ability and Achievement r=.55

1. Find predicted Ability

2. Converting to z-scores

3. Multiply by correlation 1815.)33(.55.)(

)(*)(

2

33.15/100105

/

1

predictedz

abilityzrxypredictedz

Step

z

SDMrawz

Step

SER

pts

obtainedSSpredicatedSS

Step

predicatedSS

zpredictedSS

Step

1885103

()(

5

103)15*1815(.100(

)15*(100)(

4

4. Convert to scale 100 SD=15

5. Subtract predicted from obtained

6. Calculate SEResidual

551.

6975./)605(.269.72.

55.1/)55(.2)55)(.89(.72.

1/)(2)(

6

222

222

residual

residual

residual

rxyrxyrrxxryyresidual

Step

SER

396.8

)670)(.835(.15

551.1*3025.115

1*1 2

SER

SER

SER

residualxyrSDSER

Multiply by 1.96 (p<.05 = 16.45), 17 pt difference

Multiply by 2.58(p<.01=21.66), 22 pt difference

Conclusion?

Compare SED and SER methods

Predictive Utility

Valid Positive(hit)- predicted + and confirmed +

False Positive (false alarm)-predicted + but actual –

Valid Negative(correct rejection)-predicted – and actual –

False Negative(miss)- predicted – actual +

Decision Making

The Decision MatrixThe Decision MatrixIn realityIn reality

WhatWhatwe concludewe conclude



Null trueNull true

Alternative falseAlternative falseIn In realityreality... ...

• There is no real program effectThere is no real program effect• There is no difference, gainThere is no difference, gain• Our theory is wrongOur theory is wrong


Whatwe conclude

Null trueNull true


Accept null

Reject alternative

We say...

• There is no real program effect

• There is no difference, gain

• Our theory is wrong



Whatwe conclude

Null trueNull true


Accept null

Reject alternative

We say...





1-1-

THE CONFIDENCE LEVELTHE CONFIDENCE LEVEL

The odds of saying there is The odds of saying there is nono effect or gain when in effect or gain when in

fact there is nonefact there is none

# # of times out of 100 of times out of 100 when there is when there is nono effect, effect, we’ll say there is nonewe’ll say there is none


Whatwe conclude

Null trueNull true


Reject null

Accept alternative

We say...

• There is a real program effect

• There is a difference, gain

• Our theory is correct



Whatwe conclude

Null trueNull true


Reject null

Accept alternative

We say...





TYPE I ERRORTYPE I ERRORThe odds of saying there The odds of saying there isis

an effect or gain when in an effect or gain when in fact there is nonefact there is none

# of times out of 100 when # of times out of 100 when there is there is nono effect, we’ll say effect, we’ll say

there is onethere is one


Whatwe conclude

Null falseNull false

Alternative trueAlternative true

In In realityreality... ... • There is a real program effectThere is a real program effect• There is a difference, gainThere is a difference, gain• Our theory is correctOur theory is correct


Whatwe conclude



In In realityreality... ...

Accept null

Reject alternative

We say...




• There is a real program effectThere is a real program effect• There is a difference, gainThere is a difference, gain• Our theory is correctOur theory is correct


Whatwe conclude




Accept null

Reject alternative

We say...





TYPE II ERRORTYPE II ERROR

The odds of saying there is The odds of saying there is no effect or gain when in no effect or gain when in

fact there is onefact there is one

# of times out of 100 # of times out of 100 when there when there isis an effect, an effect, we’ll say there is nonewe’ll say there is none


Whatwe conclude




Reject null

Accept alternative

We say...










Reject null

Accept alternative

We say...





1-1-

POWERPOWERThe odds of saying there The odds of saying there isis

an effect or gain when in an effect or gain when in fact there is onefact there is one

# of times out of 100 when # of times out of 100 when there there isis an effect, we’ll say an effect, we’ll say



Whatwe conclude

NullNull truetrue Null falseNull false

Alternative falseAlternative false Alternative trueAlternative true

In In realityreality... ... In In realityreality... ...

Accept null

Reject alternative

Reject null

Accept alternative

We say...




We say...






1-1-

THE CONFIDENCE LEVELTHE CONFIDENCE LEVEL TYPE II ERRORTYPE II ERROR




there is nonethere is none





1-1-

TYPE I ERRORTYPE I ERROR POWERPOWERThe odds of saying there The odds of saying there isis


The odds of saying there The odds of saying there isis an effect or gain when in an effect or gain when in







Whatwe conclude

Null trueNull true Null falseNull false



Accept null

Reject alternative

Reject null

Accept alternative

We say...




We say...






1-1-

THE CONFIDENCE LEVELTHE CONFIDENCE LEVEL TYPE II TYPE II ERRORERROR

1-1-

TYPE ITYPE I ERRORERROR POWERPOWER


Whatwe conclude




Accept null

Reject alternative

Reject null

Accept alternative

We say...




We say...






1-1-


1-1-

TYPE I ERRORTYPE I ERROR POWERPOWER

CORRECTCORRECT

CORRECTCORRECT


Whatwe conclude




Accept null

Reject alternative

Reject null

Accept alternative

We say...




We say...






1-1-










1-1-









If you try to increase power, you increase If you try to increase power, you increase the chance of winding up in the bottom the chance of winding up in the bottom

row and of Type I error.row and of Type I error.

If you try to increase power, you increase If you try to increase power, you increase the chance of winding up in the bottom the chance of winding up in the bottom

row and of Type I error.row and of Type I error.






Accept nullAccept null

Reject alternativeReject alternative

Reject null

Accept alternative

We We saysay......

• There is no real There is no real program effectprogram effect

• There is no difference, There is no difference, gaingain

• Our theory is wrongOur theory is wrong

We say...






1-1-










1-1-









If you try to If you try to decrease Type I decrease Type I

errors, you errors, you increase the increase the

chance of winding chance of winding up in the top row up in the top row

and of Type II and of Type II error.error.

If you try to If you try to decrease Type I decrease Type I

errors, you errors, you increase the increase the

chance of winding chance of winding up in the top row up in the top row

and of Type II and of Type II error.error.

Documents

Norm- Referenced measurements BY Shadia Abd Elkader