Download pdf - Reliability and Validity - Mansfield University of Pennsylvaniacoursework.mansfield.edu/psy3315/3315 - Reliability and...individual when that individual is measured a number of times.”

Reliability and

Validity

Reliability: Notation/Symbols

rxx = reliability of the predictor

ryy = reliability of the criterion

Informal The consistency and stability of

measurement

Formal “The extent of unsystematic variation

in the quantitative description of an

individual when that individual is

measured a number of times.”

(Ghiselli, Campbell, & Zedeck, 1981,

p. 482)

Reliability: Definitions

Observed

Score

Reliability: “Classic Measurement Theory”

= ( True

Score+

Systematic

Error ) + Random

Error

A poor measurement instrument

A poor user of the measurement instrument

An unstable trait, characteristic, or attribute

Reliability: Some Sources Affecting Reliability

Test - Retest

Reliability: Forms of Reliability

Use the same test with the same

sample (people) on 2 different

occasions; correlate the scores from

the 2 administrations of the test

Correlation sometimes referred to as “Coefficient of Stability”)

Some Sources of Error

Test - Retest

Change in testing conditions

Change in test taker

Practice effects


Parallel

Forms (a.k.a.

equivalent or

alternate forms)

Develop 2 or more similar tests

designed to measure the same thing;

correlate scores from the “parallel”

forms, using same people to complete

both forms.

Correlation may be referred to as “Coefficient of Equivalence”

2 approaches:

1. Immediate

2. Delayed


Some Sources of Error

Parallel Forms

Immediate – fatigue, content of tests, practice

Delayed – content of tests, time

Problems with constructing two versions of measure


Give the measure once

Somehow divide measure into 2

parts with equal number of items

(e.g., odd/even split)

Correlate scores on the 2 halves

Correct this value! (It’s an

underestimate of overall internal

consistency)

Internal

Consistency

Extent to which responses are

consistent to items designed to

measure the same thing within a

single test.

Approaches:

1. Split half


Observed

Score = True

Score+

Systematic

Error+ Random

Error( )

Remember the

“random error”

component!

Other things being equal, the longer a test is (i.e., more items),

the more “reliable” the test will be in terms of estimating the

“true score.” This is why you must correct the “split half”

estimate of reliability!


Spearman – Brown Prophecy Formula for Split Half rxx:

rfull test =2rxx

1 + rxx

Reliability

estimate from

split half


Approach #2: Coefficient alpha (a.k.a. “Cronbach’s alpha”)

Give the measure once

Consistency of responses to all of

the items in a “test” of same thing

once

Mathematical estimate of average

of all possible split halves

Do NOT have to correct this value!

Symbolized as


Inter-rater

Reliability (a.k.a.

inter-rater

agreement)

Examples – judges in Olympic events,

assessors in assessment center, profs

grading term papers

Problems – inter-rater differences in

temperament, motivation, observation skills,

etc.

Improve with training, clear definitions,

better measures

Extent to which 2 or more observers

of the same behavior, using a similar

measurement instrument, rate or

score the behavior in a similar or

consistent manner.


Validity

Simple Extent to which a test measures

what it is supposed to measure

Formal “… refers to the appropriateness,

meaningfulness, and usefulness of

the specific inferences made from

test scores. Test validation is the

process of accumulating evidence

to support such inferences.”

Standards for Educational and Psychological Testing (1985) p.9

Validity: Definitions

Definition - Principles for the Validation

and Use of Personnel Selection

Procedures (SIOP, 2003)

Validity

“… the degree to which accumulated evidence and theory

support specific interpretations of test scores entailed by

proposed uses of a test” (AERA et al., 1999, p. 184).”

“Validity is the most important consideration in developing and

evaluating selection procedures. Because validation involves

the accumulation of evidence to provide a sound scientific

basis for the proposed score interpretations, it is the

interpretations of these scores required by the proposed uses

that are evaluated, not the selection procedure itself.”

Validity: Types

1. Criterion-related validity (Symbol: rxy)

Predictive design (“Predictive validity”)

Concurrent design (“Concurrent validity”)

2. Content validity

3. Construct validity

Criterion-Related: Predictive Design

Time 1 “Applicants” take predictor/test

Time 2 Collect job performance

(criterion) data after “applicants”

have been on the job for some

time

Correlate predictor scores with

job performance data

Validity: Types

Time 1 Job incumbents take predictor/test

Time 1 Collect job performance (criterion)

data for job incumbents

Correlate predictor scores with

job performance data

Criterion-Related: Concurrent Design

Validity: Types

Definition “…refers to the degree to which the test

items are a representative sample of

behaviors exhibited in some performance

domain.”

(Schneider & Schmitt, 1986, p. 237)

Almost entirely based on judgment of

process used to identify test content

Tenopyr (1977) called it “content-oriented

test construction.”

Robinson’s (1981) “Construction Error

Recognition Test”

Example

Content ValidityValidity: Types

Definition …extent to which a measure (indirectly)

assesses an underlying concept or

construct

… the interpretation of what the scores on a

measure represent (Spector, 1996)

2 Issues: Define the construct (what it is and what it

is not)

Make judgments, usually over a series of

studies, of how well a pattern of results

from a measure matches up with the

patterns expected for that construct

Convergent validity & discriminant validity

Validity: Types Construct Validity

Example

Validity: Types Construct Validity