Reliability and
Validity
Reliability: Notation/Symbols
rxx = reliability of the predictor
ryy = reliability of the criterion
Informal The consistency and stability of
measurement
Formal “The extent of unsystematic variation
in the quantitative description of an
individual when that individual is
measured a number of times.”
(Ghiselli, Campbell, & Zedeck, 1981,
p. 482)
Reliability: Definitions
Observed
Score
Reliability: “Classic Measurement Theory”
= ( True
Score+
Systematic
Error ) + Random
Error
A poor measurement instrument
A poor user of the measurement instrument
An unstable trait, characteristic, or attribute
Reliability: Some Sources Affecting Reliability
Test - Retest
Reliability: Forms of Reliability
Use the same test with the same
sample (people) on 2 different
occasions; correlate the scores from
the 2 administrations of the test
Correlation sometimes referred to as “Coefficient of Stability”)
Some Sources of Error
Test - Retest
Change in testing conditions
Change in test taker
Practice effects
Reliability: Forms of Reliability
Parallel
Forms (a.k.a.
equivalent or
alternate forms)
Develop 2 or more similar tests
designed to measure the same thing;
correlate scores from the “parallel”
forms, using same people to complete
both forms.
Correlation may be referred to as “Coefficient of Equivalence”
2 approaches:
1. Immediate
2. Delayed
Reliability: Forms of Reliability
Some Sources of Error
Parallel Forms
Immediate – fatigue, content of tests, practice
Delayed – content of tests, time
Problems with constructing two versions of measure
Reliability: Forms of Reliability
Give the measure once
Somehow divide measure into 2
parts with equal number of items
(e.g., odd/even split)
Correlate scores on the 2 halves
Correct this value! (It’s an
underestimate of overall internal
consistency)
Internal
Consistency
Extent to which responses are
consistent to items designed to
measure the same thing within a
single test.
Approaches:
1. Split half
Reliability: Forms of Reliability
Observed
Score = True
Score+
Systematic
Error+ Random
Error( )
Remember the
“random error”
component!
Other things being equal, the longer a test is (i.e., more items),
the more “reliable” the test will be in terms of estimating the
“true score.” This is why you must correct the “split half”
estimate of reliability!
Reliability: Forms of Reliability
Spearman – Brown Prophecy Formula for Split Half rxx:
rfull test =2rxx
1 + rxx
Reliability
estimate from
split half
Reliability: Forms of Reliability
Approach #2: Coefficient alpha (a.k.a. “Cronbach’s alpha”)
Give the measure once
Consistency of responses to all of
the items in a “test” of same thing
once
Mathematical estimate of average
of all possible split halves
Do NOT have to correct this value!
Symbolized as
Reliability: Forms of Reliability
Inter-rater
Reliability (a.k.a.
inter-rater
agreement)
Examples – judges in Olympic events,
assessors in assessment center, profs
grading term papers
Problems – inter-rater differences in
temperament, motivation, observation skills,
etc.
Improve with training, clear definitions,
better measures
Extent to which 2 or more observers
of the same behavior, using a similar
measurement instrument, rate or
score the behavior in a similar or
consistent manner.
Reliability: Forms of Reliability
Validity
Simple Extent to which a test measures
what it is supposed to measure
Formal “… refers to the appropriateness,
meaningfulness, and usefulness of
the specific inferences made from
test scores. Test validation is the
process of accumulating evidence
to support such inferences.”
Standards for Educational and Psychological Testing (1985) p.9
Validity: Definitions
Definition - Principles for the Validation
and Use of Personnel Selection
Procedures (SIOP, 2003)
Validity
“… the degree to which accumulated evidence and theory
support specific interpretations of test scores entailed by
proposed uses of a test” (AERA et al., 1999, p. 184).”
“Validity is the most important consideration in developing and
evaluating selection procedures. Because validation involves
the accumulation of evidence to provide a sound scientific
basis for the proposed score interpretations, it is the
interpretations of these scores required by the proposed uses
that are evaluated, not the selection procedure itself.”
Validity: Types
1. Criterion-related validity (Symbol: rxy)
Predictive design (“Predictive validity”)
Concurrent design (“Concurrent validity”)
2. Content validity
3. Construct validity
Criterion-Related: Predictive Design
Time 1 “Applicants” take predictor/test
Time 2 Collect job performance
(criterion) data after “applicants”
have been on the job for some
time
Correlate predictor scores with
job performance data
Validity: Types
Time 1 Job incumbents take predictor/test
Time 1 Collect job performance (criterion)
data for job incumbents
Correlate predictor scores with
job performance data
Criterion-Related: Concurrent Design
Validity: Types
Definition “…refers to the degree to which the test
items are a representative sample of
behaviors exhibited in some performance
domain.”
(Schneider & Schmitt, 1986, p. 237)
Almost entirely based on judgment of
process used to identify test content
Tenopyr (1977) called it “content-oriented
test construction.”
Robinson’s (1981) “Construction Error
Recognition Test”
Example
Content ValidityValidity: Types
Definition …extent to which a measure (indirectly)
assesses an underlying concept or
construct
… the interpretation of what the scores on a
measure represent (Spector, 1996)
2 Issues: Define the construct (what it is and what it
is not)
Make judgments, usually over a series of
studies, of how well a pattern of results
from a measure matches up with the
patterns expected for that construct
Convergent validity & discriminant validity
Validity: Types Construct Validity
Example
Validity: Types Construct Validity