Ail apresentation(kumazawa)

Evaluating validity of criterion-

referenced test score

interpretations and usesTakaaki Kumazawa

Kanto Gakuin University

([email protected])

Kintai Bridge, Japan (wiki)

Purpose

ß The purpose of my talk is to evaluate

validity of criterion-referenced placement

test score interpretations and uses using

Kane’s (2006) argument-based validity

framework

ß This presentation is based on a paper I

published in the JALT Journal

（http://jalt-publications.org/jj/issues/2013-05_35.1）

http://jalt-publications.org/jj/issues/2013-05_35.1

Classical view of validity

ß Validity: the extent to which a test is supposed to measure

ß Three types of validity

Þ Criterion-related validityCorrelation between a valid measure and a test developing

Þ Content validityExperts’ judgment on whether items are measuring what is supposed to measure

Þ Construct validityStatistical examination on whether items are measuring what is supposed to measure

Current view of Validity

ß Validity is “the degree to which evidence

and theory support the interpretations of

test scores entailed by proposed uses of

tests” (American Educational Research

Association, American Psychological

Association, & National Council on

Measurement in Education [AERA, APA, &

NCME], 1999, p. 9).

Argument-based validity framework

Interpretive argument: proving argument that the inferences are

going to make is theoretically valid

Validity argument: evaluating the interpretive argument by providing

warrant

Observatio

n

Observed

score

Universe

score

Target

scoreUse

Scoring generalization extrapolation

decision

Interpretive argument

ß Scoring inferenceÞ to what extent do examinees get placement items correct

and high-scoring examinees get more placement items correct

ß Generalization inference Þ to what extent are placement items consistently sampled

from a domain and sufficient in number so as to reduce the measurement error

ß Extrapolation inferenceÞ to what extent do the difficulty of placement items match to

the objectives of a reading course

ß Decision inferenceÞ to what extent do placement decisions made to place

examinees in their proper level of the course have an impact on washback in the course

Participants

Þ 428 Japanese 1st year university students majoring in law

Þ TOEIC score of about 250-450

Þ Three courses in the English program Reading

Listening

TOEIC skills

ß Proficiency based programÞ Three levels

Level 1: 60 high scoring studentsMajor objective of the reading course: improve their reading skills such as fast reading

Level 2: about 300 students

Level 3: 50 low scoring studentsMajor objective of the reading class: re-learn Jr High and High school grammar

Criterion-referenced placement test

ß Grammar (k = 40)

Þ Items are taken from textbooks used in junior and high schools

Þ Grammar: present, past, & future tenses, continuous, relative pronoun,

gerund, participle, etc…

Þ Sample: Hi, I ( ) Ken.

1. am 2. are 3. is 4. be

ß Vocabulary (k = 40)

Þ Items are taken from high frequent 1000-3000 words based on the

JACET 8000 corpus

Þ Sample: Bring

1. 送る (send) 2. 持ってくる (bring) 3. 鳴る (ring) 4. 購入する (buy)

ß Reading (k = 10)

Þ Two passages are taken from two textbooks used in Level 1 and Level

3 reading classes

Þ Sample: How do they travel?

1. by plane 2. by bus 3. by car 4. by train

Procedures

ß On the first day of semester, the placement test was given in 45 minutes

ß A grammar pretest (k = 55, α = .85) was given on the first day of class in Level 2 classes (n = 51) and Level 3 classes (n = 49)

ß 30 90-minute lessons in two semesters

ß The same grammar posttest (α = .92) was given on the last day of class to the same students (n = 51, 49)

ß A course evaluation survey was given to the same students (n = 51, 49)

Backing for scoring inference

ß Item facilityÞ 7 items below .29

Þ 62 items between .30 and .70

Þ 21 items above .71

ß Item discriminationÞ 4 items below .19

Þ 86 items above .20

ß Rasch Item difficulty estimatesÞ -3.79〜2.33

ß Infit MSÞ 0.80〜1.30

Backing for generalization inference

ß Multivariate generalizability theory

(Decision study of a persons X Items

design)

Þ Grammar (k = 40, ρ = .85, Φ = .83)

Þ Vocabulary (k = 40, ρ = .86, Φ = .84)

Þ Reading (k = 10, ρ = .58, Φ = .55)

Þ Total (k = 90, ρ = .92, Φ = .91)

Cut point for Level 1

Level 1 reading

Cut point for Level 3

Junior High grammar and 1000 word level

Backing for extrapolation inferenceDifficulty level estimates FACETS map

Level Difficulty SE Infit MS

Junior High grammar -0.65 0.03 1.00

High School grammar 0.29 0.02 1.00

1000 word level vocab -0.94 0.03 1.00

2000 word level vocab 0.15 0.03 1.00

3000 word level vocab 0.12 0.05 1.00

Level 3 rearing 0.30 0.05 1.00

Level 1 reading 0.73 0.05 1.10

-----------------------------------------------------

|Measr|+students

|-items | -levels

| CUT Po int for Leve ls 1, 2,

3

-----------------------------------------------------

+

3

+

+

+

+

|

|

.

| |

|

|

|

.

|

|

|

|

|

.

|

|

|

|

|

.

|

|

|

|

|

.

|

*

|

|

|

|

*

.

|

|

|

+

2

+

.

+

*

+

+

|

|

.

|

|

|

|

|

*

*

.

|

|

|

|

|

*

.

|

*

|

| Level 1a ( 1.49)

---------------------------------------------------------------------------

|

|

*

*

**.

|

|

|

|

|

*

*

**.

|

|

|

|

|

*

*

*.

|

*

|

|

+

1

+

*

**.

+

***

**

+

+

|

|

*

*

****

.

|

*

**

**

*

**

|

|

|

|

*

*

*.

|

***

|

Lev

el

1

Rea

d

ing

| L

e

vel 1b

(.77 )

---------------------------------------------------------------------------

|

|

*

*

****

.

|

*

****

*

|

|

|

|

*

*

**

|

****

**

|

|

|

|

*

*

****

*

. |

**

**

*

***

| Basic

H

S

G r a m

m a r |

|

|

*

*

**

|

****

****

|

JACET2000

J

ACET3000 |

*

0

*

*

****

*

*. *

***

*

** *

*

L e

v e l

2

( .

7 7-.70)

|

|

*

*

****

*

|

*

**

|

|

|

|

*

*

**.

|

***

***

|

|

|

|

*

*

****

.

|

*

***

|

|

|

|

*

*

****

*

** | ***

*

**

|

|

----------------------------------------------------------------------------

|

|

*

*

*

|

*

****

| Jr

H

Gram

m

a

r

| L

e

vel 3a

( -.70)

|

|

*

*

*.

|

**

|

|

----------------------------------------------------------------------------

+

-1 +

**

*

*.

+

**

+

J

AC

ET1

000

+ L e v el

3b

(

-.99)

|

|

*

*

.

|

*

*

|

|

|

|

.

|

*

|

|

|

|

.

|

|

|

|

|

.

|

|

|

|

|

|

*

|

|

|

|

.

|

*

|

|

+

-2 +

+

*

+

+

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

*

|

|

|

|

|

| |

+

-3 +

+

+

+

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

*

|

|

+

-4 +

+

+

+

-----------------------------------------------------

|Measr| *

=

4

|

*

=

1

| -levels

|

-----------------------------------------------------

Backing for decision inferenceLevel 2 and Level 3 students’ (n = 51, 49) grammar pretest and posttest

scores (k = 55)

11 points down

6 points up

Level 2

students

scored

higher

Level 3

students

scored

higher

Grammar pretest（α=.85） Grammar posttest（α=.92） Class Level n M SD n M SD Level 2a 26 30.38 6.34 21 12.14 2.50 Level 2b 25 32.36 8.47 24 28.63 7.93

Level 2 51 31.35 7.45 45 20.93 10.24 Level 3c 25 20.80 5.09 22 26.82 5.21 Level 3d 24 19.88 4.29 23 26.78 5.95 Level 3 49 20.35 4.69 45 26.80 5.53

Validity argumentInterpretive argumentß Scoring inference

Þ to what extent do examinees get placement items correct and high-scoring examinees get more placement items correct

ß Generalization inference

Þ to what extent are placement items consistently sampled from a domain and sufficient in number so as to reduce the measurement error

ß Extrapolation inference

Þ to what extent do the difficulty of placement items match to the objectives of a reading course

ß Decision inference

Þ to what extent do placement decisions made to place examinees in their proper level of the course have an impact on washback in the course

Validity argumentß Scoring inference

Þ Because most items were working well,

the inference from observation to the

observed score was valid

ß Generalization inference

Þ Because of high dependability with the

small amount of measurement error, the

inference from the observed score to

universe score was valid

ß Extrapolation inference

Þ Because the difficulty of the items were

adequate to the objectives of the program,

the inference from the universe score to

target score was valid

ß Decision inference

Þ Because Level 3 students were placed in

the right level and were able to improve

their grammar test scores, the inference

from the target score to test use was valid.

Conclusionß “Validation is simple in principle, but

difficult in practice. The argument-based

framework provides a relatively pragmatic

approach to validation” (Kane, 2012, p. 15).

William Jolly Bridge, Brisbane

(wiki)

References

ß Kane, M. (2006). Validation. In R. Brennan

(Ed.), Educational measurement (4th ed.). (pp.

17-64). Westport, CT: Greenwood Publishing.

ß Kane, M. (2012). Validating score

interpretations and uses. Language Testing,

29, 3-17. doi: 10.1177/0265532211417210

ß Kumazawa, T. (2013). Evaluating validity for

in-house placement test score interpretations

and uses. JALT Journal, 35, 73-100.

Data & Analytics

Ail apresentation(kumazawa)