How to Interpret Effect size in CBT–PBT Comparability Studies Presented By Leah Tepelunde Kaira Dr. Nambury Raju Summer Internship Program1

How to Interpret Effect size in CBT–PBT Comparability Studies

Presented

By

Leah Tepelunde Kaira

Dr. Nambury Raju Summer Internship Program 1

Order of Presentation• Introduction• Purpose of study• Review of Literature• Method• Results• Concluding remarks


Introduction• Use of computerized testing has increased

over the past decade – immediate scoring and reporting of results– more flexible test administration schedules– greater test administration efficiency

• Due to limited resources, education systems provide both computer based (CBT) and paper based (PBT) tests


Introduction continued

• Standards (AERA et. al, 1999) require a “clear rationale and supporting evidence” (Standard 4.10, p. 57) that scores obtained from CBT and PBT can be used interchangeably

• International Test Commission (ITC) requires that testing agencies “provide clear documented evidence of equivalence …” (ITC, 2005. p21)

4Dr. Nambury Raju Summer Internship Program

Introduction continued

• Although professional guidelines stipulate some methods that could be employed to examine comparability, they are silent with respect to how to judge comparability

• The lack of criteria has resulted in educational testing researchers using professional judgment or guidelines employed in other fields

• Among the mostly used guidelines are those suggested by Cohen (1988)

– Problem: May be misleading because in some areas (e.g. education), small effect sizes are more likely


Purpose of study

• Provide guidelines in interpreting effect sizes in comparability studies

• Questions:– How should effect sizes in comparability

studies be interpreted?– Does size of score scale have an impact on

effect size?– Does sample size have an impact on effect

size?– Does magnitude of effect size depend on the

score distribution?6Dr. Nambury Raju Summer Internship Program

Related Literature– Choi and Tinkler (2002) compared CBT and

PBT scores from math and reading for grades 3 and 10. • compared item difficulty estimates and calculated

difference weighted by standard error• Compared mean ability estimates across the

modes and grades to assess comparability.• Reading items were coded based on their textual

focus to assess the relationship between textual focus and item difficulty estimates.


Related literature continued

• More reading items were flagged compared to math. • Higher mean differences in item difficulty estimates for

3rd graders than 10th graders, and larger mean differences were observed in reading than in math.

• Within grade comparisons showed reading items for 3rd grade became harder on a computer than on paper. Such a difference was negligible at 10th grade.

• Mode effect was larger for reading that math– It is noted that this study does not provide guidelines on how to

evaluate the size of effect. In addition, no empirical evidence is provided for using an absolute d-value of 2 for flagging differentially difficult items for the two administration modes.


Related literature continued• Pearson (2007) evaluated comparability of online and paper field

tests • Students were matched on reading, math, and writing scale score,

gender, ethnic group and field test form. • A standardized difference (Zdiff) was calculated for both the theta

and difficulty parameter estimates. • Cohen’s (1992) guidelines were used to interpret effect size. • Standardized mean differences in theta were also small except in

one form where larger standardized mean differences and effect sizes were observed for white, Hispanic, and students that indicated ‘other’ as their ethnicity. The observed effect sizes were small based on Cohen’s guidelines

• Comparison of difficulty parameters resulted in flagging of 24 items that had standardized mean differences of ±1.96. However, the associated effect sizes for all flagged items were 0.20 or less


Related literature continued

• Kim and Huynh (2007) investigated equivalence of scores from CBT and PBT versions of Biology and Algebra end of course exams.

• Results were analyzed by examining differences in scale scores, item parameters, ability estimates at the content domain level

• An effect size measure (g) was used to evaluate the differences. Cohen’s criteria was used to judge the magnitude of g.


Related Literature continued• Items were recalibrated and parameter estimates

were compared to parameters in the bank. Robust Z and average absolute difference (AAD) statistics were used to examine significant difference

• TCCs and TIFs of CBT and PBT were also compared.

• Results showed small differences in scaled scores as measured by the effect size. High correlations were observed between recalibrated and bank item parameters.

• The AAD statistic ranged from 0.29 to 0.37 with small differences between CBT and PBT. TCCs and TIFs for CBT and PBT were generally comparable in both subjects.


Related Literature continued• Criteria used in evaluating comparability

– Difference in mean scores– Difference in item difficulty estimates– Difference in ability parameter estimates– Difference in TCCs and TIFs


Method

• Study conditions– 2 score scale sizes– 4 score distributions– 4 sample sizes


Method• Procedurea. Compute baseline TCC using operational item parameters

and theta valuesb. Simulate performance of CBT learners on the test by

manipulating the item difficulty parameter such that the maximum difference in expected score between CBT and PBT groups is 0.1. Compute a TCC.

c. Repeat the procedure in (b) above to reflect maximum differences in expected scores () of 0.2 to 3.00 in increments of 0.1.

d. For each of the simulated TCCs, compute scaled scores for various raw scores

e. Using the scaled scores computed in step d, compute effect size between 2 TCCs.


Results0.00690.01370.02060.02740.03420.04110.04790.05480.06160.06840.07530.08210.08890.09580.1026

0.10950.11630.12310.13000.13680.14370.15050.15730.16420.17010.17790.18450.19160.19840.2053


Results- Empirical distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

TCC

Effec

t si

ze n=1

n=2

n=3

n=4


Results- Normal distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

TCC

Effec

t si

ze n=1

n=2

n=3

n=4


Results – Negatively skewed distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

TCC

Effec

t si

ze n=1

n=2

n=3

n=4


Results- positively skewed distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

TCC

Effec

t si

ze n=1

n=2

n=3

n=4


Results- Summary

• Both sample size and score distribution have an impact on effect size

• Better results obtained with roughly equal sample sizes

• Larger effect sizes observed with skewed distributions than empirical and normal distributions


Concluding remark

• Researchers evaluating comparability of CBT and PBT scores may need to be more cautious in using Cohen’s guidelines to judge comparability


Thank You !

•Suggestions and comments are welcome!


Documents

How to Interpret Effect size in CBT–PBT Comparability Studies Presented By Leah Tepelunde Kaira Dr. Nambury Raju Summer Internship Program1