Japan’s universitykoizumi/PROMS2016_Koizumi_entrance...2 Reforms in university entrance examinations in Japan (MEXT, 2014; Yomiuri Shimbun Kyouikubu, 2016) To keep up with the internationalization

1

Multi-faceted Rasch analysis in rating tasks in Japan’s university entrance examinations

Rie KOIZUMI (Juntendo University)

[email protected]

PROMS (Pacific Rim Objective Measurement Society) 2016, Orient Hotel, Xi'an, China,

August 1

mailto:[email protected]



2

Reforms in university entrance examinations in Japan (MEXT, 2014;

Yomiuri Shimbun Kyouikubu, 2016)

To keep up with the internationalization

To increase those who have the ability to think, judge, and express ideas and perform actively

Exam construct: knowledge, skills → + ability to think logically and express effectively

L2 English test: reading, listening → + writing, speaking

3

Current types of university entrance examinations

(1) General examinations

(2) Recommendation-based examinations

(3) Admissions Office (AO) examinations

(4) Special selection examinations

→Restructured? All should use both academic tests and other methods such as interviews and long essays.

4

Types of tests in general examinations

(a) Only the National Center Test for University Admissions (Center Test) Administered only once a year to about

550,000 examinees nationwide

(b) Only a university-developed test Limited quality control

(c) Both

5

Reforms in the Center Test Format: multiple-choice (MC) with single

answers → + MC with single and multiple answers + constructed-response

×Multiple administrations per year

×Not select applicants based on a one-point difference in test scores

L2 English four-skill test, with the speaking section using voice recorders

Analysis: CTT → + IRT

6

Concrete plan proposed for the Center Test reform

For math and L1 Japanese:

2020 to 2023: Elicit relatively short answers (40–80 Japanese characters)

From 2024: Administered digitally and elicit longer written responses (200–300 characters)

Scored initially by humans, but later by artificial intelligence

40–80 Japanese characters

西安は観光するところも多く、滞在はとても充実しています。兵馬俑に行き、中国の歴史と文化を学びました。兵馬俑博物館では写真をとってよいことに驚きました。(76 Japanese characters)

7

8

Constructed responses require judgments using rating scales

Rating involves many factors (McNamara, 1996; Fulcher, 2003)

Human raters; automated system (e.g., characteristics, training)

Rating scale (e.g., orientation, construct definition)

Task (e.g., orientation, goals)

Interlocutor

Local performance condition

9

Multi-faceted Rasch analysis (MFRA)

Useful in analyzing data affected by many factors (Barkaoui, 2013; Bond & Fox, 2015; Eckes,

2011; Engelhard, 2013; McNamara & Knoch, 2012)

Facets: Test takers, tasks, raters, rating criteria, + α (Two or more facets)

Obtain detailed information that cannot be derived from raw-score data analysis

E.g., Do raters produce stable scores?

Do rating scales work effectively?

Do tasks work in an expected manner?

Argument-based approach to validity (Chapelle et al., 2008)

F. Utilization

The degree of impact on learning and teaching

E. Extrapolation

Relationship with other test scores

D. Explanation (reflection of the construct)

Factor structure

Function analysis

C. Generalization (reliability)

The test and raters provides stable estimates.

B. Evaluation (appropriate scoring)

Statistical characteristics of tasks/a rating scale

Sufficient difficulty spread of tasks

A. Domain Definition

Relevance and representativeness of the target domain

11

Issues in reforms Radical proposal in 2014

Slow and limited progress

Background:

Conflicting principles (Kuramoto & Koizumi, 2016)

Principle of education

Principle of measurement

Methodological preferences

Some users of 2-parameter item response theory

Limited users of Rasch model

12

Focus on speaking tests testing L2 English ability 1/2

Speaking tests for university entrance examinations

(1) Four skill test

TEAP Eiken

GTEC CBT GTEC for Students

IELTS Cambridge English

TOEIC (LR + SW)

TOEFL iBT TOEFL Junior Comprehensive

13

Focus on speaking tests testing L2 English ability 2/2

Separate speaking tests

As a supplement to the current exam (Mizohata, 2016)

SST

TSST

OPIc

Versant

TOEIC Speaking

14

TEAP (Test of English for Academic Purposes)

Developed by the Eiken Foundation of Japan and Sophia University (http://www.eiken.or.jp/teap/)

Construct: academic English proficiency required for learning and researching at universities

10-min face-to-face, one-on-one interviews

CEFR about A2 to B2

0 to 30 points

Rater: training session

15

TEAP tasks and criteria

Part 1: Interview

Part 2: Role play (interviewing an examiner)

Part 3: Monologue

Part 4: Extended interview

5 criteria

4 levels (0 to 3; Below A2, A2, B1, B2)

(a) Pronunciation, (b) Grammatical Range and Accuracy, (c) Lexical Range and Accuracy, (d) Fluency, (e) Interactional Effectiveness

16

TEAP analysis (Nakatsuhara, 2014)

Analysis of the pilot study data

120 test takers

5 criteria

6 trained raters

Facets

using a partial

credit model

17

TEAP pilot study results

Favorable overall

High rater agreement (actual: 59.7%)

5.67 strata of test takers

2.8% underfitting test takers

Appropriate rating scale functions

Only a pilot study was analyzed using MFRM.

18

GTEC CBT (Global Test of English

Communication Computer Based Testing)

Developed by the Benesse Corporation and Center for Entrance Examination Standardization (http://www.benesse-gtec.com/cbt/en)

Construct: English proficiency in four skills for academic purposes

20-min computer-based test

CEFR about A2 to B2

Maximum score: 350 points

19

GTEC CBT tasks Part 1: Listening and responding (6 items)

E.g., Where do people like to travel to in your country?

When is the best time of year to visit your country?

Part 2: Delivering and asking for information (3 items)

Part 3: Expressing your opinion (3 items)

Do you think technology has changed the way we live?

20

GTEC CBT criteria 23 criteria in total. Each item has 1 to 5.

Full marks: 1 to 3 points

Example of criteria

Part 1: Listening and responding

Respond to simple questions appropriately and clearly

Part 2: Delivering and asking for information

Based on the provided information, give the factual information and preference, and ask questions

Part 3: Expressing your opinion

State an opinion and provide reasons to support the opinion

21

GTEC CBT rater training, quality maintenance

Practice session and certification

Trial session

Independent ratings by two raters.

Divergent responses are assessed by the third experienced rater.

Analysis (Koizumi, Okabe, & Kashimada, 2016): 648 test takers, 23 criteria, 13 trained raters; Facets（Version 3.71.4）, using partial credit model

GTEC CBT Results 1/2

22

Min～Max M (SD) Strata Relia- bility

% of under-fitting

Test takers

-6.94～5.10

0.92 (1.44)

6.35 .95 8%

Criteria -3.18～

3.35 0.00 (1.54)

29.84 1.00 0%

Raters -0.41～

0.17 0.00 (0.17)

3.64 .86 0%

23

GTEC CBT Results 2/2 Favorable overall

High rater agreement (actual: 79.4% > expected: 63.4%)

Criteria and rater fit to the model

Bias analysis: Rater x criteria: 0％

Rater x test taker: 1.39％

Test taker x criteria: 0.01％

Off-topic responses received unexpectedly low scores.

Appropriate rating scale functions

24

TSST (Telephone Standard Speaking Test)

Developed by ALC (https://tsst.alc.co.jp/tsst/e_contact.html)

Based on ACTFL OPI

Construct: how well a person can answer function-based questions spontaneously

15-min telephone-mediated test

Prompts both in L1 Japanese and L2 English

1 to 9 levels (Novice to Advanced)

CEFR A1 to B2

25

TSST

10 tasks (no preparation time; speaking for 45 sec)

Questions selected randomly from a question pool

From intermediate level to advanced level tasks

E.g.,

26

TSST rater training, quality maintenance

Certification

Training sessions annually

Independent ratings by three raters

One of them is an experienced rater.

Analysis (Koizumi, 2016) 5406 test takers, 771 tasks, 32 trained raters

Facets（Version 3.71.4）, using a rating scale model

27

TSST Results 2/2 Favorable overall

Task and rater fit to the model

2 task separation is intentional (intermediate and advanced level tasks)

High % of underfitting test takers

Appropriate rating scale function

Require further analysis

Toward reforms in Japanese university entrance exams

Recent use of multi-faceted Rasch analysis in L2 speaking tests

Should be expanded to other types of tests

Routine analysis and reporting to the public will benefit both test developers and test users.

Need for cooperation with content experts and measurement professionals

28

Acknowledgments

I appreciate Dr. Zhang and organizing committee members for inviting me to PROMS 2016.

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI, Grant-in-Aid for Scientific Research (C), Grant Number 26370737.

29

References Barkaoui, K. (2013). Multifaceted Rasch analysis for test

evaluation. In A. Kunnan (Ed.), The companion to language assessment (Vol. III: Evaluation, Methodology, and Interdisciplinary Themes, Part 10: Quantitative analysis, pp. 1301–1322). West Sussex, UK: John Wiley & Sons. doi:10.1002/9781118411360.wbcla070

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch

model: Fundamental measurement in the human sciences

(3rd ed.). New York, NY: Routledge.

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language.™ New York, NY: Routledge.

Eckes, T. (2011). Introduction to many-facet Rasch

measurement: Analyzing and evaluating rater-mediated

assessments. Frankfurt am Main, Germany: Peter Lang. 30

Engelhard, Jr. G. (2013). Invariant measurement: Using

Rasch models in the social, behavioral, and health

sciences. New York, NY: Routledge.

Fulcher, G. (2003). Testing second language speaking. Essex, U.K.: Pearson Education Limited.

Kuramoto, N., & Koizumi, R. (2016). Current issues in large-scale educational assessment in Japan: Focus on national assessment of academic ability and university entrance. Unpublished manuscript. Submitted for a journal.

Koizumi, R. (2016). Validity argument for TSST. Unpublished manuscript.

Koizumi, R., Okabe, Y., & Kashimada, Y. (2016, August). Rater reliability in GTEC CBT speaking section: Using multi-faceted Rasch analysis. Paper presented at the 32 Japan Society of English Language Education (JASELE), Saitama 2016. Dokkyo University, Saitama.

31

McNamara, T. (1996). Measuring second language performance. Essex, U.K.: Addison Wesley Longman Limited.

McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29, 555-576. doi: 10.1177/0265532211430367

MEXT (Ministry of Education, Culture, Sports, Science & Technology). (2014). [On integrated reforms in high school and university education and university entrance examination aimed at realizing a high school and university articulation system appropriate for a new era―Creating a future for the realization of the dreams and goals of all young people: Report]. Retrieved from http://www.mext.go.jp/b_menu/shingi/chukyo/chukyo0/toushin/1354191.htm (for the abbreviated version in English, see http://www.mext.go.jp/english/topics/1356088.htm)

32

Mizohata, Y. (2016). Nyuushi to supiikingu no hyouka [Entrance examinations and speaking assessment]. In E. Izumi & S. Kadota (Eds.), Eigo Supiikingu shidou handobukku [Handbook of teaching English speaking] (pp. 282-287). Tokyo: Taishukan.

Nakatsuhara, F. (2014). A research report on the development of the Test of English for Academic Purposes (TEAP) speaking test for Japanese university entrants—Study 1 & Study 2. Retrieved from http://www.eiken.or.jp/teap/group/pdf/teap_speaking_report1.pdf

Yomiuri Shimbun Kyouikubu. (2016). Daigaku nyuushi kaikaku [University entrance examination reforms: Reports from Japan and abroad]. Tokyo: Chuokoron-Shinsha.

33

Documents

Japan’s universitykoizumi/PROMS2016_Koizumi_entrance...2 Reforms in university entrance examinations in Japan (MEXT, 2014; Yomiuri Shimbun Kyouikubu, 2016) To keep up with the internationalization