FL 024 200 AUTHOR Ed.Brown, James Dean, Ed.; Yamashita, Sayoko Okada, Ed. Language Testing in Japan. Japan Association for Language Teaching, Tokyo. ISBN-4-9900370-1-6 95 193p. JALT

ED 400 713

AUTHOR

TITLEINSTITUTIONREPORT NOPUB DATENOTEAVAILABLE FROM

PUB TYPE

EDRS PRICEDESCRIPTORS

DOCUMENT RESUME

FL 024 200

Brown, James Dean, Ed.; Yamashita, Sayoko Okada,Ed.

Language Testing in Japan.Japan Association for Language Teaching, Tokyo.ISBN-4-9900370-1-695

193p.

JALT Central Office, Urban Edge Bldg., 5th Floor,1-37-9 Taito, Taito-ku, Tokyo 110, Japan.Reports Descriptive (141) Guides Classroom Use

Teaching Guides (For Teacher) (052) CollectedWorks General (020)

MF01/PC08 Plus Postage.Behavioral Objectives; Classroom Techniques; ClozeProcedure; College Entrance Examinations; CollegeFreshmen; Comparative Analysis; Criterion ReferencedTests; Decision Making; English (Second Language);Foreign Countries; Higher Education; Industry;Language Proficiency; *Language Tests; Nonverbal.Communication; Norm Referenced Tests; Oral Language;Program Development; Pronunciation Instruction;Second Language Instruction; *Second Languages;Standardized Tests; *Test Construction; *Test Use;Test Validity; *Verbal Tests

IDENTIFIERS ACTFL Oral Proficiency Interview; *Japan; *OralProficiency Testing; Teaching to the Test

ABSTRACTPapers on second language testing in Japan include:

"Differences Between Norm-Referenced and Criterion-Referenced Tests"(James Dean Brown); "Criterion-Referenced Test Construction andEvaluation" (Dale T. Griffe); "Behavioral Learning Objectives as anEvaluation Tool" (Judith A. Johnson); "Developing Norm- ReferencedTests for Program-Level Decision-Making" (Brown); "Monitoring StudentPlacement: A Test-Retest Comparison" (Sayoko Okada Yamashita);"Evaluating Young EFL Learners: Problems and Solutions" (R. MichaelBostwick); "Good and Bad Uses of TOEIC by Japanese Companies"(Marshall Childs); "A Comparison of TOEFL and TOEIC" (Susan Gilfert);"English Language Entrance Examinations at Japanese Universities:1993 and 1994" (Brown, Yamashita); "Exploiting Washback fromStandardized Tests" (Shaun Gates); "Testing Oral Ability: ILR andACTFL Oral Proficiency Interviews" (Hiroto Nagata); "The SPEAK Testof Oral Proficiency: A Case Study of Incoming Freshmen" (ShawnClankie); "Making Speaking Tests Valid: Practical Considerations in aClassroom Setting" (Yuji Nakamura); "Cooperative Assessment:Negotiating a Spoken-English Grading Scheme with Japanese UniversityStudents" (Jeannette Mclean); "Assessing the Unsaid: The Developmentof Tests of Nonverbal Ability" (Cloze Testing Options for theClassroom" (Cecilia. B. Ikeguchi); and "The Validity of WrittenPronunciation Questions: Focus on Phoneme Discrimination" (Shin'ichiInoi) . (MSE)

ilkil.___

NM

114.0.......__

U S DEPARTMENT OF EDUCATIONOffice of Educational Research and Improvement

RESOURCES INFORMATIONCENTER (ERIC)

This document has been reproduced asreceived from the person or organizationoriginating it

1:1 Minor changes have been made toimprove reproduction quality

Points of view or opinions stated in thisdocument do not necessarily representofficial OERI position or policy

AP

. - I - .

s I 0

aa a

a is, 111 .0. . a .

PERMISSION TO REPRODUCE ANDDISSEMINATE THIS MATERIAL

HAS BEEN GRANTED BY

lerle- 4,..r-I____....

TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)

f Ora.. - ....atrib..

02,

A

.L0

_..........Ali

I.1

ill

ii

,....... _,,j%. ,. ,

BEST COPY AVAILABLE

r/ S. a e . 0

. .0' ...

COMING SOON!On JALT 95: Curriculum

and EvaluationProceedings of the 21st JALT International

Conference on Language Learning/Teaching

So you wanted to come to the 1995 JALT International Conference, but for some reason you just couldn'tmake it. What happened there? What did you miss that's professionally important to you? How can you find

out?

In the past you couldn't get the answers to these questions, except through word-of-mouth or less than satisfying

presentation reviews that would appear in JALT's monthly magazine, The Language Teacher.

This year that's going to change. The answers (at least some of them) will be in On JALT 95: Curriculum andEvaluation, the proceedings of the 21st JALT International Conference on Language Teaching/Learning.Finally, a book that addresses the events and concerns of the members of Asia's largest language teachingprofessional organizationtheJapan Association for Language Teaching. OnJALT 95wi11 feature the plenaryaddresses and speeches of the Conference's invited guests, in addition to articles based on the broad spectrumof colloquia, roundtable discussions, and presentations made at the 1995 conference in Nagoya.

This volume is a "must read" for anyone interested in language teaching and teaching research in Japan andAsia in general. It is the most comprehensive collection of articles in English that has appeared on FLT in Japanin two decades, and it will be sent to all Conference attendees as part of their registration package. If youweren't an attendee, here's how to order a copy:

Price*: (in Japan) V2500. Orders should be sent by postal transfer (yObin furikae) using the form at the backof TL T, or a regular postal transfer form bearing the account #YOKOHAMA-9-70903-JALT. In the messagearea be sure to write "JALT 95 Conference Proceedings." (International) V3500. Checks or internationalpostal money orders must be made out in the proper yen (Y) amount. Sorry, no credit card orders accepted.Orders should be sent to:

JALT 95 Proceedings, JALT Central Office2-32-10-301 Nishi-Nippori, Arakawa-ku, Tokyo 116

On JALT 95: Curriculum and Evaluation ISBN: 4-9900370-1-6Published by the Japan Association for Language Teaching

*NB: Remember, if you attended the JALT 95 Conference, you do not need to order a copy.

JALT Applied Materials

Language Testing in Japan

JAMES DEAN BROWN & SAYOKO OICADA YAMASHITA, EDITORS

DALE T. GRIFFEE, SERIES EDITOR

GENE VAN TROYER, SUPERVISING EDITOR

The Japan Associationfor Language Teaching

Tokyo, Japan

4

JALT Applied Materials

A series sponsored by the Publications Board of the Japan Associationfor Language Teaching.

JALT President: David McMurrayPublications Board Chair: Gene van Troyer

Cover design and graphics: Junji Maeda, Press Associates NagoyaDesign and typography: The Word Works, Ltd., Yokohama

Printing and binding: Koshinsha K.K., Osaka

LANGUAGE TESTING IN JAPAN

Copyright © 1995 by the Japan Association for Language Teaching, James Dean Brown

and Sayoko Yamashita.

All rights reserved. Printed in Japan. No part of this book may be used or reproduced in anyform whatsoever without written permission of the editors, except in cases of brief quotations

embodied in scholarly articles and reviews. For information address JALT, 2-32-10-301 Nishi-

Nippori, Arakawa-ku, Tokyo 116, Japan.

Cataloging Data

Brown, James Dean, & Yamashita, Sayoko Okada (eds.).Language testing in Japan.

Bibliography: p.1. Applied linguisticsLanguage testingSecond language teaming. I. Title.

1995ISBN 4-9900370-0-6

ContentsForward

Dale T Griffee, Series Editor

About the Contributors 1

1 Introduction to Language Testing in Japan 5James Dean Brown & Sayoko Okada Yamashita

SECTION I: CLASSROOM TESTING STRATEGIES

2. Differences between norm-referenced and criterion-referenced tests 12James Dean Brown

3. Criterion-referenced test construction and evaluation 20Dale T Griffe

4. Behavioral learning objectives as an evaluation tool 29Judith A. Johnson

SECTION II: PROGRAM-LEVEL TESTING STRATEGIES

5. Developing norm-referenced tests for program-level decision making 40James Dean Brown

6. Monitoring student placement: A test-retest comparison 48Sayoko Okada Yamashita

7. Evaluating young EFL learners: Problems and solutions 57R. Michael Bostwick

SECTION III: STANDARDIZED TESTING

8. Good and bad uses of TOEIC by Japanese companies 66Marshall Childs

9. A comparison of TOEFL and TOEIC 76Susan Gilfert

10. English language entrance examinations at Japanese universities: 1993 and 1994 86James Dean Brown & Sayoko Okada Yamashita.

11. Exploiting washback from standardized tests 101Shaun Gates

SECTION N: ORAL PROFICIENCY TESTING

12. Testing oral ability: ILR and ACTFL oral proficiency interviews 108Hiroto Nagata

13. The SPEAK test of oral proficiency: A case study of incoming freshmen 119Shawn Clankie

14. Making speaking tests valid: Practical considerations in a classroom setting 126Yuji Nakamura

SECTION V: INNOVATIVE TESTING

15. Cooperative assessment: Negotiating a spoken-English grading scheme withJapanese university students 136

Jeanette Mclean

16. Assessing the unsaid: The development of tests of nonverbal ability 149Nicholas 0. Jungheim

17. Cloze testing options for the classroom 166Cecilia B. Ikeguchi

18. The validity of written pronunciation questions: Focus on phoneme discrimination 179Shin'ichi Inoi

iv JALT APPLIED MATERIALS

Foreword

About JALT Applied Materials

Welcome to the JALT Applied Materials (JAM) series, of which Language Testing in Japan isthe inaugural (and pilot) project. The concept behind JAM is that each proposed volume is acollection of papers focused on a single theme, written by classroom teacherssome of whomare experienced writers and researchers and some who are just beginning. It is a concept thathas been received with widespread, even enthusiastic approval from language teaching profes-sionals in Japan, and we believe that it is sure to stimulate much fruitful discussion and con-tinuing research in years to come. We are especially pleased to be able to bring this collectionout in conjunction with JALT 95: the 21st International Conference on Language. Teaching/Learning and Educational Materials Exposition.

The Situation

First, a few words about the reasons for embarking on this kind of publishing project. Atpresent, books and collections of papers on various subjects are written and published byauthors from various parts of the world, mainly from English-speaking countries such as Aus-tralia, Canada, New Zealand, the U.K. and the U.S.A. The authors of these books are typicallyuniversity professors who teach courses in graduate programs and speak at conferencesaround the world, including JALT's annual international conference. There exists, however, agap between the international level where these books are generated and written and the localteacher community in Japan and Asia. With rare exceptions, these books and collectionswhile of general professional valuebear little relevance to the L2 context in either Japan orAsia. We propose this new series of collected papers to help fill this gap, and JALT, as thelargest L2 professional organization in Asia, is uniquely positioned to help bring this about.

Purpose

The JAM series is targeted at a) improving the quality of research and academic writing inJapan and Asia; b) publishing collections of articles on subjects of direct interest to classroomteachers which are theoretically grounded, reader-friendly and classroom-oriented; c) givingJapan- and Asia-based classroom teachers a publication outlet not heretofore available; and d)to help teachers around the world implement new ideas in their classrooms.

How JAM Works

We feel that JALT has reached a level of professional development as an organization in whichmany of its members are capable of writing creative and professional papers. Research andwriting, however, are not easy tasks and we believe that more JALT members would be more

LANGUAGE TESTING IN JAPAN V

capable of producing quality papers if they had comprehensive and constructive editorialsupport and guidance. To this end, the JAM project is guided by an editorial network com-posed of a general series editor, one or more country-based "local" editors and the possibilityof an international editor for each collection of papers, all of whom would work with theauthors offering contributions to a specific collection, not to mention the editorial expertise ofJALT Publications' other editors.

Any person who is a resident of Japan, a member of JALT, and primarily a classroomteacher could request to be a JAM editor. Persons making such a request will be asked tosubmit a proposal stating the theme or area of focus for the collection, along with a CV andpublications relevant to their subject. These Japan-based local editors could, if they desired,ask an international expertfor the sake of discussion, a "global" editorin the field to assistthem. After approval by the JAM committee and JALT Publications Board of the theme andeditors, the next step would be for the local editor(s) to issue a call-for-abstracts. Deadlines forreceipt of first and final drafts would be set by the local and series editors, and manuscriptsmust be approved by all editors: first by the local editor, then the global editor, and finally bythe series editor. JAM collections are vetted through peer review, meaning that publication inone of them is a personal as well as professional achievement.

Contributing authors to a JAM collection are expected to be classroom teachers in Japan ora country with similar teaching conditions (e.g., Korea, China, Taiwan, Thailand, etc.), to befamiliar with JALT and its mission, and to submit an original article which has not been pub-lished in or submitted to another publication. The JAM editors welcome contributions fromnew as well as experienced writers, and they need not have published previously in thetheme area.

Conclusion: A Unique Opportunity

JALT is a professional organization of classroom teachers who occupy the niche between newdevelopments and traditional classroom practices. As such we are historically positioned tocreatively interact between theory, research and the daily reality of the language classroom.We firmly believe that what we have to report will be of significant value to teachers through-out Asia and around the world. We are confident that you will concur.

Dale T. GriffeeSeries Editor

N.B.: Persons who desire further information about JAM should contact the current series edi-tor, Dale T. Griffee, at: Koruteju #601, 1452 Oazasuna, Omiya-shi, Saitama-ken 330, Japan.

vi JALT APPLIED MATERIALS 9

About the Contributors

Mike Bostwick has lived in Japan for over 10 years and is a doctoral candidate in the TESOL Pro-gram at Temple University Japan. He is also director of the English Immersion Program at KatohGakuen, a private Japanese school in Numazu, Shizuoka. Students in the program receive 50% ormore of their regular school instruction in the English language on a daily basis. The programbegins in preschool at three years of age and is expected to continue through high school whenfully implemented. Throughout the program, Japanese children acquire English proficiency natu-rally as they study the regular concepts and skills required in the Japanese Ministry of Educationcurriculum.

James Dean ("JD") Brown, Professor on the graduate faculty of the Department of ESL at the Uni-versity of Hawaii at Manoa, specializes in language testing, curriculum design, program evalua-tion, and research methods. He has taught in France, the People's Republic of China, SaudiArabia, Japan, Brazil, and the United States. He has served on the editorial boards of the TESOLQuarterly and JALTJournal, as well as on the TOEFL Research Committee, TESOL Advisory Com-mittee on Research, and Executive Board of TESOL. In addition to numerous journal articles andbook chapters, he has published three books entitled: Understanding Research in Second Lan-guage Learning: A teacher's guide to statistics and research design (Cambridge, 1988); The Ele-ments of Language Curriculum: A systematic approach to program development (Heinle & Heinle,1995); and Testing in Language Programs (Prentice-Hall, 1995).

Marshall R. Childs is a lecturer at Fuji Phoenix College in Gotemba, Japan, and a program man-ager in the international department of Katoh Schools in Numazu, Japan. He is a consultant inintercultural and language education. Mr. Childs was a market research manager for IBM in theUnited States until 1985. He then worked for five years in IBM's Asia Pacific headquarters in To-kyo. Since 1990, he has been a teacher, curriculum writer, and consultant in Japan. Mr. Childsholds an A.B. from Harvard and an M.B.A. 'from New York University. He is currently a doctoralcandidate in the TESOL Program at Temple University Japan. His research interests includepsycholinguistic development and the integration of testing and curriculum.

Shawn M. Clankie is an American-born EFL instructor, originally from Rockton, Illinois. He holds aB.A. in French, and M.A. in EFL, from Southern Illinois University, and was also educated atl'Universite de Savoie in France, and l'Universite Catholique de Louvain-la-Neuve in Belgium.Teaching has taken him to the intensive ESL program at St. Johnsbury Academy in Vermont, USA,and then on to two years as a visiting lecturer at Kansai Gaidai University in Osaka. An avidwriter, traveller, and language learner, his research interests include TESOL, historical linguistics,and Japanese linguistics. His articles have appeared in publications such as The Language Teacherand TESL Reporter, and his current research is on the internal and external factors of lexicalchange in pre-to-post World War II Japanese. In 1994, he moved to England to begin work to-wards a Ph.D. in Linguistics and is currently a M.Phil. candidate in theoretical Linguistics at theUniversity of Cambridge, England.

10 LANGUAGE TESTING IN JAPAN 1

CONTRIBUTORS

Shaun Gates received his M.Sc. degree in Applied Linguistics from Edinburgh University. He iscurrently lecturing in English conversation and composition at Shiga Women's Junior College inShiga Prefecture, Japan. He is also an oral examiner for the University of Cambridge Local ExamSyndicate (UCLES) and an oral and writing examiner for the International English Language Test-ing Service (IELTS).

Susan Gilfert received her M.A. degree from Ohio University and is currently lecturing at NagoyaUniversity, Aichi Prefectural University, and Nagoya University of Foreign Studies. She also con-sults with Kawai Juku in Nagoya. She is co-author of TOEIC Strategies, a test-preparation text tobe published by Macmillan Japan in fall 1995. She has written other texts including books onTOEFL preparation, study-abroad preparation, a global issues discussion text, and a film apprecia-tion demonstration text for Kawai Juku. She has studied and presented papers on comparativestandardized testing at JALT and TESOL conferences, and is interested in learning more aboutother standardized tests (such as the STEP series of tests) in Japan.

Dale T. Griffee, Assistant Professor at Seigakuin University, has taught in Japan since 1976 at com-mercial English language conversation schools, company in-house programs, junior colleges,intensive language programs, and universities. He graduated from Baylor University and holds anM.A. in TESL from the School for International Training, Brattleboro, Vermont. His research inter-ests are SLA and the effects of instruction. He is especially interested in testing, interlanguagepragmatics, and the learner-centered classroom as an alternative to the traditional lecture-format,teacher-fronted classroom. He is author of three textbooks: Listen and Act (1982, Lingual House),HearSay (1986, Addison-Wesley), and More HearSay (1992, Addison-Wesley). His most recentpublication is Songs in Action (1992, Prentice-Hall). He has held several offices in JALT includingchapter president of the Sendai Chapter and the West Tokyo Chapter, as well as having served asmember of the Long-Range Planning Committee, and Nominations and Elections Committee. Heis general editor of the new JALT Applied Materials (JAM) series.

Cecilia B. Ikeguchi is a lecturer at the English Language Department at Dokkyo University andTsukuba University in Japan. She obtained an M.A. degree in the teaching of English, anda Ph.D.degree in Education from the University of Tsukuba, a national university in Japan. She has beena teacher of English as a second and foreign language for ten years. Her research interests includeESL testing and bilingualism, as well as comparative education and educational systems.

Shin'ichi Inoi was born and brought up in Aizu-wakamatsu-shi, Fukushima Prefecture. He re-ceived his Bachelor of Education degree from Fukushima University in 1979 and taught English atseveral high schools in Chiba and Fukushima Prefectures for more than 10 years. He received aMasters of Education degree from Fukushima University in 1991 and has been a lecturer in theLiterature Department of Ohu University in Koriyama since 1993, where he teaches phonetics,grammar, and linguistics. He is a member of JALT, JACET, and the Tohoku English LanguageEducation Society. His research interests are in second language acquisition.

Judith Johnson, Associate Professor of Education at Kyushu Institute of Technology in Japan, hastaught ESL/EFL, intercultural communication, linguistics, and teacher education courses as well asconducted teacher-training workshops in the Americas and Asia. Her interest areas are teachertraining, curriculum design, materials development, and global education.

2 JALT APPLIED MATERIALS 1.1

CONTRIBUTORS

Nicholas 0. Jungheim, Associate Professor at the Center for Languages and Cultural Exchange ofRyutsu Keizai University in Ibaraki Prefecture, is a graduate of the TESOL Program at TempleUniversity Japan. His research interests include issues in classroom-oriented language acquisition,language learners' use of nonverbal communication, and language testing. He has presented hisresearch on nonverbal communication at TESOL, AERA, and JALT conferences.

Jeanette McLean has been involved since the beginning of her career in various facets of educa-tion and has lived and worked in Europe, Africa, and Asia. With an educational backgroundbased in EFL, personnel management, and training, she has had the opportunity to teach at pri-mary, secondary, and adult levels. Both in a training and management capacity, she has also beeninvolved in and instigated development projects to assist and encourage educationally disadvan-taged learners. As a result of this experience, she developed an interest in interculturalcommunication. Her subsequent research at Tokyo Denki University, where she taught for the lastthree years, focused on spoken English, the assessment of spoken English, and difficulties in-volved in oral interaction.

Hiroto Nagata has worked for over a decade as a language program coordinator at such majorJapanese companies as Marubeni, Nissan, and Panasonic, where he has conducted a series oflanguage proficiency interviews. He has published a number of articles on language teaching, anda variety of books and dictionaries for both children and adult learners of English. He currentlyteaches at Meikai University and Panasonic Tokyo, and is a doctoral candidate in the TESOL Pro-gram at Temple University Japan, where he also received his M.Ed. in TESOL.

Yuji Nakamura, Associate Professor of English at Tokyo Keizai University, teaches courses in lan-guage testing and EFL. He received his Ph.D. in Applied Linguistics (specializing in LanguageTesting) in 1994 from International Christian University in Tokyo. His research interests includelarge-scale assessment of the oral proficiency of Japanese students, application of item responsetheory to listening comprehension tests, and detailed study of validity.

Sayoko Okada Yamashita is currently teaching Japanese as a second language in the Division ofLanguages at International Christian University and the teaching practicum for the Labo Interna-tional Foundation. She has also been developing audio-video materials for language teaching as acommittee member of the National Language Research Institute. She has published a textbook forJapanese language teaching and articles on Japanese language teaching and Applied Linguistics.She received her Master of Education degree in Curriculum and Instruction at the University ofHouston, Texas. She is now a doctoral candidate in the TESOL Program at Temple UniversityJapan in Tokyo. Currently her primary research interest is in the testing of pragmatics.

12LANGUAGE TESTING IN JAPAN 3

Chapter 1

Introduction

JAMES DEAN BROWN

SAYOKO OKADA YAMASHITA

The idea for a book that focused onlanguage testing issues in Japan tookshape several years ago. It was originally

suggested by Dale Griffee and then fleshed outas an idea and proposal by J. D. Brown (fromhis position at the University of Hawaii atManoa in Honolulu). We knew that there weremany books written about language testing aswell as a number of edited collections of ar-ticles, but to our knowledge, no book had everbeen dedicated specifically to the issues in-volved in language testing in Japan. Theproject became truly international whenSayoko Yamashita was invited to participate atlogistical and editorial levels from her positionat International Christian University in Tokyo.

Given that JALT was not in the habit of pub-lishing books on professional issues in lan-guage teaching, considerable debateaccompanied JALT's eventual approval of theidea of producing this book. Nevertheless,once the idea was given a green light, theproduction of Language Testing in Japan wastaken up enthusiastically by the JALT. The callfor papers went out through JALT's publica-tions, JALTJournal and The Language Teacher,and a wide variety of interesting papers werereceived. In the fall of 1994, the papers in thisbook were selected, edited, and sent back tothe authors for revisions. The contributors' finalversions were submitted in winter 1995. Sincethen, the papers have been edited several moretimes in the process of producing this book.

From our perspective, the purpose of thebook is to stimulate discussion about the

testing that is done in the many languageprograms and classrooms throughout Japan.Naturally, testing serves many purposes inJapan including making decisions about stu-dents' language proficiency and placement, aswell as for diagnosing their strengths andweaknesses, monitoring their progress, andmeasuring their achievement.

To help explore all of those uses of tests, thebook offers chapters on classroom testing strat-egies, program-level testing strategies, stan-dardized testing, oral proficiency testing, andinnovative testing. To give readers an overviewof the book, we will use the remainder of thischapter to summarize the other chapters. Thesesummaries are based on the abstracts writtenby the authors themselves (please note that allcitations will be found in the references of therespective chapters).

Classroom Testing Strategies

You are now reading Chapter 1, the introduc-tion to the book. Next, you will find a sectionon classroom testing strategies, which in-cludes Chapters 2, 3, and 4.

Chapter 2, entitled "Differences betweennorm-referenced and criterion-referenced tests"by James Dean Brown, addresses the distinc-tion between norm-referenced tests (NRTs) andcriterion-referenced tests (CRTs). He begins byexamining differences in their test characteris-tics, including differences in: (a) the underlyingtesting purposes, (b) the types of decisions thatare made with each test type, (c) the levels of


BROWN & YAMASHITA

generality of the material contained in the tests,(d) the students' expectations when they takethe tests, (e) how scores are interpreted, and(1) how scores are reported to the students.Then, logistical differences between the twotypes of tests are discussed, including groupsize, ranges of ability, test length, time alloca-tion, and cost. Finally, the author argues that:(a) the testing that teachers are doing in theirclasses is at least a good start, (b) scores onclassroom CRTs may not necessarily be nor-mally distributed, (c) certain testing respon-sibilities ought to rest with teachers and otherswith administrators, (d) CRTs and NRTs shouldbe developed in different ways, and (e) NRTscannot be expected to do all things well.

Chapter 3, entitled "Criterion-referencedtest construction and evaluation" by Dale T.Griffee, introduces the subject of classroomtest construction and evaluation to secondlanguage teachers, especially those teachersnot familiar with test analysis procedures.The differences between CRTs and NRTs aretouched on. The development and adminis-tration of pretests and posttests are explained.The subjects were 43 students in two classesat Seigakuin University. The chapter reportsthe results of those two test administrations.Descriptive statistics are presented and theItem Facility (IF) and Difference Index (DI)statistics are given for the first 15 items. Itemanalysis is discussed in some detail. Thechapter concludes with a discussion of theimplications of these classroom test construc-tion procedures and the importance of institu-tional learning objectives.

Chapter 4, entitled "Behavioral learningobjectives as an evaluation tool" by Judith A.Johnson, argues that one important key toenhancing student learning is to set andachieve meaningful and unambiguous in-structional objectives. Stating objectives asspecific, observable, and measurable learningoutcomes that must be demonstrated by thestudent is essential to the instructional pro-cess: teaching/learning and evaluation.Bloom's taxonomy (1956) of observable stu-dent behaviors provides a practical and flex-ible framework within which (a) teachers candetermine their instructional objectives, meth-

6 JALT APPLIED MATERIALS

ods, and materials, and (b) teachers and stu-dents can measure student achievement dur-ing both the teaching /learning and evaluationphases of the instructional process, enablingthem to determine which skills each studenthas mastered and which skills still need to behoned in order to fulfill the requirements ofeach objective. This chapter not only demon-strates how to write behavioral learning ob-jectives, but also how to incorporate theminto the instructional process as measurementtools in order to improve student learning.

Program-Level Testing Strategies

The next section of the book, consisting ofChapters 5, 6, and 7, groups three papers onprogram-level testing strategies.

Chapter 5, entitled "Developing norm-referenced language tests for program-leveldecision making" by James Dean Brown,describes the basic procedures involved indeveloping norm-referenced tests. Administra-tors and teachers can benefit in a number ofways from developing effective norm-refer-enced tests that can be used for aptitude,proficiency, and placement decisions. To helplanguage professionals do so, this chapteraddresses the following issues: (a) What arenorm-referenced tests, and what are theyused for? (b) What do administrators need intheir norm-referenced tests? And, (c) howshould norm-referenced tests be developedand improved?

This chapter also argues that norm-refer-enced tests must be accorded a much moreimportant place in the administrator's programdevelopment plans. Strategies are suggested forfinding or training staff members in this impor-tant area of curriculum development so that theimportant decisions that need to be made withnorm-referenced tests can be made efficiently,effectively, and responsibly.

Chapter 6, "Monitoring student placement:A test-retest comparison" by Sayoko OkadaYamashita, investigates potential proficiencydifferences between subjects in the Fall andSpring terms of the Japanese Language Pro-gram (JLP) at International Christian Univer-sity, along with possible reasons for such

14

differences. The study is based on three sub-test scores of the JLP placement test given inFall as pretests and the same subtests given inSpring as posttests. Three research questionsare addressed: (a) Do continuing students inthe spring courses perform the same as thosewho had been placed into the equivalentcourses in the Fall? (b) If there are differencesin performance, are the differences observedat all levels or in some particular levels only?(c) Are there any differences in student per-formance on the different subtests? The studyreveals that there are differences between theFall and Spring populations. Possible reasonsof this phenomenon are discussed and peda-gogical implications are given.

Chapter 7, entitled "Evaluating young EFLlearners: problems and solutions" by R.Michael Bostwick, argues that, althoughsome types of paper and pencil tests may beappropriate for assessing language skills inolder EFL students, they are not likely to bevery suitable for younger EFL learners.Teachers who work with younger childrenneed alternative assessment procedures thatcan inform instruction and provide teachersand students with feedback on the effective-ness of classroom teaching and learning.Before alternative forms of foreign languageassessment can be employed, a number ofissues must be considered so that the assess-ment procedures that are developed arealigned with the goals and objectives of theprogram, as well as with the actual instruc-tional practices. This chapter examines sev-eral common problems foreign languageprograms for children face as they begin toconsider ways to evaluate student progressand the effectiveness of their instruction.Solutions to these problems are discussedand two examples of assessment instrumentsare provided in order to help other programsbegin to think about and develop their ownassessment procedures.

Standardized Testing

This next section of the book includes Chap-ters 8 to 11. Chapter 8, by Marshall Childsand titled "Good and bad uses of TOEIC by

BROWN & YAMASHITA

Japanese companies," examines the Test ofEnglish for International Communication(TOEIC), which enjoys wide popularityamong those Japanese companies that en-courage their employees to learn English.One such company uses TOEIC scores togauge the learning achievement of groupsand to measure the effectiveness of differentschools or treatments, and to guide individu-als in choosing curricula. To what extent arethese wise uses of TOEIC? A study of thescores of 113 learners in four successive testsduring a course of English instruction in thiscompany yields some answers. With propercare, TOEIC can be used to measure thelearning gains of groups. For individuals,however, the variability of scores makesgauging learning uncertain, and the absenceof achievement measures makes guidinglearner curricula difficult. Advice is offered toadministrators who may want to use TOEICfor gauging learning.

Chapter 9, "A comparison of TOEFL andTOEIC" by Susan Gilfert, begins by pointingout that the TOEFL is a well-known multiple-choice measure of an examinee's receptiveEnglish skills and is considered a reasonablygood predictor of the examinee's productiveskills. The general register of the TOEFL isacademic English. The TOEIC is a lesser-known multiple-choice measure of anexaminee's English skills in the general regis-ter of real-life, business situations. The TOEFLis created, produced, and sold by the Educa-tional Testing Service (ETS) in Princeton, NewJersey. The TOEIC was originally created byETS, but is now entirely owned and operatedby the Japanese TOEIC office in Tokyo. Thischapter gives a short history of the two tests,the TOEFL coming from American academicrequests of the 1960s and the TOEIC comingfrom a Japanese Ministry of InternationalTrade and Industry request in the middle1970s. While both are multiple-choice, norm-referenced tests, the content of each testdiffers, as does the register (academic versusbusiness English). Both tests measure Englishskills, but the purpose of the TOEIC is toprovide "highly valid and reliable direct mea-sures of real-life reading and listening skills"

1_5LANGUAGE TESTING IN JAPAN 7

BROWN & YAMASHITA

(Woodford, 1982, p. 64). The chapter exam-ines the skills tested by the TOEFL and theskills tested by TOEIC, and reports the appar-ent similarities and differences in both tests,as well differences in how they.are scoredand used.

Chapter 10, entitled "English languageentrance examinations at Japanese universi-ties: 1993 and 1994," by James Dean Brownand Sayoko Okada Yamashita, defines andinvestigates the various types of test itemsused in the English language entrance exami-nations administered at various Japaneseuniversities in 1994 and compares these 1994results to a previous study of the 1993 exami-nations. Ten examinations were selected fromprivate universities and 10 from public universi-ties along with the nationwide "center"examination. These twenty-one examinationswere studied to find answers to the followingquestions: (a) How difficult are the variousreading passages used in the 1994 universityEnglish language entrance examinations?(b) Are there differences in the levels of read-ing passage difficulty in private and publicuniversity examinations in 1993 and 1994?(c) What types of items are used on the 1994English language entrance examinations, howvaried are they, and how does test length vary?(d) Are there differences in the types of itemsand test lengths found in private and publicuniversity examinations in 1993 and 1994?(e) What skills are measured on the 1993 and1994 English language entrance examinations?

To answer these questions, the examina-tions were analyzed item-by-item by a nativespeaker of English and a native speaker ofJapanese. Then computer software was usedto further analyze the level of difficulty of thereading and listening passages. The resultsshould help English teachers in Japan to pre-pare their students for taking such tests andto help their students in deciding on whichtests to take. Equally important, this studymay continue to provide some professionalpressure on those responsible for creatingentrance examinations to do the best they canto create high quality tests.

Chapter 11 is "Exploiting washback fromstandardized tests," by Shaun Gates. It ex-

8 JALT APPLIED MATERIALS 16

plores the washback effect, i.e., the influenceof testing on teaching and learning. If teach-ers and students know the format and contentof a test, they will tend to alter their class-room behavior to maximize performance onthe test. Whether washback from any particu-lar test is considered negative or positivedepends on one's point of view. In this chap-ter, washback is considered positive if it pro-motes communicative language use in theclassroom. Some of the factors identified ascontributing to washback are particularlystrong when considering standardized tests.Does standardized testing lead to a conflictwith what the teacher wishes to achieve inthe classroom? To answer this question, thewritten section of the IELTS test and the oralsection in the UCLES Preliminary English Test(PET) are examined. The author argues thatteachers preparing their students for thesetests can exploit washback for classroompurposes.

Oral Proficiency Testing

The next section of the book, consisting ofChapters 12, 13, and 14, is about oral profi-ciency testing, which the Japanese educa-tional system as a whole has only recentlybegun to consider.

In Chapter 12, "Testing oral ability: ILR andACTFL oral proficiency interviews," HirotoNagata argues that, as language teachers, weare constantly facing the problem of reliablyappraising our students' oral proficiency, andare always looking for some rule-of-thumb,easy-to-handle yardsticks. This chapter pro-vides one such solution. Presented here aretwo interview models (the ILR and ACTFLoral proficiency interviews) with some practi-cal suggestions on how to use them effec-tively not only for assessing students' oralproficiency but also for making decisionsabout (a) the kinds of remedial treatmentsthat should be offered to students and (b)how to successfully add real-world reality toclassroom activities as well.

Chapter 13, "The SPEAK test of oral profi-ciency: A case study of incoming freshmen"by Shawn M. Clankie, provides a detailed

analysis of an attempt to institute the SPEAKtest of oral proficiency, as a measure for as-sessing incoming freshmen, into an existingEFL program at Kansai Gaidai University inOsaka. The SPEAK test, based on retired TSE(Test of Spoken English) tests, contains sixtaped sections. Each student responds to thetaped sections, and their voices are then re-corded on separate cassettes for later evalua-tion. The test was administered to 150students already admitted into the IntensiveEnglish Studies (IES) program on the basis oftheir TOEFL scores alone. This administrationserved as a preliminary run in preparation forcandidates seeking admittance into the fol-lowing year's classes. The TSE was adminis-tered and critiqued in terms of both itspracticality and its ability to be accuratelyapplied to Japanese learners. The chapter alsodescribes the SPEAK test and the IES pro-gram, followed by a discussion of the benefitsand failures of the TSE. In the end, the SPEAKtest was abandoned in favor of more tradi-tional face-to-face methods of oral evaluation.The reasons for this decision are explained.

Chapter 14, entitled "Making speaking testsvalid: Practical considerations in a classroomsetting" by Yuji Nakamura, argues that thecentral problem in foreign-language testing,as in all testing, is validity. Although there areother aspects such as reliability and practical-ity, there is no doubt that validity is the criti-cal element in constructing foreign languagetests. Validity concerns the question "Howmuch of an individual's test performance isdue to the language abilities we want to mea-sure?" (Bachman, 1990). In other words, valid-ity is about how well a test measures what it issupposed to measure (Thrasher, 1984). In thischapter, the author will (a) examine variouskinds of validity in language testing in general,(b) discuss which types of validity are mostsuitable for a test of English speaking ability ina classroom setting, and (c) describe a validitycase study which introduces the types of valid-ity (i.e., construct validity, concurrent validity,face validity, and washback validity) whichhe has used in the development of asemi- direct speaking test suitable for use withJapanese college students. The construct

BROWN & YAMASHITA

validity of the test was examined throughfactor analysis and task correlation analysis;concurrent validity was estimated by thecomparison of the test results and teacher'sestimates; face validity, and washback validitywere examined using a questionnaire andinterviews.

Innovative Testing

The final section of the book, including Chap-ters 15 to 18, addresses the matter of innovativetesting. In Chapter 15, "Cooperative assess-ment: Negotiating a spoken-English gradingscheme with Japanese university students"Jeanette McLean describes how a workablegrading scheme for assessing Japanese univer-sity freshmen's spoken English performancecan be arrived at through negotiation with thestudents themselves. The setting up of thisproject is briefly described, and some problemsinherent in assessing spoken English in Japa-nese universities are discussed. The steps takento develop the grading scheme are described todemonstrate how student participation can beencouraged and how, through the consultationprocess, ownership of the assessment schemeby learners can be promoted. Benefits to theparticipants are then considered. The chapterconcludes that this approach to testing can beused as an effective teaching tool to increaseconfidence, build trust, and harness coopera-tion among learners, teachers, and assessorsand that the development of a grading schemeto which all participants contribute is a vitalstep towards achieving a valid test of spokenEnglish in the educational context of Japan.

In Chapter 16, "Assessing the unsaid: Thedevelopment of tests of nonverbal ability,"Nicholas 0. Jungheim, presents a rationaleand a framework for testing nonverbal abilityas a part of language learners' communicativecompetence and describes the creation of twotests of nonverbal ability using traditional testconstruction methods. The nonverbal abilityframework, based on the CommunicativeLanguage Ability model (Bachman, 1988,1990), consists of nonverbal textual ability,sociolinguistic ability, and strategic ability.The first test is the Gestest, a test of the non-


BROWN & YAMASHITA

verbal sociolinguistic ability to interpret NorthAmerican gestures. The collection of baselinedata, the application of traditional item analy-sis procedures, and the results of a pilot ad-ministration are described. The second testconsists of the Nonverbal Ability Scales, aseries of scales for rating language learners'use of head nods, gaze direction changes,and gestures in conversations. These scalesinvolve nonverbal textual ability, includingfrequency and appropriateness of head nodsand gaze direction changes, and nonverbalstrategic ability, including the appropriatenessand compensatory usage of gestures. Baselinedata-collection procedures, the constructionand interpretation of the scales, and the re-sults of their application are described. Thevalidity and reliability of both tests are dis-cussed. The chapter concludes with a discus-sion of problems related to testing nonverbalability and future prospects.

Chapter 17, by Cecilia B. Ikeguchi andtitled "Cloze testing options for the class-room," points out that an apparent di-chotomy exists between theoretical researchon language testing and actual classroomtesting. Specifically, the concept of doze hasoften been confined within the walls of em-pirical research, R, which has often beenconsidered distinctly different from practicallanguage testing situations, labeled r byLo Castro (1991). This chapter brings theresults of research on doze testing into thepractical classroom setting by offering asimplified review of empirical studies on fourmajor types of doze tests and presentingpractical classroom applications supported bythe research. A chronological discussion offour major types of doze tests (fixed-ratedeletion, rational-deletion, multiple-choicedoze, and C-test) is presented along withtheir theoretical justifications, the features ofeach, and their distinct merits and demerits.The accuracy of measuring the specific lan-guage traits language teachers want to assessmay depend on the kind of doze procedurethat is selected.

Chapter 18, "The validity of written pronun-ciation questions: Focus on phoneme dis-


crimination" by Shin'ichi Inoi, examines thevalidity of phoneme discrimination questionsfor Center Exams administered by the Minis-try of Education. To determine the validity ofthese types of questions, a written test of 30phoneme discrimination questions and theiroral versions were given to 60 Japanese col-lege freshmen. One type of written phonemediscrimination question involves the learnersbeing instructed to select from four alternativeresponses the word with the underlined partthat is the same as the word they hear pro-nounced on a tape. The other type of writtenquestion requires the learners to choose theword with the underlined part pronounceddifferently from the other alternatives. On theoral version of the test, learners were askedto pronounce each of the words from thewritten test questions. The learners'pronunciations were recorded onto cassettetapes for later analysis. The data were ana-lyzed in terms of the correlation of scoresbetween the written and oral versions and therate at which answers between the two testsagreed. The correlational analysis showedthat moderately strong relationships existedbetween scores on the written oral tests,indicating, to some degree, the validity of thewritten phoneme discrimination questions.However, the agreement rate analysis alsoraised some doubts about their validity: lowagreement rates prevailed, especially amongthe answers obtained from low-score achiev-ers; on some items, the agreement rate failedto reach 50%.

Conclusion

To summarize briefly, Language Testing inJapan provides three to four chapters eachon classroom testing, program-level testingstrategies, standardized testing, oral profi-ciency testing, and innovative testing. Wehope that readers will find this collection ofpapers on language testing in Japan as inter-esting as we do. More importantly, we hopethat readers will find the papers accessibleand useful, as well as provocative, and per-haps even inspiring.

18

Section I

Classroom Testing Strategies

19

Chapter 2

Differences BetweenNorm-Referenced and

Criterion-Referenced Tests

JAMES DEAN BROWN

UNIVERSITY OF HAWAII AT MANOA

Tanguage testing is an important aspectof the field of language teaching. Hencelanguage teachers in Japan should be

informed about the topic as should languageteachers everywhere else. A number of bookshave been written on how to construct and uselanguage tests (see, for instance, Alderson,Clapham, & Wall, 1995; Bachman, 1990; Baker,1989; Brown, 1995a; Carroll & Hall, 1985; Har-ris, 1969; Heaton, 1988; Hughes, 1989; Madsen,1983; or Valette, 1977), and many other booksand articles have been written about theoreticalaspects of language testing. Indeed, a journalcalled Language Testing has flourished for anumber of years, and a conference occursevery year with the imposing name of Lan-guage Testing Research Colloquium. Moreimportantly, in one way or another, languagetests are used in almost every language pro-gram in the world in a variety of differentways including testing for: language aptitude,proficiency, placement, diagnosis, progress,and achievement.

With all these important uses, tests ought tobe of interest to almost every languageteacher. Yet in my professional travels, espe-cially in Japan, I have found that most teach-ers are intimidated by tests and maybe even alittle frightened of them. Why is that? I believethat one cause is the fact that nobody hasever taken the trouble to clearly explain to


most teachers how very useful tests can be tothem in their daily work. Hopefully, many ofthe articles in this book will provide explana-tions and examples of just how useful testscan be for language teachers in Japan.

As a basis for that process, I will begin inthis chapter by explaining the differencesbetween criterion-referenced and norm-refer-enced language tests. Some teachers mightargue that such jargon (norm-referenced andcriterion-referenced) is one of the primaryproblems with language testing. Perhaps thatis true, but nonetheless, the jargon exists, andI think that this particular jargon is very im-portant because understanding these particu-lar concepts is crucial if language teachers areto truly understand the variety of useful waysthat tests can function in their program deci-sion making and in their individual classes.

I'll address the distinction by answeringtwo central questions: How are criterion-referenced tests fundamentally different fromnorm-referenced tests? And, why is the dis-tinction important to both language teachersand language program administrators?

Criterion-Referenced vs. Norm-Referenced

Criterion-referenced tests (CRT) and norm-referenced tests (NRT) differ in many ways,and an ever-increasing literature is developing

20

within language testing on the distinctionbetween CRTs and NRTs (Cartier, 1968;Cziko, 1982, 1983; Hudson & Lynch, 1984;Brown, 1984, 1988, 1989, 1990a & b, 1992,1993, and 1995a & 13; and Bachman, 1989,1990). I have found that the primary differ-ences between CRTs and NRTs can be classi-fied as either differences in test characteristicsor differences in the logistics involved.Hence, the following discussion will be orga-nized around those two basic categories.

Table 1. Differences in Characteristics andLogistics for CRTs and NRTs (adapted fromBrown, 1992)

CRTs NRTs

Test Characteristics

UnderlyingPurposes

Types ofDecisions

foster learning classify/groupstudents

diagnosisprogressachievement

Levels of classroomGenerality specific

Students' know contentExpectations to expect

Score percent percentileInterpretations

Score Report tests and only scoresStrategies answers to go to students

aptitudeproficiencyplacement

overallglobal

do not knowcontent

Logistical Dimensions

Group Size relatively large groupsmall group

Range of relatively wide range ofAbilities homogeneous abilities

Test Length relatively large number offew questions questions

Time Allocated relatively long (2-4 hrs)short time administration

Cost teacher time& duplication

test booklets,tapes, proctor& a fee

BROWN

Differences in Test CharacteristicsAs I have pointed out elsewhere (Brown,

1990b, 1992, & 1995a), six primary test char-acteristics distinguish criterion-referencedfrom norm-referenced tests: differences inunderlying purpose, in the types of decisionsthat are made, in the levels of generality, instudents' expectations, in score interpreta-tions, and in score report strategies. Theupper portion of Table 1 summarizes howCRTs and NRTs differ on these important testcharacteristics.

Underlying purpose. The basic purpose ofcriterion-referenced tests is to foster learning.Typically, teachers administer CRTs in orderto encourage students to study, review, orpractice the material being covered in acourse and/or in order to give the studentsfeedback on how well they have learned thematerial. In contrast, the underlying purposefor norm-referenced tests is usually to spreadstudents' performances out along a con-tinuum of scores so the students can be clas-sified or grouped for admissions or placementpurposes. While creating such groupings maybe beneficial to learning, the NRTs are nottypically designed to test material that is spe-cifically and directly related to a single courseor program. Thus NRTs are not directly cre-ated to foster learning.

Types of decisions. Some types of testingdecisions are best made with criterion-refer-enced tests. For instance, CRTs are well suitedto making diagnostic, progress, and achieve-ment decisions. Such decisions usually involvegiving students feedback on their learning:either detailed feedback on an objective-by-objective basis in diagnostic and progress test-ing, or less detailed feedback in the form ofcourse grades, which are very often based, atleast in part, on achievement test scores. Norm-referenced tests are more appropriately usedfor aptitude, proficiency, and placement deci-sions. Aptitude and proficiency test scores aretypically used for determining who will beadmitted to a particular institution or programof study. Later, after students have been admit-ted to an institution, a placement test may beused for deciding the appropriate level of studywithin that institution.

LANGUAGE TESTING IN JAPAN 13

BROWN

Levels of generality. Criterion-referencedtests are usually based on the very specificobjectives of a course or program. CRTs aretherefore likely to vary in content, form, andfunction from teacher to teacher and fromcourse to course. In contrast, norm-referencedtests are usually designed to be institution-independent, that is, NRTs are typically usedwith students from many institutions, or atleast many classrooms. Norm-referenced testsmust therefore be based on knowledge, skills,or abilities that are common to a number ofinstitutions, programs, or courses.

Students' expectations. Students generallyknow what to expect on a criterion-refer-enced test. Indeed, they will ask the teacherwhat they should study for the test, and theteacher usually ends up playing a kind ofgame with them: giving a list of the pointsthat have been covered in the course (that is,the course objectives that the teacher wantsthe students to study, review, or practice)without giving away exactly what will be onthe test. But in any case, the students shouldknow from the course objectives what willbe covered on a CRT. On norm-referencedtests, the students usually have little idea ofwhat content to expect. They may have someidea of the types of test questions that will beon the test (for instance, multiple-choice,interview, doze passage, writing task, role-play, etc.), but they will have virtually noidea of the exact content that the test ques-tions will cover. Security is often a very im-portant issue on NRTs.

Score interpretations. Criterion-referencedtests are typically scored in percentage terms.For example, a student who scores 95 per-cent on a CRT might get an A.grade, or 88percent might earn a student a B+, or 33percent might be failing. In contrast, norm-referenced test scores are typically reportedas "standardized" scores like the CEEB scoresreported for the overall Test of English as aForeign Language (TOEFL) (ETS, 1995),where 500 is the norm average and thescores range from 200 to 677 (see Brown,1995a for a full explanation of how thesestandardized scores work). Some NRTs reportpercentile scores, which are also standard-


22

ized scores. Percentile scores are generallyeasier for students to understand. Such NRTscores only have meaning with reference to aspecific population of students (i.e., thegroup of students which served as the normgroup). Thus, on an NRT, a student who is inthe 91st percentile has scored better than 91out of 100 of the students in the norm group,but worse than 9 out of 100. On NRTs, theconcern is not with how many questions thestudent answered correctly (or what percent),but rather with how the student scored rela-tive to all of the other students in the normgroup (that is, the percentile). Such interpre-tations are very different from the straightfor-ward percent typically used in thinking aboutCRT performance.

Score report strategies. In order to fosterlearning through criterion-referenced testing,teachers will often return the tests to thestudents when reporting their scores, andeven go over the answers in class so thestudents can benefit from the experience.Hence the students usually know exactly howmany questions they answered correctly andwhich ones. On norm-referenced tests, stu-dents typically do not get their tests back aftertaking the examination.' In most cases, theyare not even told the actual number of ques-tions they answered correctly. Instead, theyare given transformations of their actual "raw"scores into "standardized" scores. As a result,students seldom know how many questionsthey answered correctly on an NRT.

Logistical DifferencesAs shown in the bottom portion of Table 1,

the logistical differences between CRTs andNRTs have to do with differences in groupsizes, ranges of abilities, test length, timeallocated, and cost.

Group size. Criterion-referenced tests aremost often constructed to be used in class-room settings. Thus the group size involved isusually limited to the relatively small (say, 10to 75 students) groups found in languageclassrooms. NRTs, on the other hand, aretypically constructed to be administered torelatively large groups of students. For in-stance, the TOEFL has been administered to

about a million students worldwide in each ofthe past three years (plus or minus a coupleof hundred thousand). This fact, alone, hascertain implications in terms of the test ques-tion formats that can be employed, and mayaccount for the general prevalence of mul-tiple-choice questions in norm-referencedtests. After all, multiple-choice questions arerelatively easy to score, especially if the scor-ing can be done by machine.

Range of abilities. In addition, criterion-referenced tests are usually designed for agroup of students who have been placed intoa particular level of study and are studyingexactly the same material. Thus the range ofabilities involved in criterion-referenced test-ing situations is usually relatively narrow. Incontrast, norm-referenced tests are typicallynormed on a population of students withvery wide ranges of abilities. For example,the TOEFL is normed on a group with abili-ties that range from virtually no knowledgeof the English language (that is, a studentwhose score is based entirely on guessing) tohigh native-speaker ability. Thus the TOEFLscores represent a very wide range of abili-ties, indeed.

Test length. Furthermore, CRTs tend to con-tain relatively few test questions because theyare designed to test a relatively small body ofknowledge or set of skills. In comparison,NRTs tend to be fairly long because of thelarge body of knowledge or skills that is beingassessed. In addition, NRT developers are wellaware that making their tests long enhancestheir chances of achieving good statistical char-acteristicsparticularly reliability.

Time allocated. Because criterion-refer-enced tests are relatively short and are oftenadministered during class time, they tend tobe relatively quick, usually about one hour,depending on how long the class is sched-uled to meet. Because norm-referenced testshave more questions and are not limited byclass scheduling, they tend to take muchmore time to administer. The NRTs that Ihave worked on have generally ranged fromabout two hours to six hours.

Cost. The last logistical distinction betweenCRTs and NRTs involves cost. Criterion-refer-

BEST COPY MUM

BROWN

enced tests are usually viewed as beingnearly free because they only involve theteacher's time (testing is usually consideredpart of the teacher's job) and whatever mini-mal duplication costs might be involved.Because norm-referenced tests are longerand more elaborate (including test booklets,tapes, answer sheets, official proctors, andconsiderable space), they usually cost muchmore to administer.Naturally, some NRTs arecheaper than others. For instance, the TOEFLwith its multiple-choice machine-scorableformats is much cheaper than its companiontests, the Test of Spoken English (TSE) and theTest of Written English (TWE), which are alsopublished by Educational Testing Service ona regular basis. The TSE and TWE are moreexpensive because they involve speechsamples and written compositions, respec-tively. These productive language samplesmust each be judged by several paid humanraters, and, of course, that is a relativelyexpensive process.

Two Contrasting Example TestsIn terms of the six test characteristics listed

in Table 1, a typical example of a criterion-referenced test would be the intermediatelevel speaking test that we developed when Iwas teaching EFL in China from 1980 to 1982(see Brown, 1995b, for further description ofthis program). The underlying purposes ofthis test were to foster learning by giving thestudents diagnostic feedback at the beginningof the course and progress feedback at themidterm, and by assessing their overallachievement at the end of the course. One ofour hidden purposes was to get studentswho were accustomed to traditional gram-mar/translation language teaching to cooper-ate in role-play, pair-work, and othercommunicative activities that they thoughtwere strange and pointless. Thus the contentof the test was very specific and based onthe precise objectives of our course. Ourobjectives were for the students to be able toeffectively use 15 of the functions covered inthe Gambits series (Keller & Warner, 1979)by the end of the course. It was very easy forthe students to predict which functions were


BROWN

going to be tested because we told themprecisely what our course objectives wereand precisely what to expect on the test interms of objectives, directions, format, timing,and tasks. In addition, they took the sametest three times so, at least by the end of thecourse, everyone was sure to know what toexpect on the test. The test took the form ofa taped interview. To save time, each studentrandomly selected three cards from the fif-teen that we had (one for each objective),and the interview proceeded. Variousschemes were used to score these interviews,but the best one from my point of viewasked two teachers (the student's ownteacher and one other) to give the studentsseparate ratings for fluency, content, effec-tiveness at getting their meaning across, us-ing the correct exponents to accomplish thefunction, and stress/intonation. Each of thesefive categories was worth 20 points for atotal of 100 points, which we interpreted inpercentage terms as is typical of a CRT. Wedidn't give the tests back to the students, butwe did give them feedback (on the diagnos-tic and progress tests) in the form of ateacher conference with each student, wherewe played the tape for them and gave themoral feedback from our notes.

In terms of logistical dimensions, this testwas also clearly a CRT. Every ten-week term,the test was typically administered three times(weeks one, five, and ten) to three smallclasses of 20 students. All of the studentswere in the intermediate level so they werefairly homogeneous in terms of their range ofabilities. The test was also relatively shortwith only three questions per student in aninterview that took only about five minutes;and, this speaking test cost very little to ad-minister because the administration and scor-ing were considered part of the teachers'jobs, and the materials were duplicated as anormal part of program expenses.

An example of a norm-referenced test isthe TOEFL (ETS, 1995), which can also beexamined in terms of the six test characteris-tics listed in Table 1. The underlying purposeof the TOEFL is to classify students along acontinuum of general abilities (that is, overall


English language proficiency for academicpurposes) usually with the ultimate goal ofmaking admissions decisions at universitiesin the United States. Thus the content of thetest is very general within three broad cat-egories: (a) listening comprehension, (b)writing and analysis, and (c) vocabulary andreading comprehension. Within those catego-ries, it is very difficult to predict exactly whatthe content of the questions will be, andEducational Testing Service strives to main-tain that situation by using a large test ques-tion pool, by developing 12 forms per year,by taking legal action against those whoviolate their copyright, and by enforcing stricttest security measures worldwide. Hence, it isdifficult for students to know exactly what toexpect on the test when they take it. Thescores on TOEFL are always interpreted asstandardized scores. For instance, a studentwho scores 500 is average in comparison tothe norm group, that is, 50 percent of thestudents would be below the student in thedistribution of scores (and of course, 50 per-cent would be above the student, too). Inaddition, Educational Testing Service wouldprefer not to give the tests back to students.Indeed, students have to go out of their wayto request a copy of the test before ETS willsupply it. This is reasonable from the tester'spoint of view because giving out copies ofthe tests compromises test security and raisesthe costs of testingcosts that will have tobe passed on to future examinees.

In terms of logistical dimensions, theTOEFL is also clearly an NRT. Every year, itis administered to many hundreds of thou-sands of students around the world whohave a wide range of abilities from virtuallyno English to high native proficiency. Thistest is also relatively long, both in the num-ber of test questions and the time that ittakes to administer it (about two hours and30 minutes). Finally, the TOEFL costs a greatdeal more than a CRT to administer, and as aresult, it has a registration fee that amountsto a month's salary in some countries.

In Japan, numerous examinations are, orshould be, considered NRTs. For instance,because of the types of admissions decisions

being made, the high school and universityentrance examinations should probably beconsidered norm-referenced. Similarly, theTOEIC and the various Eiken examinationsare norm-referenced.

Typically, norm-referenced tests in theUnited States are analyzed qualitatively andstatistically to determine the degree to whichthey are functioning well as norm-referencedtests (particularly in terms of test-item quality,descriptive statistics, test reliability, and valid-ity). However, after considerable effort, I havefound it impossible to obtain similar informa-tion about the norm-referenced tests in Japan.This raises the question of whether the qual-ity of NRTs in Japan is analyzed at all (statisti-cally or otherwise).

Why is the Distinction Important?

The distinction between CRTs and NRTs isprimarily important to teachers in Japan be-cause, based on the distinction, they shouldrealize that: (a) the testing that they are do-ing in their classes is at least a good start,(b) scores on a classroom test may not neces-sarily be normally distributed because suchtests should be criterion-referenced, (c) cer-tain testing responsibilities ought to rest withteachers and others with administrators,(d) CRTs and NRTs are developed in differ-ent ways, and (e) NRTs cannot be expectedto do all things well.

Most current classroom testing is a goodstart. I'm sure that in many cases, classroomcriterion-referenced testing could be donebetter, but at the same time, I am also surethat most teachers are at least on the righttrack when they try to test the things thatthey have taught in their courses. The onlyteachers who might be on the wrong trackare those who don't do any testing at all.Students need feedback on how they aredoing. If asked, most students will even saythat they like to be tested. But, to be effec-tive, a CRT must match what is being taughtin the class. So it would be inappropriate toteach communicative language skills andfunctions and then test the students with amultiple-choice grammar test.

BROWN

CRT scores may not be normally distrib-uted. Teachers will also be comforted torecognize that a normal distribution (com-monly known as the bell curve) may notnecessarily occur in the scores of their class-room tests. As mentioned above the groupsof students are usually small and homoge-neous, and it is not reasonable to expect anormal distribution of scores in such groups.In addition, on CRTs, the ideal distributionswould occur if all of the students scored zeroat the beginning of a course (indicating thatthey all desperately needed to learn the ma-terial) and 100 percent at the end of thecourse (indicating that all of the studentshave learned all of the material perfectly).Neither of these ideals is ever really met,even with a good test, but the scores mightlogically be "scrunched up" toward the bot-tom of the range at the beginning of a courseand toward the top of the range at the end ofthe course. Hence for a number of reasons,expecting a normal distribution in classroomtesting is unreasonable. Nonetheless someadministrators expect just that, usually in thename of "grading on a curve." Teachers nowhave the information to show administratorsthe error of their waysthough in manysituations in Japan, teachers may prefer tokeep their own counsel.

Testing roles of teachers and administra-tors. However, even in Japan, both teachersand administrators can benefit from thinkingabout the relationships of these differenttypes of tests to their respective job responsi-bilities. I think that CRTs for testing diagno-sis, progress, and achievement oughtproperly to be the responsibility of teachers,individually, or better yet, collectively ingroups of relevant teachers working together.In contrast, any NRTs for testing aptitude,proficiency, or placement should primarily bethe responsibility of the administrators. I amnot saying that I think that administratorsshould not to be involved in the criterion-referenced testing processes or that teachersshould not help with the norm-referencedtesting administrations. In most cases, theadministrators probably need to help theteachers coordinate the development of CRTs


BROWN

and help them with the logistics of adminis-tering and scoring them. Equally important,the administrators will probably need helpfrom the teachers in proctoring and scoring aplacement test, and maybe in developing it.Nonetheless, because of the nature of therespective decisions that will be made withthese tests, I strongly feel that the primaryresponsibility for CRTs ought to rest withteachers, while the primary responsibility forthe NRTs should rest with administrators.

CRTs and NRTs are developed in differentways. One other difference exists betweenCRTs and NRTs that is not covered in thischapter. It is the difference in the strategiesthat are used for developing and improvingCRTs and NRTs. Those issues are covered inother chapters later in the book: one onCRTs and one devoted to NRTs.

NRTs can't do it all. Teachers and adminis-trators should not expect too much of NRTs.Often in Japan, I have found that NRTs likethe TOEFL are used for many types of test-ing, including proficiency, placement,achievement, and even progress testing.While the TOEFL, as a very general NRT,can serve as an excellent proficiency test, itseems irresponsible to use it as a placementtest because, in most cases, the TOEFL is toobroad in nature to work well in placing thestudents in a particular institution. Similarly,the content of the TOEFL is entirely toobroadly defined to be useful in tracking theprogress of students, or measuring theirachievement in semester-length, or evenyear-long English courses.

As Alderson (1990) put it at the RELCSeminar on Language Testing and Lan-guage Programme Evaluation in Singapore,norm-referenced tests simply are not sensi-tive enough for doing criterion-referencedtesting. This lack of sensitivity is certainlyrelated to the degree of specificity involved,but may also be due in part to the ways thatcriterion-referenced and norm-referencedtests differ along logistical dimensions.

Administrators and teachers alike shouldalso realize that using NRTs for CRT pur-poses minimizes the possibilities that theirprogram will look good. Program evaluations


conducted by smart language professionalswill use tests that are sensitive to the goalsand objectives of the program involved inorder to maximize the chances of their stu-dents showing gains in learning. Such pro-gram-sensitive tests (sometimes calledprogram-fair tests) are by definition crite-rion-referenced, not norm-referenced.

Conclusion

In sum, because of all of the differenceslisted in Table 1 and the forgoing discus-sion, it seems clear that the characteristicsof criterion-referenced tests make themunsuitable for classifying students intogroups for admissions or placement deci-sions, while at the same time, the charac-teristics of norm-referenced tests makethem inappropriate for assessing what per-cent of the material covered in class eachstudent has learned for diagnostic, progress,or achievement decisions. Hence, one of thefew truths available to us in language teach-ing is that criterion-referenced and norm-referenced language tests are fundamentallydifferent from each other.

Perhaps the single most important messagethat this chapter contains is that tests,whether criterion-referenced or norm-refer-enced, are important tools in language pro-grams. These tools that can and should beused effectively and efficiently to help lan-guage teachers and administrators with thevariety of different types of decisions thatthey must make in order to deliver excellentinstruction to their students. However, all ofthis can only be accomplished if languageprofessionals in Japan understand how to useboth criterion-referenced and norm-refer-enced tests properly.

26

Note

Note that the Educational Testing Servicehas recently developed a policy wherebystudents who take the TOEFL and want acopy of the examination can obtain it bywriting directly to Educational TestingService.

References

Alderson, C. (1990). Language testing in the1990's: How far have we come? How muchfurther must we go? Plenary address at the 1990RELC Seminar on Language Testing and Lan-guage Programme Evaluation. Singapore.

Alderson, C., Clapham, C., & Wall, D. (1995).Language test construction and evaluation.Cambridge: Cambridge University Press.

Bachman, L. F. (1989). The development and useof criterion-referenced tests of language profi-ciency in lanpuge program evaluation. In K.Johnson (Ed.) Program design and evaluationin language teaching. Cambridge: CambridgeUniversity Press.

Bachman, L. F. (1990). Fundamental consider-ations in language testing. Oxford: OxfordUniversity Press.

Baker, D. (1989). Language testing. London:Edward Arnold.

Brown, J. D. (1981). Newly placed versus con-tinuing students: compging proficiency. In J.C.Fisher, M.A. Clarke & J. Schachter (Eds.) OnTESOL '80 building bridges: Research andpractice in teaching English as a second lan-guage (pp. 111-119). Washington, DC: TESOL.

Brown, J. D. (1984). Criterion-referenced lan-guage tests: What, how, and why? Gulf AreaTESOL Bi-annual, 1, 32-34.

Brown, J. D. (1988). Understanding research insecond language learning: A teacher's guide tostatistics and research design. Cambridge:Cambridge University Press.

Brown, J. D. (1989). Improving ESL placementtests using two perspectives. TESOL Quarterly,23(1), 65-83.

Brown, J. D. (1990a). Short-cut estimates of crite-rion-referenced test consistency. LanguageTesting, 7(1), 77-97.

Brown, J. D. (1990b). Where do tests fit intolanguage programs? JAL T Journal, 12(1), 121-140.

Br6wn, J. D. (1992). Classroom-centered lan-guage testing. TESOL Journal, 1(4), 12-15.

BROWN

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In D. Dou-glas and C. Chapelle (Eds.) A New Decade ofLanguage Testing Research (pp. 163-184).Washington, DC: TESOL.

Brown, J. D. (1995a). Testing in Language Pro-grams. Englewood Cliffs, NJ: Prentice-HallPublishers.

Brown, J. D. (1995b). The Elements of LanguageCurriculum: A systematic approach to programdevelopment. New York: Heinle & Heinle Pub-lishers.

Carroll, B. J., & Hall, P. J. (1985). Make your ownlanguage tests: A practical guide to writinglanguage performance tests. Oxford: Pergamon.

Cartier, F. A. (1968). Criterion-referenced testingof language skills. TESOL Quarterly, 2(1), 27-32.

Cziko, G. A. (1982). Improving the psychometric,criterion-referenced and practical qualities ofintegrative language tests. TESOL Quarterly,16(3), 367-379.

Cziko, G. A. (1983). Psychometric and edumetricapproaches to language testing. In J. W. 011er,Jr. (Ed.), Issues in language testing research.Cambridge, MA: Newbury House.

ETS. (1995). Test of English as a foreign lan-guage. Princeton, NJ: Educational Testing Ser-vice.

Harris, D. P. (1969). Testing English as a secondlanguage. New York: McGraw-Hill.

Heaton, J. (1989). Writing English language tests(New edition). London: Longman.

Hudson, T., & Lynch, B. (1984). A criterion-referenced approach to ESL achievement test-ing. Language Testing, 1(2), 171-201.

Hughes, A. (1988). Testing for language teachers.Cambridge: Cambridge University Press.

Keller, E., & Warner, S. (1979). Gambits conver-sational tools (Books one, two, & three). Hull,Quebec: Canadian Government Printing Office.

Madsen, H. (1983). Techniques in testing. Ox-ford: Oxford University Press.

Valette, R.M. (1977). Modern language testing(2nd ed.). New York: Harcourt BraceJovanovich.


Chapter 3

Criterion-Referenced TestConstruction and Evaluation

DALE T. GRIFFEE

SEIGAKUIN UNIVERSITY

Despite the increasing number ofteachers who have master's degreesin teaching English as a second lan-

guage (TESOL) or other forms of training,there remains an ignorance and even anaversion to the technical aspects of test con-struction on the part of many teachers. In thepast three years, I have attempted classroomtests by several means, all of which provedunsatisfactory. Paper tests were either tooeasy or too difficult for my classes and inter-view tests proved exhausting for me. Elimi-nating tests altogether and basing grades onclass participation and attendance alsoproved unsatisfactory for several reasons.First, I had the feeling of not being fair to mystudents. I flunked one student who was onthe borderline of allowable absences, seldomparticipated in discussions, and on occasionslept in class. On the other hand, I gave highgrades to students with good attendance butlow class participation. In both cases, I felton shaky ground and wished for additionalcriteria. Second, students expect a test, and Iwonder how seriously they take a coursewithout a final examination. Third, by notgiving tests, I was not receiving any feedbackon student progress. Fourth, without a pre-test, I had no idea what the levels of mystudents were on entering my class or whattheir level of previous knowledge was. Fifth,without tests, especially a final test, I wasgetting no sense of closure and completion


to my course. The purpose of this paper is tointroduce criterion-referenced tests (CRT) toteachers who either dislike the whole idea oftesting or who have for one reason or an-other avoided the issue of testing. Two ques-tions will be addressed: What is thedifference between criterion-referenced testsand norm-referenced tests (NRT)? And, howcan criterion-referenced tests be evaluatedand revised?

Norm-Referenced Tests andCriterion-Referenced Tests Defined

Most classroom teachers are familiar withnorm-referenced tests by function if not byname. One of the well-known NRTs is theTest of English as a Foreign Language(TOEFL). However, few ESL/EFL teachers arefamiliar with the concept of criterion-refer-enced tests, perhaps because CRTs have notbeen discussed as much in the literature asNRTs. For example, in explaining NRTs andCRTs, one popular teacher training text(Savignon, 1983, p. 240) gives four para-graphs consisting of fifty-one lines and twotables to explaining NRTs, but gives only oneparagraph consisting of six lines to explainingCRTs. Perhaps unfamiliarity with CRTs is dueto the fact that NRTs have dominated testingsince the mid-1970's (Bachman, 1989, p. 248).Perhaps another reason NRTs are more famil-iar to teachers than CRTs is because NRTs are

GRIFFEE

Table 1. Differences between norm-referenced and criterion-referenced testsAdapted from Brown (1989).

Characteristic Norm-referenced Criterion-referenced

Purpose of testing

Type of measurement

Type of Interpretation

Knowledge of questions

Score distribution

To spread students outalong a continuum ofgeneral ability

General languageabilities are measured

Relative: A student'sperformance iscompared with that ofall other students

Students have little orno idea of what contentto expect in the testquestions

Normal distributionof scores, e.g. a bellcurve

To determine theamount of materiallearned

Specific languagepoints are measured

Absolute:performance iscompared only witha pre-specifiedlearning objective

Students knowexactly what contentto expect in the testquestions

If all students know allthe material, allshould score 100%

used to decide proficiency and placementissues which are of high interest to both pro-gram administrators and classroom teachers.For whatever reason, the distinction betweenNRTs and CRTs is only recently being recog-nized by TESOL teachers (Brindley, 1989, p.49; Brown, 1990a, p. 125; Brown, 1992).Table 1 summarizes the differences betweenNRTs and CRTs.

NRTs measure general language proficiencywhereas CRTs measure specific objectives(Brown and Pennington, 1991, p. 7). An NRTcannot give specific information relative towhat a student can or cannot do with lan-guage, and it is this characteristic that leadsBachman (1989, p. 243) to say that NRTs area poor choice for program evaluation. NRTsinterpret student scores relative to other stu-dent scores, whereas on CRTs students'scores are interpreted relative to an absolutestandard, e.g., learning 25 vocabulary wordsby the end of the week. For NRTs to success-fully compare students, they must involve alarge enough sample of students (30 or more)to create what is called a normal distribution

(see Richards, Platt, and Platt, 1992, p. 249).Figure 1 shows an example of a normal distri-bution.

Figure 1. Normal distribution

100

number 90of 80

students 706050403020100 T

test scores 0 10 20 30 40 50 60 70 80 90 100

If you were to connect the top of eachbar with a line, you would see the familiarbell curve shape. A CRT, on the otherhand, does not operate on the normal dis-tribution concept. In fact, a good CRTwould have a positively skewed distribu-tion for the pretest, as seen in Figure 2,and a negatively skewed distribution forthe posttest as seen in Figure 3.


GRIFFEE

Figure 2. Positively skewed distribution

numberofstudents

1009080706050403020100 -17

test scores 0 10 20 30 40 50 60 70 80 90 100

Figure 3. Negatively skewed distribution

numberofstudents

100908070605040302010 r_F-10

test scores 0 10 20 30 40 50 60 70 80 90 100

The reason a well-functioning CRT pretestmight produce the distribution in Figure 2 isthat at the beginning of a course it mightreasonably produce many scores at the lowend because students did not have theknowledge or skill being tested. In otherwords, in an ideal situation a CRT pretestwould indicate that many students did notknow the material and scored zero or close toit. At the end of the course, when the stu-dents had the benefit of instruction and tookthe posttest, they would probably score veryhigh, which would result in a distribution likethat shown in Figure 3.

Method

SubjectsThe subjects in this study were 50 second-

year students at Seigakuin University in thenewly-formed Division of Euro-AmericanStudies. The students were in two classes,one of which met on Tuesday and anotherwhich met on Thursday. Each class met oncea week for 90 minutes. The Tuesday classhad 25 students consisting of 11 women and14 men, and the Thursday class had 25 stu-


dents consisting of 12 women and 13 men.The students ranged in age from 19 to 21years of age. Due to absences and late regis-tration, 43 students (21 women and 22 men)took the pretest and 37 students (20 womenand 17 men) took the posttest. All studentswere Japanese and all except two were fromSaitama prefecture and the nearby Tokyoarea. No university or department objectivesexisted at the time of this study and noTOEFL or other NRT test scores were avail-able. The course syllabus was entirely up tothe instructor. One grade was to be given toeach student for the entire year based onattendance, homework, and the final test.

MaterialsIn light of the absence of any institutional

course objectives, the test was based entirelyon the course textbook More Hearsay(Griffee, 1992). Each unit in the textbook hasapproximately an equal number of listeningand speaking exercises. The general criterionfor test construction was that the test reflectthe course book as much as possible. Othermore specific criteria were that the test be awritten test, at least half listening; that therebe fifty items scored two points each; that thetest contain no material directly taken fromunits one and two; that all questions bescorable as right or wrong; that all contentitems be explicitly taught in the text; and that,except for units one and two, the test itemscover as much of the text as possible. Elevenformats used in the textbook were identified,and five were judged acceptable for inclusionin the test: doze passages, multiple-choicequestions, listen and write the word/number/prices you hear, listen and circle the wordyou hear, listen and identify what is de-scribed, and write the phrase you hear. Four-teen content areas were identified and eightwere used because they had a wide distribu-tion throughout the textbook: vocabulary,cultural items, numbers, schedules, cities,money, food, and travel. The final form of thetest, not given here for test security reasons,had nine sections with a total of fifty items.The first six sections of the text, consisting of26 questions, were for listening and included

30

the following subtests: listen and write whatyou hear, count the number of words youhear in these sentences, listen and identifywhich state in the U.S. is being described.The last three sections contained only writtenitems, which included matching words andspecific content questions such as "What doesASAP mean?"

ProceduresThe pretest was administered in April 1993

during the second class meeting of bothclasses. In both classes, the first meeting wastaken up with a course introduction and Unit1 of the textbook. No pretest makeup wasadministered to any student who was absentor transferred into the course after the secondmeeting. The posttest was administered inJanuary 1994 during the last class session.Both the pretest and the posttest were admin-istered using the same cassette tape, whichincluded instructions as well as the listeningpassages. The tests were then collected andgraded by the instructor, but not returned tothe students. The statistics were calculated ona Macintosh LC 520 using the Claris Worksversion 2.0 spreadsheet program.

Analysis

In this paper three types of statistics will bediscussed: descriptive statistics, item statistics,and consistency estimates.

Descriptive StatisticsDescriptive statistics, as the name implies,

give a basic description of the test. In thispaper, seven descriptive statistics will begiven. Five of the statistics are self-explana-tory: they are the number of students, thenumber of test items, the minimum score, themaximum score, and the range of scores. Theremaining two statistics, the mean (which issometimes symbolized by the letter M or[pronounced ex-bar]) and standard deviation(SD), require some explanation. According toRichards, Platt, and Platt (1992, p. 349), themean is the average of a set of scores. Inother words, the mean is the sum of scoresdivided by the number of test scores. If the

GRIFFEE

scores on a certain test are 2, 4, 6, and 4 theirtotal is 16. Divide this sum by four (the num-ber of scores) and the mean is 4. The stan-dard deviation is an average of the differenceof each score from the mean. The word "de-viation" refers to how far each score is fromthe mean and the word "standard" is a kindof average. The formula for the standarddeviation is as follows:

SD - E(x scl2N

Where: SD ==

standard deviationmean

x = scoresN = number of studentsS = sum

Item statisticsIn a language test, each scorable piece of

language is called an item, and a test questionmay contain one or more items. Item analysisis a way of obtaining some simple statistics toanalyze the items on the test. Item analysismight be a new concept for many teachers,but it should be interesting for classroomteachers working with tests because it givesthem a practical tool to evaluate, revise, andimprove their classroom tests. Item analysistells the teacher how students scored on eachitem. By using item analysis, a teacher candetermine how well or how poorly each itemin the test is functioning. The teacher canthen revise the test by deciding which itemsto leave in the test and which items to deleteor change. Two item statistics will be dis-cussed later in this paper. They are item facil-ity (IF) and the difference index (DI).

Consistency estimatesConsistency or dependability estimates for

CRTs are comparable to the NRT notion ofreliability. The central issue is the degree towhich the teacher can expect the test to givethe same results test after test. Because themain focus of this chapter is on item analysis,only one consistency estimate will be givenhere, the Kuder-Richardson formula numbertwenty-one (K-R-21). K-R-21 will be ex-plained in more detail later.


GRIFFEE

Table 2. Descriptive statistics

Test totalpossible

SD Min Max Range

pretestposttest

4337

100100

52.04766.590

13.75612.577

2634

8096

5462

Results

The pretest results show that, of the fifty stu-dents enrolled, only forty-three took the pre-test. The results also show a fairly wide spreadof 54 points ranging from a low of 26 to a highof 80 points. The mean or average is about 52points. Since the standard deviation was about14 points and it is known that 34% of the testscores are one standard deviation plus or mi-nus from the mean, 68% of the students scoredfrom 38 to 66 points. Assuming the traditionalpass-fail cutpoint at .70, 84% of the studentsfailed on the pretest. This indicates that the testwas effective in that the majority of studentsdid not know the material when they enteredthe class at the beginning of the school year.

Figure 4. Bar chart showingstudents' pretest scores.

Figure 5. Bar chart showingstudents and posttest scores.

In this paper, only the itemanalysis statistics for items 1-15 are given.


The posttest results show that thirty- sevenstudents were still in the class when the finalposttest was administered. The mean scoreincreased from 52 to almost 67 points andthe minimum and maximum scores also indi-cate some improvement. Another way tocompare pretest and posttest scores is visu-ally through the use of bar charts. You havealready seen bar charts to show the normaldistribution and skewed distributions. Figure4 shows a bar chart in the horizontal viewshowing the distribution of pretest scores andFigure 5 shows a bar chart in the horizontalview showing the distribution of posttestscores. One student scored 70 on the pretestand 66 on the posttest. All other studentsimproved from the pretest to the posttest.

10090SO

70 -60 -504030 -20 -10-

O

pretest scores

100 -90 -SO -706050 -40 -3020 -10

posttest scores

0 1110,011

32

GRIFFEE

Table 3. Item statistics for items 1- 15.

Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Pretest

IF .70 .95 .59 .86 .70 .41 .76 .49 .73 .86 .03 .11 .79 .33 .49Posttest

IF .86 .97 .70 .89 .57 .65 .86 .59 .89 .86 .03 .08 .94 .56 .70DI .16 .02 .11 .03 -.13 .24 .10 .10 .16 .00 .00 -.03 .15 .23 .21

Note: IF = item facility DI = Difference index

The item facility (IF) is an item statistic thatgives the percent of correct answers (Brown,1989). The formula is IF = N correct / N total.To determine the IF, divide the number ofcorrect answers for an item by the total num-ber of students who took the test. For ex-ample, on the pretest reported in this paper,the IF for item one was .70 and the IF foritem eleven was .03. That means item onewas answered correctly by 70% of the stu-dents and item eleven was answered correctlyby only 3% of the students.

There are four possible outcomes for anytest item in a pretest-posttest situation. Theseoutcomes are given in Table 4.

As can be seen in Table 4, in outcome one,an item is answered correctly on the pretestand then answered incorrectly on theposttest. This is a rather bizarre situationbecause it means that the student knew theanswer on the pretest and then forgot or insome way unlearned the answer for theposttest. In such a situation, the student may

Table 4. Possible CRT item outcomes

have been distracted while taking the test.Several things might distract students, and theteacher should investigate such occurrences.Distraction might come from an uncomfort-able room, poor lighting, or taking the testthe day after an all-night party. In outcometwo of Table 4, an item is answered correctlyon both the pretest and the posttest. In thiscase the item was too easy, probably becauseit was taught in previous courses. In outcomethree, an item is answered incorrectly on boththe pretest and the posttest. Perhaps this itemwas too difficult for the students to learn, orperhaps the teacher did not adequately re-view the item. It could also be the case thatthe teacher did not actually cover the item inclass. In outcome four, the item was an-swered incorrectly on the pretest and an-swered correctly on the posttest. This is theideal case for a CRT test item. The studentscame to the class not knowing this point, anddue to their hard work (and good teaching)students exited the course knowing the item.

possible pretest

outcome item

posttest

item

explanation

1 correctanswer

2 correctanswer

3 incorrectanswer

4 incorrectanswer

incorrectanswer

correctanswer

incorrectanswer

incorrectanswer

student forgot or was distracted

item was too easy

item was too difficult, not taught, or notreviewed

ideal item for a CRT


GRIFFEE

To restate, the purpose of item analysis isto provide item statistics that enable theteacher to decide which of the four possibleoutcomes each item belongs to, so that theteacher can improve the test by deciding infuture versions of the test which items tokeep, which items to revise, or which itemsto reject. Let's see how this process workswith the first fifteen items in our test. The IFindex and how it was derived has alreadybeen explained. An IF index is calculated foreach item in the pretest and each item in theposttest. The pretest IF is subtracted from theposttest IF and the resulting number is theDifference Index (DI). The formula is DI =posttest IF pretest IF, and the higher the DIthe better. The test was arranged in sectionsor groupings titled "tasks." The first fifteenitems of the test include task 1 (items onethrough five), task 2 (items six through ten),and task 3 (items eleven through fifteen).

To begin the revision of task one, we seethat the DI for items one through five are .16,.02, .11, .03, and -.13. These are not veryimpressive numbers. They indicate that itemone showed an increase in scores of .16 or16% which is not bad, but item two is only.02 or 2 percent and number five is a disasterwith a minus sign indicating a net lose prob-ably because some students experiencedoutcome one in Table 3. Looking at the actualtest, items one through five appear as fiveblank lines. Students are instructed to listenand write the number they hear. The num-bers the students hear on the tape are: fourhundred, ninty-nine, eight thousand, thirty-two thousand five hundred, a hundred thou-sand and, a hundred and fifty thousand. Eachnumber is repeated two times. The itemanalysis suggests that apart from item one, wecould improve the questions, especially ques-tion five. Unfortunately, item analysis doesnot give us any idea of what to do. To im-prove the questions, we must use our knowl-edge and imagination. My goal was to makethese items more difficult (so that more stu-dents would miss them on the pretest) andmore easy (so that more students would getthem correct on the posttest). What I decided,in fact, was to make the items easier by mak-


34

ing the numbers smaller and to make theitems more difficult by embedding them in asentence. Item one was changed to, "Can youhelp me, I have to make twenty-five copies ofthis report." This sentence is taken directlyfrom one of the units.

Items six through ten are also blank lineson the test sheet with instructions in print andon tape to "listen and write the prices youhear." Students hear various prices on thetape such as three dollars and seventy-fivecents for item one and eighty-five cents foritem ten. Item six seems to be functioningwell with a DI of .24 but item ten has a DI ofzero with a pretest IF of .86 and a posttest IFof .86 indicating that no learning took placeprobably because the item was too easy. Inthe revised test, all prices were embedded insentences. For item six, students now hear,"Is fifty-four cents enough for an airmail letterto France?" and for item ten students hear, "Aticket for the bus is fourteen fifty and you canpay the driver."

Items 11 through 15 are multiple-choicequestions. The students see four words, listento the tape, and circles the word they heard.Items eleven and twelve were not function-ing well (DI of zero and minus three) whileitems thirteen, fourteen, and fifteen are func-tioning much better. I interpreted the itemanalysis statistics as an indication to reviseitems eleven and twelve. Looking closely atitem eleven, we see the four options:1) you'll, 2) I, 3) I'll, and 4) this. The utter-ance students heard was, "You copy and I'llcollate" which is taken from one of the dia-logues in the textbook. Students are selectinganswer one perhaps because it is easier tohear than the correct answer, number three.My revision was to keep the utterance andchange distracter number one to a word notcontaining the sound /ou/ or /1/. By doingthis, I reduced the distractor interference andstudents should find it easier to circle thecorrect answer. However, item analysis nextyear will confirm or deny my supposition.Next school semester, this test will again beadministered as a pretest in April andposttest in January, and all items will beevaluated and revised as become necessary,

especially items with a DI of less than .20.Reliability is a statistical concept which is

used to estimate inconsistency in a test. Imag-ine a perfect test. Such a test might exist in aPlatonic heaven of perfect tests, but neverhere on earth. On earth all we have are im-perfect tests all of which contain some incon-sistency. We would generally like to knowhow reliable our imperfect test is.

This paper uses the Kuder-Richardsonformula twenty-one or KR-21 which is anNRT statistic and, as such, is technically notappropriate for CRTs. However, there is anadvantage to using KR-21. According toBrown (1990b), KR-21 is a conservative con-sistency estimate of the phi coefficient, a CRTstatistic beyond the scope of this paper toexplain. However, unlike the phi coefficient,KR-21 is easy to calculate because only threenumbers are necessary, and they have alreadybeen calculated and reported in the descrip-tive statistics. They are the number of items,the mean, and standard deviation. The for-mula for KR-21 is

K-R 21k

1M (k M)

(

k 1 k S 2

where k = number of itemsM = the mean of test

itemss = standard deviation of

the test items

In this calculation M and s are based on thesample posttest raw score data, and KR-21turns out to be .85. k = 100, M = 66.59, and s= 13.35. This indicates that the students'scores are 85 percent reliable and 15 percentunreliable.

Discussion

In this paper, the distinction between NRTsand CRTs has been clearly illustrated. NRTsare of little or no help to classroom teachersin diagnosing their students' strong and weakpoints, assessing achievement, or evaluatingprograms. I have also shown how can CRTs

GRIFFEE

can be designed and evaluated by using itemanalysis, which makes it possible to evaluateand improve the items. The results of theitem analysis reported in this paper are sober-ing. The test described in this paper had beencarefully designed (see Griffee, 1994, p. 34)by an English native speaker with an M.A.and many years of teaching experience. De-spite these qualifications, many of the testitems were shown by item analysis to beineffective. This indicates that test items con-structed by academically qualified and experi-enced teachers cannot be assumed tofunction as intended. Item analysis operatesto flag certain test items and makes it possiblefor the teacher to identify those items in termsof one of the four outcomes in Table 4.

One weakness of this particular test is thatthe lack of institutional goals forced reliance onthe textbook for test construction. Using atextbook instead of course objectives as thebasis for the test raises two problems: one isthe narrowness of the scope and the other islimitations of sequence. The problem of nar-rowness scope is that the focus of the testisrestricted to a single textbook. In other words,lack of institutional program learning goalsforces the teacher to make the textbook an endrather than a means toward an end. The prob-lem of limited sequence is that, in any givencourse, the teacher has no way of relating orsupporting other courses in the curriculum.The lack of institutional or departmental objec-tives means there is a risk that each course inthe curriculum will become an isolated islandwith no bridges to the other islands.

Acknowledgements

An earlier form of this paper benefited fromcomments by K. Anderson and D. Reid. I amindebted to W. G. Kroehler for help in settingup the spreadsheet program.

References

Bachman, L. (1989). The development and use ofcriterioned-referenced tests of language abilityin language program evaluations. In R. K.Johnson (Ed.), The second language curriculum.


GRIFFEE

Cambridge: Cambridge University Press (pp.242-258)

Brindley, G. (1989). Assessing achievement in thelearner-centred curriculum. Sydney: NationalCentre for English Language Teaching andResearch.

Brown, J. D. (1989). Improving ESL placementtests using two perspectives. TESOL Quarterly,23 (1), 65-83.

Brown, J.D. (1990a). Where do tests fit into lan-guage programs? JALT Journal, 12 (1), 121-140.

Brown, J. D. (1990b). Short-cut estimates of crite-rion-referenced test consistency. LanguageTesting 7(1), 77-97.

Brown, J.D. (In press). Testing in language pro-grams. Englewood Cliffs, N. J.: Prentice Hall.

Brown, J. D. (1992). Classroom-centered languagetesting. TESOL Journal , 1 (4), 12-15.

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In Douglasand Chapelle (Eds.), A new decade of language


testing research pp 163-184. Washington D.C.: Teachers of English to Speakers of OtherLanguages.

Brown, J. D. & Pennington, M. C. (1991). Devel-oping effective evaluation systems for lan-guage programs. In M.C. Pennington (Ed.),Building better English language programs:Perspectives on evaluation in ESL (pp. 3-18).Washington D C: NAFSA.

Griffee, D. T. (1992). More HearSay: Interactivelistening and speaking. Reading, Mass.Addison-Wesley.

Griffee, D. T. (1994). Criterion-referenced testconstruction: A preliminary report. Journal ofSeigakuin University 6, 31-40.

Richards, J., Platt, J., and Platt H. (1992). Dictio-nary of Language Teaching & Applied Linguis-tics. (2nd ed.). London: Longman.

Savignon, S. J. (1993). Communicative compe-tence: Theory and classroom practice. Read-ing, MA. Addison-Wesley.

Chapter 4

Behavioral Learning Objectivesas an Evaluation Tool

JUDITH A. JOHNSON

KYUSHU INSTITUTE OF TECHNOLOGY

If a poll were taken, probably most of theteachers around the world would say thatthey do not use the Taxonomy of Educa-

tional Objectives (Bloom, 1956) as an instruc-tional aid. It is more than likely, however,that their teaching is in some way influencedby itin the structure of the curriculum orsyllabus they follow, the design of the text-books they use, or the construction of thestandardized tests they administer. The Tax-onomy was originally conceived as an "edu-cational-logical-psychological" system forclassifying test items (p. 6) and "a method ofimproving the exchange of ideas and materi-als among test workers as well as other per-sons concerned with educational researchand curriculum development" (p. 10). It wasa very progressive schema in education at thetime of its conception and quickly became avaluable tool for helping educators be moresystematic in organizing educational and in-structional objectives and assisting learners todevelop "intellectual abilities and skills" (p. 38).

In addition to having an impact on large-scale achievement testing and curriculumevaluation, it has directly influenced class-room instruction and assessment as well. Thisinfluence was largely brought about by (a)the introduction of instructional and behav-ioral objectives and (b) the categorization oflower-order and higher-order objectives. Inthis chapter, first, the Taxonomy will be de-

scribed; next, brief instructions on how towrite behavioral objectives will be given; andthen, explanations of how the use of objec-tives can improve planning, instruction, andevaluation of student learning will be pre-sented. Finally, concerns about the use of theTaxonomy will be addressed, and its majorcontributions to classroom instruction andassessment will be discussed.

A Description of the Taxonomy

The Taxonomy of Educational Objectives(Bloom, 1956) is a list of six major classes ofeducational outcomes, based on the idea thatthe outcomes are hierarchically related asfollows:

1.00 Knowledge2.00 Comprehension3.00 Application4.00 Analysis5.00 Synthesis6.00 Evaluation

The most basic outcome being knowledgeand the most complex being evaluation. Inother words, the lower-order categories areprerequisite to achieving the higher ones.

Rohwer and Sloane (1994) identify sixpresuppositions about the nature of humanlearning and thinking upon which the struc-ture and principles of the Taxonomy depend:


JOHNSON

1. There is a relationship between learn-ing and performance.

2. There are qualitatively different variet-ies of learning.

3. Learning is hierarchical and cumula-tive.

4. Learning is transferred to higher levels(vertically) and horizontally.

5. Generalized higher-order skills andabilities can be applied across contentareas.

6. There is a difference in the way nov-ices and experts learn new material.

Behavioral objectives came into existencewhen the authors of the Taxonomy decidedthat "virtually all educational objectives whenstated in behavioral form have their counter-parts in student behavior. These behaviors,then could be observed and described, andthe descriptions could be classified" (Bloom,1994, p. 3). The use of behavioral objectivesenables both the teacher and the learner tosee what is expected of the learner.

The cognitive structure of the Taxonomy

was perceived to consist of lower-order andhigher-order objectives. Although the divisionbetween lower-order and higher-order objec-tives has not been agreed upon by research-ers, Bloom (1963) referred to Analysis,Synthesis, and Evaluation as classes whichrequired "higher mental processes." Becauseof the differences between the two levels ofobjectives, different teaching methods foreach have been proposed and supported byresearch findings. Teaching methods such aslectures have been found to be more effectivein teaching lower-level objectives while com-munication-oriented methods such as discus-sions and group-work are more helpful inlearning higher-order objectives (McKeachie,1963; McKeachie & Kulik, 1975; Johnson &Johnson, 1991). Research into the two levelsof objectives and teacher questioning prac-tices has found a relationship between thetypes of questions asked by teachers andstudent achievement. When the greater num-ber of questions posed by teachers arehigher-order questions, student achievementis positively effected (Redfield & Rousseau,

Table 1. Classes of Instructional Objectives and Corresponding Behaviors

Knowledgebehaviors

Comprehensionbehaviors

Applicationbehaviors

Analysisbehaviors

Synthesisbehaviors

Evaluationbehaviors

the ability to remember previously learned informationdefine, name, state, reproduce, match, list, identify, describe

the ability to grasp the meaning of materialdistinguish, defend, predict, explain, estimate, give examples, infer,paraphrase, summarize

the ability to use learned material in new and concrete situationsuse, modify, predict, compute, demonstrate, prepare, produce,show, discover, manipulate

the ability to break down material from its component parts so thatits organized structure can be understooddifferentiate, relate, subdivide, separate, break down, identify,diagram, illustrate

the ability to put parts together to form a new wholecategorize, modify, tell, write, rewrite, summarize, generate, plan,reconstruct, explain, create, design

the ability to judge the value of material for a given purposecompare, conclude, support, justify discriminate, critique, contrast,interpret, relate


38

1981). Unfortunately, it appears that mostteachers focus on lower-level questions(Anderson, Ryan, & Shapiro, 1989).

Table 1 is a summary of Taxonomy defini-tions of each class and corresponding observ-able behavior terms that can be used forstating specific learning outcomes.

Writing Behavioral Objectives

Behavioral objectives changed the focus ofinstruction from what the teacher will teachto what the student should learn. The objec-tives identify learner achievement in terms ofbehaviors which can be observed and mea-sured by both the teacher and learners dur-ing and at the end of the instruction period.Comparing the following two objectives, thesecond objective is more clearly stated thanthe first.

1. Teach students how to write a businessletter in English using an acceptableformat

2. Students will write a business letter inEnglish, using the format studied inclass.

A minimum level of acceptable performancemust be determined and announced to thelearners. It should be included in the objec-tive:

1. Write a business letter in English, with-out errors, using the format studied inclass.

Any other important conditions that need tobe known must also be stated:

1. Using information provided, write abusiness letter in English, withouterrors, in the format studied in class.Dictionaries can be used.

Well-stated objectives include the learner asthe person who will produce a visible actionthat corresponds to the instructional objective.The level of acceptable proficiency and theconditions under which the learner must per-form the action should also be included.

The first step in writing behavioral objec-tives is to determine the desired instructional

JOHNSON

outcome level. For example, if the objective isfor learners to understand the content of areading assignment, the instructional objectiveis at the Comprehension level: By the end ofthe course, the students will be able to com-prehend the meaning of written material.

Next, behaviors (i.e., visible actions) whichcorrespond to this level are identified andselected taking into consideration the natureof the reading materials, class size, length ofinstruction period, and other related factors.

By the end of the course, the students willbe able to comprehend the meaning of thereading passage in Lesson 4 as demonstratedby their ability to:

1. Summarize the ideas2. Explain the main ideas3. Predict the outcome4. Define key vocabulary items

Finally, the form of the behavior (written,oral, graphic, etc.) must be decided by theteacher, based on the curriculum objectives.(For more information on writing objectivessee Mager, 1962; Popham, 1973; Gronlund,1978; or, Brown, 1995).

Also, learning activities should be selectedbased on the potential of the activities to fur-ther curriculum objectives and benefit a specificgroup of learners. Lower-level behavioral ob-jectives can be included but objectives fromcognitive levels higher than the level beingtaught, generally, should not be used as theyare probably too difficult for the learners. Mas-tery of lower-level objectives prepares learnersto achieve higher-level objectives.

Planning Teaching/Learning Activities

Research related to lesson planning revealsthat many teachers plan their lessons basedon what they want the students to do ratherthan on what they want the students to learn(Peterson & Clark, 1986). In other words, theydecide to use learning activities such as role-playing, writing a composition, or discussinga reading assignment without, first, identify-ing the specific aim(s) of the activity. As aresult, before using an activity, teachers oftenfail to consider how the learners' acquisition


JOHNSON

of a new skill or knowledge will be evalu-ated. So at the end of the class, either noassessment or only very cursory (and perhapssubjective) assessment is made.

Using objectives in planning enables theteacher to see the inter-relatedness of allthe phases of the instructional process(planning teaching/learning ' evaluation)from the outset. A commonly stated instruc-tional aim for a foreign language class is,"Have students write an essay about hob-bies." First of all, notice that the objective isfor the teacher, not the learner. Secondly,note that the mere act of writing is consideredthe objective of the lesson. The purpose forwriting the essay is unknown. In reality, writ-ing the essay is not an objective, but an activ-ity that provides the opportunity for studentsto learn, practice and apply, for example,specified language skills, composition rulesand vocabulary. More accurately-stated objec-tives would be as follows:

(Application Level)1. (The learner) uses the verb tenses

studied in Lessons 3-6 with 90% accu-racy when writing about given topics.

2. Uses specified punctuation with 80%accuracy when writing about giventopics.

3. Uses given sentence structures relatedto hobbies with 90% accuracy.

In these objectives, it is understood that at theend of the instruction period, the learner isexpected to use specific verb tenses, sentencestructures and punctuation when writing abouthobbies. With these objectives in mind, appro-priate activities can be planned. For example:

1. (Learners) listen to a recording ofspeakers discussing hobbies and iden-tify the hobbies mentioned

2. Make a list of hobbies (see who canmake the longest list)

3. Say which hobbies they think are inter-esting

4. Write a paragraph explaining theirhobby (or interest), or saying whenthey began it, why they like it, orhow much time they spend on it


when they do it.5. Correct another student's paragraph6. Ask others about their hobbies7. Fill in punctuation in given essays

When clearly-stated objectives are used,lesson planning becomes an efficient andlogical process, rather than a hit-or-missaffair. Equally important, the teacher and thelearners can see the relationships among thebehavioral objectives, the activities carriedout in class, and the assessment of theirachievement.

The concept of mastery learning, devel-oped by Bloom, is an instructional strategyfounded on the belief that students can attaina specified level of learning if they are pro-vided information and practice which is pre-sented in a logical format and provides thelearning time they require. According to theneeds of the students, instruction can beindividualized, carried out in small groups, orinclude the entire class. The use of the mas-tery learning strategies has repeatedly beenproven to be highly successful. When themastery approach is employed, an apprecia-bly larger percentage of the learners haveattained set objectives (Chung, 1994). Andremember, the Taxonomy is the foundationof mastery learning instruction.

Learning and Assessment

Typically, learners cannot estimate theprogress they've made in learning a foreignlanguage until they receive their test results.This is because, generally, learners are notgiven observable standards by which theycan measure their language abilities. Behav-ioral objectives focus on visible evidence thata skill has been learned at an acceptablestandard, under given conditions. With thisknowledge, learners can gauge their ownprogress at any given time. For example, ifthe objective is for the learner to use the pasttense with a minimum of 80% accuracy whentelling someone about a past experience, thelearner can practice by recording accounts(or conversations with another person) ofpast experiences. The recording can be re-

40

played and the verbs checked for accuracy.Learners can check their own work and/oreach others' work. Needless to say, the learn-ers should be given adequate time during theclass to practice each skill. While the learnersare practicing, the teacher has the opportu-nity to give them feedback on their progress.Similarly, objectives for reading, writing, andlistening can be evaluated during the practiceperiod. In this way, assessment becomes anintegral part of the learning phase of theinstructional process.

Criterion-referenced testing (CRT), used inthe evaluation stage of the instructional pro-cess, is closely linked to the use of instructionalobjectives. A criterion-referenced test, as de-fined by the Longman Dictionary of LanguageTeaching & Applied Linguistics (Richards, Platt,& Platt, 1992, p. 91), is "a test which measuresa student's performance according to a particu-lar standard or criterion which has been agreedupon. The student must reach this level ofperformance to pass the test, and a student'sscore is therefore interpreted with reference tothe criterion score, rather than to the scores ofother students." Docking (1986) explains that"the base of criterion-referenced assessment isnot the syllabus but objectives. The syllabus isderived from the objectives domain in the samemanner as are the tests. In this way the tests donot just assess what was taught, rather they testwhether or not the students have achieved theobjectives the teaching (syllabus) was intendedto facilitate." In norm-referenced testing (NRT),a student's score obtained on standardized testsis compared with the scores of all the otherlearners. In CRT, scores from tests (generallymade by the teacher) provide informationabout how well the learner can perform speci-fied skills and tasks or master a given contentdomain, and the scores are compared to an.established standard. CRT can also be used todetermine the effectiveness of specific objec-tives, materials, learning activities, and teachingmethods. As Brown (1992) points out, theprimary purpose of classroom assessment is "tofoster learning." To this end, criterion-refer-enced test scores supply teachers with valuableinformation that can be used to improve theinstructional process (Brown, 1990). Significant

JOHNSON

beneficial effects that CRT can have on educa-tional programs are identified in Table 2.

Table 2. Beneficial effects of criterion-refer-enced tests on students, teachers, and curricu-lum (adapted from Brown, 1992)

Category Beneficial Effects

Students

Teachers

Curriculum

Diagnose strengths and weak-

nesses

Motivate to study and review

Reward for hard work

Focus teaching efforts where

needed

Review areas of weakness

Evaluate effectiveness of teaching

Reassess needs and objectives

Revise and improve tests

Modify materials and teaching

For the purpose of developing criterion-refer-enced evaluation tools such as achievementtests, quizzes, and random individual spotchecks, behavioral objectives can be definedin terms of tasks. Sample criterion-referencedtest items are given in Table 3. When select-ing items as an assessment tool, the itemsshould be based on the instructional objec-tives that were used to develop the syllabusand be categorized by cognitive levels so theteacher can be sure that the items cover whatthe learners are expected to know and yet arebalanced with reference to difficulty.

CRT focuses on the learner's formativelanguage development, permitting periodicself-assessment in addition to teacher assess-ment. Learner performance can be assessed ina variety of ways. Two common ways are asfollows: (a) indicating whether or not thestudent meets the minimum level of accuracyrequired (e.g., can/cannot use the past tense),(b) distinguishing levels of mastery (e.g., usespast tense with infrequent/frequent errors,but meaning is/isn't conveyed clearly usingpast tense). As many teachers use a pointsystem for assessing learners' performance,points can be assigned to the levels of mas-tery of each objective.

4' LANGUAGE TESTING IN JAPAN 33

JOHNSON

Table 3. Sample criterion-referenced test items

(Knowledge)Know the meaning of vocabulary items.

1. Select the picture that expresses the meaning of2. State (say or write) a synonym for3. Write a definition for each of the words below.

(Comprehension)Understand written material.

1. Write a summary of the paragraph, below.2. Give (say or write) two more examples of the type of discussed in the passage.3. Explain (say or write) why the main character of the story

(Application)Follow oral directions.

1. Draw a line on the map along the route you are told to take by the speaker.2. After listening to a series of directions you are given, identify your destination.

Give oral directions.1. Guide someone to a place he/she wants to go in this school.2. Give someone directions on how to get from to the place he/she wants to go to near

(Analysis)Identify the organizational structure of a composition.

1. Write an outline of the organization of the main and supporting ideas of the composition, below.Point out unstated assumptions in given literary works.

1. After reading the following article, give (say/write) three things that the author assumes the readeraccepts as truths.

(Synthesis)Combine information from different sources to tell about a topic.

1. In a five-minute speech, tell the class about using information obtained from at least threedifferent sources.

2. Write a play, story, or narration about using parts of personal accounts you've heardfrom at least two people who've experienced

(Evaluation)Critique a work of art.

1. Give (say or write) your opinion about the film,support your ideas.

2. Compare the ideas presented into this topic.

citing specific incidents and facts to

with what we know, today, as scientific fact related

Discussion

While most educators, to varying degrees,accept the assumptions upon which the Tax-onomy is based (see Gagne, 1977; Rohwer &.Sloane, 1994), some contend that learning isnot necessarily hierarchical. Others (Moore &


Kennedy, 1971; Purves, 1971; Orlandi, 1971)point out that in some subject areas, compre-hension is enhanced when the categories areordered differently. Madaus, Woods, andNuttall (1993) found the levels of synthesisand evaluation to be similar. However, otherinvestigations by Ekstrand (1982) and Hill

(1984) yield data that favor the cumulative-hierarchy theory. At this time, however, find-ings are still inconclusive.

Some teachers (Eisner, 1967; Brokehoff,1979; Tumposky, 1984) have taken a verynarrow view of the use of behavioral objec-tives in language teaching. Tumposky (1984)proposes that, due to the unpredictability andcreativity involved in foreign language learn-ing, objectives are not suitable for this type ofinstruction. However, the structural, phonetic,and social aspects of language learning arereplete with predictable language patterns atall speaking levels. In fact, using higher-orderobjectives may also help students to exercisetheir creativity and their ability to expressmore sophisticated ideas. Objectives canbring clarity and purpose to activities andtasks students are often asked to perform inintermediate and advanced foreign languageclasses. Rather than merely receiving vagueinstructions to discuss a topic or write a com-position on a given theme, the use of objec-tives enables learners to know the specificaims of the assignments.

Another issue raised by opponents of behav-ioral objectives is that not all types of learninghave observable outcomes. This may be true.However, in foreign language learning, whereusing the target language is of primary impor-tance, almost every learning outcome will havea corresponding behavior that can be de-scribed. Handbook II, the second volume ofthe Taxonomy (Krathwohl, Bloom, & Masia,1964), identifies objectives in the affective do-main, a domain which also greatly influencesforeign language acquisition.

Some critics are afraid that the use of objec-tives will result in taking the spontaneity outof language learning (and teaching), espe-cially if teachers become preoccupied withinsignificant aspects of the language, dividingit into many isolated parts that are difficult torelate to actual language use. Foreign lan-guage learners in Japan would undoubtedlybe of the opinion that this is the current stateof most foreign language instruction. Whatthey are learning in most language classes isremote from the ways in which language isused in normal, daily life. Viewed from the

JOHNSON

functional and notional aspects of languageacquisition, behavioral objectives can helpteachers plan instruction that would result inpractical, meaningful communication-orientedlearning outcomes (van Ek & Alexander,1977; Johnson, 1994).

Those who are unfamiliar with the Tax-onomy have said it's too complicated to useand, therefore, requires a lot of planning time.Most skills take time to be learned. Foreignlanguage students are expected and required todevote quite a lot of out-of-class time to thesubject. Should teachers shortchange theirstudents because they don't want to investextra time in learning how to provide betterinstruction to the students? Actually, oncelearned, the very logical and systematic natureof the Taxonomy makes all phases of the in-struction cycle easier to execute.

The difficulty in using the Taxonomy toclassify objectives (which the authors alsoacknowledged) is that the teacher must knowor rely on assumptions about the learners'prior educational experiences. Most teacherswould probably agree, though, that this diffi-culty is not unique to the Taxonomy. CRT,which is most suitable for evaluating behav-ioral outcomes according to Bachman (1989),is not yet able to define language proficiencylevels because the content of a criterion-referenced test must be taken from a "well-defined domain of ability," in which theremust be "an absolute scale of ability" onwhich learners' performance can be measured(pp. 255-256). The upper and lower limits offoreign language proficiency and the standardagainst which the foreign language learnercan be measured have yet to be determined.Although research is being carried out in thedevelopment of language competency scalesand models, the difficulties being faced arethe same as those found in present instru-ments used to evaluate language proficiency:terminology is relative, distinct levels of profi-ciency cannot be determined, ratings arebiased by oral evaluators, and the like.

Some evidence exists that the ability tounderstand the events which take place in theclassroom and the skill to guide classroomactivities are important elements of a teacher's


JOHNSON

success in classroom management (Doyle,1986). If teachers can identify the behaviorthey expect the learners to exhibit at the endof a lesson, planning the lesson becomes alogical process (Gower & Walters, 1983).Using objectives to design the course sylla-bus, select activities, and create evaluationinstruments yields multiple sound results: theprogram of instruction can be logical andcohesive; teachers and students can clearlyknow what is to be learned and how thatlearning is to be assessed; once learners un-derstand what they must do, they can takeresponsibility for their own learning and self-assessment; assessment of the learner'sprogress can take place during the learningprocess; a variety of teaching methods andtechniques can be used to help studentsachieve objectives; teachers and students canfocus on using the language, skill, material,etc. being studied; teachers can determine thegeneral cognitive level of their instruction(which can be changed, if necessary); andteachers' classroom management skills can beenhanced (see Brown, 1995, for more on thistopic). It is evident that the use of objectivescan reap numerous benefits for teachers.However, it is important to remember that thecentral reason for using behavioral objectivesis to improve students' learning.

Educators involved in language teachingare obligated to provide foreign languagelearners the best possible learning environ-ment, instruction, and assessment of theirlanguage skills. The intelligent use of learningobjectives combined with other sound educa-tional practices appears to be the optimumway to fulfill this responsibility.

References

Anderson, D., Ryan, W., & Shapiro B. J. (Eds.).(1989). The lEA classroom environment study.Oxford: Perga mon.

Bachman, L. F. (1989). The development and useof criterion-referenced tests of language profi-ciency in language program evaluation. In R. K.Johnson (Ed.), The second language curricu-lum. Cambridge: Cambridge University Press.

Bloom, B. S. (Ed.). (1956). Taxonomy of educa-


tional objectives, handbook I: Cognitive domain.New York: David Mckay.

Bloom, B. S. (1963). Testing cognitive ability andachievement. In N. L. Gage (Ed.), Handbook ofresearch on teaching. Chicago: Rand McNally.

Bloom, B. S. (1994). Reflections on the develop-ment and use of the taxonomy. In L. W. Ander-son & L. A. Sosniak (Eds.), Bloom's taxonomy:A forty-year retrospective. Chicago: University ofChicago.

Broekhoff, M. (1979). Behavioral objectives andthe English profession. English Journal, 68(6),55-59.

Brown, J. D. (1990). Where do tests fit into lan-guage programs? JALTJournal, 12(1), 121-140.

Brown, J. D. (1992). Classroom-centered lan-guage testing. TESOL journal, K4), 12-15.

Brown, J. D. (1995). The elements of languagecurriculum: A systematic approach to programdevelopment. Boston: Heinle & Heinle.

Chung, B. M. (1994). The taxonomy in the Re-public of Korea. In L. W. Anderson, & L. A.Sosniak (Eds.) Bloom's taxonomy: A forty yearretrospective. Chicago: University of Chicago.

Docking, R. A. (1986). Norm-referenced andcriterion-referenced measurement: A descriptivecomparison. Unicorn, 12(1).

Doyle, E. (1986). Classroom organization andmanagement. In M. C. Wittrock (Ed.) Handbookof research on teaching. New York: Macmillian.

Eisner, E. W. (1967). Educational objectives: helpor hindrance? School Review, 752), 250-260.

Ekstrand, J. M. (1982). Methods of validatinglearning hierarchies with applications for math-ematics learning. Paper presented at the AnnualMeeting of the American Educational ResearchAssociation, New York City, 1982. (ERIC Docu-ment Reproduction Service ED 216 896).

Gagne, R. M. (1977). The conditions of learning.New York: Holt, Rinehart, and Winston.

Gower, R., & Walters, S. (1983). Teaching prac-tice handbook. Oxford: Heinemann.

Gronlund, N. E. (1978). Stating objectives forclassroom instruction. New York: Macmillan.

Hill, P. W. (1984). Testing hierarchy in educa-tional taxonomies: A theoretical and empiricalinvestigation. Evaluation in Education, 8, 179-278.

Johnson, E. W., & Johnson, R. T. (1991). Class-room instruction and cooperative learning. InH. C. Waxman, & H. J. Walberg (Eds.), Effectivelearning: Current research. Berkeley, CA:McCutchan.

Johnson, J. A. (1994). Using behavioral learning

objectives to improve the teaching of scientificsubjects. The Bulletin of the Faculty of ComputerScience and Systems Engineering Kyushu Insti-tute of Technology, 7, 99-107.

Krathwohl, D. R., Bloom, B. S., & Masia, B. B.(1964). Taxonomy of educational objectives,handbook II: Affective domain. New York:Longman.

Madaus, G., Woods, E. M., & Nuttall, R. L. (1993).A causal model analysis of Bloom's taxonomy.American Educational Research Journal, 10,253-262.

Mager, R. F. (1962). Preparing instructional objec-tives. Palo Alto, CA: Fearon.

McKeachie, W. J. (1963). Research on teaching atthe college and university level. In N. L. Gage(Ed.), Handbook of research on teaching. Chi-cago: Rand McNally.

McKeachie, W. J., & Kulik, J. A. (1975). Effectivecollege teaching. In F. N. Kerlinger (Ed.), Reviewof research in education. Itasca, IL: Peacock.

Moore, W. J., & Kennedy, L. D. (1971). Evaluationof learning in the language arts. In B. S. Bloom,J. T. Hastings, & F. Madaus (Eds.), Handbook onformative and summative evaluation of studentlearning. New York: McGraw-Hill.

Orlandi, L. R. (1971). Evaluation of learning insecondary school social studies. In B. S. Bloom,J. T. Hastings, & F. Madaus (Eds.), Handbook onformative and summative evaluation of student

JOHNSON

learning. New York: McGraw-Hill.Peterson, P. L., & Clark, C. (1986). Teachers'

thought processes. In J. C. Wittrock (Ed.), Hand-book of research on teaching. New York:Macmillan.

Popham, W. J. (1973). The uses of instructionalobjectives. Belmont, CA: Fearon.

Purves, A. C. (1971). Evaluation of learning inliterature. In B. S. Bloom, J. T. Hastings, & F.Madaus (Eds.), Handbook on formative andsummative evaluation of student learning. NewYork: McGraw-Hill.

Redfield, D. & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacherquestioning behavior. Review of EducationalResearch, 51, 244.

Richards, J. C., Platt, J., & Platt, C. (1992).Longman dictionary of language teaching &applied linguistics. Essex: Longman House.

Rohwer, W. D., & Sloane, K. (1994). Psychologicalperspectives. In L. W. Anderson, & L. A. Sosniak(Eds.) Bloom's taxonomy: A forty-year retrospec-tive. Chicago: University of Chicago.

Tumposky, N. R. (1984). Behavioral objectives,thecult of efficiency and foreign language learning:Are they compatible? TESOL Quarterly, 1.5(2),295-310.

van Ek, J. A. & Alexander, L. G. (1977). Systemsdevelopment in adult language learning:Waystage. Strasbourg: Council of Europe.


Section II

Program-Level Testing Strategies

46

Chapter 5

Developing Norm-ReferencedLanguage Tests for

Program-Level Decision Making

JAMES DEAN BROWN


When I was an ESL and EFL teacherand administrator, I encounteredtwo basic types of language tests:

tests that administrators need in order toclassify or group students in some way orother, and tests that teachers need to helpthem determine what their students havelearned in a particular course. These twotypes of tests are usually called norm-refer-enced tests, and criterion-referenced tests (formore on this distinction, see Chapter 2 of thisbook, or Hudson & Lynch, 1984; Brown,1984, 1988, 1989, 1990, 1992, 1993, 1995a, &1995b; Bachman, 1990).

At the same. time, I have long recognizedthat language tests are most commonly usedto make six basic types of decisions in lan-guage programs: language aptitude, profi-ciency, placement, diagnostic, progress, andachievement tests. Because their basic pur-pose is to classify or group students, norm-referenced tests are best suited for makingthree of these types of decisions: aptitude,proficiency, and placement decisions.

For instance, language aptitude tests, likethe one I took at the beginning of my mili-tary service, are used to decide who willmost benefit from language training (or fromthe U.S. Army's point of view: who will betheir best investment?). In contrast, general


47

language proficiency tests are most com-monly used for admissions tests (in the waythat TOEFL scores are used to decide whichinternational students should be admitted toAmerican universities). Later, when studentshave already been admitted to a particularinstitution, it may be necessary to use place-ment tests to decide which levels of languagestudy students should pursue (like the En-glish Language Institute Placement Test, orELIPT, used at the University of Hawaii atManoa to decide whether students shouldstudy at the intermediate or advanced levelsor be exempted from ESL study). In all threecasesaptitude, proficiency, and placementteststhe scores are needed to make deci-sions that compare each student to all otherstudents so they can be classified (i.e., as agood or bad investment in the case of apti-tude; as admitted or not admitted in the caseof proficiency; or placed in levels of languagestudy in the case of placement testing).

Most of the language testing literature thathas developed over the years has been aboutnorm-referenced language testing in one wayor another (see Bachman, 1990; Brown, 1995afor overviews of the norm-referenced lan-guage testing literature). Yet, few papers havelooked at norm-referenced testing as a set oftools for making decisions in language pro-

grams. In this chapter, I will provide answersto three basic questions about norm-refer-enced testing:

1. What are norm-referenced tests, and whatare they used for?

2. What do administrators need in theirnorm-referenced tests?

3. How should norm-referenced tests bedeveloped and improved?

Examining these three questions, one at atime, should help administrators and teachersto understand how important norm-referencedtesting is, as well as why and how they shouldbe doing their norm-referenced testing.

What Are Norm-referenced Tests,and What Are They Used for?

Norm-referenced tests will be defined here asthose tests which are used in language pro-grams to assess students' aptitude to learn alanguage, measure their proficiency, or deter-mine their appropriate level of study in aparticular language program.

In the language programs that I have known,the norm-referenced tests were either (a)adopted from other sources, (b) developedwithin the program, or (c) adapted to meet theneeds of the program. As director of the ELI atthe University of Hawaii, I was an administratorwho sometimes chose to adopt already existingnorm-referenced tests. For instance, weadopted a test for our overall proficiency test-ing called the Test of English as a Foreign Lan-guage (ETS, 1990). We also choose to developour own tests for placement purposes. Ourplacement test was the English Language Insti-tute Placement Test (see Brown, 1989a, 1995a,or 1995b for fuller descriptions). Since we hadto deal with all students who were accepted bythe university, language learning aptitude wasnot an issue for us so we have never done anyaptitude testing.

At other institutions, administrators maychoose to adopt placement tests, or adapt teststhat are already in place in their institutions.The point is that putting sound norm-refer-enced tests into place is an important responsi-bility. This responsibility will typically fall on

BROWN

the administrator(s). Classroom teachers maybe involved in norm-referenced testing as asort of labor pool for administering and scoringthe tests, but seldom are they directly involvedin the selection, development, or adaptation ofsuch tests. I am not saying that teachers shouldnot take an interest in the norm-referencedtypes of decisions that are closely linked toprogram level decisions. I am just saying thatteachers are likely to be most directly respon-sible for and most interested in the criterion-referenced types of decisions discussed inChapters 2, 3, and 4 of this book.

Regardless of the level of interest they takein the norm-referenced tests, teachers mustrecognize that they benefit in a number ofways from norm-referenced tests. For ex-ample, because of aptitude, proficiency, and/or placement testing, teachers may have rela-tively homogeneous groups of students toteach in each of their classes. Such homoge-neity may take the form of having only stu-dents with a relatively high degree oflanguage learning aptitude in the class, oronly students above a certain admissionslevel as determined by a proficiency test, oronly students within a certain band of abilitiesas determined by a placement test, or allthree. Once provided such homogeneity,teachers can carefully tailor their classroomactivities, exercises, homework, and so forthto the needs of a clearly defined group ofstudents. Any teacher who has ever had toteach students with a wide range of aptitudesor abilities will easily understand the value ofthis benefita benefit that comes from usingsound norm-referenced testing practices.

What Administrators Needin a Norm-referenced Test

The question of what administrators need in anorm-referenced test is really a two-part ques-tion. First, what types of test information doadministrators need? Second, what qualities aredesirable in a norm-referenced test?

Types of test information. Naturally, thereare many types of information that a norm-referenced test might usefully provide, andthe types will vary from institution to institu-

4.8.


BROWN

tion and administrator to administrator. How-ever, I most often find myself needing toknow: whether the students' language profi-ciency is high enough to enter our institution,and what level (if any) of ESL study theyshould be placed into. These are questionsthat can't be answered on the basis of theinformation provided on typical criterion-referenced tests because such issues are veryglobal and can only be addressed by classify-ing students into groups for comparison usingwell-developed norm-referenced tests.

Ideal norm-referenced test qualities. Anideal norm-referenced test is one that fits thegroup for which it was designed in terms ofthe difficulty of the items, and one that hasitems on it that discriminate well between thehigh ability students and the low ability stu-dents. In addition, an ideal norm-referencedtest is one that efficiently spreads the studentsout along a continuum of abilities so thatgrouping decisions can be made easily andresponsibly. Whether adopting, developing,or adapting norm-referenced tests, administra-tors still have to examine the tests very care-fully and critically, just as they do withtextbooks, to determine whether or not thetests are up to their standards.

Unfortunately, most programs are unable tofind off-the-shelf tests that meet their needs,especially for purposes of placement. A givenplacement test is designed for making place-ment decisions within a particular populationof students, and such a test may be totallyinappropriate for the different population ofstudents found in another program. As a result,programs often end up writing their own tests.

The norm-referenced test review checklistshown in Table 1 provides a list of crucialquestions that should be addressed whendeveloping such a norm-referenced test. Ifyou go down the list and carefully addresseach of the questions, you should be able toproduce a test of respectable quality.

How Should Norm-referenced TestsBe Developed and Improved?

All tests should be developed such that thetest items make sense in terms of the curricu-


Table 1. Norm-referenced test reviewchecklist

El

El

El

El

El

El

El

El

El

El

El

El

Are the test questions that I have written directlyrelated to the teaching materials and activitiesbeing used in class?

Are there enough test questions so that I caneliminate some bad ones.

Has at least one colleague critically lookedover the test?

Is the test easily reproducible on locally avail-able equipment?

If I have previously used the test, did I revise iton the basis of what happened in that adminis-tration?

If I have previously used the test, does that meanthat it is available to some of the students?

Do I want to develop the test in multiple forms?

Are the directions concise and adequate?

Does the test have clear, complete, and correctanswer keys and directions for scoring?

Is score interpretation clear?

Have provisions been made for clearly report-ing the scores to the students?

Is the test of demonstrably good quality?

lum for which they are being designed. Inother words, the items should be carefullywritten to reflect the teaching points andtypes of activities that go on in the classroom.Because these aspects of language teachingvary considerably, the decisions about thetypes of items to include will have to bemade locally. However, there are severalgeneral approaches that can be used in devel-oping and improving norm-referenced tests.Here, I will describe what I call the minimumprocess of norm-referenced test development,as well as what I call the full process ofnorm-referenced test development.

The minimum process of norm-referencedtest development. In fact, the checklist shownin Table 1 can be used at various stages of

the test development process to maximize thequality of your norm-referenced tests, andthereby minimize the probability of makingerrors in the aptitude, proficiency (particularlyadmissions), or placement decisions that youmust make based on the tests. At minimum,you should include the following steps inyour test development process:

1. Once the test is actually adopted, adapted,or developed, you should critique it usingthe checklist in Table 1.

2. Take the test yourself before you adminis-ter it. This will help you to spot problemsand insure that you have an accurate an-swer key to work from.

3. Have a number of colleagues look overthe test before administering it.

4. Take notes during the test administrationon anything that the students ask about, oron any problems that you notice whilecorrecting the tests.

5. Most important, revise the test immediatelyafter administering it (while the problemsare still fresh in your mind) and use allthat you have learned in steps one to fourabove in the revision process.

If you follow these minimum steps and usethe checklist provided in Table 1, you couldfind yourself in the enviable position of pro-viding relatively sound norm-referenced testsfor your students.

The full process of norm-referenced testdevelopment. To find out how well the testquestions are functioning in a norm-refer-enced test, administrators must examine eachquestion to find out how students perform onit. To do this effectively, the administratormust examine at least one set of test resultsgathered either by pilot testing it with a groupof students like the ones who will eventuallybe tested, or by administering the test opera-tionally and then analyzing the results.

On a question-by-question basis, severalstatistics exist that can help teachers to decidewhich test questions fit the ability levels oftheir group of students and discriminate wellbetween the high and low ability students.The statistics explained here require only thata test be administered on one occasion.

BROWN

Calculating the facility index. The facilityindex is one very easy statistic that can helpteachers determine how well a particular testquestion fits the ability levels of the studentsinvolved. As mentioned in Chapters 2 and 3,the facility index is calculated by adding upthe number of students who correctly an-swered a particular test question and dividingby the number of students who took the test;these calculations yield the proportion ofstudents who answered correctly. For in-stance, if 40 (out of 200) students correctlyanswered a test question, the facility indexwould be 40/200 = .20. Understanding thisindex is easy because moving the decimalpoint two places to the right turns this indexinto a percent. In the example, .20 wouldbecome 20 percent. This specific questionmight be considered fairly difficult for thestudents in this group because only 20 per-cent answered correctly.

Interpreting the facility index. So interpret-ing facility indexes is relatively easy. Forinstance, a facility index of .95 indicates that95 percent of the students answered the ques-tion correctly, and that it was a very easyquestion for the particular group of studentswho were tested. An item facility of .10 indi-cates that the question was very difficult forthe students because only 10 percent couldanswer it correctly.

An ideal question for a norm-referencedtest is one that 50-60 percent of the studentsanswer correctly. Typically, testing books saythat items can be retained in the norm-refer-enced test revision process if their facilityvalues fall between .30 and .70 because itemsoutside of that range really ought to be con-sidered either too hard or too easy for thegroup being tested.

Calculating the discrimination index. An-other straightforward statistic, called the dis-crimination index, is used to examine thedegree to which the high ability, or top scor-ing, students on a norm-referenced test itemanswer it correctly as compared to the lowability, or bottom scoring, students. To calcu-late a discrimination index, you must firstdecide which students belong in the topscoring group and which belong in the bot-


BROWN

tom scoring group. Typically, the scores onthe whole test are lined up from high to lowand then the top and bottom third are chosento be the top and bottom groups, as shown inTable 2. (Depending on the number of stu-dents, it may be more convenient to use thetop and bottom 25 percent or somethingbetween 25 and 33 percent.) Once the topand bottom groups are decided, the discrimi-nation index is calculated in three steps:

1. Calculate the item facility for the topgroup (see the row labeled IFTOp inTable 2)

2. Calculate the item facility for the bottomgroup (see the row labeled IFBOT inTable 2)

3. Calculate the item discrimination index(ID) by subtracting the item facility for thebottom group from the item facility for thetop group (IFTOp IFBOT = ID)

For example, item 1 in Table 2 has an itemfacility for the top group of 1.00 and an itemfacility for the bottom group of 1.00. In otherwords, everybody in the top and bottom

groups answered this item correctly. Calculat-ing the item discrimination index using theformula (IFTOp IFBOT = ID), ID wouldturn out to be 1.00 1.00 = 0.00. In otherwords, this item has zero discrimination,which makes sense because the top andbottom groups did equally well on this item.

Let's consider several other examples. Item2 in Table 2 has an item facility for the topgroup of 0.00 and an item facility for thebottom group of 0.00. In other words, every-body in the top and bottom groups an-swered this item wrong. Calculating the itemdiscrimination index using the formula(IFTOp IFBOT = ID), ID would turn out tobe 0.00 0.00 = 0.00. In other words, thisitem has zero discrimination, which makessense because the top and bottom groupsdid equally poorly on the item.

In contrast, on item 3 in Table 2, the itemfacility for the top group was 1.00 and thatfor the bottom group was 0.00. In otherwords, everybody in the top group answeredthis item correctly and everyone in the bot-tom group answered it incorrectly. Calculating

Table 2. Calculating NRT Discrimination Indexes

Student ItemsTotal1 2 3 4 5 6 7 8 9 10

YOKO 1 0 1 0 1 1 1 1 1 1 47TOSHI 1 0 1 0 0 0 1 0 1 1 44SAYOKO 1 0 1 0 1 1 1 0 1 0 42HIDE 1 0 1 0 0 1 0 0 1 0 41

NAOYO 1 0 1 0 1 1 1 1 1 1 39RIEKO 1 0 1 0 0 1 1 0 1 1 35KEIKO 1 0 0 1 1 1 1 0 0 1 34NAOMI 1 0 0 1 0 0 1 0 0 1 30

TAEKO 1 o o 1 1 1 1 0 1 0 30YUKIE 1 0 0 1 0 0 0 0 1 0 30ASAKO 1 0 0 1 1 0 0 0 0 0 27HIROTO 1 0 0 1 0 0 0 0 1 0 13

IF 1.00 0.00 0.50 0.50 0.50 0.58 6.67 0.17 0.75 0.50

IFTOP 1.00 0.00 1.00 0.00 0.50 0.75 0.75 0.25 1.00 0.50

IFBOT 1.00 0.00 0.00 1.00 0.50 0.25 0.25 0.00 0.75 0.00

ID 0.00 0.00 1.00 -1.00 0.00 0.50 0.50 0.25 0.25 0.50


the item discrimination index using the for-mula (IFTop - IFBOT = ID), ID would turnout to be 1.00 0.00 = 1.00. This value of 1.00is as high as the discrimination index can go.In other words, this item has perfect discrimi-nation because it discriminated as well as ispossible between the top and bottom groups(i.e., everyone in the top group did as well aspossible and everyone in the bottom groupdid as poorly as possible).

Item 4 in Table 2 shows what happenswhen for some reason the students in thebottom group all get the item right (1.00) andeverybody in the top group answers it incor-rectly (0.00). In this case, ID would turn outto be 0.00 1.00 = -1.00. This value of -1.00 isas low as the discrimination index can go. Itindicates that the item is doing somethingcompletely opposite from the test as a whole(i.e., everyone in the top group did as poorlyas possible and everyone in the bottom groupdid as well as possible).

However, these first four examples areextreme examples because they were de-signed to illustrate the values that a discrimi-nation index can take on. If a set ofnorm-referenced test items are well written,they will typically take on intermediary valuesmore like those shown in items 5 to 10 inTable 2.

Interpreting the discrimination index. Re-member, the point of item statistics is to selectthose items that discriminate best for a revisedversion of the test. Deciding which items tokeep and which to eliminate will depend inpart on how many items you need to keep.Let's say that four items are needed from thisfirst batch of 10. The decision would theninvolve examining the ID values and keepingthose items that, in this case, have IDs of .50 orhigher (items 3, 6, 7, & 10, that is, the oneswith a # at the very bottom). If six items wereneeded then the standard for ID might dropconsiderably to .25 to include items 8 and 9(the ones with an at the very bottom).

You should also keep an eye on the averageIF for the items that remain on a revised ver-sion of the test, because it indicates how diffi-cult the test has become in the revisionprocess. For instance, in the six item revision in

BROWN

the above example, the average IF would be(.50 +.58 +.67 +.17 +.75 +.50)/6 = .53. Inother words, the average percent answeringthe items correctly was 53 percent. Of course,if item 8 (with its IF of .17) is also elimi-nated from the test, the resulting averageIF for the remaining five items will be(.50 +.58 +.67 +.75 +.50)/5 = .60. Since thisindicates that 60 percent on average are an-swering correctly, the test will be considerablyeasier than the six item version with its averageIF of .53. You will simply have to decide howeasy the test should ultimately be.

In short, selection of items in any norm-referenced test revision process must also takeinto account common sense and practical con-siderations like how long the test must ulti-mately be and how difficult it should be. As aresult, the only rule of thumb I can give youfor interpreting ID in a norm-referenced testingsituation is to try to keep those items whichhave the highest item facility while keeping asharp eye on the quality of the items them-selves, the number of items that are needed,and the difficulty of the resulting test.

To expand the minimum process of norm-referenced test development given earlier, anumber of steps can now be added, espe-cially the steps that examine how students areperforming on the individual test questions.The full process would thus include at leastthe following steps:

1. To improve test validity, you should cri-tique the test using the checklist in Table 1with special attention to checking that thetest questions match the sorts of thingstaught in your program.

2. Take the test yourself before you adminis-ter it. This will help you to spot problemsand insure that you have an accurate an-swer key to work from.

3. Have a colleague look it over before ad-ministering it.

4. Take notes during the test administrationon anything that the students ask about, oron any problems that you notice whilecorrecting the tests.

5. Compile the results (from a piloting ofyour test, or from an operational adminis-


BRowN

tration) on an question-by-question basis.6. Code each student's answers in such a

way that you can calculate the percent ofstudents who answered each test questioncorrectly.

7. Calculate overall facility indexes (as shownin Table 2).

8. Arrange your item data so that they aretabulated according to the students' totalscores, from high to low.

9. Identify a top and bottom scoring groupconsisting of about 33 percent of the stu-dents each.

10. Calculate the item facility index for eachgroup (as shown in Table 2).

11. Calculate a discrimination index for eachitem by subtracting the item facility for thebottom group from the item facility for thetop group.

12. Interpret the facility and discriminationindexes in terms of their ramifications forrevising the particular test involved whilekeeping in mind the quality of the actualitems, the number of items that are needed,and the difficulty of the resulting test.

13. Most important, revise the test immediatelyafter administering and analyzing it (whilethe problems are still fresh in your mind)and use all that you have learned in stepsone to twelve above.

If you follow these 13 steps, you will havecreated (a) a better quality test, (b) a test thatis more closely related to your students' abil-ity levels, and (c) a test that spreads yourstudents out more efficiently so that you canmake more responsible norm-referenceddecisions such as the admissions and place-ment decisions we make regularly at theUniversity of Hawaii at Manoa.

Conclusions

This chapter began by defining norm-refer-enced tests and listing some of the differenttypes of information that can be gatheredusing norm-referenced tests. It also provideda checklist for reviewing the quality of NRTsand explained the processes involved in de-veloping and revising themincluding two


statistics, the facility and discrimination. Inshort, quite a bit of ground was covered inthis discussion of NRTs, yet the subject is farfrom being exhausted.

The single greatest problem in norm-refer-enced language testing is that administratorsdon't recognize that most off-the-shelf testsmay not be suitable for their particular lan-guage program. NRTs are developed usingitem facility and discrimination statistics to fita particular group of students. Since groupsof students vary widely from one institutionto another, it may be totally inappropriate touse a norm-referenced placement test devel-oped and published by one institution atanother institution. For instance, as directorof the ELI at University of Hawaii at Manoa, Ioften got calls from directors of other lan-guage programs in Hawaii (and elsewhere)asking if they could copy or buy our place-ment test. My answer was always a resound-ing, though polite, "NO!" for two reasons:(a) I did not want to compromise the securityof our six-test battery, which involved a greatinvestment of time, knowledge, energy, andmoney; and (b) our tests were tailored by itemanalysis to fit our group of students (whoranged from about 500 to 600 on the TOEFL),and thus our tests would be totally inappropri-ate at most other institutions, where the rangesof ability are typically quite different.

Unfortunately, many administrators do nothave the time or knowledge to build in-house NRTs that will be suitable for theirprograms. To address this issue, strategiesmust be developed for finding or trainingpeople who will have the time and knowl-edge to develop in-house NRTs that will helpin making aptitude, proficiency (particularlyadmissions), and placement decisions asnecessary. If you are too busy to apply theinformation supplied in this article, there arethree other strategies you might try: (a) hirea teacher who also is trained in the importantarea of language testing and release thatperson from some teaching duties so thathe/she can develop effective in-house tests;(b) identify one teacher, who is interested intesting, and send that teacher to one of themany ESL/EFL teacher training programs

around the world where language testingcourses are offered (for instance, the TESOLSummer Institutesone or two of these oc-cur every year); or (c) hire a testing specialistto come briefly to your program to workwith and train selected staff members (per-haps in a hands-on manner by helping themto develop tests for your program).

In short, NRTs must be accorded a muchhigher priority in the administrator's programdevelopment plans. Important decisionsabout students' lives are made on the basis ofthese tests. Hence, they must be accorded amore important place in language curriculaeverywhere. To do so, language programsmust marshal increasing resources in terms oftime, money, materials, and training so thatadministrators and teachers can develop anduse norm-referenced testing as an important,effective, and integral part of their decisionmaking processes.

In essence, resources must be found tohelp people like you work on your NRTs.Given even minimal support, you can de-velop sound NRTs that can in turn help youto make more effective decisions. Norm-referenced tests like the university entranceexaminations in Japan, which may or may notbe of good quality (see Brown & Yamashitain Chapter 10), are already a reality to moststudents in Japan, and that is all-the-morereason why you must take responsibility forcreating sound norm-referenced tests in your

BROWN

language program so that at least the apti-tude, admissions, and placement decisionsthat you make within your program will beefficient, effective, and fair to your students.

References


Brown, J. D. (1988). Understanding research insecond language learning: A teacher's guide tostatistics and research design. Cambridge: Cam-bridge University.

Brown, J. D. (1989). Improving ESL placement testsusing two perspectives. TESOL Quarterly, 23(1),65-83.

Brown, J. D. (1990). Where do tests fit into lan-guage programs? JALT Journal, 1Z1), 121-140.

Brown, J. D. (1992). Classroom-centered languagetesting. TESOL Journal, 1(4), 12-15.

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In D. Dou-glas and C. Chapelle (Eds.) A new decade oflanguage testing research (pp. 163-184). Wash-ington, DC: TESOL.

Brown, J. D. (1995a). Testing in language pro-grams. Englewood Cliffs, NJ: Prentice-Hall.

Brown, J. D. (1995b). The elements of languagecurriculum: A systematic approach to programdevelopment. New York: Heinle & Heinle.

ETS. (1990). Test of English as a foreign language.Princeton, NJ: Educational Testing Service.

Hudson, T., & Lynch, B. (1984). A criterion-refer-enced approach to ESL achievement testing.Language Testing, 1(2), 171-201.


Chapter 6

Monitoring Student Placement:a Test Retest Comparison


INTERNATIONAL CHRISTIAN UNIVERSITY

NIany Japanese language educationinstitutions either use the Test ofJapanese Language Proficiency'

or a placement test that they developed inhouse to measure the proficiency of their newstudents. The Japanese Language Program (JLP)at International Christian University (ICU) isone institution that uses its own placement test.

Every September at ICU, the general JLPPlacement Test (JLPPT) is administered tostudents in the JLP. The JLPPT consists of anaural comprehension subtest, a comprehen-sive structure subtest, and a reading kanjiand vocabulary subtest. After three terms(Fall, Winter, and Spring), or nine monthslater, the same placement test is given to thesame group of students to retest their profi-ciency. A report of the results of the Septem-ber placement test, the retest in June, and thedifference between the two tests (or gainscore) is made to each student. For example,if students who were placed in the third levelJapanese course (J3) in September on thebasis of their JLPPT scores pass this coursewith a grade of A to D, they will successfullyadvance to J4 in Winter term and J5 in theSpring term. It is expected that the retestscores in June, at the end of the Spring term,will be equivalent to or at least close to theplacement score necessary to get into J6. Thisshould be the endofterm proficiency scoreof a student who has mastered the J5 courserequirements.


55

In reality, though, many students are dis-appointed to find that their retest scores arefar below the scores which they expected.Why does this happen? Is there a similareffect for all students in all levels? In short, isthere a mismatch between the standard re-quired of those students who are placed intoa level as compared to the proficiencies ofthose students who are promoted into thesame level from lower courses?

Brown (1980) investigated proficiency datafrom two different student populations withinsome ESL classrooms and found that the twopopulations, newly placed students and con-tinuing students (those promoted from lowercourses), were significantly different. He alsoclaimed that the difference in performancewas expected to be greatest at the most ad-vanced level (Brown, 1980, p. 113), but hedid not investigate this last claim statistically.The results of Brown's study indicated thatthe newly placed students performed farbetter than continuing students on the threedifferent measures: course grade, final exami-nation, and cloze procedure. Brown sug-gested three possible causes: (a) the amountof instructional time devoted to the subject'sESL study, (b) the amount and nature of thesubject's previous EFL study, (c) the amountof time that had passed since that previousEFL study (Brown, 1980, p. 116).

The present study focuses on issues similarto those investigated by Brown, i.e., the po-

tential mismatch in proficiency between thestudents finishing a certain level with a pass-ing grade in the Spring and going into thenext level in the Fall and students newlyplaced in that level in the Fall. UnlikeBrown's study, however, this study comparesthe proficiency score means of two differentstudent groups in the same course level atdifferent times (i.e., Fall and Spring terms) byusing a pretestposttest design.'

In the field of Japanese as a second lan-guage (JSL), there have been a number ofstudies about developing placement tests(Ichikawa & Ogawa, 1991; Ishida, Inagaki, &Nakamura, 1982; Kijima, 1988; Saegusa, 1986,1988; Sakai, 1990), improving placement tests(Hiura, 1989; Suzuki, 1989; Taji, 1987, 1988),or measuring differences in the distributionof placement scores according to native lan-guage or other background variables (Sakai,1988, 1990; Kano & Shimizu, 1991). How-ever, to this researcher's knowledge, no stud-ies have systematically compared andreported the placement test scores of differ-ent student populations across levels.

The purpose of the present study is to inves-tigate potential proficiency differences betweenSubjects in the Fall and Spring terms and pos-sible reasons for such differences. If problemswith placement are found, we should furtherstudy the appropriateness of decisions basedon cutpoint scores at each level, analyze thetest items, and investigate the curriculum of theprogram. This study, then, formulates the fol-lowing research questions:

1. Do continuing students in the Springcourses perform the same (at the end oftheir courses) as students at equivalentlevels who are placed in the Fall (at thebeginning of their courses)?

2. If there are differences in performance, arethe differences observed at all levels or insome particular levels only?

Thus the following hypotheses are beingstudied:

Hl: Newly placed students in the Fall willperform better than continuing students inthe equivalent levels in the Spring.

YAMASHITA

H2: The differences between the two groupswill be greater in the more advancedlevels (following Brown's 1980 claim).

Method

SubjectsA total of 44 students from LEVEL4 to

LEVEL7 took the JLPPT as a pretest (TEST1)and had scores on all subtests in the Fallsemester 1992. This Fall semester group in-cluded all newly entering students and somestudents from a previous summer course.3 InSpring semester, a total of 63 students fromLEVEL4 to LEVEL7 took the JLPPT as a retest(TEST2) and had scores on all three subtests.The reason for choosing levels 4 to 7 will beexplained in the Procedures section.

On the retest given at the end of each term,the scores are expected to be equivalent to thescores of students beginning the next level.Hence, the means at the beginning of eachlevel (i.e., LEVEL4 to LEVEL7) on the Fall 1992pretest (TEST1) were compared with the meansat the end of the Spring 1993 retest (TEST2) forthe level just below it (See Table 1). It wasgenerally believed that the students taking theretest TEST2 at the end of one level would beequivalent to students just beginning the nextlevel and taking the pretest TEST1.

Table 1. Comparison of equivalent levels

LEVEL TEST1

(Beginning of course)TEST2

(End of course)

LEVEL4 J4 (Beginning)LEVELS J5 (Beginning)LEVEL6 J6 (Beginning)LEVEL7 Advanced I (Beginning)

J3 (End)J4 (End)J5 (End)J6 (End)

The subjects in this study are described interms of nationality, gender, academic status,and major for each level in Tables 2, 3, 4, and5, respectively. Tables 2 through 5 indicatethat the students on Testl and Test2 (labeledTi and T2) are similarly distributed in termsof nationality (Americans are predominant),gender, academic status (undergraduate is


YAMASHITA

predominant), and major (language majorsare predominant). Although these variablescould become moderator variables in a studysuch as this one, the distributions appear tobe approximately equivalent for the twogroups so the potential effects appear to becontrolled and will not be considered in thisstudy. Note that such variables have remaineduncontrolled in language studies elsewhere(Brown, 1988).

MaterialsThe only materials used in this study were

the JLP Placement Test which was used forboth the pretest (TEST1) and retest (TEST2).The separate scores on three subtests wereused: the aural comprehension subtest, com-prehensive structure subtest, and readingkanji and vocabulary subtest. Note that all ofthe questions on all three subtests were mul-tiplechoice.

Basic descriptive statistics are given inTable 6. Notice in Table 6 that all three sub-tests are reasonably wellcentered (as indi-cated by the means) and disperse thestudents well (as indicated by the relativelyhigh standard deviations, or S). Notice also

Table 2. Nationality

Nationality

L3 L4 15 L6 17

- T2 Ti T2 Ti T2 T1 T2 T1-

Canada 2 1 1 1

China 1 2 2

Denmark 1

Germany 1 1 1 1

Ireland 2

Japan 2 2

Korea 2 2

Mexico 1

Netherlands 3 2

Russia 1

Taiwan 1 1

Turkey 1

UK 1 2 1 1

UK/Hong Kong 1 1

USA 6 7 3 7 17 5 19 2

TOTAL 13 12 6 15 19 8 25 9

T1 TESTI, T2 = TEST2


that the reliability estimates (both Alpha andKR20) are reasonably high for the AuralComprehension subtest and very high for theother two subtests.

Table 3. Gender distribution

L3 L4 L5 L6 L7

Gender - T2 Ti T2 T1 T2 T1 T2 T1-

MaleFemale

7 8 5 10 10 7 17 46 4 1 5 9 1 8 5

Table 4. Academic status

L3 L4 15 L6 L7

Academic - T2 T1 T2 Ti T2 Ti T2 T1-Status

UndergraduateGraduate

12 8 6 11 18 8 18 71 4 4 I 7 2

Table 5. Major

L3 L4 L5 L6 L7

Major - T2 Ti T2 Ti T2 T1 T2 T1-

Language 5 5 5 7 13 6 14 5Humanities 1 1 3 1 1

Social Science 6 4 7 3 1 8 2

Intl Studies 1 1 1 1

Education 1 1 1 1

Graduate Program

TOTAL 13 12 6 15 19 8 25 9

ProceduresTESTI. was administered in the beginning

of the Fall semester in 1992 as a generalplacement test. As described in the Subjectssection, newly entering students and somestudents from the summer course tookTEST1.

The students themselves selected whichcourse series they would take: either theIntensive Japanese Course Series (twentytwo 70minute periods a week) or the semi

YAMASHITA

Table 6. Placement test results (by subtest) for the two groups combined

Poss. Reliability

Subtests k score Mean SD Alpha K-R20

Aur. Comp 25 25 144 14.67 4.56 .7751 .7747Structure 80 100 144 39.74 14.30 .9347 .9568Reading 100 100 144 55.33 16.96 .9387 .9287

intensive course which is called the JapaneseSeries (ten 70-minute periods a week). Thenthey were placed in the appropriate level ineither series according to their scores on thethree subtests of TEST1. In this study, themean scores of each subtest for the studentgroups taking Japanese Course Series J3(beginning level), J4-J5-J6 (intermediatelevels) and Advanced 1 will be analyzed. Forsimplicity, J3 to Advanced 1 levels will belabeled LEVEL3, LEVEL4, LEVELS, LEVEL6,and LEVEL7, respectively. The scores ofthose students who were in the Intensivecourses, J1, J2, and Advanced 2 of the Japa-nese series were not included in this studybecause those courses were not offered regu-larly and were missing in either the Fall orSpring semesters.

AnalysisEssentially, the analyses in this study focus

on the means for each level on TESTI. (ad-ministered as a pretest in Fall term) as com-pared with the means of students atequivalent levels on TEST2 (which was con-ducted program-wide at the end of theSpring semester in June, 1993).

In this pretest-posttest design, mean com-parisons were first made by using two-waymultivariate analysis of variance (MANOVA)procedures since there were three dependentvariables (the three sets of scores on the threesubtests). The two independent variableswere the Tests (TEST1 and TEST2) and Levels(LEVEL4 to LEVEL7). Pillais, Hotelling andWilks statistics were converted to F ratios todetermine whether there were significantoverall differences across the dependent vari-ables for each factor (i.e., Tests and Levels).Then, where significant multivariate differ-ences were found, univariate analyses werejustified to discover where more specificsignificant differences might lie. Univariateanalysis of variance procedures (and appro-priate F ratios) were calculated to estimate thedifferences for Tests and Levels on the indi-vidual dependent variables, namely, the auralcomprehension subtest (SCOREA), the com-prehensive structure subtest (SCORES), andreading kanji and vocabulary subtest(SCORER). It is important to note that thesubjects in this study were only those whohad complete data (i.e., the results for thosesubjects with any missing scores were not

Table 7A. Descriptive statistics and significance of differences in aural comprehension subtest(SCOREA)

SCOREA TESTI TEST2 Mean differences

(TEST2 -TESTI)n Mean n Mean S

LEVEL4 12 14.17 2.86 13 12.77 2.20 -1.40LEVELS 15 17.20 3.75 6 121.33 2.50 -2.87LEVEL6 8 21.13 2.17 19 16.63 4.33 -4.50LEVEL7 9 21.56 2.19 25 18.44 1.71 -3.12


58

YAMASHITA

Figure 1A. Aural comprehension subtest (SCOREA)

2220

18

16141210

8

x = Test 1

+ = Test 264

2

0II I I

Level 4 Level 5 Level 6 Level 7

reported here and were not included in theanalysis). Null hypotheses of no differencesbetween group means wereadopted and the alpha decisionlevel was set at alpha < .01.

Results

The multivariate analysis ofvariance procedures (usingPillais, Hotel ling, and Wilks tests)indicated significant overall dif-ferences for all three subteststaken together in Levels (p < .001;df = 3, 99) and Tests (p < .001;df = 1, 99), and showed no signifi-cant differences for the Levels byTests interaction (p < .01;df = 3, 99). Univariate followupanalyses for the main effects dueto Levels and Tests indicated that

significant differences were found(p < .001) for all three subtests onboth variables. The descriptivestatistics are given in Tables 7A,7B, and 7C and the results areshown graphically in Figures 1A,1B, and 1C.

Discussion

In short, all three subtests,SCOREA, SCORES, SCORER, weresignificant for Test and Level ef-fects at p < .001, and none weresignificant (at p < .01) for the Lev-els by Tests interaction.

Figure 1B. Comprehensive structure subtest (SCORES)

70

60-

50

40302010

0

Level 4 Level 5 Level 6 Level 7

x = Test 1

+ = Test 2

Table 7B. Descriptive statistics and significance of differences in comprehensive structure subtest(SCORES)

SCOREA TEST 1 TEST2 Mean differences

(TEST2 -TESTI)n Mean S n Mean S

LEVEL4 12 35.58 4.68 13 29.15 7.26 -6.41LEVELS 15 48.67 5.01 6 36.67 4.89 -12.00LEVEL6 8 56.63 6.57 19 45.58 9.91 -11.05LEVEL7 9 68.44 4.75 25 49.00 7.89 -19.44


YAMASHITA

Table 7C. Descriptive statistics and significance of differences in reading, kanji, and vocabu-lary subtest (SCORER)

SCOREA TEST 1 TEST2 Mean differences

(TEST2 -TESTI)n Mean n Mean S

LEVEL4 12 52.35 7.94 13 40.62 10.60 -11.73LEVEL5 15 60.60 15.23 6 49.67 11.50 -10.93LEVEL6 8 69.13 11.26 19 59.26 9.47 9.87LEVEL7 9 87.89 4.35 25 64.60 10.99 -23.29

Answering the Research QuestionsResearch question 1 asked if continuing

students in the Spring courses perform thesame (at the end of their courses) as studentsat equivalent.levels who are placed in the Fall(at the beginning of their courses). The re-sults showed that placed students in Fall in alllevels performed better. The findings sup-ported the hypothesis (i.e., the newly placedstudents in the Fall will perform better thanthe continuing students in the equivalentlevels in the Spring).

Research question 2 asked, if there aredifferences in performance, are the differ-ences observed at all levels or in some par-ticular levels only? In the SCOREA subtest, thedifferences become larger as the levels in-crease from L4 to L5 to L6), but the differencebecomes smaller again when it gets to L7. Forthe SCORES and SCORER subtests, the differ-ences are similar for levels L4, L5, and L6, but

Figure 1C. Reading, kanji, & vocabulary subtest(SCORER)

get larger for L7. However, since theMANOVA showed no significant interactioneffect for Levels by Tests (p < .01), hypothesis2 (i.e., Brown's hypothesis that the differ-ences between the two groups would begreater in more advanced levels) is not sup-ported by the results of this study.

Other Pertinent IssuesSample sizes. In this study, only complete

data which fitted the research problem wereused. The purpose of the study was to ex-plore whether there were any differencesbetween the student populations in the Falland Spring semesters. Hence the sample sizeswere relatively small.

Sampling procedures. Another problemexisted with TEST2. Although the retest pro-cedure (TEST2) was a programwide proce-dure and was conducted in each classroomusing instructional time, it was still up to the

students to decide whether ornot to take the test. Also, sinceadvanced courses were offeredseparately (Aural Comprehen-sion, Writing, and Reading),some students did not take somesubtests, which caused somereduction in the data.

Also, since the subjects wereall ICU exchange students,predominantly American, thegeneralizability of this study maybe limited. Studies at otherinstitutions with other types ofstudents would be helpful in thisregard.Level 4 Level 5 Level 6 Level 7

x = Test 1

+ = Test 2


YAMASHITA

Students' background knowledge. Thelength of Japanese language study, as well asteaching methods, and materials (includingtextbooks) in the previous institutions mayhave affected the relative performances ofthe groups taking TEST1 and TEST2. In otherwords, if the average length of previousJapanese study for students in any givenlevel in the Fall was longer than that of stu-dents in the comparable level in the Spring,it is possible that the mean of TEST1 wouldbe higher than the mean of TEST2 for thosestudents due to differences in backgroundalone rather than to differences in placed orcontinuing student status. Students' back-ground should be taken into account in fu-ture studies.

Familiarity of the test. All the test questionswere multiplechoice, and a computerread-able answer sheet was used for the purposeof computer processing the scores. This mayhave caused problems for some students.Some European and Asian students who areat intermediate and upper levels are not fa-miliar with such a test format. They may haveperformed more poorly than their real profi-ciency level on TEST1 simply because of thenovel answer sheet. However, this variable isvery difficult to control.

Score distribution in each level. After takingthe initial placement test, the students areplaced according to the score ranges shownin Table 8 for each subtest. These normswere set based on the experience of theteachers who were involved in making eachsubtest. However, the norms were neverchecked statistically. Mismatches between thenorms and the mean scores of the real levelsof proficiency attained by the students mayhave caused the differences in the scores onTEST1 and TEST2. Closer study of this vari-able is necessary and important. However,again, such study is beyond the scope of thischapter, and the researcher only suggeststhat it should be done in the future.

Reliability and validity of the test. Asshown in Table 6, all three subtests werereliable (Aural Comprehension KR20 = .77;Comprehensive Structure KR20 = .96; Read-ing Kanji and Vocabulary KR20 = .93). How-


Table 8. Score ranges for placement using theJLPPT

Aural Comp. Structure Reading

LEVEL4 12 - 15 41 - 55 45 - 59LEVELS 16 - 18 56 - 65 60 - 69LEVEL6 19 - 21 66 - 75 70 - 79LEVEL7 22 - 23 76 - 85 80 - 84

ever, the placement test was never studied todetermine if it was correlated with other suchtests (e.g., the Test ofJapanese LanguageProficiency) as a measure of criterionrelatedvalidity. Such a study and/or other contentand construct validity studies would help inexamining and improving the JLPPT.

Curriculum. In lower levels, course con-tent is decided according to structural (gram-matical) and functional syllabuses developedby the JLP. The placement test questions forlower levels were based on the new text-book which was recently developed by JLP.So, the students in the lower levels may havea better chance of achieving higher scores onTEST2 after they have studied this particularsyllabus. In contrast, the syllabus for theintermediate levels and above is still underdevelopment by the JLP. Consequently, thecurriculum for the upper levels is not firmand varies slightly in each term dependingon which instructors are teaching in anyparticular term. Thus, even if the test ques-tions are reliable, they do not necessarilyreflect what the students at these levels werestudying in their courses during the regularyear. As a result, the intermediate and ad-vanced students may have performed some-what more poorly on TEST2. That is, thecurriculum may have affected the results.However, this issue is also beyond the scopeof this study. Nevertheless, further study ofthis question is needed.

Conclusions and Pedagogical Implications

By comparing the scores of the placementtest administered in the beginning of theprogram with scores on the retest adminis-tered nine months later, this study revealed

that there were potential differences betweenpopulations that should be similar at thebeginning and end of the school year. Thedifferences were consistent. There are severalstraightforward pedagogical implications ofthis study. First, the differences in the scoresbetween TEST1 and TEST2 could be adjustedby applying more appropriate norms whichreflect the scores of the population of eachlevel on each subtest as they progressthrough the program. It may be importantthat such adjustments be made because theobserved differences may discourage stu-dents who find, after taking the retest, thattheir scores were lower than expected after ayear of study. This could cause a loss ofinterest or motivation in any student learninga language.

Second, the test questions for the interme-diate and advanced levels should be carefullyrevised based on the syllabus taught in thoselevels so that they will reflect the contentmatter which is covered by the curriculum ineach level. Of course, the purpose of aplacement test is to spread students out intoa normal distribution; and in that sense, it isa normreferenced test. However, the re-searcher suggests that criterionreferencedtype questions may also be added to reflectthe "domain of language skills or material thestudents have learned" (as explained inBrown, 1988). In short, it would be helpful ifthe test were measuring what the newlyentering students know and do not knowabout the syllabus of the particular institutionin question.

This study on Japanese as a second lan-guage education found differences betweennewly placed students and continuing stu-dents just as Brown (1980) found in his ESLstudy at UCLA. The phenomenon may not beidiosyncratic to UCLA and ICU, but rathermay be a more general trend at many lan-guage institutions. This phenomenon is notjust an issue of test design, but also involvesvarious topics such as curriculum design,students' background variables, etc. In fact, itis possible that effective testing that fits thecurriculum is at the heart of any good lan-guage program.

YAMASHITA

Notes

' The Test of Japanese Language Proficiency(Society for Japanese Language Teaching) isadministered once a year. The test consists ofthree subtests: aural comprehension, readingand writing (kanji), as well as grammar andvocabulary.

2 The author sought and received permission touse the 1992 and 1993 placement test data fromthe Japanese Language Program at the Interna-tional Christian University.

3 In more detail, the Summer course (SCJ '92)students who took the JLPPT in September 1992had already taken the same placement test atthe beginning of SCJ '92 in July but wanted tosee if they could skip a level. Students from SCJ'92 who did not wish to skip a level (i.e.,wanted to continue to the next level in the Fallterm with a passing grade) did not take thisplacement test.

References

Brown, J. D. (1980). Newly placed students versuscontinuing students: Comparing proficiency. InJ. C. Fisher, M. A. Clarke, and J. Schacter (Eds.),On TESOL '80 building bridges: Research andpractice in teaching English as a second lan-guage (pp. 111-119). Washington, DC: TESOL.

Brown, J. D. (1988). Understanding research insecond language learning. Cambridge: Cam-bridge University Press.

Hiura, K. (1989). Placement test ni tsuite nohoukoku [A report on the placement test].Bulletin of the ICU Summer Program in Japa-nese, 6, 136-141.

Ichikawa, Y., & Ogawa, T. (1991). Kishuryokushindan test no tameno bunpo map noichishian [Grammatical map for a proficiencytest]. Annual Bulletin of Japanese LanguageCenter. Tsukuba University, 7, 155-176.

Ishida, T., Inagaki, S., & Nakamura, T. (1982). HiKanjiken gakusei oyobi kikoku gakusei betsunihongo nouryoku shindan houhouno kaihatsu.[Development of a proficiency test of the Japa-nese language for foreign students with nonChinese background and returnees. Monbushokagaku kenlguuhi kenkyuu houkoku [A studyreport for the Ministry of Education ScientificResearch Fund] #551059.

Kano, C., & Shimizu, Y. (1991). Kanjiryoku nosokuei: Hyokani kansuru ichi shian. [Assess-


YAMASHITA

ment of Kanji proficiency]. Annual Bulletin ofJapanese Language Center, Tsukuba University,7,177-192.

Kijima, H. (1988). SPJ 1988 placement testhoukoku [SPJ 1988 placement test report].Bulletin of the ICU Summer Program in Japa-nese, 5, 109-114.

Saegusa, N. (1986). Development of item analysisand related programs for personal computer.Annual Bulletin ofJapanese Language Center,Tsukuba University, 2, 193-200.

Saegusa, R. (1986). Placement test no toukeitekishori no kokoromi [Statistical analysis of a place-ment test]. Annual Bulletin ofJapanese Lan-guage Center, Tsukuba University, 2, 171-192.

Saegusa, R. (1988). Placement test no datousei tokongo no tenbo [Validity of a placement testand outlook for the future]. Annual Bulletin ofJapanese Language Center, Tsukuba University,4, 161-200.

Sakai, T. (1988). Placement test no bogobetsubunseki. [An analysis of a placement test forstudents with different native languages]. An-


nual Bulletin of Japanese Language Center,Tsukuba University, 4, 139-160

Sakai, T. (1990). Placement testMoji mondai nikansuru ichi kousatsu [Placement testingOneview on writing]. Annual Bulletin ofJapaneseLanguage Center, Tsukuba University, 6, 167-186.

Suzuki, M. (1989). 1989 nen ICU kaki nihongokoza placement test ni kansuru houkoku [Areport about the 1989 ICU Summer Programplacement test]. Bulletin of the ICU SummerProgram in Japanese, 6, 128-136.

Taji, Y. (1987). 1987 nen ICU Summer Program noPlacement test houkokukomoku bunseki nohitsuyousei ni tsuite [A report about the 1987ICU Summer Program in Japanese placementtestthe necessity of item analysis]. Bulletinofthe ICU Summer Program in Japanese, 4, 120-121.

Taji, Y. (1988). Placement test komoku bunsekino houkoku [A report on the item analysis of aplacement test]. Bulletin of the ICU SummerProgram in Japanese, 5, 100-108.

Chapter 7

Evaluating Young EFL Learners:Problems and Solutions

R. MICHAEL BOSTWICK

KATOH GAKUEN

Assessment is one of several criticalcomponents of the instructional pro-cess because it lets the teacher know

if the instruction has been effective and if theintended learning has occurred (Genesee &Upshur, in press). In essence, assessmentanswers the questions: How are we (studentsand teachers) doing? And, how can we dobetter? Unfortunately, far too few schoolshave systematic evaluations of their students'foreign language proficiencyevaluationsthat can provide the kind of informationneeded to improve and adjust classroominstruction. In particular, this lack seems tobe especially prevalent in elementary schoolprograms and conversation schools foryoung children.

Problems

Before selecting tests or designing their owninstruments, teachers must be aware of threeproblems that can subvert assessment efforts inany English language program for children.

First, a lack of any clearly established pro-gram goals and objectives makes it very diffi-cult to know what should be tested or if whatis being tested is of any real importance(Winograd, 1994). Assessment is only effec-tive and useful to the extent that the testcontent matches the primary goals of theclass or program. Teachers may have fairlyclear goals for an individual lesson but are

often more vague about what they would liketheir students to be able to do six months, orsix years from now. If your only goal was toteach the text and get to the end of the book,then your assessment would be fairly straight-forwardjust check to see if you got to theend of the textbook.

Unfortunately, there is no guarantee thatthe children will have acquired greater lan-guage proficiency in your march through thebook. It is in just such cases that clearer profi-ciency goals can help you to maintain yourfocus where it should be: on developing theforeign language proficiency of the children.In short, assessment should always be tied tothe intended goals of the language class orprogram involved because only with clearlyunderstood goals can assessment be effective.

Second, teachers may not know exactlywhy they are assessing students, what type ofinformation is desired, how they intend touse the information, and who the informationis for. All of these questions are closely inter-related and inseparable. The answers to eachmust be clear in the mind of the teacher ifassessment is to be of any value.

Third, a program may lack a clear andsystematic plan for gathering information andassessing students' progress towards theschool's goals and objectives. Without a plan,assessment becomes fragmented and disorga-nized, rendering it less effective and useful toall parties involved (Brown, 1992).

64LANGUAGE TESTING IN _JAPAN 57

BOSTWICK

Traditionally, most formal English languageassessment in Japan has been done to certifythe level of knowledge of English and skilldevelopment so as to provide interested par-ties with information for selection (e.g.,school entrance tests) or certification (e.g.,Jido Eiken and Eiken STEP). The Eiken STEPtest (from upper elementary students toadults) and the Jido Eiken test (for primaryand upper elementary school students) arevery popular in Japan. This is particularly trueof the Eiken STEP test because the JapaneseMinistry of Education (Mombusho) has autho-rized the test and given it their stamp of ap-proval. Over 40,000,000 students have takenEiken STEP in Japan over the last 32 yearsafact which attests to its popularity. The EikenSTEP claims to be a criterion-referenced testin that it specifies proficiency standards andattempts to identify whether the student canpass the pre-established standard. The testhas seven different levels, and it is adminis-tered three times per year. The test is rewrit-ten each time and a different version is usedfor each administration.

Unfortunately, the developers of the test donot publish information on the test's validityor reliability. This lack of information makesit difficult for people not in the Monbusho toknow whether the test is doing what thepublishers claim it can do--namely, identifystudents at distinct levels of language ability.Because there is no clear explanation of howlevels or passing scores were determined andbecause no information on the reliability(consistency of the test from one administra-tion to the next) is provided, we cannot besure that the imaginary line indicating that astudent has passed a certain level is not infact jumping up and down with each differentadministration of the exam. In other words, itis possible that the same level test may beeasier or more difficult from one administra-tion to the next. The Jido Eiken for youngerchildren has many of the same problems asthe STEP test. In fact, the Jido STEP may beeven more problematic because it is a newertest with less background and development.

The general trend of these tests has been ina positive direction with greater emphasis on


communicative use and oral proficiency. Still,they are limited in the ways they can informand influence daily classroom practices. Onereason these tests may be of limited useful-ness to teachers is that the goals and objec-tives assessed in these tests may not bealigned with the goals of a particular pro-gram. It is also true that, by the time you getthe results from these tests, it may be too lateto do much about it in your class because thestudents may have already completed theclass or program. The tests described aboveare examples of summative forms of assess-ment in that they are generally used toprovide a picture of students' overall achieve-ment at the end of a course of study.

This type of summative assessment can beused to answer the following questions:

1. What level of achievement have thestudents reached?

2. Are the students progressing as well asexpected?

3. How effective has the instruction been?

Formative assessment also provides feed-back to teachers and students on the effec-tiveness of instruction and learning but in amuch more diagnostic and prescriptive way(Herman, Aschbacher, & Winters, 1992). For-mative types of assessment can render spe-cific information on the progress of thestudents, what aspects of the instructionalprocess need to be changed or reviewed, andwhat teachers may need to do next. Hence,formative assessment is generally more usefulthan summative assessment for improvingteaching and learning.

Solutions

Teachers typically carry out formative types ofassessment much of the time by monitoringlearning and then adjusting instruction basedupon informal observations and feedbackfrom the students. However, if more formaland systematic assessment is to be carriedout, the issues raised at the beginning of thischapter must be addressed. Only then canassessment serve its intended function.

Overcoming Problem 1: Have Program Goalsand Objectives

As stated earlier, many English programsfor children may lack clear goals or objec-tives other than the desire to complete thetextbook. What exactly do you want thechildren to be able to do, know, or be likeby the end of the class or program? In otherwords, what skills, knowledge, and attitudeswould you like the students to be able todemonstrate as the result of being in yourclass? Defining clear goals and objectiveshelps to insure that what will be assessed istied to the content of the course. Withoutclearly knowing where you are going andhow you may get there, there is no guaran-tee that you may actually arrive.

One way to conceptualize this issue is bythinking of a continuum that ranges fromgoals that are very broad to goals that areextremely narrow (Oiler, 1989). An exampleof a very general goal would be: Students willbe able to understand and interpret writtenand spoken language on a variety of topics.An example of a very specific goal would be:Students can use the indefinite pronoun otherin a sentence. The extremes presented herebring to mind a statement Albert Einstein isreported to have hung in his office, "Noteverything that counts can be counted andnot everything that can be counted counts."This certainly seems to be equally true inlanguage testing as well (i.e., some objectivesare important, others are not). The challengefor language programs is to identify thosethings that count and find a way to countthem. Developing program objectives is noeasy task because they must be at just theright level so that they identify meaningful,authentic, and important objectives while atthe same time being specific enough so thatthey are observable, useful, and manageablefor teachers. Once significant objectives areidentified, programs must then ensure thatthe desired objectives actually drive assess-ment and instruction. It would make no senseto identify meaningful goals for a programand then continue to use the end-of-unit orend-of-book tests if these tests did not closelymatch the program goals.

BOSTWICK

The first step in designing effective assess-ment procedures is to be sure that you havewell-articulated goals and objectives, withoutwhich assessment loses its usefulness. It couldbe argued that one of the main problems withthe high school and university entrance exami-nations in Japan is (a) that there is no clearnational consensus on what English languageobjectives should be and (b) that the assess-ments currently used by these schools have notbeen designed or aligned with such goals inmind. What then results is atail-wagging-the-dog effect in which the assess-ments, not the goals, drive instruction.

Overcoming Problem 2: Know Who, What,How, and Why

Assessment can serve a number of differentpurposes and audiences, some of which mayinclude:

1. Providing information to parents and stu-dents and address issues of accountability

2. Evaluating the strengths and weaknessesof a program and suggest areas for im-provement

3. Identifying the placement of a student in aprogram

4. Providing diagnostic information about achild's language proficiency and skill de-velopment

5. Determining skill or proficiency mastery,and certify promotion or graduation re-quirements

6. Helping teachers focus their instruction andprovide feedback to evaluate instruction

7. Providing students with feedback to helpthem see what needs to be done to achievetheir goal and to encourage them to takemore control over their own learning

8. Confirming student learning and enhancestudent motivation

Because assessment serves these many pos-sible functions and audiences, it is important tobe clear about exactly how the information isto be used and who it is for. Answers to thesequestions can greatly influence your assess-ment procedures (Genesee, 1994). It is alsoimportant to realize that assessment can beused to improve instruction and help students


BOSTWICK

take control of their own learning. That ismore easily accomplished when assessment isauthentic and tied to the instructional goals ofthe program. Clearly then, assessment canserve purposes and audiences well beyond itstraditional functions.

Some forms of assessment lend themselvesmore to certain kinds of purposes. For ex-ample, assessment that is for accountabilitypurposes may be best done using summativeforms of assessment (end of term tests, stan-dardized tests, exhibitions, etc.). If the pur-pose is to inform instruction and to makedecisions about the learning process, thenobservational checklists, anecdotal records,conferences, portfolios, etc. may be morehelpful (Genishi & Haas Dyson, 1987).

Overcoming Problem 3: Have a PlanPrograms need to have clear and systematic

plans for gathering information and assessingstudents progress toward the program goals ifassessment is to be of any real benefit to theprogram (Brown, 1992). Without a plan, as-sessment becomes haphazard and disorga-nized, making it less effective and useful- toall concerned. This is why assessment is inte-gral to the learning process and must beplanned before and during instruction. A planshould include: what will be assessed, how itwill be assessed, when assessment will takeplace, and who will do the assessing. Itshould also include who the information isfor, how the information will be used, andhow the information will be reported.

Two Examples

There are many possible alternatives to tradi-tional summative forms of assessment, severalof which have been mentioned earlier. Whatfollows are two specific examples of possiblemethods for collecting and documentingstudent progress. Both examples come fromour English immersion program for elemen-tary school children in which Japanese stu-dents study content area subjects in Englishon a daily basis. The purpose of both instru-ments is to: (a) establish accountability byproviding a means to communicate progress


67

to both students and parents, and (b) provideteachers and students with feedback on theeffectiveness of the teaching and learning inthe classroom. The content for each instru-ment is tied to the program goals and objec-tives as determined by both teachers andschool administration.

Attainment Level StandardsEnglish Attainment Level Standards can be

designed to articulate overall program goalswith specific benchmarks for each level (in ourcase for each grade) of the program. The gen-eral goals run across all grade levels and arebroken down into (a) interpersonal, (b) educa-tional, and (c) presentational communicationgoals. These overall goals are often difficult tomeasure because they are so global in na-ture; therefore, benchmarks that anchor thesegoals to observable behaviors are identifiedfor each level. These benchmarks provideexamples of behaviors/abilities one wouldexpect from students at each level. Only twobenchmarks for each goal at each level havebeen listed in the example given. The actualdocument contains many more benchmarksthat could be incorporated into the instruc-tional program.

Teachers and students work toward thesespecific benchmarks at each level and includetime within the class for students to demon-strate proficiency with each of these bench-marks. Creating attainment level standardslike the example provided in Figure 1 helpsteachers monitor the progress of the studentsand provides on-going feedback that is inte-grated with the instructional program.

Figure 1 is an abbreviated version of theEnglish Attainment Level Standards underdevelopment for our immersion program.However, even this abbreviated version re-flects fairly ambitious goals. To be used inanother context, it would need to be adaptedto match the goals of that program and theneeds of its students. Figure 1 is offered onlyas a resource to assist others as they begin tothink about and develop their own attainmentlevel standards. Schools may also wish torefer to other guidelines available when creat-ing their own benchmarks. In our own case,

O

Efish A

ttainment L

evel Standards for Katoh School

Outcom

esPre-1

Grade 1

Grade 2

Grade 3

Grade 4

Grade 5

Grade 6

1. Interpersonal(U

sing languagefor non-academ

icpurposes)"Students engagein conversation,provide and obtaininform

ation,express feelingsand em

otions, andexchange personalopinions w

ithininform

al, socialcontexts."

2. Educational

(Using language to

learn & negotiate

meaning in

academic settings)

"Students discuss,question,hypothesize,explain solutionsto problem

s, formopinions, m

akejudgm

ents in thesecond language ontopics related tothe schoolcurriculum

."

3. Presentational(U

sing language toexpress personalm

eaning to others)"Students presentinform

ation,concepts, ideas,em

otions, andopinions to anaudience oflisteners on avariety ofscholastic topics."

IA. Students can use

expressions ofgreeting and leavetaking appropriately.("See you tom

orrow")

and use set phrasesw

ith teacher and peersin and outside of theclassroom

. ("I'mfinished" "I have astom

achache, etc.)

2. Students canunderstand and followm

ost teacherdirections.

2A. Students can

understand stories(picture books) andansw

er questionsabout the story w

ithteacher guidance.

2B. Students can

describe 5 thingsabout a picture ofinterest to the teacher.

3A. Students can draw

a picture of theirfam

ily (or other topicof interest) and askand answ

er 2-3questions from

otherstudents about thepicture.

3B. Students can sing

songs, do finger playsw

ith teacher guidance.

1A. Students can use

classroom phrases to

negotiate meaning and

classroomconversations inE

nglish with the

teacher.

1B. Students can

exchange essentialinform

ation and askand talk aboutpersonal likes anddislikes w

ith eachother.

2A. Students can

respond to questionsabout a story and retellit in a coherent w

aym

aintaining a logicalsequence.

2B. Students can

create stories or math

word problem

s to tellother students.

3A. Students can

present simple oral

reports / presentationsabout fam

ilym

embers, friends,

comm

on objects intheir environm

ent.

3B. Students can

recite mem

orizedchunks of fam

iliarstories and songscom

mon to E

nglishspeaking cultures.

IA. Students can use



classroomconversations w

itheach other instructured settings.

1B. Students can

exchange essentialinform

ation aboutevents ortransportationincluding date, tim

e,and location.

2A. Students can give

and follow sim

pleinstructions on how

toplay various gam

esand ask questions ifthey do notunderstand.

2B. Students can

paraphrase otherstudents' ideas.

3A. Students can

demonstrate / explain

something of interest

to their peers (showand tell) and give brieforal book reports.

3B. Students can

create their own play

based on a popularchildren's' story andperform

it for others.

1A. Students can use



classroomconversations inE

nglish with each

other in unstructuredsettings.

lB. Students can

exchange information

with peers about

favorite activities andm

emorable past

events.

2A. Student can

.

employ rephrasing and

circumlocution to

comm

unicatem

essages successfully.

2B. Students can

recount main ideas in

some detail and take

part in structured,sm

all groupdiscussions.

3A. Students can

make a brief report on

a personal experience.

3B. Students can

summ

arize the plotand provide briefdescriptions ofcharacters in selectedshort stories and folktales.

Figure1:

Attainm

ent Level Standards

BE

ST C

OPY

IA. students can w

orkin groups to planevents and activities tobe carried out inE

nglish. They can

evaluate the success oftheir task &

identifyw

ays to improve their

comm

unication .

1B. Students can

exchange opinions andfeelings about peopleand events in theirpersonal lives.

2A. Students can give

and follow com

plex (4or 5 step) directions tocarry out a specifictask and can askquestions forclarification.

2B. Students can w

orkin groups to plan andconduct a m

ath orscience activity.

3A. Students can

make a brief report to

the class on a scienceor social studies topic.

3B. Students can

perform short plays /

stories / literature /songs in E

nglish.

IA. Students use

appropriate socialam

enities: expressinggratefulness,extending &

receivinginvitations,apologizes,and com

municating

preferences.

1B. Students can

explain cultural habitsand custom

s toforeign peers inE

nglish.

2A. Student can


and opinions on avariety of academ

ictopics.

2B. Students can use

persuasion and logicalargum

ent for thepurpose of convincingthem

to see their pointof view

s.

3A. Students can

prepare a videorecorded m

essage topeers abroad on topicsof shared interest.

3B. Students can

describe / presentsom

e aspect ofJapanese life or cultureto peers of the targetculture.

IA. Students can


about current or pastevents and aspirationsin their personal livesand the lives of theirfriends, fam

ily andcom

munity.

lB. Students can

exchange opinions andindividual perspectiveon a variety of topicsthat are ofcontem

porary orpersonal interest.

2A. Students use the

dictionary thesaurusand other referenceresources to selectappropriate w

ords foruse in preparing andoral reports.

2B. Students can

discuss and analyzethe character plotdevelopm

ent andthem

es found inauthentic literature(poem

s, short stories,short w

orks of fiction&

non-fiction).

3A. Students can

summ

arize thecontents of a featurem

agazine / newspaper

article on a topic ofcurrent or historicalinterest to peers of thetarget culture.

3B. Students can

debate a topic ofinterest, presentingand defending theirposition logically andpersuasively.

iLA

BL

E

BOSTWICK

both the ALL Guidelines (Scarino, Vale,McKay, & Clark, 1988) and ACTFL (1995)national standards were consulted in develop-ing our own framework. Other schools mayalso find these resources helpful.

Unit ChecklistsThe Unit Checklists (sometimes referred

to by our students in the past as star sheetsbecause the teacher places a star next toeach item mastered) are directly related tothe objectives for each 3 to 6 week unit ofinstruction in the program. Each checklistcontains approximately six tasks that studentsare expected to be able to do by the end ofthe unit. In a sense, they are mini-exhibitionsof student achievement within each unitbecause the successful exhibition of thesetasks demonstrates student mastery of theobjectives of the unit.

Having clearly established objectives foreach unitobjectives that are written andgiven to both students and teachers at thebeginning of the unithelps to provide direc-

Figure 2. On animals with six tasks

Lion to the learning and ensures that assess-ment, instruction, and objectives are alignedwith each other. Because the objectives areachievable within a relatively short period oftime, students have a clearer understanding oftheir responsibilities in the classroom and aremotivated by successful accomplishment ofeach objective. Information gained from theseactivities provides direct feedback to studentsas to how they are progressing, i.e., objectiveevidence that they are learning. However, byfar the greatest power of these checklists isthat they clarify for the students what is be-lieved by their teachers to be the importantobjectives of each unit.

Figure 2 is an example of a primary levelunit on animals with six tasks. The tasks act asbenchmarks of student achievement and assistteachers and students in documenting masteryof course objectives in an on-going formativemanner. Although tasks normally listed at theunit level may be more detailed than many ofthe benchmarks identified in the AttainmentLevel Standards mentioned above, some of the

The Living World of AnimalsDate"

1. Use five classroom phrases with yourfriends. (Copy three squares)

(interpersonal)

2. Describe your favorite animal. See ifother students can guess the name of theanimal. Say why you like the animal.Answer questions from your friendsabout your animal. (Copy three squares)

(interpersonal)

3. Retell an animal story that yourteacher read. (Copy three squares)

(educational)

4. Make an addition problem usinganimals. (Copy three squares)

(educational)

5. Sing 'Old Mac Donald and act it outwith the class. (Copy three squares)

(presentational)

6. Draw a picture of a wild animal andshow it to the class. Say where it livesand what it eats. (Copy the rest of thesquares. (presentational)

"Copy Cat" Name:

,")' fr .-------

,-...... \

.('(07-Th,s,

-------..._.?,-,'.---\

'''.

g


6 9BEST COPY AVAILABLE

tasks (benchmarks) on any given star sheetmay actually come from this document.

In the example in Figure 2, each task is tiedto the overall language goals of the program(interpersonal communication goal, educa-tional communication goal, and presenta-tional communication goal, as shown inFigure 1). Because immersion teachers arealso responsible for teaching the contentareas in the elementary school curriculum,the tasks also reflect academic goals as well.The unit checklist provides two tasks for eachof the three overall language goals of theprogram. Teachers check off students as theydemonstrate their ability to actually do thetask successfully. These records are kept andcan later be shared with parents.

Conclusion

I have argued that assessment should ulti-mately be tied to the goals and objectives ofthe program and the needs of the students.Assessment should also reflect the purposeyou have for collecting the information and itshould be done in a systematic way. In addi-tion, assessment can serve several functions.It can inform teachers about the extent towhich their students have learned what theyset out to teach. But perhaps at an evenmore important level, assessment can be anintegral part of the instructional process.Since effective assessment enables teachersto know how successful their instruction hasbeen, the information collected should beused to inform and influence on-going class-room practice.

In short, assessment is an essential part ofthe learning process and not somethingadded on at the end of a series of lessons(Brown, 1995). The two examples providedhere demonstrate how assessment that (a) is

BOSTWICK

aligned with the program goals, (b) is clearabout its function and audience, and (c) has asystematic plan for gathering and reportingthe information, can provide teachers andprogram managers with the kinds of informa-tion needed to improve instructional practicein their classrooms.

References

Brown, J. D. (1992). Testing in language pro-grams. Tokyo: Temple University Japan.

Brown, J. D. (1995). The elements of languagecurriculum. Boston, MA: Heinle & Heinle.

Eiken, STEP. (1995). Eiken, STEP. Tokyo: NihonEigo Kentei Kyokai.

Genesee, F. (Ed.). (1994). Educating secondlanguage children. Cambridge: CambridgeUniversity.

Genesee, F., & Upshur, J. (in press). Alternativesin second language assessment. Cambridge:Cambridge University.

Genishi, C., & Haas Dyson, A. (1987). Languageassessment in the early years. Norwood, MA:Ablez.

Herman, J., Aschbacher, P., & Winters, L. (1992).A practical guide to alternative assessment.Alexandria, VA: Association for Supervision andCurriculum Development.

Jido Eiken. (1995). Jido Eiken. Tokyo: Nihon JidoEigo Shinko Kyokai (JAPEC).

ACTFL. (1995). National standards in foreignlanguage Education. Washington, DC: Ameri-can Council on the Teaching of Foreign Lan-guages.

Oiler, J. W., Jr. (1989). Testing and elementaryschool foreign language programs. In K. Muller(Ed.), Languages in elementary schools. NewYork: The American Forum.

Scarino, A., Vale, D., McKay, P., & Clark, J.(1988). ALL guidelines. Melbourne, Australia:Curriculum Corporation.

Winograd, P. (1994). Developing alternativeassessments: Six problems worth solving. TheReading Teacher, 47, 420-423.


Section III

Standardized Testing

71

Chapter 8

Good and Bad Uses of TOEICby Japanese Companies'

MARSHALL CHILDS

FUJI PHOENIX COLLEGE

DIRECTIONS: This is a quiz. Read the following passage and answer the questions.

Passage

You are the Education Director of a Japanese company. You are in charge of the first-year training of agroup of 113 new employees, recruited from college. Their average TOEIC score upon entry into yourcompany was 269, thanks to the Japanese education system. Your program gave them 53 hours ofclassroom English during the course of one year. You have measured their English-language profi-ciency gains by means of TOEIC tests administered before and after the English instruction.

Now it is time for you to look at the TOEIC results, draw conclusions, and take action. You reviewwhat you know about the uses of TOEIC. For each of the following uses of TOEIC, write either Good,Bad, or Good Under Some Circumstances (GUSC).

Questions1. Measuring overall group gains in proficiency.2. Comparing the performance of different schools or treatments.

3. Gauging the progress of individual learners.4. Counseling learners on their progress.5. Guiding the courses of study of individual learners.

Readers may be pardoned if they do notknow the answers to all these ques-tions. Most company education direc-

tors do not know the answers, either. Thischapter will give the answers in the course ofdiscussing the issues, but for readers whorequire immediate resolution of ambiguity, ananswer key is offered in the Conclusion sec-tion of this chapter.

The Test of English for International Com-munication (TOEIC) is both a boon and 'a


72

bane for company education directors. It is aboon in that it is widely available, interna-tionally understood, and relatively cheap toadminister. It is a bane in that TOEIC is tooeasy to use for purposes that it should notserve, a circumstance that leads to ineffec-tiveness and in some cases harmful misun-derstandings. One problem is a general lackof understanding of the difference betweennorm-referenced tests and criterion-refer-enced tests (see, for instance, Bachman, 1990;

Brown, 1995; or Chapter 1 of this book).TOEIC is a norm-referenced test, which is tosay that its chief virtue lies in its ability tospread the scores of test-takers into a bell-shaped curve of all test-takers. TOEIC is nota criterion-referenced test, which is to saythat it is not designed to measure progresstoward mastery of particular instructionalobjectives in the way that, for instance, finalexaminations measure the degree to whichlearners have mastered the subject matter offormal instruction.

In an ideal world, each type of testnorm-referenced and criterion-referencedwouldbe used in its own sphere, without overlap orconfusion. In fact, however, tests are some-times used inappropriately. In particular, theuse of TOEIC in Japan often seems to be anattempt to combine the functions of norm-referenced and criterion-referenced tests.Although TOEIC is basically a norm-refer-enced test, the sellers of TOEIC themselvesencourage users to regard the test as a mea-sure of progress of English-language students(see, for example, applications describedapprovingly in the quarterly Japanese-lan-guage publication, TOEIC Newsletter). As aresult of the above factors, many companies,governments, and educational institutions useTOEIC to measure both English ability leveland language gains due to teaching/learning.

The Present Study

This chapter describes in greater detail thesituation presented in the short quiz at thebeginning, in which a Japanese companyrequires recently hired employees to take aseries of TOEICs, not only to measure theiroverall English skill but also to measure theirlearning progress in groups and as individu-als, and to guide their individual courses ofstudy. Because the company employs differ-ent language schools for teaching, it alsocompares the schools on the basis of theTOEIC gains that occur under their tutelage.The purpose of this chapter is to examine theuses of TOEIC by this company in an effortto judge what are good and bad uses of thetest results.

CHILDS

Questions of interest include the following:Is TOEIC a good measure of the average gainfor a group of learners? Is an apparent differ-ence between the average gains of two sub-groups significant and meaningful? In a seriesof four TOEIC tests in which the final testdoes not show the highest average score,how is learning gain to be measured? Howare we to understand an overall decline inmean scores after training? Is it possible tomeasure and interpret the standard error ofmeasurement of the learners in this study?How can we understand the levels of indi-vidual learners when, in a period of generallyincreasing English proficiency, ups anddowns introduce uncertainty into the inter-pretation of test scores? Given this uncer-tainty, how can we assign learners to abilitygroups for further English-language trainingand for job selection? How shall we counselindividual learners on their apparent progressor lack of it?

Method

SubjectsSubjects consisted of 113 recent college

graduates (110 men and three women) whojoined a Japanese manufacturing company asnew employees in April 1992. All were nativespeakers of Japanese. Precise age data are notavailable, but it is certain that most subjectswere 22, 23, or 24 years old during the courseof the study. During the period of observa-tion, English-language training was not theprimary goal of the subjects. They were en-gaged in a grueling one-year initiation to thecompany that included many practical experi-ences as well as classes on various subjects.Outside the English classes, they rarely usedEnglish on the job.

The mean TOEIC starting score of thisgroup of learners was 269. This mean startingscore is approximately 100 points lower thanthe average scores of all recent college gradu-ates in Japan according to the Vice-Chairmanof IIBC, the official sponsor of TOEIC in Ja-pan (Kitaoka, 1992). The difference may bedue, in greater or lesser part, to three causes:(a) company hiring practices, (b) the fact that


CHILDS

test-takers were mostly men, and (c) the factthat the company tends to assign beginningemployees who achieve unusually highscores to a different course of English train-ing, more advanced than the courses de-scribed here. A skewness statistic was appliedin order to judge whether the initial scores ofthese employees formed a normal distribu-tion. They did.

MaterialsTesting of English-language ability was

done by means of the TOEIC, a two-hour testof EFL listening and reading (TOEIC SteeringCommittee, 1991). This test was originallydesigned to:

1. spread out the scale of the TOEFL test,particularly in the lower half of that scale,

2. provide "highly valid and reliable mea-sures of real-life reading and listeningskills" as needed in business and govern-ment (Woodford, 1980, p. 5), and

3. offer means of score interpretation "thatwould allow score recipi-ents to actually see the kindof English that examinees atdifferent levels could read"(Woodford, 1980, p. 5).

ProceduresLearners were given a total

of 53 hours of formal Englishteaching, consisting of (a) five-day (37-hour) intensive classesconducted in groups of 10 to12 students during Septemberand October 1992, and (b) fourmonthly half-day (4-hour)English classes conducted inNovember and December1992, as well as January andFebruary 1993. All 113 learnersreceived the treatment; therewas no control group. Sixty-five of the subjects began theirtraining in Tokyo and weretransferred to Osaka in lateDecember. Forty-eight beganin Osaka and transferred to


Tokyo in late November.The TOEIC was administered on four occa-

sions: at the time of hiring in April 1992; atthe end of the five-day intensive Englishcourses in September and October 1992; andat the end of the second and fourth monthlyhalf-day classes, i.e., in December 1992 andFebruary 1993. These administrations will bereferred to below as test 1, test 2, test 3, andtest 4. Tests were administered by TOEIC-accredited test centers in Tokyo and Osaka.

Results

Group MeansFigure 1 shows the overall shape of the

mean TOEIC results for all 113 learners andfor the two subgroups who changed placesbetween Osaka and Tokyo during the pro-gram. The major point is that overall scoresincreased. It is notable, however, that thehighest average scores were recorded in test2, at the end of the one-week intensive pro-gram for all groups. Also, the subgroup that

Figure 1. Mean TOEIC test results for 113 learners (totalgroup and two subgroups that finished in Osaka andTokyo, respectively)

TOEIC Score

360

340

320

300-

280-

260T

345341338

329

Key := Total Group Average.

22276693

+ = 48 people who started in Osaka and finished in Tokyo.266 * = 65 people who started in Tokyo and finished in Osaka.

317

328328323

!4

Test 1 Test 2

Test Sequence

Test 3 Test 4

began in Tokyo and finishedin Osaka appears to haveoutperformed the overallmean, beginning below themean and finishing above it.

Overall mean scores in-creased from 269 to 341 fromtest 1 to test 2. Overall meanscores for tests 3 and 4 were322 and 326, respectively. Arepeated-measures one-wayANOVA showed a significantdifference among thesemeans, (F = 61.12; df = 3,336; p < .05). Follow-upScheffe analyses for n < .05showed that the means oftests 2, 3, and 4 were all sig-nificantly different from themean of test 1, and also thatthe mean of tests 3 and 4taken together was signifi-cantly different from the meanof test 2, but that the meansof tests 3 and 4 taken indi-vidually were not significantlydifferent from test 2 or fromeach other. To say that arelationship is "statisticallysignificant" means that therelationship is probably notdue to chance alone (Brown,1988, p. 122).

Figure 1 shows that the Osaka and Tokyogroup means follow the general pattern of theoverall means. The group that began in To-kyo and finished in Osaka started with amean lower than the overall mean (266 com-pared to 269) and finished with a meanhigher than the overall mean (328 comparedto 326). The group that began in Osaka andfinished in Tokyo began higher (273 com-pared to 269) and finished lower (323 com-pared to 326). These differences were notfound to be statistically significant.

TOEIC Score

360

340

320

300

280

260-r

273269266

345341338

CHILDS

287

M 57's proportionalscores (35 points below

group average)

245

Test est 2

est Sequence

234 Key := Total Group Average.

+ = 48 people who started In Osaka and finished in Tokyo.

* = 65 people who started in Tokyo and finished in Osaka.

195

-39

Test 3 Test 4

-61 +133 -31

Figure 2. Scores for learner #57: Actual scores, calculatedproportional scores, and the differences between them

ReliabilityTo assess the reliability of TOEIC in this

setting, three procedures were used: an inter-nal consistency reliability estimate, a standard

error of measurement estimate, and an exami-nation of the differences between actualscores and calculated proportional scores. Toassess internal consistency reliability, theKuder-Richardson formula 21 (K-R21) wasapplied to the scores for all four tests, usingraw scores approximated from the standard-ized scores by reversing the procedure ofscore conversion (ETS, 1980).2 Approximatedraw scores ranged from 53 to 110 (out of apossible 200). The resulting K-R21 reliabilityestimate was .57. This means that 57% of thevariance of test scores was consistent on thetest, and 43% was from other sources. If it isin error, K-R21 tends to underestimate inter-nal consistency reliability (Guilford &Fruchter, 1978, p. 429). Nonetheless, .57 isnot high. Accordingly, an examination of the


CHILDS

standard error of estimate will be of interest.The standard error of measurement (SEM)

is a gauge of the extent to which anindividual's actual score varies from his or her"true score" (the "true score" is defined as thescore that a learner would achieve if he tookthe test a very large number of times withoutlearning anything). With the present data, theSEM was calculated by the formula SEM = SD1-R, where SEM is standard error of estimate,SD is standard deviation, and R is internalconsistency reliability as measured by K-R21.Based on this formula, the SEM for the stu-dents in this study turned out to be seven testquestions, or 43 points in terms of the con-verted TOEIC score. This SEM statistic can beinterpreted to mean that there is a 68 percentchance that a given test score is within 43points of the learner's "true" score. It follows,of course, that there is a complementary 32percent chance that a given test score differsfrom the true score by more than 43 points.Some consequences of this SEM will be dis-cussed below.

The fact that these learners took four testsin sequence permits us to gain some idea ofthe consequences of measurement error. Ifthere were no measurement error, we wouldexpect each learner to maintain his or herproportional distance from the group averageon each of the four tests. The learner with ID

Figure 3. Differences between actual andcalculated proportional scores for 452 TOEICscores (grouped data)No. ofCases

70

60 .-50

40-30 --

2010

3 21

3031

206T

10

53

44,

8

67

47

32

26

14

20

13

100 -50

2 0 rip0 50 100

Difference of Actual Score from Calcu ated Proportional Score


#57, for instance, averaged 35 points belowthe group means. Recall that the overallgroup means for the four tests were 269, 341,322, and 326. Accordingly, we might expectthat Mr. 57 would have scored 35 pointsbelow the mean in each case, chalking upscores of 234, 306, 287, and 291 (see Figure2). Of course his actual scores were not sobeautifully proportional; instead, they were195, 245, 420, and 260 (Figure 2). They dif-fered from the calculated proportional scoresby, respectively, -40, -62, +133, and -31points.

Such differences between actual observedscores and proportional scores were calcu-lated for each of the 113 learners in thepresent study. A graph of grouped data isshown in Figure 3. Figure 3 is, in effect, agraph of the distribution of the standard errorof measurement (SEM). It will be seen thatthe graph resembles a normal curve. Thedistribution may be further described as fol-lows: differences within plus or minus 35points make up 67.9% of the total, differenceswithin plus or minus 70 points make up95.4% of the total, and differences within plusor minus 105 points make up 98.7% of thetotal. Thus the distribution of differences fromproportional scores is similar to that of anormal distribution with a standard deviationsomewhat greater than 35. This distribution of

differences is uniform in the sense that it isthe same for learners at all levels: for thetop third, the middle third, and the bottomthird of the group of 113 learners, the dis-tributions of differences all show the sameform.

Using the present data to examine pat-terns of ups and downs, learner experi-ences were classified in terms of sequencesof score increases and decreases for tests 1,3, and 4. Test 2 results were omitted fromthis analysis because they appear to in-clude an uncharacteristic "polish" effect(see the third part of the Discussion sec-tion) not found in the other three sets oftest results. Table 1 shows the number oflearners who experienced each of the fourpossible sequences of gain and no gain inthe transitions from test 1 to test 3 and

from test 3 to test 4. In a period when themean score increased by 56 points, only 35learners (31%) experienced two successivegains. Seventy-five learners (66%) had mixedexperiences and three learners (3%) had twosuccessive losses.

Discussion

What lessons for TOEIC users can be gleanedfrom the results of this analysis of the scoresof 113 learners? We have seen that somegroup mean results are statistically significantand some are not. Using a standard internalconsistency reliability estimate, TOEIC doesnot appear to be very reliable, at least for thisgroup of students. And we have seen that thestandard error of measurement is large forthis group of students. Some further discus-sion of each of these areas, along with somerelated issues, is warranted. The followingdiscussion will cover six topics: (a) measuringoverall group gains, (b) facing ambiguityabout the causes of gains, (c) assessing tem-porary versus long-term gains, (d) using testresults to evaluate different teaching treat-ments, (e) gauging the learning of individualsand basing action upon it, and (f) counselingand guiding individual learners in the face ofups and downs in test scores.

Measuring Overall Group GainsIn the opening section of this chapter, the

question was raised: "Is TOEIC a good mea-

Table 1. Patterns of gain and no gain for 113test-takers in a sequence of three TOEIC tests

Test 1 to

Test 3

Test 3 to Number of LearnersTest 4

Pattern

GainGainNo gainNo gain

GainNo gainGainNo gain

355817

3

Total number of learners 113

CHILDS

sure of the average gain for a group of learn-ers?" Based on the present data, the answer isthat, with appropriate statistical tools, signifi-cant gains can be discerned in the means oflarge groups. The data on 113 Japanese learn-ers from the company in this study showed asignificant gain, in the sense that for the over-all group of 113 learners, tests 2, 3, and 4were significantly different from test 1. Onecaution should be observed, however: thecauses of the gain have not been demon-strated. The gain, or part of it, may well bedue to formal classroom teaching, but withoutfurther information, alternative explanationsof the improved performance cannot be ex-cluded. This is a serious handicap in inter-preting the results of this training and testingprogram. Some alternative explanations forthe learning gain are explored in the follow-ing section.

Causes of Score GainsIn the present study, it is difficult to sepa-

rate the effect of teaching from other possiblecauses of score gains. Four major causes maybe at work: formal English teaching, differ-ences in motivation, English in the environ-ment, and practice effect. Each of thesepossible causes will be considered in turn.

The formal teaching of English. Certainlythis is the first cause to look at. Formal teach-ing seems unlikely, however, to account forthe whole gain. There were only 53 hours offormal teaching; and a gain of 56 points, orabout one TOEIC point per hour of teaching.According to Saegusa (1983; 1985), in theTOEIC range of around 730 a gain of oneTOEIC point would require about 2.5 hoursof teaching. The ETS (1990) observes that thedifficulty of achieving a TOEIC score in-creases out of proportion with the level of thescore:

. . . it is much easier for an employee toimprove a score at the lower end of thescale than at the upper end. The em-ployee with a base score of 240 can im-prove 150 points in a much shorter timethan, for example, the individual whosescore is 710. (p. 6)


CHILDS

Although the 113 learners are generally inthe TOEIC range of 250 to 400, a teachingefficiency of one TOEIC point per teachinghour would be a difference of 2.5 times fromthe range of the low 700s. A difference of thismagnitude from one part of the scale to an-other seems rather large.3 It is probable thatthe observed learning gain of 56 points is dueto a combination of effects, of which teachingis only one.

Differences in motivation levels. The lowerresults on test 1, compared to the later tests,may be due in part to a lack of focused moti-vation before the learners had actually joinedthe company where they now plan to dotheir life's work. In later months the learnersmay have become more motivated. Hence,there may be a noticeable difference in moti-vation between the first and subsequent tests.

Heeding English in the environment. For a22-year-old college graduate, living in a majorJapanese city for one year may by itself en-hance proficiency in English. Working for amajor company, watching television, listeningto the radio, and reading advertisements incosmopolitan Japan all involve exposure toEnglish, and young professional people mightglean some knowledge of that language fromtheir environment.

Practice effect. It seems likely that a learnerwho takes the TOEIC several times will be-come more proficient at taking the TOEIC.For learners who learn greater test-takingskills, some portion of any score increasesmay be due to this practice effect, rather thanto any increased proficiency in English.

Temporary versus Long-Term Gains (thePolish Effect). The fact that the highest scoreswere achieved in the second test, rather thanthe third or fourth, deserves comment. Thesecond test was administered at the end of aone-week full-time course in English. Afterthat, the learners returned to their predomi-nantly Japanese environments and experi-enced only one four-hour English class permonth. Thus the high scores for the secondtest may be attributed to the active practicethat immediately preceded that test adminis-tration. This effect, which is longer thanshort-term memory but which nevertheless


degrades over time, will be called the "polish"effect here, in reference to the fact that at theend of an intensive week, the learners' En-glish was highly polished. When active par-ticipation in English became less frequent, thepolish gradually lost its luster, leaving onlythe underlying longer-term gains to be re-flected in the results for tests 3 and 4.

The mean scores for tests 3 and 4 are stillsignificantly higher than the starting results(test 1) but indicate retention of only abouttwo-thirds of the polished gains shown intest 2. This fraction (two-thirds) may be agood rule of thumb for estimating the amountof learning that is retained after the polish ofan intensive class fades. Further research maybe rewarding, however, not only to check themagnitude of the polish effect under differentconditions, but also to explore the effective-ness of different approaches to schedulingEnglish classes. It is possible, for instance,that even after subtracting the loss due to thepolish effect, teaching English intensively inlarge blocks will prove to be more effectivethan presenting it in small, diluted doses.

Comparing Teaching TreatmentsThe company hires commercial schools to

teach English to its employees, and usesmean TOEIC gains of learners taught by thedifferent schools to compare the schools'performances. Such a comparison is possible,with due caution. In the present data, wehave seen that, using ANOVA statistical pro-cedures, mean score differences of about 50points for large groups of learners can bestatistically significant when means and stan-dard deviations are similar to those foundhere. However, in the present data, the differ-ences in the means of the Osaka and Tokyosubgroups were not statistically significant.This may be taken as an indication of theneed for cautionand for proper statisticalproceduresin using TOEIC to assess differ-ences in the performances of subgroups.

If the different gains of the Tokyo and Osakasubgroups are not statistically significant, howshould we think of them? The differences inthe means may still be of interest not only toeducation directors but also to school adminis-

trators as early warnings. Thus, the director ofthe Tokyo school might look upon these re-sults as an occasion for re-examining theschool's practices in an effort to maximize theschool's teaching effectiveness.

In view of the discussion in the previoussections, two primary cautions should beobserved in comparing any mean gains forsubgroups. First, causes are almost alwaysambiguous, and should be considered aspotential sources of doubt for any conclu-sions. Even if we can identify differencesbetween two groups of learners, we cannotknow how much of any reported gain is dueto teaching. Second, as ETS (1990) says, it ismore difficult for higher-level learners thanfor lower-level learners to gain a givenamount. Thus, if the learners' beginninglevels are different, and particularly if assign-ment to schools is based on level of profi-ciency, differences in TOEIC score gainsshould be interpreted very carefully. If it isreally true, for instance, that for students inthe 700 range it takes 2.5 times as manyteaching hours as in the 240 range to raise aTOEIC score by one point, then a one-for-one comparison of mean score gains wouldbe inappropriate. With these caveats in mind,administrators using proper statistical proce-dures can examine the differences in meanscores of learners in different schools or oflearners given different treatments in in-house education programs.

Gauging Individual LearningTo summarize the findings on reliability,

we have seen that the SEM was 43 points forthe subjects in this study, and we have alsoseen (Figure 2) that in the sequence of teststhese learners' TOEIC scores fell within 35points of their calculated proportional scoresonly about two-thirds of the time. In addition,we have seen (Table 1) that, in a groupwhose overall mean score was increasingsignificantly, two-thirds of learners saw nogain in their scores in one or the other of twocomparisons.

In light of these results, TOEIC seems apoor instrument for gauging the short-termlearning gains of individuals like those in this

CHILDS

study. Hence in a situation like ours, TOEICshould not be used to assess the progress ofindividuals in programs with relatively fewteaching hours, or should be used only withextreme care. For gauging the effects on theindividual of a teaching program of consider-able length (perhaps in the range of 200classroom hours), or for measuring very long-term growth of English ability (measured inyears rather than months), TOEIC may beused with caution as a very general measureof change in proficiency. The caution to beborn in mind is, of course, that scores aresubject to a standard error of measurementthat may be in the range of the apparentproficiency gain.

The fact is simply that TOEIC, as a norm-referenced test, is not the best gauge of indi-vidual learning. Instead, criterion-referencedtests, which are specifically designed to mea-sure individual learning, should be used. Awell-designed criterion-referenced test mea-sures the learner's degree of mastery of thespecific material being taught. When appliedas a pretest and a posttest, a criterion-refer-enced test can be a very reliable measure oflearning gain. See Chapters 1, 2, and 3 forinformation on creating and using criterion-referenced tests.

Counseling and Guiding LearnersAs Table 1 illustrates, in a sequence of four

tests, individual learners can expect to seesome inexplicable, and sometimes very dis-appointing, differences in their scores. Theups and downs make it difficult also foradministrators to use TOEIC scores for coun-seling individuals on their progress and forprescribing specific instructional levels orexperiences.

How should learners be counseled aboutthe vicissitudes of their TOEIC scores? Proba-bly the first message to give them is that timeand chance happen to them all: that jumpingaround is in the nature of TOEIC scores.Showing learners Figure 2 or a similar dia-gram should help them understand the prob-abilistic relation between TOEIC scores andunderlying proficiency. A second message isthat TOEIC scores will very likely reflect


CHILDS

large differences in learning; TOEIC scoregains are unreliable only for differences thatrun afoul of the standard error of estimate. Athird message is that TOEIC is not designedto provide information on specific strengthsand weaknesses or on specific languagetopics that the learner should address.

For these reasons, TOEIC should not beused except in a very general way forplacement decisions, selection of teachingtreatments, and counseling of learners.Criterion-referenced tests should be usedfor other purposes like diagnostic testing,assessing achievement, or measuring learn-ing gains. Program administrators shouldrealize, too, that the learners share the re-sponsibility for identifying strategic targets fortheir own learning. When that responsibilityis taken seriously, the learner will seek outactivities that lead to unmistakable improve-ments in TOEIC scores. The wise administra-tor will tap the energy and knowledge of thelearners in selecting learning experiences.Together, the administrators and teacherswith their general knowledge and the learn-ers with their specific knowledge can set abetter course than either group can do alone.

There is a positive aspect to the variabilityof TOEIC scores. Students may be counseledthat if they take the test several times, theycan expect that by chance alone they willachieve a score that is higher than their truescore. This is a positive motivation for learn-ers, for the TOEIC score that is kept onrecord is the highest score, not an average ofscores. A sporting appreciation of the oddsmay encourage learners, who face repeatedTOEIC tests, to attack the tests with vigor.

Conclusions

We are now in a position to answer the ques-tions in the quiz at the beginning of thischapter. How good is TOEIC for the follow-ing uses?

1. Measuring overall group gains in profi-ciency: Good under some circumstances.We have seen that TOEIC can be usedeffectively to differentiate some group


mean gains in a group of 113 learners,with the caveats that careful measurementof statistical significance is necessary inorder to distinguish real gains from illu-sory ones, and that, even if the signifi-cance of gains is established, the causes ofgains may remain problematic. In addition,the administrator must consider to whatextent any differences measured by TOEICare related to what has been taught.

2. Comparing the performance of differentschools or treatments: Good under somecircumstances. The cautions of answer 1apply here. In addition, the administratormust be aware that the difficulty of raisinga TOEIC score is considerably greater atthe upper end of the scale than at thelower end. Therefore, TOEIC should beused with care for measuring the effective-ness of different teaching approaches.

3. Gauging the progress of individual learn-ers. Bad. The use of TOEIC for gaugingindividual learning is, in general, ineffi-cient or wrong. We have seen that, in ateaching program that totaled 53 hours,the variability of TOEIC results defeatedtheir usefulness in measuring learninggains because the SEM of TOEIC was inthe range of expected individual gains.

4. Counseling learners on their progress. Bad.Because of the SEM of TOEIC, test-to-testdifferences will display very great variability.For example, differences may be negative,or they may be very largeand somewhatillusory in both cases. Indeed, the lowerresults that are frequently encountered insuccessive tests can have the unfortunateside effect of demotivating learners.

5. Guiding the courses of study of individuallearners. Bad. TOEIC is not a diagnostictest, and it cannot pinpoint learners'strengths and weaknesses. It can be arough guide for gauging a learner's overalllevel, if the administrator clearly under-stands the statistical variability of the re-sults, but TOEIC cannot help theadministrator determine what a specificlearner needs to be taught.

No matter how loudly we proclaim thatTOEIC should not be used for purposes forwhich it is not suited, proclaiming alone isnot a long-term solution. Having no practicalalternative, education directors will continueto use TOEIC because it is familiar and, ifinefficient, at least a known evil. UsingTOEIC may be better, for instance, than al-lowing language schools and internal educa-tion departments to devise their own tests,for that, to paraphrase Saegusa (1983), wouldbe putting the fox in charge of the henhouse. Nevertheless, company educationdirectors and language schools should bewarned that short-term TOEIC results cannotbe substituted for more specific measures oflearning achievement. Test users await aseries of criterion-referenced tests comple-mentary to the norm-referenced TOEIC. Thenew series of tests would consist of achieve-ment tests that can be integrated with cur-riculum and can indicate mastery of or needfor specific types of learning modules.

Because of these conclusions, companyeducation directors who incorporate TOEICinto their testing programs should do sothoughtfully. They should understand thatthe long-term solution to many of their needswill be not TOEIC but a series of tests thatare in tune with the specific goals and meth-ods of their English education programs.TOEIC should be used in its area of strength,which is to find the approximate location oflearners on the global bell curve of Englishproficiency.

2

Notes

I gratefully acknowledge the assistance of JDBrown, Atsushi Kodera, and Rory Rosszell inhelping me achieve what understanding I have ofthe use of TOEIC tests by Japanese companies.The approximation used for relating TOEICstandard scores to raw scores was:

CHILDS

Standard Score = (6.25 x Raw Score) 225.3 Some experiences reported by Kitaoka (1992)

suggest an even greater level of difficulty in the700 800 range: about 3.6 teaching hours toraise a TOEIC score one point.

References


Brown, J. D. (1988). Understanding research insecond language learning. Cambridge: Cam-bridge University Press.

Brown, J. D. (1995). Testing in language pro-grams. Englewood Cliffs, NJ: Prentice-Hall.

ETS (Educational Testing Service). (1980).Score conversion tables, Form 3BIC, in Test ofEnglish for International Communication (p.153). Princeton, NJ: Educational Testing Ser-vice.

ETS (Educational Testing Service). (1990). Guidefor TOEIC users. Princeton, NJ: EducationalTesting Service.

Guilford, J. P., & Fruchter, B. (1978). Fundamen-tal statistics in psychology and education (6thed.). New York: McGraw-Hill.

Kitaoka, Y. (1992, July 30). How Americanfirms in Japan should use TOEIC (Talk beforethe American Chamber of Commerce in Ja-pan).

Saegusa, Y. (1983). TOEIC and in-company En-glish training. Cross Currents, 10(1), 71-89.

Saegusa, Y. (1985). Prediction of English profi-ciency progress. Musashino English and Ameri-can Literature, Vol. 18. Tokyo: MusashinoWomen's University.

TOEIC Newsletter (quarterly). Tokyo: Institute forInternational Business Communication.

TOEIC Steering Committee (1991). Test of Englishfor international communication (TOEIC):History & Status. Princeton, NJ: EducationalTesting Service.

Woodford, P. E. (1980). The test of English forinternational communication (TOEIC). Paperpresented at English-Speaking Union Confer-ence, London, England.


Chapter 9

A Comparison of TOEFL and TOEIC

SUSAN GILFERT

AICHI PREFECTURAL UNIVERSITY

What are these two tests, the Test ofEnglish as a Foreign Language(TOEFL) and the Test of English for

International Communication (TOEIC)? Whatdo they test? Are they useful? How are theyused? What do the scores mean? Are thescores comparable in any way? These andother questions will be answered in this pa-per. Knowledge of the background of eachtest will help users understand the test so ashort history of each test will be provided,and a general structural comparison is made;then examples of question types will begiven, subsection by subsection. A brief dis-cussion will also be provided of how resultsof the exams are used or misused.

Comparative History of TOEFL & TOEIC

The TOEFL was generated by American aca-demic needs in the early 1960s. American uni-versities had many applications from studentswhose native language was not English, andthe university administrators needed to knowwhat to do with these students. ESL programswere ongoing everywhere, but without muchcohesion. University administrators were famil-iar with the reputation and commercial valueof the Educational Testing Service (ETS) testsbecause the Scholastic Achievement Tests(SATs) produced by ETS were being usedsuccessfully by university admissions offices.Universities requested ETS to create a generaltest of English "to evaluate the proficiency ofpeople whose native language is not English"


82

(ETS, 1993), specifically North American En-glish. Today, it is used primarily in U.S. univer-sities for admissions decisions, and sometimesmisused for placement testing purposes in ESLsettings. Most of the examinees are in theirmid-teens to mid-twenties and are high schoolor university students.

The TOEIC came from the Japanese Minis-try of International Trade and Industry re-quests in the middle 1970s. It is "designed tomeasure the English-language listening com-prehension (LC) and reading (R) skills ofindividuals whose native language is notEnglish. The TOEIC is used primarily by cor-porate clients, worldwide" (Wilson, 1989).Most examinees are in their mid-twenties tolate forties and work for a corporation. Fromits beginning nearly 20 years ago, the use ofTOEIC has spread from Japan throughoutAsia, and it is becoming more frequently usedthroughout Europe and South America.

Structure Comparison

The TOEFL is a well-known multiple-choiceinstrument designed to measure anexaminee's receptive English skills and isconsidered a reasonably good prediCtor ofthe examinee's productive language skills.The general register of the TOEFL is aca-demic English. The TOEIC is a lesser-knownmultiple-choice instrument designed to mea-sure a examinee's receptive English skills,and is increasingly becoming considered areasonable predictor of these skills. The gen-

eral register of the TOEIC is real-life,business-type English. The TOEFL is created,produced, and sold by Educational TestingSer Vices in Princeton, New Jersey; the TOEICwas created by ETS, but is now entirelyowned and operated by the Japanese TOEICoffice in Tokyo.

The two tests are not radically different instructure (see Table 1). The topic treated ineach test, however, is different. For example,in the reading comprehension subtests, theTOEFL uses passages which are purported tobe found in first-year college textbooks;whereas the TOEIC tends to use businessletters, short news items, and advertisements.However, the type of questions asked in theReading Comprehension subsection (mainidea, inference, attitude/tone, or applicationwithin the passage) are similar in the twotests. This same relationship (topic/type ofquestion) exists in the Short Talks subsec-

Table 1. General comparison of tests

GILFERT

tions. The Incomplete Sentence and ErrorRecognition subsections are also almost com-pletely alike.

Listening ComprehensionSpecific comparisons of listening sections.

Section 1 of the TOEIC measures the abilityof the examinee to recognize vocabulary inthe context of a photo prompt (see Table 2).The TOEFL does not have any equivalent tothis type of question. Examinees tend to feelthat the photo prompt, providing visual con-text, is reassuring, even though both thequestion and possible answers are only spo-ken, not printed.

Section 2 of the TOEIC assesses the exam-inee's ability to listen to a question andchoose the appropriate response (see Table3). Some Japanese examinees have comment-ed that this section seems to be mostly astructure test, listening for the grammatically

TOEFL TOEIC

3 major subtests; 5 subsections150 questions

scaled score ranges from 200 to 677

examinees tend to be students (18-25 years old)

results tend to determineschools to be attended and otheracademic matters

I. Listening Comprehension

A. Short conversation (25 Qs)B. Short talks (25 Qs)

II. Struct/Written ExpressionC. Incomplete sentences (15 Qs)D. Error recognition (25 Qs)

III. Reading ComprehensionE. Reading comprehension (30 Qs)

2 major subtests; 7 subsections200 questions

scaled score ranges from 10 to 990

examinees tend to be corporate-level employ-ees (25-50 years old)

results tend to determine

overseas postings and other businessrelated matters

I. Listening ComprehensionA. One photograph, spokensentences (20 Qs)

B. Spoken utterances, spokenresponse (30 Qs)C. Short conversation (30 Qs)D. Short talks (20 Qs)

II. Reading ComprehensionE. Incomplete sentences (40 Qs)F. Error recognition (20 Qs)

G. Reading comprehension (40 Qs)


GILFERT

Table 2. Question example comparison: listening comprehension

TOEFL TOEIC

no comparable section on TOEFL A. One photograph, spokensentences (Gilfert & Kim, 1995):

Seen:

A photo of two men talking acrossa table. An unused computer is inthe background.

Heard:(A) The two men are computing.(B) The computer is having a meeting with

the men.

(C) The two men are talking.(D) One man is buying a computer.(C) is the correct answer since it is closest

in meaning tto what is shown in thephoto.

correct response. Most examinees feel thatthis part of the TOEIC is the most difficultpart of the listening component since boththe prompt and the possible answers areonly spoken, not printed.

Section 3 of the TOEIC is most similar toSection 1 of the TOEFL listening subtest (seeTable 4). The TOEFL "short conversation"subsection of the tape follows an A: B: pat-tern, then a Narrator asks a question aboutthe conversation. Four printed possible an-swers appear in the test booklet. The TOEIC"short conversation" tends to follow an A: B:

A: pattern where the question and four pos-sible answers are printed in the test booklet.Examinees tend to feel that the TOEIC iseasier to understand in this section becauseboth the question and possible answers areprinted in the test book, which providesexaminees more context into which to fit theconversation. In both tests, there is only onequestion per conversation.

Section 4 of the TOEIC is most similar toSection 2 of the TOEFL listening subtest, butthe TOEIC talks tend to be shorter, and theretend to be fewer questions for each talk (see


TOEFL TOEIC

Not comparable; lookat next section

B. Spoken utterances, spokenresponse (Gilfert & Kim, 1995):

Heard: Hello, John.Heard: (A) Hi, John. How are you?

(B) Who's John?(C) Good-bye, see you later.

The correct response is (A), since it isthe most likely response to this greeting.


84

GILFERT


TOEFL TOEIC

A. Short conversations (adaptedfrom ETS, 1993, p. 26):

Heard:Man: Do you mind if I

turn off the television?Woman: Well, I'm watching this show.Narrator: What does the womanimply?

Read:(A) The man should show his watchto the woman.(B) The man should leave thetelevision on.(C) The show will be over soon.(D) The woman will show thetelevision to the man.

The correct response is (B),since the woman implies thatshe is using the television.

C. Short Conversation, Four printedanswers (Gilfert Kim, 1995):

A:B:

A:

May I help you?Yes, do you have this shirt in size 12?Certainly. I'll get one for you.

Read:

Where is this conversation most likelytaking place?(A) in a hotel(B) in a department store(C) in a post office(D) in an airport

The correct response is (B), since the

conversation appearsto be happeningbetween a sales clerk and a customer.

Table 5). The TOEFL "longer conversations"subtest normally have up to 1.5 minutes'worth of spoken language and ask 4 to 6questions, whereas the TOEIC "short talks"subtest tends to have shorter talks (1-1.5minutes) and ask 3-5 questions per talk. Thecontent of the TOEFL talks are typically ei-ther long conversations (4-10 exchanges)between two people talking aboutschool-related matters, or single speakersgiving an introduction to a class or to aschool club activity. The content of theTOEIC is typically made up of extendedconversations (5 or 6 extended exchanges)between two people talking about officematters, or single speakers giving a newsreport or other information. Examinees tendto feel that the TOEIC material is less difficultthan the TOEFL since both the amount ofspoken language and the number of ques-tions are reduced.

General comparison of listening subtests.Comparing the listening subtests in a general

sense, there are 50 questions on tape in theTOEFL and 100 questions on tape in theTOEIC. The listening subtest of the TOEFLrequires about 30-40 minutes to take; theListening subtest of the TOEIC takes about40-50 minutes. Vocabulary is tested through-out both tests; it is no longer a separate sec-tion. The general register of the TOEFLlistening subtest is academic with the topicsof conversation tending to be somethingabout school or everyday life. The generalregister of the TOEIC listening subtest is"business" with more idioms being spokenthan on the TOEFL and fewer polysyllabicwords.

GrammarThe next two sections of the TOEIC and

the TOEFL are identical in structure; the onlydifference is the register. These two sectionsare: Incomplete Sentences and Error Recogni-tion. They both assess the examinee's knowl-edge of English structure, or grammar. The


GILFERT


TOEFL TOEIC

B. Longer Conversations (Gilfert& Kim, 1990):

Heard: Questions 3, 4, 5, and 6are based on the following conversation:Woman: What do you think? Am I OK?Man: Exhale slowly, please.

Well, there is some congestion.I want to do some tests.

Woman: How soon will I get the results?Man: Oh, you'll have the results before you

leave the office, and here is somemedicine that I believe will help you.

Question 3: What is the probablerelationship between these two speakers?

Question 4: When will the woman receive theresults of the tests?

Question 5: What does the manfeel will help the woman?

Question 6: What is the woman's problem?

Read:

3. (A) dentist-patient(B) doctor-patient(C) teacher-student(D) pharmacist-customer

4. (A) in a few days(B) before leaving the office(C) very slowly(D) soon enough

5. (A) some medicine(B) some tests

(C) exhaling slowly(D) filling her lungs with air

6. (A) She does not have enough air in her lungs.(B) She's exhaling too slowly.(C) She didn't do well in her tests.(D) She has a little congestion.

For question 3, (B) is the best answer.For question 4, (A) is the best answer.For question 5, (A) is the best answer.For question 6, (D) is the best answer.

D. Short Talks (Gilfert & Kim, 1995):

Heard: Sunshine is forecast for today aftertwo damp days. Westerly winds will freshenby afternoon and chilly air will betransported across the metropolitan area.Clouds will overtake clear skies by morning.

Read: 1. How was the weather earlier thisweek?(A) Sunny C) Damp(B) Cool and dry (D) Chilly

2. What kind of weather is expectedtomorrow?(A) Cool and cloudy(B) Sunny and dry

(C) Damp and windy(D) Cold and sunny

For question 1, (C) is the best answer. Theannouncer notes that the last two dayshave been damp.

For question 2, (A) is the best answer.


GILFERT

Table 6. Question example comparison: grammar

TOEFL TOEIC

C. Incomplete Sentences(adapted from ETS, 1993, p 27):

Read: The fact traveller'schecks can usually be easilychanged to cash makes them apopular way to carry currency.

(A) of (C) is that(B) that (D) which is

(B) is the most grammaticallycorrect answer.

E. Incomplete Sentences(Gilfert & Kim, 1995):

Read: girl overthere is my sister.

(A) This

(B) These

(C) Those

(D) That

(D) is the answer that isgrammatically correct here.

TOEFL subtest is supposed to "measure abil-ity to recognize language that is appropriatefor standard written English" (ETS, 1993), andthe TOEIC subtest measures much the same.The examples in Table 6 both test demon-strative pronoun usage. Either examplewould work for TOEFL or TOEIC. As notedbefore, this section on each test is practicallyidentical. Register may differ slightly, but notsignificantly.

The examples in Table 7 both test wordorder. Again, either example would work forTOEFL or TOEIC. As noted before, thissubtest on each test is practically identical. Incomparing the grammar subtests, there are 40questions in the Grammar subtest of theTOEFL. There are 60 questions in the Gram-

mar subtest of the TOEIC. For timing, exam-inees should allow about 25 seconds foreach question in this subtest, taking about15 or 20 minutes for the TOEFL and about25 minutes for the TOEIC. However, theGrammar and Reading Comprehensionsections are timed together. If an examineecan quickly (and accurately) go through theGrammar section, then more time is left forthe Reading Comprehension questions.

Reading ComprehensionThe examples in Table 8 clearly show the

difference in register between TOEFL andTOEIC Reading questions. The TOEFL read-ing example is likely adapted from sometextbook, and the TOEIC example is a news

Table 7. Question Example Comparison: Reading Comprehension

TOEFL TOEIC

D. Error Recognition (adaptedfrom ETS, 1993, p. 28):

Read: Good puzzles provide anA

excellent way to explore theC

area of thought abstract.D

(D) is the error here; thewords are in the wrong order.

F. Error Recognition (Gilfert& Kim, 1995):

Read: In today's class middle,A

both parents have to work inB C

order to pay all their bills.D

(A) is an error in word order,making it the correct answer.


GILFERT

Table 8: Question example comparison: reading comprehension

TOEFL TOEIC

Questions 40-41 refer to the followingpassage (Gilfert & Kim, 1990, p. 93):

In 1920, after some thirty-nine years ofproblems with disease, high costs, and politics,the Panama Canal was officially openedopened, finally linking the Atlantic andPacific Oceans by allowing ships to passthrough the fifty-mile canal zone insteadof traveling some seven thousand milesaround Cape Horn. It takes a shipapproximately eight hours to completethe trip through the canal and costs ancanal and costs an average of fifteenthousand dollars, one-enth of what it would costan average ship to round the Horn.More than fifteen thousand ships pass throughits locks each year. The French initiatedthe project but sold their rights to theUnited States. The latter will control ituntil the end of the twentieth centurywhen Panama takes over its duties.

40. According to the passage, whocurrently controls the Panama Canal?(A) France

(B) United States(C) Panama(D) Canal Zone

41. In approximately what year willa different government takecontrol of the Panama Canal?(A) 2000 (C) 3001(B) 2100 (D) 2999

40. (B) is the correct answer.41. (A) is the correct answer.

Questions 1-2 refer to the following reportfollowing report (Gilfert & Kim, 1995):

SAN FRANCISCO (NNN)Rains,accompanied by high winds, closed a numberof schools on Monday, but the storm waswelcomed because it brought some relief froma fifth straight year of drought.

1. What has the weather been like in Cali-fornia for the last five years?

(A) wet(B) windy

(C) dry(D) mild

2. Why would people want rain?A) California people enjoy walking

in the rain.(B) California roads need relief

'relief from the sun.(C) California school children

want to study rain.(D) There has been no rain for five years.

1. (C) is the correct answer.2. (D) is the correct answer.

report. However, the types of reading com-prehension questions are mostly the same oneach test: main idea, details, inference, and/or author's attitude.

The TOEFL reading comprehensionsubtest tends to have 3-5 questions perreading passage, and the TOEIC tends tohave 3-4 questions per passage. In compar-ing the reading subtests, there are 30 ques-tions in the Reading subtest of the TOEFL.


88

There are 100 questions in the second sec-tion of the TOEIC: 60 questions in theGrammar subtest and 40 questions in theReading subtest. The TOEIC Grammar/Reading sections are timed together, andthe examinee is free to switch back andforth between subtests. The Grammar andReading subtests of the TOEFL require 80minutes to take; the Grammar/Readingsubtest of the TOEIC takes 75 minutes.

Purposes of the Tests

Both the TOEIC and the TOEFL are useful tests,but each is used for different purposes. Thestated purposes of both tests are to provide ageneral measure of the examinees' Englishability. However, some institutions misuse thetests for purposes which should not be mea-sured on these tests. For example, the TOEFL isused correctly as a measure of an examinee'soverall ability to use academic English or todetermine whether a examinee may enterfull-time study at a university. Sometimes, how-ever, TOEFL results are incorrectly used forplacement purposes in EFL settings or as deter-miners for employment or placement within acompany. Many Japanese universities andcolleges regularly offer TOEFL-preparationcourses to their students. These students maybe thinking about going to graduate schoolsoverseas, or simply need to get a high TOEFLscore in order to get a job.

The TOEIC is correctly used to assess anexaminees' overall English proficiency in abusiness context. TOEIC scores are increas-ingly being required by corporate employersof either entering employees or employeeswho are being considered for promotion and/or overseas assignments. Employers often useTOEIC scores as a screening device, hiringonly those who meet a certain pre-determined TOEIC score. As a result of thispractice, Japanese colleges, universities, andtertiary-level vocational schools are now of-fering TOEIC-preparation courses in greaternumbers than five years ago. TOEIC-prepara-tion courses have already been offered bylanguage schools throughout Japan for manyyears now. Some corporate employers usethe TOEIC incorrectly, by requiring theirdomestic employees (who do not use Englishon a regular basis) to obtain a certain scorefor promotion or raises.

Score Comparison

Some researchers and students of testingbelieve that the TOEFL shows the differencebetween intermediate-to-advanced levels inEnglish proficiency very well, but does not

GILFERT

discriminate well on scores lower than about450 (generally considered intermediate-level).For most U.S. academic purposes, a TOEFLscore of 500 is acceptable for a student tobegin part-time studies, along with some ESLcourses. A TOEFL score of 550 is generallyconsidered acceptable for a non-native En-glish speaker to do undergraduate studiesfull-time, and a TOEFL score of 600 is gener-ally accepted for full-time graduate studies.

The TOEIC, on the other hand, is believedto show the differences betweenlow-beginner to high-intermediate levels verywell. A TOEIC score of 450 is frequentlyconsidered acceptable for hiring practices,with the understanding that the employeewill continue English studies. A TOEIC scoreof 600 is frequently considered the minimumacceptable for working overseas. Domestic-based engineers who have a TOEIC score of500 are considered reasonably proficient inEnglish. If the same engineer is being consid-ered for a posting overseas, he or she mustusually try for a TOEIC score of about 625. Adomestic-based desk-worker with a TOEICscore of 600 is considered reasonably profi-cient in English. For the same desk-worker togo overseas, she or he must usually have aTOEIC score of 685.

What do the Scores Mean?The TOEIC office in Tokyo, Japan, has

published a comparison between the OralProficiency Index (the OPI is used by theU.S. Foreign Service), TOEFL scores, TOEICscores, other tests, and the Japanese Eiken.All these tests assess English reading, listen-ing, and grammar proficiency. The OPI andEiken series further test speaking ability. TheOral Proficiency Index is considered one ofthe best tests since it provides a means oftesting the examinees' productive languageskills, as well as their receptive languageskills. However, due to time and cost consid-erations, the OPI is an impractical test toadminister for large numbers of people. TheEducational Testing Service also administersthe Test of Spoken English (TSE) and the Testof Written English (TWE). Of the 12 official(International) TOEFL tests administered

89 LANGUAGE TESTING IN JAPAN .83

GILFERT

every year, five include a TWE in addition tothe regular TOEFL. The TSE has its owntesting schedule, since it requires making anaudio tape. Many schools do not requirethese additional tests for much of their ad-missions procedures, although non-nativeEnglish speaking graduate students who wishto become Graduate Teaching Assistants areincreasingly required to pass the TSE in orderto get an assistantship. Both TOEFL andTOEIC test receptive skills (listening andreading) rather than productive skills (speak-ing and writing). It is possible for students toscore very high on the TOEFL, but not beable to use oral or written English in context.Many examinees become expert in takinglanguage tests, but do not learn how to usethe language. Therefore, the author maintainsthat TOEFL and TOEIC tests operate in an"artificial reality."

The tests, when used alone, are validand reliable in themselves, but not in alarger sense. Examinees who score well onthese tests may have self-confidence in thelanguage classroom, but using their lan-guage skills in the real world may be quitea different thing. In theory, examinees witha TOEFL score of 550 should be able to usetheir receptive English skills better thanexaminees with a TOEFL score of 500. Butthe examinee's English will also be usedproductively in the actual academic world.Hence, their English listening abilities haveto be good enough to permit them to writenotes in class, and their writing abilitiesmust typically be equal to English nativespeakers' when creating papers or reports.Similarly, the language used in the TOEICis intended to be used in receptive lan-guage contexts. An examinee with a scoreof 650 would be expected to operate in anEnglish-speaking business context betterthan an examinee with a score of 600. Inthe real world, examinees will be readingand generating faxes and reports, listeningto and making presentations, and using thetelephone. Examinees who excel in takingpaper tests, yet are unable to use theirlanguage productively, will be at a loss inthe real world.


90

Are the Scores Comparable?Finally, the question comes to this: What is

the difference between the TOEFL and theTOEIC? Can they be compared at all? Thescoring system is different and the number ofquestions is different, as is the amount oftime needed to take each test. The register isalso different. The reasons for taking each test(the examinees' motivation) can be different(except perhaps in the area of securing em-ployment), and the ways of using the resultsof the tests are different. The vocabulary inthe two tests has areas of similarity, but thereare some noticeable differences. Some exam-inees feel that the TOEIC is easier than theTOEFL. Some students of testing consider thatthe TOEFL is a more accurate discriminatorfor higher-level examinees, and the. TOEIC isa more accurate lower-level discriminator.The tests were both created by EducationalTesting Service, and test American English.ETS has calculated a number of reliability andvalidity checks on both tests, so they are bothconsidered accurate and useful when usedwithin the guidelines published by ETS. Thegrammar subtests of both tests are quite simi-lar and the types of questions asked in theReading Comprehension subtest (main idea,details, inference, and/or author's attitude)are similar. In short, with proper understand-ing of the TOEFL and the TOEIC, they can beuseful, but they must be used properly, withfull knowledge of their limitations.

Notes

1 The Japanese Eiken is a six-part series ofnon-standardized tests produced by the Societyfor Testing English Proficiency (STEP), publishedin Tokyo, Japan. The STEP test levels are level 4(low beginning), 3 (high beginning), pre-2 (lowintermediate), 2 (high intermediate), pre-1 (lowadvanced) and 1 (high advanced). STEP is of-fered by preregistration at a relatively low cost(Y2,500 -Y3,000 per person per test) twice a yearin Japan. There is no limit on how often a per-son can take the test. In contrast, the Interna-tional TOEFL is available, by preregistrationonly, 12 times a year, anywhere in the world.The cost of taking a TOEFL is U.S. $42 on a

Friday administration or $35 on a Saturday ad-ministration, payable in U.S. (or Canadian) fundsonly. Five times a year, the TWE is part of theTOEFL at no extra cost. The International TOEFLis the official TOEFL; the scores are sent at theexaminee's request to examinee-selectedschools. Five of the International TOEFL admin-istrations are disclosed. Examinees can chooseto give a self-addressed envelope and postagefor 43 grams from the U.S. to the test administra-tors. The examinee's test booklet will be maileda week or so later to the examinee. There areother versions of the TOEFL, called InstitutionalTOEFL. Institutions may choose to purchase andoffer the TOEFL to their students or employeesat any time they wish, as long as that date doesnot conflict with an International TOEFL. Thesescores are used within the institution whichoffers the TOEFL; the scores do not leave theinstitution for use in applying to schools in theU.S. or Canada. The TSE is offered at other

GILFERT

times, for U.S. $80 (for TSE-A) or $110 (forTSE-P). The TOEIC is offered, by preregistrationonly, six times a year for V6,500, payable inJapanese yen or the equivalent in local funds.This information may change. Please check thelatest bulletin of information for the latest prices,test dates, and availability. Bulletins are availablefree of charge at many bookstores and universityor college campuses.

References

Gilfert, S., & Kim, V. (1990). TOEFL Strategies.Nagoya, Japan: Kawai Juku Press.

Gilfert, S., & Kim, V. (1995). TOEIC Strategies.Tokyo: Macmillan Japan K. K.

ETS. (1993). TOEFL 1993-4 bulletin of informa-tion. Princeton, NJ: Educational Testing Service.

Wilson, K. M. (1989). TOEIC research report #1.Princeton, NJ: Educational Testing Service.


Chapter 10

English Language EntranceExaminations at Japanese

Universities: 1993 and 1994

JAMES DEAN BROWN



INTERNATIONAL CHRISTIAN UNIVERSITY

As we have pointed out in more detailin Brown and Yamashita (1995), pres-tigious secondary schools and universi-

ties in Japan administer their own entranceexaminations on a yearly basis to students whowould like to be admitted to their studentbodies. To prepare for these examinationsstudents spend months and years workingindustriously in school, at home, and in jukusin a kind of language testing hysteria (Brown,1993). The whole process is known as shikenjigoku, or examination hell. Examination hell isnot a new phenomenon. Amano (1990, p. xx)says that it existed in its present form in the1920s with roots back to the Meiji Restoration.Because most Japanese believe that the successof their children hinges on passing these ex-aminations ". . . families devote a surprisingproportion of their resources toward assistingtheir children in exam preparation, and chil-dren devote long hours day after day to study"(Cummings, 1980, p. 206).

Many Japanese are not 100 percent happywith the current examination system. Frost(1991) gives an overview of ways that Japanesecriticize, need, admire, and tolerate the en-

86 JALT APPLIED MATERIALS .2

trance examination system. Tsukada (1991, p.178) explains ways in which these examina-tions have "undesirable effects on curriculum,on foreign language instruction, on family life,and on children's emotional, physical, andintellectual development." As Frost (1991, p.303) noted, however, "separate universityachievement tests simply have too long a his-tory and meet too many needs . . . to disappearsimply because a new generation is beginningto be truly worried about them." Since thisexamination system so dramatically affectsEnglish teaching in Japan, we feel that EFLteachers should know as much as possibleabout it.

To that end, we published Brown &Yamashita (1995), a study of the English lan-guage entrance examinations at 21 Japaneseuniversities in 1993. The present study expandson that project by describing the examinationsproduced by the same universities in 1994 andcomparing the 1993 and 1994 examinations.

DefinitionsBefore looking at those examinations in

detail, we will briefly define some key terms

BROWN & YAMASHITA

(fuller explanations are given in Brown &Yamashita, 1995):

Test item. The smallest distinctive unit on atest that yields separate information, that is,the smallest part that could be scored sepa-rately (after Brown, 1995). For instance, asingle multiple-choice item is a test item, butso is an essay, or a translation task.

Discrete point vs. integrative. A discrete-point item is one that is designed to test asingle well-defined language point; for ex-ample, true-false, multiple-choice, matching,and sometimes fill-in items are usually dis-crete-point in nature. Integrative test items areharder to define and identify because they aresituated in a language context and becausethey may interact with each other and otherelements of the language context in relativelycomplex ways; for instance, dictations, dozetests, essay writing tasks, and interviews, aremade up of integrative items.

Receptive vs. productive. Receptive itemtypes are those in which the students receivelanguage in the sense of reading it or hearingit, but are not required to produce language(i.e., are not required to speak or write any-thing); generally, tnie-false, multiple-choice,and matching items are receptive. Productiveitems require the students to actually producelanguage in one form or another; fill-in, shortanswer, and task-oriented (e.g., compositions,interviews, role plays, etc.) are productive.

Translation items. Translation items requirethe students to translate a phrase, sentence,paragraph, etc. from their first language (L1)into their second language (L2), or vice versa;for instance, a student might be required towrite out in Japanese the translation of asentence in English. We argue in Brown andYamashita (1995) that translation may be toodifficult, demanding, and specialized a task torequire of students who have only studied alanguage through junior and senior highschool, and that putting translation tasks on atest signals the students that it is a legitimatestrategy to use in normal communication.

PurposeThe purpose of this study is to further in-

vestigate the Japanese university English en-

trance examinations by examining some ofthose produced in 1994. To that end, we haveset ourselves two goals. Our first goal is todescribe the state of affairs in 1994 at some ofthe major universities in Japan in order tobetter understand how these examinationsvary from university to university. Our secondgoal is to compare the 1994 examinationswith those produced in 1993 to determinehow the examinations vary from year to year.

To those ends, the following more formalresearch questions were posed:

1. How difficult are the various reading pas-sages used in the 1994 university Englishlanguage entrance examinations?

2. Are there differences in the levels of read-ing passage difficulty in private and publicuniversity examinations in 1993 and 1994?

3. What types of items are used on the 1994English language entrance examinations,how varied are they, and how does testlength vary?

4. Are there differences in the types of itemsand test lengths found in private and pub-lic university examinations in 1993 and1994?

5. What skills are measured on the 1993 and1994 English language entrance examina-tions?

It is hoped that answering these questionswill help English teachers in Japan preparetheir students for such tests and encourageresponsible testing practices at the variousuniversities that do this kind of testing.

Method

MaterialsTwo books served as our primary sources:

Koko-Eigo Kenkya, 1993a; Koko-EigoKenkyO, 1993b. Both were readily availableat commercial bookstores. Each book con-tains a number of English language entranceexaminations along with hints in Japanese onstudying for these examinations. Koko-EigoKenkyO, 1994a contained examinations from67 private universities. The same ten univer-sities used in Brown and Yamashita (1995)were selected for this study: Aoyama Univer-


BROWN & YAMASHITA

sity, Doshisha University, Keio University,Kansai Gaidai (Foreign Languages) University(KANGAI), Kansai University, Kyoto Univer-sity of Foreign Studies (KYOTOUFS), RikkyoUniversity, Sophia University, Tsuda Univer-sity, and Waseda University. Koko-EigoKenky6, 1994b contained examinations from56 public university examinations, 41 ofwhich were from national universities, 15from municipal universities, and one was thenationwide "center" examination. The sameten public universities used in our previousstudy were selected here. Eight were nationaluniversities: Hitotsubashi University(HITOTSU), Hokkaido University, KyotoUniversity, Kyushu University, Nagoya Uni-versity, Osaka University, Tokyo University,and Tokyo University of Foreign Studies(TYOUFS). And, two were municipal: TokyoMunicipal University (TORITSU) andYokohama City University.

The Daigaku Nyuushi Sentaa (or so-calledcenter examination), which is administerednationwide, was also included in both studiesfor a total of 21 examinations (see Brown &Yamashita 1995, p. 12, for more informationon the center examination).

We originally selected ten each of the mostprestigious private and public universities witha view to also getting a good geographicaldistribution. The center examination was in-cluded because it is administered nationwide.

AnalysesWe began by labeling each item for its

type, its purpose, the number of optionsstudents had, the language(s) involved (Japa-nese or English), the task required, and anyother salient features.

Data entry took several forms in thisproject, but all of the computer programsused were for IBM (MS-DOS) computers asfollows: 1) Each item was coded for itemtype and recorded in the QuattroProTMspreadsheet program (Borland, 1991); 2) allreading passages were typed into a wordprocessing program using the WordPerfecrm(WordPerfect, 1988) computer program;13) the reading passages in the entrance exami-nations were analyzed using the RightWriter"


computer program (Que Software, 1990) forsuch features as number of words, number ofunique words, percent of unique words, num-ber of sentences, syllables per word, words persentence, as well as the Flesch, Flesch-Kincaid, and Fog readability indexes. Thenumber of words, percent of unique words,number of sentences, syllables per word, andwords per sentence are self-explanatory, butsome of the other characteristics may not beso clear. The number of unique words canalso be described as the number of differentwords (i.e., no single word counts more thanonce). The Flesch, Flesch-Kincaid, and Fogreadability indexes are all ways of estimatingthe reading-level difficulty of a passage. TheFlesch scale ranges from 0 to 100 with ahigher number indicating an easier passage.The Flesch-Kincaid and Fog readability in-dexes are both meant to show the gradelevel of students for which the reading pas-sages should be appropriate. (For more onthese readability indexes, see Brown &Yamashita, 1995; Que Software, 1990, pages7.5 to 7.6; and Flesch, 1974).

Only descriptive statistics were used in thisproject to make comparisons between univer-sities, types of universities, and the 1993 and1994 results. These statistics consisted prima-rily of averages and percents. Because of thedescriptive nature of this study, no inferentialstatistics were used.

Results

The results of this study address the originalresearch questions posed earlier. As such, thequestions themselves will be used as head-ings to help organize the discussion.

1. How difficult are the various reading pas-sages used in the 1994 university Englishlanguage entrance examinations?

Since all of the examinations in this studyused at least some reading passages as thebasis for items on the test, we began by exam-ining those passages. Tables 1A and 1B presentthe statistics for the reading passages on theexaminations at the private and public universi-ties, respectively. Notice that the universities

BROWN & YAMASHITA

TABLE 1A: READING PASSAGE STATISTICS FOR PRIVATE UNIVERSITIES, 1994 (AVERAGES)

STATISTICPRIVATE UNIVERSITIES

AOYANA DOSHISHA KEIO KANGAI KANSAI KYOTOUFS RIKKYO SOPHIA TSUDA WASEDA

NO. OF PASSAGES 2 2 1 2 2 2 2 7 3 4

WORDS 445.50 622.00 914.00 423.50 778.00 471.50 507.00 410.57 332.67 565.75UNIQUE WORDS 228.00 303.50 435.00 221.00 304.50 226.50 258.50 217.71 157.00 296.50UNIQUE WORDS (%) 51.18 48.79 47.59 52.18 39.14 48.04 50.99 53.03 47.19 52.41SYLLABLES/WORD 1.42 1.64 1.62 1.57 1.37 1.52 1.39 1.50 1.47 1.61SENTENCES 27.00 29.00 36.00 23.50 55.50 29.50 22.50 19.71 16.67 28.50WORDS/SENTENCE 17.75 19.91 22.29 19.84 14.20 15.62 21.81 20.41 23.86 19.44FLESCH 69.09 47.90 46.85 53.96 76.93 62.52 67.11 59.29 58.52 51.39FLESCH-KINCAID 8.04 11.53 12.26 10.66 6.06 8.42 9.32 10.06 11.03 10.92FOG 10.58 14.12 14.56 12.42 8.64 9.98 11.82 11.84 13.75 12.85

TABLE 1B: READING PASSAGE STATISTICS FOR PUBLIC UNIVERSITIES, 1994 (AVERAGES)

STATISTICPUBLIC UNIVERSITIES

HITOTSU HOKKAIDO KYOTO KYUSHU NAGOYA OSAKA TOKYO TORITSU TYOUFS YOKOHAMA

NO. OF PASSAGES 2 4 2 3 2 3 3 2 7 4

WORDS 598.50 335.75 461.50 488.33 466.50 297.00 410.00 513.50 342.43 262.75UNIQUE WORDS 312.00 177.00 234.50 258.33 267.00 184.33 210.67 247.00 176.14 156.25UNIQUE WORDS (%) 52.13 52.72 50.81 52.90 57.23 62.07 51.38 48.10 51.44 59.47SYLLABLES/WORD 1.45 1.43 1.45 1.47 1.68 1.57 1.42 1.48 1.47 1.59SENTENCES 36.00 20.25 40.00 33.00 19.00 14.33 27.00 32.00 17.43 12.75WORDS/SENTENCE 16.56 16.03 13.43 15.41 23.96 20.75 16.01 15.97 19.26 21.83FLESCH 67.49 69.36 70.51 67.12 40.21 52.58 70.33 65.30 63.16 50.28FLESCH-KINCAID 7.96 7.57 6.76 7.72 13.61 11.08 7.42 8.12 9.24 11.67FOG 9.93 9.29 9.09 9.95 15.28 13.59 9.61 10.83 11.53 13.65

are labeled across the top of the columns,while the statistics are labeled for the rows.

Tables 1A and B indicate that all of theprivate and public universities used at leastone passage and that one of the private uni-versities, SOPHIA, and one of the publicuniversities, TYOUFS, used as many as sevenpassages. The average lengths of the pas-sages can be considered at the same time byexamining the number of words. WhileSOPHIA and TYOUFS had numerous pas-sages, they were shorter than the passages inmost of the other universities at 410.57 and342.43 words per passage on average, re-spectively. In contrast, one of the privateuniversities, KEIO, used only one passage,but used relatively a relatively long one at914 words.

The number of unique words representsthe number of different words used in thepassages because each word is counted onlyonce in this statistic. The percent of uniquewords was calculated by dividing the numberof unique words by the total number ofwords. The percent of unique words can beconsidered an indicator of the relative amountof variety in the vocabulary of the passagesbecause it shows the proportion of wordswhich were unique and is therefore easily

BEST COPY MAKABLE

comparable across passages, universities, anduniversity categories. The number of syllablesper word gives additional information aboutvocabulary in that it is a rough indicator ofthe difficulty of the words (based on the ideathat longer words are usually more difficultthan shorter ones).

The number of sentences per passage isrelatively straightforward to interpret. Moreinteresting perhaps is the notion of sentencelength, which can be gauged by looking at thesentence length statistic, which describes theaverage length of the sentences. Average sen-tence length is often considered a rough indica-tion of the syntactic complexity of a passage.

The Flesch readability index indicates thatthe passages in the various universitiesranged in overall reading level from "fairlyeasy" (76.93) at KANSAI to "difficult" (40.21)at NAGOYA. The Flesch-Kincaid readabilityindex indicates that the passages would beappropriate for native speakers ranging gen-erally from early sixth grade (about 11 yearsold) at KANSAI to late thirteenth grade level(third year of university, or about 21 yearsold) at NAGOYA (13.61). The Fog indexgenerally appears to agree with the Flesch-Kincaid one, but is consistently about twogrades higher.

95. LANGUAGE TESTING IN JAPAN 89

BROWN & YAMASHITA

TABLE 2: READING PASSAGE STATISTICS SUMMARIZED BYUNIVERSITY TYPE (AVERAGES)

STATISTIC1993 EXANS 1994 EXANS

PRIVATE PUBLIC CENTER TOTAL PRIVATE PUBLIC CENTER TOTAL

NO. UNIVERSITIES 10 10 1 21 10 10 1 21NO. OF PASSAGES 2.30 3.40 3 2.86 2.70 3.20 3.00 2.95WORDS 540.15 378.33 178.33 445.87 547.05 417.63 368.00 476.89UNIQUE WORDS 272.58 196.95 100.67 228.38 264.82 222.32 189.67 241.01UNIQUE WORDS (%) 50.74 52.94 53.88 51.84 49.05 53.83 51.54 51.44SYLLABLES/WORD 1.51 1.52 1.41 1.51 1.51 1.50 1.49 1.50SENTENCES 30.09 21.18 9.33 24.86 28.79 25.18 24.67 26.87WORDS/SENTENCE 19.03 20.18 17.01 19.48 19.51 17.92 18.77 18.72FLESCH 60.40 58.29 70.35 59.87 59.35 61.63 61.91 60.56FLESCH-KINCAID 9.38 10.06 7.67 9.62 9.83 9.11 9.29 9.46FOG 11.18 12.19 10.17 11.61 12.05 11.28 10.83 11.63

2. Are there differences in the levels of read-ing passage difficulty in private and publicuniversity examinations in 1993 and1994?

The summary statistics shown in Table 2reveal some interesting overall (average)differences in 1993 and 1994 between theprivate and public universities and the centerexamination. Notice that the overall averagesare also given for all 21 universities consid-ered together (TOTAL).

In both years, the public universities ap-pear to have more reading passages, butshorter ones than the private universities,while the center examination is somewhere inbetween in terms of the number of passages,but with considerably shorter passages. No-tice that on average the passages in everycategory are longer in 1994 than in 1993. So,it appears that there is a tendency for pas-sages on the examinations to get longer.

In terms of percentage of unique words,syllables per word, and words per sentence,there are no particularly interesting differ-ences between private, public, and centerexaminations, or between the 1993 and 1994examinations. The average readability levelsof the passages involved in these examina-tions vary, however, by type of university.The Flesch, Flesch-Kincaid, and Fog readabil-ity scales all indicate that the passages be-came slightly easier overall between 1993and 1994. Examining the results in moredetail reveals that the passages on the privateuniversity and center examinations becameslightly more difficult on average between1993 and 1994, while the passages in the


public university examinations becameslightly easier, on average.

3. What types of items are used on the 1994English language entrance examinations,how varied are they, and how does testlength vary?

Item types. Tables 3A and 3B summarizethe different types of items that we found onthe 1994 entrance examinations. Table 3Adoes so for the private universities, and Table3B for the public ones. Notice that, as in theprevious tables, the university names areprovided across the top as column labels.Then, down the left-hand side, other labelsindicate that the upper portion of each tablecontains frequencies while the lower partgives the equivalent percents. The frequen-cies are provided so that the actual tallies (orfrequencies) can be examined by readers, butthe percents are also given so that readerscan make easier comparisons between uni-versities (without the confusion caused bydiffering test lengths).

Notice further that, within the frequenciesand percents, item types are categorized byskill area with reading and writing collapsedtogether to represent written tests, translationtreated as a separate skill (as discussed abovein the Definitions section), and listeninghandled as yet a third skill. Within each skillarea, we found a variety of different types ofitems. Within the reading/writing skills, wefound multiple-choice, true-false, rephrasingor reordering, fill-in the blank, and shortanswer or essay items. Within translation, wefound both English to Japanese translationand Japanese to English. Among the listening

items, we found true-false, multiple-choice,fill-in, dictation, and short-answer items.

Item variety. Tables 3A and 3B provide agreat deal of information. Naturally, we can-not hope to interpret all of that informationhere in prose. Some salient facts do, however,stand out in these two tables.

For example, notice that different universi-ties use different combinations of item types.Some universities, like those labeled KANGAIand SOPHIA, place great emphasis on mul-tiple-choice items, while other universitieslike KYOTO place heavy emphasis on transla-tion. Notice also that other universities, likeAOYAMA, TOKYO, and TYOUFS, prefer toput a fairly heavy emphasize on listeningitems, while the other universities do not. Thereader should continue to explore Tables 3Aand 3B to discover how very different theexaminations are from university to univer-sity. The one thing that seems clear, at thispoint, is that the nature of the item types onthe various university entrance examinationsvaries tremendously. Thus the types of itemsthat a student will face depends to a great

BROWN & YAMASHITA

degree on which examination the studentchooses to take.

As in all classifying and categorizing, how-ever, this process of analyzing the item typesoversimplifies reality. For example, withineach of our seemingly simple categories,there were additional differences. Considerjust the first category of reading/writing mul-tiple-choice items. There was tremendousvariation in this seemingly straightforwarditem type. For instance, in terms of the num-ber of options supplied in the items, wefound them ranging from three to five. Someof the multiple-choice items had the optionsin English, some in Japanese. Some of theitems were based on the reading passages (ina wide variety of different ways) and somestood alone. Some items posed a straightfor-ward question, others required the students toselect the option that successfully filled ablank. Still others required the students toselect the option that matched a word orphrase, and finally, some provided contextfor blanks in the form of multiple-choicedoze. In addition some of the items were

TABLE 3A: THE VARIETY OF ITEM TYPES ON PRIVATE UNIVERSITY EXAMINATIONS, 1994

SKILL:ITEM TYPE

PRIVATE UNIVERSITIESAOYAMA DOSHISHA KEIO KANGAI KANSAI KYOTOUFS RIKKYO SOPHIA TSUDA WASEDA

FREQUENCIESREADING/WRITING:MULTIPLE-CHOICE 10 34 0 79 43 43 23 80 8 17TRUE-FALSE * 0 0 0 0 0 0 0 0 5 0REPHRASE/REORDER 0 3 0 0 0 0 0 0 0 0FILL-IN 0 6 0 0 0 0 14 0 31 14SHORT-ANSWER/ESSAY 1 0 4 0 3 1 1 0 0 1

TRANSLATION:TRANSLATE (E->J) 2 1 3 0 2 1 1 0 2 0TRANSLATE (J->E) 2 1 1 0 0 3 0 0 2 0

LISTENING:TRUE-FALSE o 0 0 0 0 0 0 0 0 0MULTIPLE-CHOICE 10 0 0 0 0 0 0 0 0 0FILL-IN 0 0 0 0 0 0 0 0 0 0DICTATION 0 0 0 0 0 0 0 0 1 0SHORT-ANSWER * 0 0 0 0 0 0 0 0 0 0

TOTAL NO. OF ITEMS 25 45 8 79 48 48 39 80 49 32TIME ALLOWED 100 100 120 90 90 80 95 90 100 90

PERCENTAGESREADING/WRITING:MULTIPLE-CHOICE 40.00 75.56 0.00 100.00 89.58 89.58 58.97 100.00 16.33 53.13TRUE-FALSE * 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.20 0.00REPHRASE/REORDER 0.00 6.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00FILL-IN 0.00 13.33 0.00 0.00 0.00 0.00 35.90 0.00 63.27 43.75SHORT-ANSWER/ESSAY 4.00 0.00 50.00 0.00 6.25 2.08 2.56 0.00 0.00 3.13

TRANSLATION:TRANSLATE (E->J) 8.00 2.22 37.50 0.00 4.17 2.08 2.56 0.00 4.08 0.00TRANSLATE (J->E) 8.00 2.22 12.50 0.00 0.00 6.25 0.00 0.00 4.08 0.00LISTENING:TRUE-FALSE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00MULTIPLE-CHOICE 40.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00FILL-IN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00DICTATION 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.04 0.00SHORT-ANSWER * 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

TOTAL PERCENT 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

* New item types on the 1994 examinations not found on the 1993 examinations.

97BEST COPY AVAILABLE


BROWN & YAMASHITA

TABLE 3B: THE VARIETY OF ITEM TYPES ON PUBLIC UNIVERSITY EXAMINATIONS, 1994

SKILL:ITEM TYPE

PUBLIC UNIVERSITIESHITOTSU HOKKAIDO KYOTO KYUSHU NAGOYA OSAKA TOKYO TORITSU TYOUFS YOKOHAMA

FREQUENCIESREADING/WRITING:MULTIPLE-CHOICE 3 11 0 3 12 5 7 0 12 21TRUE-FALSE * 0 0 0 0 0 0 0 0 0 0REPHRASE/REORDER 0 0 0 0 0 0 0 0 1 0FILL-IN 0 9 0 1 3 0 3 0 0 2

SHORT-ANSWER/ESSAY 5 1 0 6 5 2 3 6 10 2

TRANSLATION:TRANSLATE (E->J) 5 6 10 7 6 5 4 6 0 2

TRANSLATE (J->E) 2 4 2 3 4 2 1 5 0 4LISTENING:TRUE-FALSE o 0 0 0 0 0 0 0 0 0MULTIPLE-CHOICE 0 0 0 0 0 0 0 0 5 0FILL-IN 0 0 0 0 0 0 0 0 12 0DICTATION 0 0 0 0 0 0 0 0 0 0SHORT-ANSWER * 0 0 0 0 0 0 10 0 0 0

TOTAL NO. OF ITEMS 15 31 12 20 30 14 28 17 40 31TIME ALLOWED 120 90 120 120 90 105 120 120 150 90

PERCENTAGESREADING/WRITING:MULTIPLE-CHOICE 20.00 35.48 0.00 15.00 40.00 35.71 25.00 0.00 30.00 67.74TRUE-FALSE * 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00REPHRASE/REORDER 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.50 0.00FILL-IN 0.00 29.03 0.00 5.00 10.00 0.00 10.71 0.00 0.00 6.45SHORT-ANSWER/ESSAY 33.33 3.23 0.00 30.00 16.67 14.29 10.71 35.29 25.00 6.45

TRANSLATION:TRANSLATE (E->J) 33.33 19.35 83.33 35.00 20.00 35.71 14.29 35.29 0.00 6.45TRANSLATE (J->E) 13.33 12.90 16.67 15.00 13.33 14.29 3.57 29.41 0.00 12.90

LISTENING:TRUE-FALSE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00MULTIPLE-CHOICE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.50 0.00FILL-IN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 30.00 0.00DICTATION 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00SHORT-ANSWER * 0.00 0.00 0.00 0.00 0.00 0.00 35.71 0.00 0.00 0.00

TOTAL PERCENT 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

* New item types on the 1994 examinations not found on the 1993 examinations.

based on passages that the students had toread while others were relatively indepen-dent. All in all, with the various combinationsof the factors just discussed, we found a tre-mendous variety of different types of items.However, even this is an oversimplification.In short, we were amazed at the variety ofdifferent types of multiple-choice items thatthe human mind can create.

We found similar variety within each of theother types of items shown in Tables 3A and3B. We are not the first to notice this phe-nomenon. As Duke (1986) put it, ". . . writtenEnglish tests in public schools can be quiteintricate. They often require detailed knowl-edge of the techniques of the language, bothwritten and oral" (p. 154).

One ramification of such intense variationin item types is that students are forced tochange item types often within any giventest. This means that new sets of directions(always in Japanese) are common. As a re-sult, these examinations must assess thestudents' testwiseness, at least to some de-gree, in that they measure the students' abili-


ties to handle varied and novel item types,read directions (in Japanese), and switchgears often. For years, tests in western coun-tries have avoided such issues by keepingthe item types similar within fairly large sub-tests. This practice is based on the belief thattests should assess the students' abilities inthe content area (in this case, English) ratherthan their abilities to take tests.

Test length. Another fact that emerges fromTables 3A and 3B is that these examinationsvaried considerably in terms of sheer length,that is, in the time allowed and the numbers ofitems involved in each examination. Theamount of time allowed ranged from 80 min-utes for KYOTOUFS to 150 minutes forTYOUFS. The numbers of items involved var-ied from lows of eight, twelve, and fourteen atKEIO, KYOTO, and OSAKA, respectively, tohighs of 79, and 80 at KANGAI and SOPHIA,respectively. Of course, such differences innumbers of items are related at least in part tothe item types involved. For instance, the KEIOexamination was made up of only 8 items, buthalf of those items were translation items (three

98

English to Japanese and one Japanese to En-glish), which are probably more involved tasksthan multiple-choice items like those whichdominate the SOPHIA examination.

Nonetheless, test length, in terms of num-bers of items, is an important consideration,which can directly affect the reliability of anexamination. As a result, examinations of thisimportance in the West, like the TOEFL ex-amination, are typically much longer in termsof both numbers of items and time allotted.

4. Are there differences in the types of itemsand test lengths found in private and pub-lic university examinations in 1993 and1994?

Table 4 summarizes the 1993 informationpresented in Brown and Yamashita (1995)and the 1994 information presented in theprevious section for the private and publicuniversities grouped together, along with thecenter examination statistics and the totals forall 21 universities in each year taken to-gether. Remember, the figures in this tableare mostly averages.

BROWN & YAMASHITA

On the whole, most of the same types ofitems were used at least sometimes in 1993 and1994 in both the private and public universities.However, there were several notable excep-tions to that general statement. Listening true-false items were only used in the privateuniversities in 1993 and listening dictation itemswere only used in the private university examsin 1993 and 1994, while listening fill-in itemswere only used in the public universities in1993 and 1994. In addition, two new types ofitems were used in 1994 that had not appearedon the 1993 examinations: true-false reading/writing items, which were used only in the1994 private university exams, and short-an-swer listening items were where used only inthe 1994 public university exams. Clearly, inboth 1993 and 1994, the center examinationused predominantly multiple-choice items witha few rephrase/reorder types of items provid-ing the only variety.

Notice how the balance maintained amongthe various item types is very different onaverage when comparing the private andpublic universities. For example, on average,

TABLE 4: ITEM TYPE VARIETY SUMMARIZED BY UNIVERSITY TYPE *

SKILL:ITEM TYPE

1993 EXAMS 1994 EXAMSPRIVATE PUBLIC CENTER TOTAL PRIVATE PUBLIC CENTER TOTALFREQUENCIESREADING/WRITING:MULTIPLE-CHOICE 32.10 3.60 48 19.29 33.70 7.40 55 22.19TRUE-FALSE * 0.00 0.00 0 0.00 0.50 0.00 0 0.24REPHRASE/REORDER 1.40 1.50 5 1.62 0.30 0.10 4 0.38FILL-IN 5.70 3.10 0 4.19 6.50 1.80 0 3.95SHORT-ANSWER/ESSAY 0.30 5.30 0 2.67 1.10 4.00 0 2.43TRANSLATION:TRANSLATION (E->J) 1.40 4.20 0 2.67 1.20 5.10 0 3.00TRANSLATION (J->E) 0.70 1.70 0 1.14 0.90 2.70 0 1.71LISTENING:TRUE-FALSE 1.00 0.00 0 0.48 0.00 0.00 0 0.00MULTIPLE-CHOICE 1.00 1.60 0 1.24 1.00 0.50 0 0.71FILL-IN 0.00 4.00 0 1.90 0.00 1.20 0 0.57DICTATION 0.10 0.00 0 0.05 0.10 0.00 0 0.05SHORT-ANSWER * 0.00 0.00 0 0.00 0.00 1.00 0 0.48TOTAL NO. OF ITEMS 43.70 25.00 53 35.24 45.30 23.80 59 35.71TIME ALLOWED 93.50 112.50 80 103.00 95.50 112.50 80 102.86

PERCENTAGESREADING/WRITING:MULTIPLE-CHOICE 63.55 13.98 90.57 41.23 62.31 26.89 93.22 46.92TRUE-FALSE 0.00 0.00 0.00 0.00 1.02 0.00 0.00 0.49REPHRASE/REORDER 2.54 4.94 9.43 4.01 0.67 0.25 6.78 0.76FILL-IN 16.82 12.97 0.00 14.19 15.62 6.12 0.00 10.35SHORT-ANSWER/ESSAY 1.77 24.17 0.00 12.36 6.80 17.50 0.00 11.57TRANSLATION:TRANSLATION (E->J) 6.64 18.41 0.00 11.93 6.06 28.28 0.00 16.35TRANSLATION (J->E) 2.61 7.88 0.00 4.99 3.31 13.14 0.00 7.83LISTENING:TRUE-FALSE 2.86 0.00 0.00 1.36 0.00 0.00 0.00 0.00MULTIPLE-CHOICE 2.86 5.31 0.00 3.89 4.00 1.25 0.00 2.50FILL-IN 0.00 12.35 0.00 5.88 0.00 3.00 0.00 1.43DICTATION 0.36 0.00 0.00 0.17 0.20 0.00 0.00 0.10SHORT-ANSWER * 0.00 0.00 0.00 0.00 0.00 3.57 0.00 1.70TOTAL PERCENT 100.00 100.00 100.00 100.00 00.00 100.00 100.00 100.00

*A11 statistics for PRIVATE and PUBLIC universities as well as TOTAL are averages.

99BEST COPY AVAILABLE


BROWN & YAMASHITA

the private universities relied much moreheavily on multiple-choice items in 1993 and1994, while the public universities put aheavier emphasis on the short-answer/essayand translation types of items in both years.In addition, the private examinations hadconsiderably more items on average in 1993and 1994 than the public examinations. Thetime allotted was also different, with theaverage private university examination beingconsiderably shorter in both years than theaverage public university.

Thus a student who wants a relativelyquick examination with a fairly large propor-tion of multiple-choice items should focus ontaking private university examinations, or,even more so, on the center examination.

5. What skills were measured on the 1993and 1994 English language entrance ex-aminations?

As we explained in the Definitions sectionabove, different skills are necessary to an-swer discrete-point test items and integrativeones, as is the case with receptive test itemsand productive ones. We also argued thattranslation is probably too difficult, demand-ing, and specialized a skill to require of stu-dents who have only studied a language

through junior and senior high school. Tables5A and 5B presents the results for these dif-ferent item categories. These two tables rep-resent the private and public universities,respectively, and the tables are organized justlike those which preceded.

Consider first the comparisons shown inTables 5A and 5B between discrete-point,integrative, and translation items. Among theprivate universities (looking at the percentsshown in Table 5A), the discrete-point itemsdominate all of the tests except for KEIO(which is 50 percent translation and 50 per-cent integrative). However, in Table 5B, thepublic universities generally appear to usefewer discrete-point items and the use ofdiscrete-point, integrative, and translationappears to vary more.

In the same tables, receptive, productive,and translation items are also compared. InTable 5A the private universities have about thesame pattern that occurred in the discrete-pointresults, that is, the receptive items dominated,probably because discrete-point items tend tobe receptive (though some items can be dis-crete-point and productive). In Table 5B, theproportions of receptive items are generallyless important in the public university tests andthe proportions of receptive, productive, and

TABLE 5A: CATEGORIES OF ITEM TYPES ON PRIVATE UNIVERSITY EXAMINATIONS, 1994

ITEM CATEGORY AOYAMA

FREQUENCIESDISCRETE-POINT 20INTEGRATIVE 1TRANSLATION 4NO. OF ITEMS 25

RECEPTIVE 20PRODUCTIVE 1TRANSLATION 4NO. OF ITEMS 25

PASSAGE DEPENDENT 22PASSAGE INDEPENDENT 3

NO. OF ITEMS 25

PERCENTAGESDISCRETE-POINTINTEGGRATIVETRANSLATIONTOTAL % OF ITEMS

RECEPTIVEPRODUCTIVETRANSLATIONTOTAL % OF ITEMS

PASSAGEPASSAGETOTAL %

DOSHISHA KEIO KANGAIPRIVATE UNIVERSITIES

43 0 790 4 02 4 0

45 8 79

37 0 796 4 02 4 0

4 8 79

30 7 1415 1 6545 8 79

80.00 95.56 0.00 100.004.00 0.00 50.00 0.0016.00 4.44 50.00 0.00

100.00 100.00 100.00 100.00

80.00 82.23 0.00 100.004.00 13.33 50.00 0.00

16.00 4.44 50.00 0.00100.00 100.00 100.00 100.00

DEPENDENT 88.00INDEPENDENT 12.00OF ITEMS 100.00

66.67 87.50 17.7233.33 12.50 82.28

100.00 100.00 100.00


1 0 0

KANSAI KYOTOUFS RIKKYO SOPHIA TSUDA WASEDA

43 43 37 80 44 313 1 1 0 1 12 4 1 0 4 0

48 48 39 80 49 32

43 43 23 80 13 173 1 15 0 32 152 4 1 0 4 0

48 48 39 80 49 32

33 20 21 80 15 1615 28 18 0 34 1648 48 39 80 49 32

89.58 89.59 94.88 100.00 89.80 96.876.25 2.08 2.56 0.00 2.04 3.134.17 8.33 2.56 0.00 8.16 0.00

100.00 100.00 100.00 100.00 100.00 100.00

89.58 89.59 58.98 100.00 26.53 53.126.25 2.08 38.46 0.00 65.31 46.884.17 8.33 2.56 0.00 8.16 0.00

100.00 100.00 100.00 100.00 100.00 100.00

68.75 41.67 53.85 100.00 30.61 50.0031.25 58.33 46.15 0.00 69.39 50.00100.00 100.00 100.00 100.00 100.00 100.00

BROWN & YAMASHITA

translation items vary more from university touniversity than among the private universities.

Most of the universities in Tables 5A and5B rely heavily on items that are based onreading or listening passages, KANGAI andTSUDA being the only universities that basedless than 40 percent of their items on pas-sages of some sort.

Table 6 compares the 1993 and 1994 aver-ages for private and public universities withthe center examination and the total for all 21universities. This table re-enforces the obser-vations made above that the private universityexaminations are much more prone to usingdiscrete-point receptive items than the publicones, and that observation is true for both1993 and 1994. However, while the publicuniversity examinations are laudably usingmore integrative items, they are also relyingheavily on translation items.

The patterns in Table 6 for the receptive,productive, and translation items are similar.In neither 1993 or 1994 are students beingrequired to use any extensive amounts ofproductive language-not much written lan-guage, and absolutely no spoken language.

In both years, the public universities appearto use a higher percent of passage-based itemsthan the private universities, and both types of

universities appear to be using more passage-based items in 1994 than in 1993. Note alsothat the center examination used a muchsmaller proportion of passage-based items inboth years than did either the public or privateuniversities.

The listening skill, heavily promoted in theMonbusho guidelines that were implementedin spring of 1993 (Monbusho, 1989), appearsin only a few of the examinations. Accordingto Brown and Yamashita (1995), in 1993, sixout of the 21 universities included listeningcomprehension items: AOYAMA, TSUDA,HITOTSU, KYOTO, TOKYO, and TYOUFS.Careful inspection of Tables 3A and 3B re-veals that, in 1994, only four of the examina-tions (out of 21) included listeningcomprehension items of some sort:AOYAMA, TSUDA, TOKYO, and TYOUFS.

Discussion

One question that remains is what theresults of this study mean to teachers andstudents of EFL in Japan?

Readability. Readability analysis of the pas-sages used on the 1994 entrance examinations(see Tables 1A, 1B, and 2) indicated that all ofthe private and public universities used at least

TABLE 5B: CATEGORIES OF ITEM TYPES ON PUBLIC UNIVERSITY EXAMINATIONS, 1994

ITEM CATEGORYPUBLIC UNIVERSITIES

HITOTSU HOKKAIDO KYOTO KYUSHU NAGOYA OSAKA TOKYO TORITSU TYOUFS YOKOHAMA

FREQUENCIESDISCRETE-POINT 3 20 0 4 15 5 10 0 30 23INTEGRATIVE 5 1 0 6 5 2 13 6 10 2TRANSLATION 7 10 12 10 10 7 5 11 0 6TOTAL NO. OF ITEMS 15 31 12 20 30 14 28 17 40 31

RECEPTIVE 3 11 0 3 12 5 7 0 18 21PRODUCTIVE 5 10 0 7 8 2 16 6 22 4TRANSLATION 7 10 12 10 10 7 5 11 0 6TOTAL NO. OF ITEMS 15 31 12 20 30 14 28 17 40 31

PASSAGE DEPENDENT 14 10 14 16 13 211 12 26 13PASSAGE INDEPEND. 17 2 6 14 1 5 14 18TOTAL NO. OF ITEMS 15 31 12 20 30 14 28 17 40 31

PERCENTAGESDISCRETE-POINT 20.00 64.52 0.00 20.00 50.00 35.71 35.71 0.00 75.00 74.19INTEGRATIVE 33.33 3.23 0.00 30.00 16.67 14.29 46.43 35.29 25.00 6.46TRANSLATION 46.67 32.25 100.00 50.00 33.33 50.00 17.86 64.71 0.00 19.35TOTAL % OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

RECEPTIVE 20.00 35.48 0.00 15.00 40.00 35.71 25.00 0.00 45.00 67.75PRODUCTIVE 33.33 32.26 0.00 35.00 26.67 14.29 57.14 35.29 55.00 12.90TRANSLATION 46.67 32.26 100.00 50.00 33.33 50.00 17.86 64.71 0.00 19.35TOTAL t OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

PASSAGE DEPENDENT 86.67 45.16 83.33 70.00 53.33 92.86 75.00 70.59 65.00 41.94PASSAGE INDEPEND. 13.33 54.84 16.67 30.00 46.67 7.14 25.00 29.41 35.00 58.06TOTAL t OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00


BROWN & YAMASHITA

TABLE 6: CATEGORIES OF ITEM TYPES SUMMARIZED BY UNIVERSITY TYPE

ITEM CATEGORY1993 EXAMS 1994 EXAMS

PRIVATE PUBLIC CENTER TOTAL PRIVATE PUBLIC CENTER TOTAL

FREQUENCIESDISCRETE-POINT 39.80 12.30 48 27.10 42.00 11.00 59 28.05INTEGRATIVE 1.80 6.80 5 4.33 1.20 5.00 0 2.95TRANSLATION 2.10 5.90 0 3.81 2.10 7.80 0 4.71TOTAL NO. OF ITEMS 43.70 25.00 53 35.24 45.30 23.80 59 35.71

RECEPTIVE 41.20 9.80 53 26.81 35.50 8.00 59 23.52PRODUCTIVE 0.40 9.30 0 4.62 7.70 8.00 0 7.48TRANSLATION 2.10 5.90 0 3.81 2.10 7.80 0 4.71TOTAL NO. OF ITEMS 43.70 25.00 53 35.24 45.30 23.80 59 35.71

PASSAGE DEPENDENT 15.30 12.30 15 13.86 25.80 15.20 14 20-19PASSAGE INDEPENDENT 28.40 12.70 38 21.38 19.50 8.60 45 15.52TOTAL NO. OF ITEMS 43.70 25.00 53 35.24 45.30 23.80 59 35.71

PERCENTAGESDISCRETE-POINT 86.08 44.61 90.57 66.54 83.63 37.51 100.00 62.45INTEGRATIVE 4.68 29.10 9.43 16.54 7.00 21.07 0.00 13.37TRANSLATION 9.24 26.29 0.00 16.92 9.37 41.42 0.00 24.18TOTAL % OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

RECEPTIVE 88.63 37.19 100.00 64.68 68.00 28.39 100.00 50.66PRODUCTIVE 2.13 36.52 0.00 18.40 22.63 30.19 0.00 25.16TRANSLATION 9.24 26.29 0.00 16.92 9.37 41.42 0.00 24.18TOTAL % OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

PASSAGE DEPENDENT 41.71 52.66 28.30 46.29 60.48 68.39 23.73 62.49PASSAGE INDEPENDENT 58.29 47.34 71.70 53.71 39.52 31.61 76.27 37.51TOTAL % OF ITEMS 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

*A11 statistics for PRIVATE and PUBLIC universities as well as TOTAL are averages.

one passage and that some used as many asseven passages. The passages ranged consider-ably in length and readability. As with the 1993examinations, the Flesch-Kincaid and Fog read-ability indexes indicated that the passageswould be appropriate for native speakers rang-ing from eighth grade (about 13 years old) tothirteenth or even late fifteenth grade level(university age) at one university. Especially thepassages at the upper end of these scalesshould be considered very difficult readingmaterial for EFL students just finishing highschool. In 1993, four universities had universitylevel reading passages according to the Fogindex (KEIO, KYOTO, TOKYO, andYOKOHAMA), while, in 1994, six universitieshad passages at this level-three private(DOSHISHA, KEIO, and TSUDA) and threepublic (NAGOYA, OSAKA, and YOKOHAMA).Table 2 indicates that, in 1993, the public uni-versities had more difficult passages than theprivate universities, but, in 1994, the reversewas true. However, on the whole, the averagelevel of difficulty across all of the universitiesdid not change much between 1993 and 1994as indicated by the TOTAL averages.

For those teaching EFL in Japan, the 1994readability results (like the 1993 results) indi-


cate that, by the end of their studies, univer-sity-bound high school students would ben-efit from learning to read relatively difficultuniversity level passages with good compre-hension.

Item types. The 1993 analysis of the itemtypes (see Tables 3A, 3B, and 4) revealed thatthe reading/writing items included multiple-choice, rephrasing or reordering, fill-in theblank, and sort answer or essays, while thetranslations were both English to Japaneseand Japanese to English and the listeningitems were true-false, multiple-choice, fill-in,and dictations. The same item types wereused in 1994 with the addition of true-falseitems in the reading/writing category andshort-answer items in the listening category.Thus university-bound EFL students shouldbe equipped to deal successfully with thesetypes of items, and with the considerablevariation in formats that they represent. Inshort, we should probably prepare studentsfor considerable variety in item types andfrequent changes in the instructions/direc-tions that go with those item types.

In 1993, DOSHISHA, KANGAI, KANSAI,KYOTOUFS, SOPHIA, TSUDA, and WASEDAused 50 percent of more multiple-choice

102

items, while, in 1994, DOSHISHA, KANGAI,KANSAI, KYOTOUFS, RIKKYO, SOPHIA,WASEDA, and YOKOHAMA did so. In 1993,only KEIO and TORITSU used more than 50percent translation items, while, in 1994,KEIO, KYOTO, KYUSHU, OSAKA, andTORITSU did so. Students should probably beadvised of the types of items that have pre-dominated in the last two years at whateveruniversities they want to enter.

It is worth noting the considerable differ-ences in the test lengths both in time and innumbers of items. This fact can be importantto individual test takers, but can also haveimportant policy level implications because ofthe direct relationship between test lengthand test reliability (i.e., longer tests tend to bemore reliable than shorter tests if all otherfactors are held constant; see Brown, 1995).

The skills analysis (Tables 5A, 5B, and 6)indicated that generally the proportions ofdiscrete-point and receptive items were veryhigh for the private university examinations,but less important in the public universitytests, and that the proportions of receptive,productive, and translation items varied morefrom university to university among the pub-lic 'universities than among the private ones.It was also noted (from Tables 3A and 3B)that six of the 1993 examinations (AOYAMA,TSUDA, HITOTSU, KYOTO, TOKYO, andTYOUFS) out of these 21 included listeningcomprehension items, while in 1994 the num-ber had dropped to four (AOYAMA, TSUDA,TOKYO, and TYOUFS).

Conclusions

Generally, the items on the tests that weanalyzed in both 1993 and 1994 were reason-ably well written with very few malaprop-isms, typographical errors, unintentionalgrammatical errors, etc. However, that doesnot mean that the tests are perfect. In fact, itis safe to say that no language test is everperfect. The remainder of this chapter willdiscuss some of the problems that we seewith the examinations that we analyzed.

Many of the items on these examinationswere based on reading passages, and as we

BROWN & YAMASHITA

pointed out earlier, a number of the passageswere difficult. Hence, the ability of a givenstudent to answer these questions will de-pend to some degree on their ability to dealwith relatively high level language that isperhaps above the level of the simplifiedtexts that are often used for pedagogical pur-poses in Japan.

In addition, the ability to answer many ofthe items may depend on the students'knowledge of the particular topics involved inthe passages. In other words, chance knowl-edge of a particular topic and its vocabularycould be helping some students to be ac-cepted into a particular university, while otherstudents, who were not lucky enough to havethat chance knowledge, are excluded.

Before analyzing our 1993 results, we wereskeptical of the value of the Japanese obses-sion with sending their children to a juku(i.e., a cram school) or yobika (i.e., a test-coaching school) to prepare for major en-trance examinations. However, based on boththe 1993 and 1994 results, we now believethat such preparation may be advisable. Forone thing, there is considerable variation inthe types of items used on these tests, espe-cially in the public university examinations.Such variation means that students are oftenreading new directions/instructions and shift-ing gears in terms of the kinds of items thatthey are answering. The ability to understanddirections and shift gears on an examinationis part of what is known as testwiseness, andtestwiseness is one of the issues dealt with ina typical juku or yobika Whether we like itor not, given the very competitive nature ofthese examinations, testwiseness, or the abil-ity to take tests in general, may be as impor-tant or even more important than thestudents' actual proficiency in English.

Teachers should also recognize the rela-tionship between the item types used onuniversity entrance examinations and thepedagogical choices that they make in theirclassrooms. In 1993 and 1994, the privateuniversities predominantly used discrete-pointreceptive items. This means that in effect theywere endorsing a discrete-point receptiveview of language teaching. Many of the pub-


BROWN & YAMASHITA

lic universities were predominantly usingtranslation items, which means that they weretacitly endorsing the use of translation as acommunication strategy.

Students who are preparing for examina-tions of one type or the other, may quitereasonably want to focus on discrete grammarpoints, or translation tasks, and have verylittle interest in the communicative languagelearning or task-based learning that a particu-lar teacher may be using if they do not seeany direct relationship between what theteacher is doing and the examination thatthey must eventually take. This effect is calledthe washback effect, i.e., the effect of a teston the pedagogy associated with it (for moreon washback effect, see Gates in Chapter 11of this book). The present nature of the uni-versity entrance examinations appears to leadto a negative washback effect on efforts toemploy modern language teaching methods.

The predominant North American examina-tion, the TOEFL, is also discrete-point in na-ture, and it has been argued for years that ithas a negative washback effect on modernlanguage teaching methods. In response,Educational Testing Service has initiated theTOEFL 2000 project, which is aimed at chang-ing the TOEFL into a more communicativeand task-oriented examination. Perhaps, Japa-nese universities should begin to change theirexaminations in similar ways so that theirwashback effect can become a positive andprogressive force for change in languageteaching in Japan.

A contradiction has also developed betweenwhat is included on these university entranceexaminations and the Monbusho (1989) guide-lines implemented in April 1993 for junior andsenior high school English teaching. The guide-lines advocate the addition of listening and/orspeaking to the curriculum, but our analysisindicates that only six universities in 1993 andfour in 1994 included even a listening compo-nent. For students, taking the examinationswithout a listening component, there is a dis-tinct negative washback effect on their desireto improve their listening or speaking abilities.Perhaps the structure and nature of the en-trance examination items should change over


the next several years to reflect the Monbusho'snew emphasis on oral/aural skills.

The universities in this study openly allowfor publication of their examinations, which isbeneficial because it allows for public scru-tiny of the items that were used. However,these universities should also be held respon-sible for defending the reliability, validity, andpracticality of their tests. We have reason tobelieve that the entrance examinations inboth the 1993 and 1994 studies are weak inall three areas. For instance, many of the testsare relatively short, which may pose a threatto their reliability. In addition, the types ofitems involved are not consonant with currentlanguage teaching theory and practice, whichis a serious threat to validity. Finally, the testsdo not appear to be 100 percent practicalbecause the item types change frequently andmany items are passage and topic dependent.Moreover, none of the universities that we areaware of do any of the statistical analyses ofreliability and validity that are standard prac-tice on major examinations (even at indi-vidual universities) in the United States.

How can these problems be avoided? Aswe did in the previous study of the 1993examinations, we strongly recommend thatJapanese universities follow the Standards forEducational and Psychological Testing(CDSEPT, 1985) or adapt those standards tocreate a set that will be acceptable in Japan.In the United States, tests only became con-sistently reliable and valid when students filedlaw suits against the various institutions thatdeveloped tests. The Standards for Educa-tional and Psychological Testing then ap-peared and began to be applied to tests.

In addition, Mental Measurements Year-book (e.g., Kramer & Conoley, 1992) pro-vides periodic critical reviews of all testspublished in the United States. Both theStandards and MMY help to keep test devel-opers honest. Similar institutions in Japanmight help improve the reliability and valid-ity of the entrance examinations used inJapanese universities.

Perhaps the single most important factabout these entrance examinations is that theresults are used to make very important deci-

sions about students' lives. As such, they mustbe of the highest quality if they are to be fairto the students. In fact, the entrance examina-tions in Japan are far too important to be leftentirely up to groups of individual test de-signers. Even university professors must bemade accountable for the important admis-sions decisions that they are making, becausethose decisions so profoundly affect youngJapanese lives.

Notes:

1 We would like to thank Ryutaro Yamashita forhis help in entering the reading passages intocomputer files.

References

Amano, I. (1990). Education and examination inmodern Japan. Tokyo: Tokyo University Press.(Translated into English by K. & F. Cummings)

Borland. (1991). QuattroPro (version 3.0). ScottsValley, CA: Borland International.

Brown, J. D. (1993). Language testing hysteria inJapan. The Language Teacher, 1X12), 41-43.

Brown, J. D. (1995). Testing in language pro-grams. New York: Prentice-Hall.

Brown, J. D., & Yamashita, S. 0. (1995). En-glish language entrance examinations at Japa-nese universities: What do we know aboutthem? JALT Journal, 17(1), 7-30.

CDSEPT (Committee to Develop Standards forEducational and Psychological Testing).(1985). Standards for Educational and Psy-chological Testing. Washington, DC: AmericanPsychological Association.

Cummings, W. K. (1980). Education and equal-ity in Japan. Princeton, NJ: Princeton Univer-sity.

Dore, R. P. (1990). Foreword to I. Amano,Education and examination in modern Japan(pp. vii-x). Tokyo: Tokyo University Press.

Duke, B. (1986). The Japanese school: Lessonsfor industrial America. NY: Praeger.

Flesch, R. (1974). The art of readable writing.New York: Harper & Row.

Frost, P. (1991). Examination Hell. In E. R.Beauchamp (Ed.), Windows on Japaneseeducation (pp. 291-305). New York: Green-wood Press.

Fujita, H. (1991). Education policy dilemmas as

BROWN & YAMASHITABROWN & YAMASHITA

historic constructions. In B. Finkelstein, A. E.Imamura, & J. J. Tobin (Eds.), Transcendingstereotypes: Discovering Japanese culture andeducation (pp. 147-161). Yarmouth, Maine:Intercultural Press.

Horio, T. (1988). Educational thought andideology in modern Japan: State authorityand intellectual freedom. Tokyo: Tokyo Uni-versity Press.

Kiefer, C. W. (1970). The psychological inter-dependence of family, school, and bureau-cracy in Japan. American Anthropologist 72,66-75.

Kitamura, K. (1991). Japan's dual educationalstructure. In B. Finkelstein, A. E. Imamura, &J. J. Tobin (Eds.), Transcending stereotypes:Discovering Japanese culture and education(pp. 162-166). Yarmouth, Maine: InterculturalPress.

Koko-Eigo Kenkyil. (1993a). '93 kokukoritsudaigaku-hen: eigomondai no tetteiteki kenkyz2. Tokyo: Kenkya-Sha. public

Koko-Eigo KenkyCi. (1993b). '93 shiritsudaigaku-hen: eigomondai no tetteiteki kenkyz2. Tokyo: Kenkri-Sha. private

Kramer, J. J., & Conoley, J. C. (1992). Theeleventh mental measurements yearbook.Lincoln, NE: University of Nebraska.

Monbusho. (1989). Chu gakko gakusha shidayory6 (Course of Study for junior highschool). Tokyo: Monbusho.

National Council on Education Reform (NCER).(1985). Report. In Gyosei (Ed.), Rinkyashin tokyaikukaikakujiyoka kara koseishugi e (Fromliberalization to putting an emphasis on indi-viduality). Tokyo: Gyosei.

Organization of Economic Cooperation andDevelopment (OECD). (1971). A review ofeducation: Japan. Paris: OECD.

Que Software. (1990). Right Writer :: Intelligentgrammar checker (version 4.0). Sarasota, FL:Que Software.

Rohlen, T. P. (1983). Japan's high schools.Berkeley, CA: University of California Press.

Singleton, J. (1991) The spirit of gambaru. In B.Finkelstein, A. E. Imamura, & J. J. Tobin(Eds.), Transcending stereotypes: DiscoveringJapanese culture and education (pp. 119-125). Yarmouth, Maine: Intercultural Press.

Shimahara, N. (1991) Examination rituals andgroup life. In B. Finkelstein, A. E. Imamura, &J. J. Tobin (Eds.), Transcending stereotypes:Discovering Japanese culture and education(pp. 126-134). Yarmouth, Maine: Intercultural


BROWN & YAMASHITA

Press.Tsukada, M. (1991). Student perspectives onjuku, yobiko, and the examination system. InB. Finkelstein, A. E. Imamura, & J. J. Tobin(Eds.), Transcending stereotypes: DiscoveringJapanese culture and education (pp. 178-182). Yarmouth, Maine: Intercultural Press.

Tsuneyoshi, R. K. (1991). Reconciling equalityand merit. (1991). In B. Finkelstein, A. E.Imamura, & J. J. Tobin (Eds.), Transcending

100 JALT APPLIED IVIA"ITIRIALS 1 0 6

Stereotypes: Discovering Japanese culture andeducation (pp. 167-177). Yarmouth, Maine:Intercultural Press.

Vogel, E. Japan's new middle class. Berkeley,CA: University of California Press.

White, M. (1987). The Japanese educationalchallenge: A commitment to children. NY:The Free Press.

WordPerfect. (1988). WordPerfect (version 5.1).Orem, UT: WordPerfect Corp.

Exi lolling Was ackfrom

Standardized Tes

SHAUN GATES

SHIGA WOMENS' JUNIOR COLLEGE

The influence of testing on teaching andlearning is known as washback. Wecan look at washback from two angles.

First, washback may be strong or weak. Ifwashback is strong, students and teachers willtend to alter their classroom behaviors inorder to,achieve good marks in the test. Incontrast, weak washback will have little or noeffect in the classroom. Another way to lookat washback is to ask whether it is positive ornegative. At this stage, we could simply de-fine washback as being positive if test andcourse objectives coincide. Negativewashback occurs when these two sets ofobjectives differ. The different possible typesof washback (from a test or part of a test) canbe located on the following grid:

Strong

Weak

Positive Negative

Teachers might reasonably want to deter-mine the type of washback that flows from agiven test. I suspect the English tests I givemy Japanese college students fall into thebottom left box. Washback is positive be-cause both the course and test objectivesstem from the communicative approach, but

Chapter 11

one reason it is weak is that I am limited inthe rewards and sanctions I can attach to testoutcomes.

In the situation outlined above, testing isdone for the internal consumption of theschool or college. This is not always the case.Other teachers may be involved in preparingstudents for a standardized test, which by itsnature provides results with a much widercurrency. This chapter has two purposes: itattempts to explain why washback from stan-dardized tests is so strong, and it shows howteachers can exploit this washback. All of thisis done with reference to the PreliminaryEnglish Test (PET, 1995) run by the Universityof Cambridge Local Exam Syndicate (UCLES).

Research into washback suggests that thephenomenon is more subtle than was at firstthought (Alderson & Wall, 1993). For Weir,the strong washback that derives from com-municative teaching and testing suggests thatwashback may be linked to construct validity(1990, p. 27). But while the exact nature andinfluence of washback still has to be estab-lished, its force is well-recognized. Considerwhat happens when teaching and testingobjectives diverge markedly, a situation thatcan arise if teaching and testing fail to de-velop together.

Davies (1990, pp. 96-100) provides some


GATES

interesting examples of this problem. Whathe calls the problem of excessive conserva-tism occurs when progress in teaching isnot matched by an equivalent advance intesting. The adoption of a communicativecurriculum in a school may have little ef-fect on students' communicative compe-tence if their end-of-course test requiresthem to write a literature essay.

The opposite problem, unthinking radical-ism, is the result of trying to impose desirablechange in the classroom through the exami-nation system. Imagine a situation where highschool students have traditionally been re-quired to translate a previously unseen prosepassage in order to get into university. Then anew exam is introduced which calls for stu-dents to take an oral interview and a listeningtest. Unless teachers are retrained, new mate-rials written, and students given time to ad-just, the whole exercise is likely to end infrustration and even failure.

You will probably recognize the force ofwashback from your own experience of tests,whether they are academic, sporting, or what-ever. As I write this chapter, I am also prepar-ing (in vain) for a Japanese language test. Mypreferred style of learning is to dip into arange of books and then try out my newknowledge on friends and students. Now,with past test papers as a guide, I spend mytime trying to memorize a large number ofverb endings and kanji compounds.

Factors Affecting Washback

The following list gives some of the factorswhich may influence the strength ofwashback: prestige, accuracy, transparency,utility, monopoly, anxiety, familiarity, andpracticality. Each of these factors deserves tobe considered in more detail.

Prestige. A test will have strong washbackif, like the Test of English as a Foreign Lan-guage (TOEFL) , it is associated with a repu-table, well-known organization (EducationalTesting Service in this case). However, it isworth remembering that prestige does notnecessarily mean widespread recognition. Atest in interpretation, for example, may have


stronger washback for potential interpretersthan a better-known but more general lan-guage proficiency test.

Accuracy. Since language tests are used byemployers and colleges to select suitablecandidates, the score users are likely to basetheir decisions on those tests which give highlevels of reliability and accuracy. This will inturn be picked up by students and teachers,who will concentrate on the relevant tests.

Transparency. The closer a language testmeets the final language needs of the student,the stronger the washback will be. Direct testswhich closely resemble real-life language useshould increase student motivation and hencethe force of washback. Indirect tests are lesstransparent and may have weaker washback.

Utility. The more opportunities a passingscore in a test offers, the stronger its wash-back. A high TOEFL score not only givesJapanese students the chance to study atAmerican universities, it may also help themfind a job in a Japanese company that hasforeign dealings. And there are other reasonswhy Japanese students take this test in largenumbers, as Brown has argued (1993). Simi-larly, a pass in the Cambridge Certificate ofProficiency (CPE) not only satisfies the En-glish language requirements of British univer-sities, but in some European countries, it alsoserves as a valid qualification to teach En-glish in government schools.

Monopoly. The less competitors an examhas, the stronger its washback. One of thenotable things about British EFL exams is thesheer number of them. In addition to theCPE, there are at least three other ways for-eign students can satisfy a British universitythat their English abilities are adequate. Theycan show they have a pass in British highschool exams, in the IELTS exam, or (atsome universities) in the TOEFL. Since stu-dents can choose the test which suits theirsituation and inclinations, the washback ofeach test is diluted.

Anxiety. Naturally, a test which puts exces-sive stress on the student will have strongwashback. Whatever the advantages of directtests, we should accept that some tasks suchas participating in a role play or writing an

expository essay may deter students fromtaking these tests. Indirect tests can causestress too (particularly, if students don'tknow the answers!), but in this case, thestress can to some extent be controlled bythe student; they can take a short break,guess the answer, or move on to an easiersection. The more important the exam is inthe student's eyes, the greater the level ofanxiety is likely to be. Anxiety is also linkedto the next factor.

Practicality. Tests which are convenient tosit, are held frequently, and are economicaland short (without compromising their valid-ity and reliability) will have strongerwashback than tests that don't have theseadvantages. Other practicality issues wouldinclude the availability of tutors, publishedstudy guides, and practice tests.

Assuming that the strength of these factorsvaries between tests, it should be clear thatstandardized tests like the well-known En-glish proficiency exams exhibit strong wash-back. The elements of a test that wouldappeal most to test takersprestige, utility,and accuracyare the washback factorsthat predominate in standardized tests. Fur-thermore, the standardized English teststhat originate in the United States and Britainhave a virtual monopoly in determiningwhether a foreign student's English is ad-equate for study on degree and vocationalcourses.

Even practicality, one of the strengths ofschool tests has been undermined by theadvance of international English exams. ForJapanese students, at least, tests exams areaffordable and are held frequently throughoutJapan. Role play and interviews may be famil-iar as classroom activities, but many Japanesestudents will be unfamiliar with their use astest instruments. As a result, the use of roleplay or interview procedures in communica-tive tests will probably lead to weakwashback, although this may be offset some-what by the transparency of these techniques.

In contrast, since most English tests inJapanese schools rely heavily on multiple-choice questions, the washback for TOEFLtakers should be strong. Common sense

GATES

suggests that the anxiety produced from astandardized test and a class test will benegative, but it will probably be much stron-ger in the former.

Exploiting Washback

Given that standardized tests have a stronginfluence on learning, teachers might askhow they can exploit this force to improvestudent performance. To answer that ques-tion, I must return to the definition of posi-tive washback. It was stated above thatpositive washback happens when course andtest objectives overlap. But what shouldthose objectives be? There are two goodreasons why I feel they should be communi-cative ones. First, the nature of languageproficiency seems best captured by modelsthat in addition to a knowledge of grammaralso incorporate the instrumental role oflanguage and the effect of context.Bachman's framework of communicativelanguage ability represents one of the mostcomprehensive explanations along theselines and is based on a decade of modelbuilding and empirical research (1990, pp.81-110). From a different perspective, Weirput forward a threepart framework to assesseach of the fourlanguage skills (1993, pp. 28-29). While hisframework focuses on performance ratherthan competence, it also argues for seeinglanguage use as goal oriented and con-strained by a range of conditions.

Second, communicative objectives fit inwith most students' reasons for learning lan-guages. Ask your students why they arelearning English, and you will receive a vari-ety of answers including for travel, to studyabroad, and to talk to foreigners. All of thesereasons for studying English, however, springfrom a common desire to use English ratherthan dissect it.

There is a third and admittedly weak rea-son for choosing communicative objectivesdefault. In their choice of class textbooks andactivities, many teachers have already instilledinto their students the importance of commu-nicative objectives.


GATES

Teachers who are responsible for theirown courses and tests can achieve positivewashback fairly easily. If they choose courseand test objectives on the basis that theyshare the same orientation, there should beno tension between exam work and learningneeds. Students will realize that class workgives both practice for the test and prepara-tion for using English in authentic situations.Hughes gives useful advice on how toachieve beneficial backwash in class achieve-ment tests (1989, pp. 44-47).

Unfortunately for the teacher who hasstudents preparing for a standardized test,matters are not so clear cut. Individualteachers cannot influence the selection of

Table 1. PET Item Specifications

content or the test techniques used in thesestandardized tests. The solution wouldseem to involve giving the students prac-tice in all sections of the test irrespective ofwhether they lead to positive washback ornot. The problem with this response is thatit takes no account of the limited timeavailable for learning English at mostschools and colleges. A more efficient wayto exploit these tests is to select only thoseitems for classroom practice which meetthe students' communicative goals. Theremaining items can then be set for home-work. An example of how this works isgiven for the UCLES Preliminary EnglishTest (PET, 1995).

Q. SKILLS

ObjectiveText Type Test Format

Reading Skills1. Understanding lexis2. Understanding grammar

and discourse3. Reading for detail4. Scanning5. Reading for gist

answer

Writing Skills6. Grammatical accuracy7. Expressing

8. Narrating

Listening Skills

9. Listening for specificinformation

10. Listening forinformation

11. Listening for main

ideas12. Listening for gist

Public notice or signShort article, letter

Adverts, brochureBrochures, noticesReview, advert

Five sentences

Form, note, message

Letter, note

Short utterance(s)

Factual report

Factual report

Conversation

Speaking Skills (An oral interview is used to test speaking subskills)

I. Social conversationpersonal information.II. Role playinvolving requests, directions, etc.III. Picture descriptionidentification, comparison.IV. Discussionlikes, dislikes, personal experiences.

Multiple-choiceGap fill

MatchingMatchingMatching, or short

Sentence rewritingDirected writingFree writing

Multiple-choice

Multiple-choice

Multiple-choice

True-false, or Yes/No

104 JALT APPLIED MATERIALS 1.10

An Example of Exploitation

The PET is aimed at students whose Englishlevel is pre-intermediate. The test is dividedinto the four traditional skills, but emphasisthroughout the test is on being able to useEnglish in real life situations. Table 1 showseach question number, the skill tested, thetype of text used, and the test format in-volved.

As can be seen in the table, the PET con-tains a variety of authentic texts and isclosely tied to language needs, making it asuitable test for students following a commu-nicative course at this level. Washbackshould thus be positive. It should also hestrong since the test, the product of a well-known exam board, would be rated highlyfor prestige, accuracy, and utility. However,exploiting a test in the classroom is not al-ways straightforward even if, like the PET, ithas strong positive washback.

Students preparing for this test at the Brit-ish Council in Kyoto can take a special coursein which they attend classes over 12 weeks.Because class time is limited to two hours perweek, there is pressure to cover a number ofthe shorter questions particularly from thespeaking and listening sections rather thanone or two questions from the reading andwriting sections. (When I use PET practicetests with my university students, there iseven less time for test preparation.)

To compound the problem, the speakingand listening sections lend themselves to classwork, naturally fitting in with student de-mands for more conversation practice. Posi-tive washback from these skills, which can beeasily translated into classroom activities, maydrown out the much weaker washback com-ing from reading and writing sections. Writingin particular is a worry. Although writing, likethe other sections, is given equal weighting inthe marking scheme, this is probably offset bythe great difficulty composition seems tocause Japanese students. Anecdotal evidencefrom the International English Language Test-ing Service (IELTS) test suggests that the writ-ing section is often seen as the biggestobstacle in a test.

GATES

One way to remedy the problem is to setthe bulk of the reading questions for home-work and get the students to do one majorwriting exercise in class Fortunately, question7 in Table 1 (on directed writing) can oftenbe integrated with student requests for oralpractice. For example, instead of filling in aform by himself, a student can interview hispartner and complete the form with thepartner's details. Question 8, the free writingquestion, which often requires the student towrite a letter to an imaginary friend, will bemore relevant if the student writes to a class-mate, who then provides spoken or writtenfeedback.

When dealing with the interview, the aimmust be to give students practice but not letthis section siphon off a disproportionateamount of class time. Mock interviews arequite common in many classrooms, and stu-dents will probably be familiar with the ex-change of social formulas (number I. inTable 1). However, the section which asksstudents to comment on a picture (III.) maybe novel, and students will almost certainlyneed practice to build up their skills andconfidence for the discussion (IV.).

Failure to understand the spoken languagecan be very demotivating for students, afeeling which will be reinforced in an examwhere 25 percent of the marks are allocatedto the listening section. Obviously, class timemust be spent on listening practice, but atthe same time, we need to remember that thelistening skill demands intensive concentra-tion from the student. In the exam, the listen-ing section takes half an hour, but it wouldprobably be counterproductive to cover allthe listening questions in a two-hour lesson.The aim should be to cover one or two ques-tions in detail, and where possible, integratethe listening with other activities in the les-son. If practice texts are available, theteacher and students can read through thetapescripts picking out the students' weakpoints for remedial study.

Conclusion

Though it may not be possible to quantify


GATES

the force of individual washback factors in astandardized test, their combination willprobably produce strong overall washback.The teacher's role should be to ensure thatthe washback is also positive. Certainly, stu-dents need to practice questions before atest, but in the classroom, this practiceshould be designed to meet their broaderlanguage needs. As the discussion of the PETindicated, tests which have the potential tocreate positive washback come under pres-sure from limited class time and the students'preferences for some question types overothers. Given the bother this creates, thetemptation is for the teacher not to tell thestudents too much about the test. This strat-egy is unlikely to be successful. Most testtakers soon find out the details of their test,and they will resent doing class work whichthey see as irrelevant to improving test per-formance. It is better to accept the existenceof washback and harness it, much the sameway an expert in aikido exploits the force of

106 JALT APPLIED- MATERIALS 112

an opponent. With luck, the outcome for testtakers and their teachers will be less painful.

References

Alderson, J. C., & Wall, D. (1993). Examiningwashback: The Sri Lankan impact study. Lan-guage Testing, 10(1), 41-69.


Brown, J. D. (1993). Language testing hysteria inJapan? The Language Teacher, 17(12), 43-43.

Davies, A. (1990). Principles of languagetesting. Oxford: Blackwell.Hughes, A. (1989). Testing for language teachers.

Cambridge: Cambridge University Press.PET. (1995). A guide to the Preliminary English

Test. Cambridge: UCLES.Weir, C. J. (1990). Communicative language

testing. Hemel Hempstead, UK: Prentice-Hall.Weir, C. J. (1993). Understanding and developing

language tests. Hemel Hempstead, UK:Prentice-Hall.

Section IV

Oral Proficiency Testing

113

Chapter 12

Testing Oral Ability:ILR and ACTFL

Oral Proficiency Interviews

HIROTO NAGATA

MEIKAI UNIVERSITY

OVERSEAS TRAINING CENTER AT PANASONIC, TOKYO

Assessing students' oral proficiency is aperplexing problem for many lan-guage teachers. We all know that

pencil-and-paper tests are not valid measuresof oral production, i.e., they cannot ad-equately appraise learners' abilities in thefunctional use of a foreign language, and wealso know that oral interviews are best usedfor that purpose. However, we do not usuallyhave enough time to participate in a trainingprogram to become a full-fledged, qualifiedinterviewer. Nonetheless, with the Ministry ofEducation placing new emphasis on commu-nicative goals in language teaching, the needfor classroom teachers to be equipped withsome measurement tools to evaluate students'oral proficiency is becoming more and moreimportant.

This chapter presents two simple and easy-to-handle models called the Interagency Lan-guage Roundtable (ILR) oral proficiencyinterview (Lowe, 1982) and ACTFL oral profi-ciency interview (ACTFL, 1986). Both the ILRoral proficiency interview and ACTFL oralproficiency interview are derived from a simi-lar test developed in the 1950s by the ForeignService Institute of the United States State


Department to assess the foreign languageability of United States Government agencystaff. This Government rating scale was latermodified for use with students in secondaryschool and college foreign language pro-grams by the ACTFL and the EducationalTesting Service.

In the first half of this chapter, the ratingscales of both the ILR and its close derivative,ACTFL oral proficiency interview, will bedescribed. The structure of an interview, andsome example elicitation questions and ques-tion types will also be explained. In addition,information regarding the disadvantages ofcertain types of questions will be provided. Inthe second half of the chapter, some basicproblems regarding oral interviews will bediscussed in the light of Dos and Don'ts thatinterviewers should bear in mind. Advice andvarious suggestions concerning planning andconducting oral interviews will then follow.

The Rating Scales

An oral interview typically takes approxi-mately 20 minutes in a face-to-face conversa-tion between one interviewee and one or two

114

interviewers. In the case of ILR ratings, onwhich the ACTFL oral proficiency interview isbased, the conversations are rated on a scaleof 0 (for no practical speaking proficiency) to5 (for proficiency equivalent to that of aneducated or well-informed native speaker). Inbetween these 6 levels (0 to 5), are five 'plus'ratings (0+ to 4+), which indicate ability atalmost the next higher level, so this makes 11levels all together.

Table 1 (see Appendix) shows ILR guide-lines at each level (ETS, 1982, pp. 34-36).Note that each description is broad enough toinclude both weaker and stronger perfor-mances over a significantly wide range. Inter-viewers are required to familiarize themselveswith these criteria and have at their fingertipsthe characteristic features that distinguisheach level. Ratings are determined by com-paring the totality of a student's speakingperformance to the descriptions at each level,i.e., no single instance of strength or weak-ness should determine the final rating.

As you can tell immediately, levels 3, 4,and 5 in the ILR scale are not really neces-sary for most purposes in dealing with sec-ond language learners in Japan. MostJapanese learners, unless they are well-edu-cated bilinguals, recent returnees, or studentsexceptionally endowed with language learn-ing ability, fall somewhere between 0+ and2+, and very rarely 3, which is considered tobe a highly proficient second languagelearner. Narrower and more precise descrip-tions are definitely needed here in Japan todiscriminate among students who wouldscore between levels 0 and 2+ on the ILRoral proficiency interview. This is where theACTFL scale comes into the picture. Table 2(see Appendix) displays the relationshipsbetween the ILR and ACTFL scale (ACTFL,1989, 2-15). It should he clear that the ACTFLscale is a more detailed expansion of theILR's lower levels, from 0 to 2+.

When rating, therefore, interviewers couldfirst go from broader descriptions, placing theperformance of an interviewee within theappropriate range 0 to 0+, 1 to 1+, and 2 to2+ of the ILR scale, which corresponds toNovice, Intermediate, and Advanced of the

NAGATA

ACTFL scale. Once the range has been deter-mined, depending on the teacher's needs, shecould proceed to refine the rating to reflectthe sub-level descriptions of Low, Mid, orHigh. If the teacher only needs to roughlyassign her students, broader descriptions,such as Novice, Intermediate, and Advancedwill suffice. However, if teachers want to giveremedial guidance to each of their students,for instance, more finely-tuned descriptionswill be called for.

Table 3 (see Appendix) shows the profi-ciency descriptions of the ACTFL scale (ETS,1982, i-iii.). Note that the difference betweenNovice Low and Novice Mid is one of quan-tity, and that between Intermediate Low andMid is affected by both the quantity and qual-ity of interviewees' performances.

The Structure of an Interview

As shown in Figure 1 (see Appendix), aninterview has four phases: warm-up, levelcheck, probes, and wind-down. Let us look ateach phase in terms of its purpose, and seehow you should conduct yourself in each.

The Warm-up (3-5 Minutes)The main purpose of the warm-up is to put

the interviewees at ease and open the wayfor further conversational exchanges. There-fore, this phase, as an opening to an inter-view, consists of social amenities and suchsimple conversation as introducing your-selves. The length varies depending on theinterviewee's level of proficiency. Those whohave been out of practice, for instance, mightneed time to get back to the language gradu-ally, while others having daily contact withthe language, or those with a high proficiencylevel may not need to be bothered withsimple social rituals. The second, more im-portant purpose of the warm-up phase is toget a preliminary indication of theinterviewee's level, which is to be checkedclosely in the next phase, the level check.

The Level Check (8-10 Minutes)The purpose of the level check is to find

the highest level at which the interviewee can


NAGATA

sustain speaking performance. In this phase,the interviewer should check a number oftopics to see if the interviewee can performconsistently at the level in question. Once ina while, the level indication the interviewerobtained in the Warm-up phase will be faultyand the Level check might begin too low ortoo high. If this happens, interviewers cansimply raise or bring down the level of thequestions and resume the level check withoutmaking a big fuss over it. In order to find theinterviewee's highest sustained level, fluency,accuracy, width of vocabulary, and contentmust be tested. Is the interviewee fluentenough to be at that certain level? Is hergrammar accurate? How about her syntax?Can she perform the functions prescribed inthe rating descriptions using suitable content?If the interviewee successfully passes thisLevel check, her performance provides a"floor" for the rating. The next phase, theProbes, attempts to find the "ceiling" of theinterviewee's abilities.

The Probes (7-10 Minutes)The purpose of this phase is to make sure

the level the interviewer has been checking isthe interviewee's highest sustained level. Inorder to determine this, the interviewershould take the interviewee one level abovethe present level several times, preferably,three to four times. While the Level checkgives evidence of what an interviewee cando, the probes show what an intervieweecannot do. Therefore, if the interviewer's levelcheck has been appropriate and the highestsustained level has been correctly established,probes will end up causing linguistic break-down: a sharp drop in fluency, or accuracy(e.g., a dramatic increase in grammatical orsyntactic errors). The interviewee may alsoconfess to the interviewer that the limit (the"ceiling") has been reached by saying some-thing like, "I don't know how to say it," or"It's difficult to say." If, on the other hand, theinterviewer has carried out the level check attoo low a level, the interviewee will consis-tently react and perform well on the probes.If that happens, the interviewer must beginthe level check and Probes over again at a

110 JALT APPLIED MATERIALSI_16

higher level, and continue the process untilthe interviewee's proficiency ceiling is estab-lished. It is imperative that assignment of therating be done by comparing the totality ofthe interviewee's performance to the leveldescriptions, then, finding the one whichmost closely matches that performance. Un-der no conditions should one interviewee'sperformance be compared to that of another.This is important because comparing studentswith each other is a built-in tendency formany language teachers.

Wind-down (2-5 Minutes)The purpose of the Wind-down is to return

the interviewees to the level at which theyperform best and let them leave the testingsite with a sense of accomplishment.

Although most interviews follow the samegeneral structure depicted above, these fourphases might become indistinguishable at thevery lowest levels, and a warm-up or wind-down might not be necessary at the veryhighest levels.

Useful Elicitation Questionsand Question Types

Interview tests typically consist of a series ofquestions. Thus, it is no exaggeration to saythat the success or failure of an interviewdepends largely on topic areas and types ofquestions asked. Table 4 (see Appendix)provides an overview of general topic areasand useful question types, as well as someinformation on the disadvantages of certainquestion types (adapted from ETS, 1982, pp.43-57, 75-86).

Some Dos and Dont's

Since many interviewers are also teachers,they might increase the interviewees' discom-fort, and thus decrease their desire to talk bybringing in certain behaviors deemed desir-able in classrooms but not in an interviewsituation. For example, an overly helpfulteacher who corrects students and finishestheir sentences for them will not be a suitableinterviewer. Teachers who have a tendency to

fill in students' pauses by providing neededwords, or filling in students' hesitations, can-not become good interviewers, either. Cuttingan interviewee's answer short by giving longcomments, expressing opinions excessively ortoo frequently also reduces an interviewee'schance to talk, thus depriving the interviewerof ratable material.

The following are some Don'ts to keep inmind during an interview (based on ETS,1982, pp. 37-39.):

During the Warm-up:1. Don't immediately launch into the Ques-

tion and Answer Routine without the in-troductory "Hello, how are you?" Instead,make the interviewee feel at ease.

2. Don't ask the interviewee "Are you ner-vous?" or say "Gee,you really look ner-vous." Instead, suggest some positivesoothing action: "Would you like tosmoke, move your chair, make yourselfcomfortable?"

3. Don't immediately pose a difficult ques-tion involving hard or obscure grammar,idioms or vocabulary. Instead, talk aboutthe weather, summer vacation, etc.

During the level check & the probes:4. Don't look uninterested in what the

interviewee is saying by looking con-stantly toward the floor, window or theclock. Instead, act interested in her experi-ences. Keep an eye contact, nod, smile,and be alert.

5. Don't play the role of authority: "I don'tthink you understood how Koreans feel:the truth is . . . Instead, judge theinterviewee on the language in which sheexpresses her thoughts.

6. Don't insist on a topic which is not theinterviewee's field. She then might begiven a lower rating than she deserves.Instead, follow up every clue which mightlead to an area of interest. Probe for thisas much as for levels.

7. Don't correct the interviewee's grammarduring the interview. Instead, find outwhat the interviewee can do. Ask for clari-fication, if unclear.

NAGATA

A Few Other Do's and Don'tsIn his book Testing for Language Teachers,

Hughes (1989, p. 105) presents 11 pieces ofadvice on planning and conducting oral tests.Three of these pieces of advice have notbeen dealt with in this chapter so far (time,number of testers, and affective factors) sothey will be touched upon here.

Time. Hughes says it is unlikely that reli-able information can be obtained in an inter-view of less than 15 minutes, while 30minutes can probably provide all the informa-tion necessary for most purposes. However,he also contends that as part of a placementtest, a five to ten minute interview should besufficient to prevent gross errors in assigningstudent to classes. (For instance, by memoriz-ing the simplified checklist shown in Table 5(see Appendix), teachers can interview theirincoming students, give ratings immediatelyafter each interview, and assign a student tothree to six classes of different levels(ACTFL's Novice, Intermediate, and Ad-vanced, or ILR's 0, 0+, 1, 1+, 2, and 2+ levels)quite easily within a very short time (i.e.,spending no more than five to ten minutes oneach one).

Number of testers. Hughes recommendsthat a second tester be present for an inter-view. This is because of the difficulty ofconducting an interview and keeping track ofthe candidate's performance. ACTFL (1989,p. 7-2) suggests that the interview be re-corded so the rater can concentrate fully onrating the sample, and, in some difficultcases, seeking the opinion of another testeris recommended.

Affective factors. Hughes also warnsagainst interviewers constantly remindinginterviewees that they are being assessed. Inparticular, an interviewer should not be seenmaking notes about an interviewee's perfor-mance, either during the interview, or anyother time. In the ILR and ACTFL oral profi-ciency interviews, taking notes is also pro-hibited. As mentioned in the previoussection, in the ACTFL interviews, tape-record-ing is recommended. Hughes also recom-mends that transitions between topics bemade as naturally as possible.


NAGATA

Conclusion

I have taken you quickly through the ILRproficiency rating scale and its close deriva-tive, the ACTFL oral proficiency interviewscale, including discussion of the structure ofan interview, example elicitation questionsand question types, and some dos and don'tsfor interviewers.

Besides being a testing instrument, an inter-view provides one of the best ways to get toknow your students. In the first place, it is fairlyeasy and quick to elicit the kinds of answersyou need through an interview. It also givesyou a thumbnail sketch of where your studentsstand on their way to acquiring proceduralknowledge of the language. Face-to-face com-munication provides a good way to diagnoseyour students' weak points, and thus helps youoffer remedial treatments rather than comment-ing on a sheet of paper. I have conducted ILRoral proficiency interviews for the past 15 yearsand kept records of the interview results ofsome interviewees to whom I gave the simpli-fied checklist (see above) together with mycomments. I also have records of those towhom I gave oral comments but not the check-list. Quite interestingly, those who were giventhe Checklist have improved their perfor-mances much more than those who were not.Having something (here, a printed checklist)that they can turn to and review/preview as asort of beacon to sail across the sea of lan-guage acquisition appears to facilitate their self-directed learning.

Another benefit in having an oral interviewimplemented in a language program is that itmakes many of the classroom activities acquiresome new real-world reality. Many students inJapan still do not feel an imminent need to becapable of functioning in a foreign language.Classroom activities have very little reality if thestudents do not feel a need to use the languageoutside the classroom. To change the wholeatmosphere of foreign language study frommerely being an academic pursuit or expensivepastime or avocation, implementing an oralinterview goes a long way.

For example, I use one part of each of myclassroom hours to prepare for an interview


testor at least, that is the pretext I use,because in theory, you cannot really preparefor a proficiency interview. In my mind, thisis a five Ws and one H activity, a question andresponse expansion exercise with some real-world meaning (see Nagata, 1993). It is apair-work exercise in which student A firstasks student B a yes/no question. Instead ofanswering that question, student B shouldconvert that yes/no question into some Wh-questions without changing the general mo-tive of the question. Student A, then answersthe Wh-question, and continues the interac-tion as if playing the parts of an interviewerand interviewee interchangeably until theyuse up all the related topics they can talkabout. One example might look somethinglike the following:

Student A: Do you like Chinese food?Student B: How do you like Chinese food?

(or What do you think of Chi-nese food? or What kind ofChinese food do you like best?)

Student A: I like it very much, especiallyspring rolls.

Student B: What is a spring roll? (interac-tion continues)

Once in a while, I incorporate achievementelements in the interview. (Here, again, I amwell aware that ILR and ACTFL oral interviewtests are proficiency tests, not achievementtests.) Announcing, for instance, that acouple of S-1 Familiar Situations (see "UsefulElicitation Questions and Question Types")will be tested in the upcoming interview,though, gives students a strong incentive toenthusiastically practice those situationalexchanges with their partners. It is not thatthey are practicing because those particularsituational exchanges will be tested, but thepractice itself now has real meaning for com-munication. Rote memorization of textbookdialogues, a common practice before imple-mentation of the interview, is long gone frommy classroom. Everybody knows memorizedlines will not help much in the interview(and in real-world situations) because whiledialogue situations are constant, real-worldsituations are always changing.

Tightly matching performance descriptionswith the course entrance and exit require-ments of a language program also helpslearners visualize a clearly defined goal ateach level. ILR and ACTFL oral proficiencyinterview ratings can be used profitably inthat direction, too.

Teachers who have no time to worry aboutparticipating in an interview training programat this time can still benefit from the simplifiedchecklist provided above. By simply listeningto your students' interaction, you will not onlybe able to assign them to different level classes,but also be able to give them remedial guid-ance as well. For example, if one youngwoman can communicate simply through ques-tions and answers, but cannot sustain controlof past and future tenses or narrate and de-scribe well, she is a 1 (on the ILR scale) orIntermediate (on the ACTFL scale). What sheneeds is the ability (a) to sustain control in thepast, present, and future tense, (b) to narrateand describe, (c) to provide real conversationalexchange (not just through questions and an-swers), and (d) to function in survival situationswith a complication.

If we assume that the objective of teachingspoken language is development of the abil-

NAGATA

ity to interact successfully in the language,and that this ability involves not only compre-hension but production, oral proficiencyinterview skills are a must for language teach-ers. I hope this article will help teachers inJapan place more emphasis on the oral pro-duction ability of their students.

References

ACTFL (American Council on Teaching of ForeignLanguages). (1986). ACTFL proficiency guide-lines. New York: American Council on Teachingof Foreign Languages.

ACTFL (American Council on Teaching of ForeignLanguages). (1989). ACTFL testers trainingmanual. New York: American Council onTeaching of Foreign Languages.

ETS (Educational Testing Service). (1982). ETSoral proficiency testing manual. Princeton NJ:Educational Testing Service.

Hughes, A. (1989). Testing for language teachers.Cambridge: Cambridge University.

Lowe, P. Jr. (1982). ILR handbook on oral inter-view testing. Washington, DC: DLI/LS OralInterview Project.

Nagata, H. (1993). How to start a conversation:Small talk and praises. The English Teachers'Magazine, 41(12), 60-63.

Appendix: Figures and Tables

Figure 1. The structure of an interview

r breakdowns

breakdowns

5

Level checks:

Probes:

( - )

Breakdowns:( )

To find the candidates'shighest sustained level.To take the candidate onerank above (to make surethat the level theinterviewer has beenchecking is really theinterviewee's highest sus-tained level).A sharp drop in fluencyA sudden groping forwords.A dramatic increase ingrammatical errors.


NAGATA

Table 1. ILR guidelines (ET S, 1982, pp. 34-36)

Level 0: No Practical Proficiency

Speaking: The examinee has no practical speaking proficiency. May have a few isolated words and phrases which are of no practical use.

Understanding. The examinee understands some isolated words and phrases, but is unable to participate even in a very simple conversation.

Level 1: Survival Proficiency

SpeakingSubject matter: The examinee has the minimum proficiency for survival on a day-to-day basis in the target country, i.e., functions

in simple question-and-answer situations. Knows enough at this level to satisfy ordinary courtesy requirements. Able to ask and answer

questions relating to situations of simple daily life and routine travel abroad. The examinee is also able to handle requests for services such

as renting a hotel room and ordering a simple meal.

Speaking Quality: The examinee at this level normally makes errors even in structures which are quite simple and common. Vocabulary is

limited to the type of situations mentioned above, and even in these situations he or she sometimes uses the wrong word. Although pronun-

ciation may be poor, he or she makes the minimum contrastive distinctions, including stress, intonation and tone patterns necessary to be

understood.

Understanding: The examinee is able to understand simple questions and statements relating to simple transactions involved in situations of

daily life and independent travel abroad, allowing for slowed speech with considerable repetition or paraphrasing.

Level 2: Limited Working Proficiency

SpeakingSubject matter: The examinee is able to talk in some detail about concrete subjects such as own personal and educational back-

ground, family, travel experiences, recreational activities, and familiar places.

Speaking Quality: The examinee has enough control of the morphology of the language (in inflected languages), and of the most fre-

quently used syntactical structures. Although vocabulary is sufficient to talk with confidence about the type of topics described above, the

limited vocabulary fairly often reduces the examinee to verbal groping, or to momentary silence. A foreign intonation and rhythm may still

be dominant.

Understanding. The examinee is able to comprehend questions and statements relating to common social topics, when the language is spoken

at normal conversational speed. Can get the gist of casual conversations with educated or well-informed native speakers talking about

subjects on the level of current events, allowing for occasional repetitions or paraphrased statements.

Level 3: Professional Working Proficiency

SpeakingSubject matter: The examinee is able to converse and express opinions about such topics as current events, including political and

social problems of a general nature.

Speaking Quality: The examinee has good control of grammar, though there are occasional errors in low-frequency structures and in the

most complex frequent structure. The vocabulary is broad enough so that he or she rarely gropes for words in discussing the topics men-

tioned above. A foreign phonology, though apparent, is no longer dominant.

Understanding. The examinee can comprehend most of what is said at a normal conversational rate of speech. A person at this level is able to

understand to a high degree more complex formal discourse, i.e., subjects on the level of panel discussion, new programs, etc.

Level 4: Distinguished Proficiency

Speaking Subject matter: Although the subject matter that the examinee is able to handle at this level may not differ very much from that of

Level 3, he or she is able to use the language in all nontechnical situations and can express opinions almost as fully and correctly as in

native language (assuming that the individual is a 5 in the native language). The examinee is able to tailor his or her speech to the audi-

ence, has near-perfect grammar and speaks the language with extensive and precise vocabulary. Although the examinee may still have an

accent, he or she very rarely mispronounces the language.

Understanding: The examinee can understand the content of all conversations and formal presentations within the range of his or her

experience. With the exception of dialect variations and colloquialisms outside the range of experience, understands the type of language

heard in speeches sprinkled with idioms and stylistic embellishments.

Level 5: Native or Bilingual Proficiency


120

BEST COPY AVAILABLE

NAGATA

Table 2. Relationship between the ILR scale and the ACTFL scale (ACTFL, 1989, pp. 2-15)

ACTFL SCALE ILR SCALE

5 Native or bilingual proficiency4+4 Distinguished proficiency3+ Superior3 Professional working proficiency

Advanced High 2+Advanced 2 Limited working proficiencyIntermediate High 1+

Intermediate Mid 1 Survival proficiencyIntermediate LowNovice High 0+Novice Mid 0 No practical proficiencyNovice Low


NAGATA

Table 3. AG 11-1 Rating Scale (ETS, 1982, pp. i-iii)

Novice Low (1LR'S Level 0)

Unable to function in the spoken language. Oral productions limited to occasional isolated words. Essentially no communication ability.

Novice Mid (ILR'S Level 0)

Able to operate only in a very limited capacity within very predictable areas of need. VocabUlary limited to that necessary to express

simple elementary needs and basic courtesy formulae. Syntax is fragmented, inflections and word endings frequently omitted, confused or

distorted and the majority of utterances consist of isolated words or short formulae. Utterances do not showevidence of creating with lan-

guage or being able to cope with the simplest situations. They are marked by repetition of an interlocutor's words as well as by frequent long

pauses. Pronunciation is frequently unintelligible and is strongly influenced by first language. Can be understood only with difficulty, even

by person such as teachers who are used to speaking with non-native speakers.

Novice High (ILR'S Level 0+)

Able to satisfy immediate needs using learned utterances. There is no real autonomy of expression, although there are some emerging

signs of spontaneity and flexibility. There is a slight increase in utterance length but frequent long pauses and repetition of interlocutor's

words may still occur. Can ask questions or make statements with reasonable accuracy only where this involves short memorized utterances

or formulae. Most utterances are telegraphic and word endings are often omitted, confused or distorted. Vocabulary is limited toareas of

immediate survival needs. Can produce more phonemes but when they are combined in words or groups of words, errors are frequent and, in

spite of repetition, may severely inhibit communication even with person used to dealing with such leaniers. Little development in stress and

intonation is evident.

Intermediate Low (1LR'S Level I)

Able to satisfy basic survival needs and minimum courtesy requirements. In areas of immediate need or on very familiar topics, can ask

and answer simple questions, initiate and respond to simple statements, and maintain very simple face-to-face conversations. When asked to

doso, is able to formulate some questions with limited constructions and much inaccuracy. Almost every utterance contains fractured syntax

and other grammatical errors. Vocabulary inadequate to express anything but the most elementary needs. Strong interference from LI occurs

in articulation, stress and intonation. Misunderstandings frequently arise from limited vocabularyand grammar and erroneous phonology

but, with repetition, can generally be understood by native speakers in regular contact with foreigners attempting to speak their language.

Little precision in information conveyed owing to tentative state of grammatical development and little or no use of modifiers.

Intermediate Mid (1LR'S Level 1)

Able to satisfy some survival needs and some limited social demands. Some evidence of grammatical accuracy inbasic construction, e.g.,

subject-verb agreement, noun-adjective agreement, some notion of inflection. Vocabulary permits discussion of topics beyond basic survival

needs, e.g., personal history, leisure time activities. Is able to formulate some questions when asked to do so.

Intermediate High (ILR'S Level 1 +)

Able to satisfy most survival needs and limited social demands. Developing flexibility in a range of circumstances beyond immediatesurvival needs. Shows spontaneity in language production but fluency is very uneven. Can initiate and sustain a general conversation but

has little understanding of the social conventions of conversation. The commoner tense forms occur but errors are frequent in formation and

selection. Can use most question forms. While some word order is established, errors still occur in more complex patterns. Cannot sustain

coherent structures in longer utterances orunfamiliar situations. Ability to describe and giveprecise information is limited. Aware of basic

cohesive features (e.g., pronouns, verb inflections), but many are unreliable, especially if less immediate in reference.Extended discourse is

largely a serious of short, discreteutterances. Articulation is comprehensible to nativespeakers used to dealing with foreigners, and can

combine most phonemes with reasonable comprehensibility, but stillhas difficulty in producing certain sounds, in certainpositions, or in

certain combinations, and speech willusually be labored. Still has to repeat utterances frequently to be understood by the general public. Able

to produce narration in either past or future.

Advanced (ILR'S Level 2)

Able to satisfy routine social demands and limited work requirements. Can handle with confidence but not with facility most socialsituations including introductions and casual conversations about current events, as well as work, family, and autobiographical informa-

tion; can handlelimited work requirements, needing help in handling any complications or difficulties. Has a speaking vocabulary sufficient

to respond simply with some circumlocutions; accent, though often quite faulty, is intelligible; can usually handle elementary constructions

quite accurately but does not have thorough or confident control of the grammar.

Advanced High (ILR'S Level 2+)

Able to satisfy most work requirements and show some ability to communicate on concrete topics relating to particular interests andspecial fields of competence. Often shows remarkable fluency and ease of speech, but under tension or pressure language may break down.

Weaknesses or unevenness in one of the foregoing or in pronunciation result in occasional miscommunication. Areas of weakness range

from simple construction such as plurals, articles, prepositions, and negatives to more complex structures such as tense usage, passive

constructions, word order, and relative clauses. Normally controls general vocabulary with some groping foreveryday vocabulary still evident.

Superior (ILR'S Level 3 and Above)

All performance above Advanced High is rated as Superior.

116 JALT AppuED MATERims 122

NAGATA

Table 4. Useful Question Types and Topic Areas

Levels Types of questions Examples Disadvantages Topic Areas

0 0+ Yes/No questions "Do you live in ...?" Yes/No questions often provide no real Name articles of clothing

(ILR) "Are you a student at the university?" information about the candidate's speechbecause they are so amenable to a one-word answer.

Name basic objects

Name colors

Name family members

Novice To encourage conversation, Yes/No Give weather

Low-High questions must be followed by Name weekdays, months

(ACTFL) Wh-questions, such as "Who?", "What?", Give today's date

"Where?", and "When?" Give year, tell time

Choice questions "How did you get to work this morning, Choice questions give away vocabulary

by bus or by car?" and sometimes grammar points that the"Which do you like better, Japanese style interviewer may be trying to check

breakfast or Western style breakfast?" Therefore, they may deprive

the candidate of an opportunity to

produce speech on her own.

1 -1+ Polite request "Would you describe this room, please?" Polite requests have to be carefully phrased Personal information

(ILR) in order to encourage speech production. Hotel, restaurantMoney matters

Inter- Information "Who was with you?" The speech sample elicited by Informa- Welfare

mediate questions "When were you there?" tion questions may be very short, in Directions

Low-High "How did you get there?" some instances only one or two words. Transportation

(ACTFL) If candidates are not talking naturally Meeting

in the target language, too many Infor- Social

mation questions may lead them to feelthat they are being interrogated.

TelephonePost office

Testers may end up rating the factualcontent of the speech sample, rather

Car

than its linguistic content.

Familiar situations "Please reserve a hotel room with a double Acting out Familiar situations can be a

(role-play) bed and bath at the cheapest possiblerate. I will play the role of the clerk"

problem to any candidate who dislikes

role-playing.

Candidate interviews "Please ask me some questions, such as The candidate may simply repeat the

Testers where I live, etc." questions which the tester gives as

illustrations.

Candidates often draw a blank when asked

to produce speech out of context, such

as "I've asked a lot of questions. Now

I'd like you to ask me some questions."

2 -2+(ILR)

Informationquestions

(see above) Like the Level 1 speaker,

the linguistic repertoireis generally in the realm

Familiar situations "You are in a restaurant. You have eaten Some candiates dislike role-playing. of "who," "what," "where,"

Advanced with complications most of your meal when you discovered a Role-plays may turn into a translation and "when," rather than& (role-play) bug under the steak. You feel ill. exercise, if the situations are too "how" and "why"

Advanced

High

(ACTFL)

Call the waiter and announce that you areleaving immediately. Refuse to pay the

bill.""You live in a condominium. Your upstairs

specific. (a distinctive feature of the

Level 3 speaker).

about themselves, such asAsk others for information

neighbor waters the plants on her balcony family status, residence, background and interests. Give

and the water ends up on your balcony,dirtying the washings. Go to your neigh-

bor and discuss the problem. I will play

the part of the neighbor."

this type of detailed information about herself.

Describe daily routine and talk about one's personalinterest. Describe a person, an object or a place.

Give directions.

Tell about a sequence of events.

Tell in simple terms about plots of books

or movies (Detailed discussion of them is

a Level 3 task).

Tell about news events in simple factual terms.

Tell about future plans.

(based on ETS, 1982, pp. 43-57, 75-86)

BEST COPY AVAILABLE123 LANGUAGE TESTING IN JAPAN 117

NAGATA

Table 5. Simplified Checklist of ILR/AGIII Level Descriptions

0+ / Novice High 1 / Intermediate Low & Mid 2 / Advanced 3 / Superior

1) Very little or no

ability to create future

* One-to two-word

responses

* Frequent repetition

of interviewer's word

through Q & A

* Very limited communi-

cation using formulas,

memorized phrases)

2) Interviewer must repeat

often

1) No control of past and future

(most speech in present tense)

2) Can create

original utterances

3) Can communicate simply

* Can make simple descriptions

about personal background

* Can ask simple questions,

directions and instructions

* Can handle minimum

courtesy requirements

* Areas of immediate need,

or very familiar topics

4) Almost all utterances con-

tain grammatical errors

1) Can sustain control in the

past, present, and

2) Can narrate, and describe

3) Real converstion

(not just Q & A)

4) Can provide limited

discourse

5) Can get out of survival

situations with

complications

1) Can discuss

hypothetical situations

2) Can support opinions

3) Can provide extended discourse

4) Good control of grammar;

only occasional errors

5) Has broad vocabulary and

rarely gropes for words

6) Can handle unfamiliar topics

or situations

(Pluses (+) indicate ability at almost

the next level, not halfway ranges)


1+ / Intermediate High

* Can narrate and describe

but unable to sustain

control in past and

future

2+ / Advanced High

* Supports opinion

and hypothesizes

in a limited manner.

124

NAGATA

Chapter 13

The SPEAK Test of Oral Proficiency:A Case Study of Incoming Freshmenl

SHAWN M. CLANKIE

KANSAI GAIDAI UNIVERSITY

The SPEAK test, or Speaking ProficiencyEnglish Assessment Kit, was created bythe Educational Testing Service (for

more information, see ETS, 1993), the makersof other tests including TOEFL and TOEIC,and was developed "in response to the inter-est expressed by many institutions in an in-strument to assess the spoken proficiency offoreign teaching assistants and other interna-tional students who are non-native speakersof English" (ETS, 1992, p. 5). This chapterpresents the results of an attempt to adapt theSPEAK test into an existing English as a for-eign language program at Kansai Gaidai Uni-versity in Osaka. It will examine in depth thebenefits of the SPEAK test, some of its prob-lems, and the reasons why we chose to aban-don the test, opting for a different form oforal evaluation.

The Program

In the spring of 1993, Kansai Gaidai Univer-sity instituted a new program called the IES,or Intensive English Studies program. Thisprogram coincided with the existing, lessdemanding program, and English majorsentering the university were allowed theoption of automatically entering the regularEnglish program or trying to test into the IESprogram. The program was inaugurated atboth the junior college and university levels,with three native speaking university level

instructors and two native speaking juniorcollege instructors.

For those who applied for the intensivecourses, the IES program was designed toafford the top 10-15% of the students theopportunity to study the language intensively.This program offered approximately doublethe standard number of class contact hoursper class per week with a restricted numberof different native speaker instructors. Classsize was restricted as well, to a maximum of30 students per class. Classes were taughtcompletely in English, and it was hoped thatthe increased exposure to English with alimited number of good students and teacherswould foster higher levels of proficiency thanthose reached in the existing program, par-ticularly in speaking ability at the end of therespective four or two year terms.

The number of students in the initial yearwas 85 and 47 at the university and juniorcolleges, respectively. The 132 charter stu-dents were selected from a group of severalhundred on the basis of TOEFL scores aloneand were placed by those same TOEFL scoresinto an IES section of students with similarscores. The university level consisted of threesections (A, B, and C), while the junior col-lege had two sections (A and B).

Selection by TOEFL scores alone posed apotential problem in that the students se-lected, while strong in grammar and writingskills were not necessarily the strongest in


CLANKIE

terms of oral production. This brought up thepossibility that the program might be losingstudents who have strong oral skills but wholack the skills necessary for success on theTOEFL. Realising this problem, the IES teach-ers began seeking alternatives to selectionstrictly on the basis of TOEFL, in particularlooking for a measure to assess oral ability,that could be administered quickly and usedalong with the TOEFL scores to gain a truerpicture of the students with the best ability.

In looking for a test of oral ability, we hadseveral concerns. We first needed a measurethat could be used to test a large number ofstudents. We were worried that it might beimpossible to accurately assess several hun-dred students one at a time given schedulingand staff limitations. Time was a second ma-jor factor. The school year begins on April 8,and, from the time of acceptance into KansaiGaidai University, the students would need toapply to the IES program, be TOEFL tested,be selected into the IES program, be placedinto a section of the program, and be notifiedof acceptance and scheduling, all in a matterof a couple of weeks. The administration ofan oral measure would of course be in addi-tion to the procedural steps above.

The SPEAK Test

In November of 1993, after a semester oflooking at alternatives to selection strictly onthe basis of TOEFL scores (and ever mindfulof the time constraints mentioned above), theteachers of the IES program at Kansai GaidaiUniversity began attempts to integrate theSPEAK into a workable model for assessingthe skills of incoming freshman seeking en-trance into the IES program in the spring of1994. In addition to this, an attempt wasmade to find a way to combine SPEAK scoreswith the TOEFL scores as the basis for admis-sions and placement decisions in the IESprogram.

The first trials of the SPEAK test were ad-ministered in November and December of1993 to the three charter classes of universitylevel second semester freshman and the twocharter classes of junior college second se-

120 JALT APPLIED MATERIALs 1 2 6

mester freshman already enrolled in the IESprogram. SPEAK was selected because itcould easily be administered to large groupswith fairly high interrater reliability.

The test kit comes with one set of thirtytest booklets, the test cassette, and a guide toadministering the test, as well as scoringsheets, rater training cassettes, and instruc-tions. Additional versions of the SPEAK testare also available. It should be noted herethat the SPEAK test is generated from the TSE(Test of Spoken English). Retired TSE testsmake up the three versions of SPEAK cur-rently on the market. The cost of the kit atpresent is $350 with additional versions of thetest for $150 each. All of the tests are ofcourse reusable and would work best in arotation system wherein each version of thetest is used once in three years or semesters.

The actual testing procedure simply in-volves distribution of the test booklets andthe playing of the matching test cassette.Students respond to each question or prob-lem and their voices are recorded on separatecassette recorders.

The test contains seven sections, of whichthe last six are scored. The first section is torelax the students and to test the equipmentand allow for modifications with questionssuch as "What is your name?" In section two,students are asked to read a paragraph si-lently to themselves, then when instructed, toread the paragraph aloud. In section three,students are given the beginnings of sen-tences and are asked to complete them. Thisis followed by a picture sequence used toelicit a narrative in section four. Section fiveinvolves a single picture from which ques-tions are asked to the students about what is,will, or has happened. Section six containsopen-ended description questions, and stu-dents are asked to explain. The final exerciseasks the students to present a class scheduleas if they were a teacher.

For each of the six scored sections of thetest, students are scored in at least two offour categories. The categories are pronuncia-tion, fluency, grammar, and comprehension.Each answer is rated on a scale of 0-3, threebeing the best. At the end of the test, the

scores of each of the core categories areadded together then averaged, with the ex-ception of the comprehension segment wherethe average is multiplied by 100. Then scoresare compared and matched to a chart defin-ing the abilities of the student.

Raters are trained according to severalsample test cassettes that come with the testpackage. The students' cassettes then aredistributed among the raters, who rate them,then meet to discuss and select students toenter the program. This all seemed easyenough.

Our purpose for testing the students al-ready taking part in the program was prima-rily to gain experience with the test and towork out any glitches found before actuallyputting the test into practice. This first runwas important as we quickly found problemsin using the SPEAK to assess Japanese learn-ers. As the students taking part in the experi-ment were already in sections, we tested eachsection one class at a time over the span of aweek. The tapes recorded by each studentwere then collected and each was rated bytwo instructors (in the hopes of increasingreliability and validity). In the case of the twojunior college instructors (of which I wasone), this meant each of the two of us tookone of the two sections of junior collegeclasses' tapes, rated them, then exchangedthe tapes for the other section and rated thoseas well. The results of the two raters werethen matched to see if the scores were simi-lar, both in comparing students and classes.

Benefits of SPEAK

The obvious benefit of the SPEAK test is thatit can be administered to a group or a singlestudent with the same ease. Moreover, theproblem of some interviewers being moredifficult in questioning or scoring than othersis overcome in that all students take the sametest, administered the same way, with thestudents each having the same amount oftime to respond. And of course, there is littlethinking time, students are expected to re-spond immediately, as in normal conversa-tion. This prevents in most cases) those long

CLANKIE

thinking pauses which often stall face-to-faceoral interviews. There is another plus. If sev-eral versions of the test are at the disposal ofa program, they may be rotated so that eachentering group of students takes a differentversion of the test.

Sections of the test such as those whichinclude free description of an open questionor topic reflect tasks which are a definitepractical concern in daily conversation. Thistest appears ideal for cases where one studentneeds to be interviewed, i.e., a graduate stu-dent seeking a teaching assistantship.

One final benefit, the test saves the timethat it takes to create the lists of questionsand topics which are normally used in face-to-face interviews. In the case of a small pro-gram such as ours, it would appear that suchbenefits would make the test very useful. Yet,we found significant drawbacks to using thistest in its published format.

Drawbacks of SPEAK

The primary fault of the test (for our pur-poses) is generated by one of its most out-standing benefits, i.e., the fact that it can begiven to a large group at the same time. Thetraditional oral examination is face-to-facewith a live tester who is administering the testand rating the examinee during or directlyafter the interview. This means that eachexamination is given once, and once it's fin-ished it's finished.

However, the SPEAK test is the same testevery time on every cassette, with varyinglevels of differences in responses. With allof the students taking the test at the sametime and their answers being recorded,each tape must be listened to and rated. Inessence, what one gains in giving the testquickly is lost by the amount of time ittakes to listen to all of the cassettes. Witheach test being roughly twenty minuteslong, multiplied by a class of 25, a ratercontinuing non-stop must work over 8.5hours. This of course is nearly impossible,and we found that a severe drop in raterreliability as well as an increase in day-dreaming during the rating, and in particu-


CLANKIE

lar, during the lull periods between sec-tions, began to take effect after only 3-5tapes.

As mentioned in the opening of this essay,time was critical, as all rating, selection, andnotification had to be done prior to the be-ginning of the semester. In addition to thestudents' voices, the original taped examiners'voices are also included on the tapes. Listen-ing to the same test repeatedly had the effectof distracting the raters from concentrating onthe intermittent test segments containingstudents' responses. Trying to fast forwardonly to the answers was a hit or miss solutionat best and, in any case, did not significantlyreduce listening time.

Rater willingness to listen to the cassettesafter the initial day of administration wasanother problem. Enthusiasm was high at theonset of the experiment. However, after thefirst day of listening to cassettes, the ratersfound themselves increasingly unwilling tocontinue with the SPEAK test ratings.

In addition, one particular part of the test,the second exercise, which asks students toread aloud is somewhat questionable in termsof test validity. Hughes (1991, p. 110) discour-ages the use of reading aloud as a measure oforal proficiency because of "inevitable inter-ference between the reading and speakingskills." He also argues that, if the same testwere given to a group of native speakers,there would "almost certainly be considerabledifferences between candidates." Underhill(1987, p. 76) also supports this view.

Moreover, we found that the scoring sys-tem needed some modification to bring itmore into line with the likely responses thatwe anticipated receiving. While the testwould seemingly work well for students pos-sessing widely varying levels of mastery inEnglish, it simply was not satisfactory whengiven to large numbers of students of similarabilities. Needless to say, it is well known thatJapanese students leaving high school haveyears of, training in areas such as grammar,yet in most cases, their oral abilities are weak.The scoring system accompanying the test,based on a 0-3 rating simply could not differ-entiate in its mid-range of 1-2. Obviously, it


was easy to spot students at the high end,often those with substantial overseas experi-ence for example, yet in the mid-range thetask was much more difficult.

A similar problem arose when we discov-ered that some of the responses given by ourstudents were direct exceptions to the scoringscale created by ETS. One question offeredby one of the raters in our program involvedhow to score the student who uses a. longer,more complex utterance containing severalerrors versus the student who only answers inshort yet perfectly correct responses. Thestudent offering the longer responses wasapparently trying to do his or her best whilethe other student was simply taking the safeway out.

In the end, after attempting to rate a por-tion of the,cassettes according to the prear-ranged rating scale, we found that changewas needed. A meeting was held and each ofthe teachers involved presented suggestionsfor making the scale more effective, primarilyin the mid-levels. After sifting through each ofthe four rating categories (pronunciation,fluency, grammar, and comprehensibility), acertain level of success in adding additionalhalf-levels in the mid-ranges seemed to offergreater differentiation in scoring. The originaland adapted scoring scales are presented inTables 1 and 2, respectively (see the Appen-dix to this chapter).

Hughes (1991, p. 105) points out a furtherdisadvantage of group testing from a cassettewhen he states, "The obvious disadvantage tothis format is its inflexibility: there is no wayof following up candidates' responses."

As a final note, on taped tests such as theSPEAK, Underhill (1987) also mentions simplepractical problems such as the possibility oftechnical difficulties which could arise caus-ing some or none of the tapes to be recorded.This actually happened on our first attempt toadminister the test to a class and is definitelya practical concern.

Prospects for SPEAK and Conclusions

The question now arises as to what (if any-thing) can be done to make the SPEAK test

applicable to the various English programsexisting in Japan. If the test is simply to beused to check a handful of students' abilities,the test will adequately serve this purpose.Obviously, the more raters that can be in-volved in the test the better the chance forsuccess. The less tapes corrected by eachrater, the more reliable will be the ratings.Reducing the number of tapes rated at asingle sitting to three will likely also increaserater reliability, but will also take substan-tially more time to rate. With new teachersbeing added each semester in our particularprogram, the possibility of reducing the num-ber of tapes listened to by each rater hasincreased. As of summer 1994, 14 teacherswere involved in the IES junior college anduniversity programs. However, with an in-coming group of potential IES students of400 (after some form of preliminary elimina-tion by TOEFL), that would mean that eachteacher would be responsible for listening to28.5 tapes or slightly more than a class. Thatis only if the tapes are to be listened to by asingle rater. With each tape being 20 min-utes, the time per rater amounts to 9.5 hours.This is still far too high.

Moreover, repeatedly listening to the originaltest prompts on each of the students' cassettesis still a problem. For schools that containlanguage labs with voice activated recordersthis problem could be overcome by tapingonly the voice of the students and therebyeliminating several minutes of blank tape wait-ing time. As many schools are not yetequipped with such a system, this problemmay still plague users of the SPEAK test. Fi-nally, some schools may find the test moresuccessful if they manipulate the scoring sys-tems to more accurately represent the problemsand abilities of the Japanese learners in theirindividual programs (as we did in Table 2).Inevitably, some schools trying out the SPEAKtest may choose to abandon the test altogether,as was the case at Kansai Gaidai.

At Kansai Gaidai, given the constraints ofour system, we found that the TOEFL scoresin conjunction with an oral interview, oneconsisting of two interviewers interviewingeach student one at a time was a more effec-

CLANKIE

tive and time efficient option than the SPEAKtest. This has allowed us to interview, rate,and place the students all in the course of asingle day.

We held a faculty meeting with the admin-istration and the department coordinators andarranged a testing day, much the same way aswhen we were to conduct the SPEAK test.This was held roughly a month ahead of theactual oral testing date, allowing ample timefor logistical concerns (scheduling interviewrooms, notifying candidates, etc.).

Then the teachers met a couple of days priorto the test (e.g., over one or two lunch hours)to discuss possible topics for the interview.These were selected from current events, bothinternational and domestic, and from generalexperiences such as overseas travel, and fromgeneral conversation topics. On the testing day,each student was then interviewed by twoteachers in the program, in roughly an eightminute interview, then the scores were com-pared to the TOEFL scores. It was possible tohave seven interviews occurring simultaneouslythroughout a three hour morning session. Themorning session consisted of interviewing all ofthe junior college candidates.

Scores were collected hourly by officials ofthe registrars' office and entered into thecomputer. Over a working lunch period thefaculty selected the students permitted toenter the two IES junior college classes. Inthe afternoon session the university candi-dates were rated in the same process. Uponcompletion of the afternoon session, weexamined the results and selected those whowould take part in the four university-levelIES classes. At the end of the one day testingsession, the teachers were taken to dinner ingratitude for the day's work (an importantfactor in assuring teachers to volunteer totake part the following term). This manner oftesting, of course, has taken away some ofthe objectivity which the SPEAK test offers inthat the same test is not being offered toeach student.

As time is among the most pressing con-cerns for university English programs in Ja-pan, the form of test used, its administration,rating, and the reason for testing must all be

1 2 9 LANGUAGE TESTING IN JAPAN 123

CLANKIE

carefully scrutinised. For those consideringusing the SPEAK test, this chapter shouldpresent a clearer picture of how the test maywork in assessing students in a Japaneseacademic situation.

Note1 I would like to give special thanks to Mary

Everett at Kansai Gaidai for preserving many ofour notes regarding this experiment, and forreading an earlier draft of this essay.

Table 1. SPEAK Scoring Key

References

ETS (Educational Testing Service). (1992).Guide to SPEAK. Princeton, NJ: EducationalTesting Service.

ETS. (1993). TOEFL 1993-1994 products andservices catalogue. Princeton, NJ: EducationalTesting Service.

Hughes, A. (1991). Testing for language teach-ers. Cambridge: Cambridge University Press.

Underhill, N. (1987). Testing spoken language.Cambridge: Cambridge University Press.

Appendix: Tables

Pronunciation0 Frequent phonemic errors and foreign stress and intonation patterns that cause the speaker to be unintelligible.

1 Frequent phonemic errors and foreign stress and intonation patterns that cause the speaker to be occasionally unintelli-

gible.

2 Some consistent phonemic errors and foreign stress and intonation patterns, but speaker is intelligible.

3 Occasional non-native pronunciation errors, but speaker is always intelligible.

Fluency0 Speech is so halting and fragmentary or has such a non-native flow that intelligibility is virtually impossible.

1 Numerous non-native pauses and/or a non-native flow that interferes with intelligibility.

2 Some non-native pauses but with a more nearly native flow so that the pauses do not interfere with intelligibility.

3 Speech is as smooth and as effortless as that of a native speaker.

Grammar0 Virtually no grammatical or syntactical control except in simple stock phrases.

1 Some control of basic grammatical constructions but with major and/or repeated errors that interfere with intelligibility.

2 Generally good control in all constructions with grammatical errors that do not interfere with overall intelligibility.

3 Sporadic minor grammatical errors that could be made inadvertently by native speakers.

Comprehensibility0 Overall comprehensibility too low in even the simplest type of speech.

1 Generally not comprehensible due to frequent pauses and/or rephrasing, pronunciation errors, limited grasp of vocabu-

lary, or lack of grammatical control.

2 Comprehensible with errors in pronunciation, grammar, choice of vocabulary items, or infrequent pauses or rephrasing.

3 Completely comprehensible in normal speech with occasional grammatical or pronunciation errors.

Reprinted by permission of Educational Testing Service, the copyright owner (ETS, 1992, p. 10). No endorsement

of this publication by Educational Testing Service should be inferred.

124 JALT APPLIED MATERIALS130

BEST COPY AVAILABLE

CIANKIE

Table 2. Revised Scoring for SPEAK, Kansai Gaidai University

Comprehensibility0 Does not comprehend the question.

1 Generally not comprehensible; with pauses and errors; limited vocabulary.

1.5 Overall comprehensibility low; many errors.

2 Generally comprehensible; short answers.

2.5 Completely comprehensible in normal speech with occasional grammatical or pronunciation errors.

3 Easily comprehensible with lengthy answers.

Fluency

0 Too halting or fragmentary to be intelligible.

1 Numerous non-native pauses and/or a non-native flow that interferes with intelligibility.

1.5 Numerous pauses but fairly intelligible.

2 Some non-native pauses but with a more nearly native flow so that the pauses do not interfere with intelligibility.

2.5 Pauses but intelligible; lengthy answers.

3 Smooth, effortless speech with few errors.

Grammar0 Virtually no grammatical or syntactical control except in simple stock phrases.

1 Some control of basic grammatical constructions but with major and/or repeated errors that interfere with intelligibility.

1.5 Good control of basic constructions; fair intelligibility.

2 Generally good control in all constructions with grammaticalerrors that do not interfere with overall intelligibility.

2.5 Good control; good intelligibility; more complex structures.

3 Sporadic minor grammatical errors that could be made inadvertently by native speakers.

Pronunciation0 Frequent phonemic errors and foreign stress and intonation patterns that cause the speaker to be unintelligible.

1 Frequent errors; often not intelligible.

1.5 Frequent phonemic errors and foreign stress and intonation patterns that cause the speaker to be occasionally unintelli-

gible.

2 Some consistent phonemic errors and foreign stress andintonation patterns but speaker is intelligible.

2.5 Some errors but speaker is generally intelligible.

3 Occasional errors but speaker is always intelligible.


Chapter 14

Making Speaking Tests Valid:Practical Considerationsin a Classroom Setting

YUJI NAKAMURA

TOKYO KEIZAI UNIVERSITY

1n this chapter, I will (a) examine variouskinds of validity in language testing ingeneral, (b) discuss which types of valid-

ity are most suitable for a test of Englishspeaking ability in a classroom setting, and(c) describe a validity case study which illus-trates the types of validity (i.e., constructvalidity, concurrent validity, face validity, andwashback validity) which can be used in thedevelopment of a semi-direct speaking testsuitable for Japanese college students.

Various Kinds of Validity in Language Testing

Spolsky (1975) says that the central problemin foreign language testing, as in all testing, isvalidity. Messick (1988) further states thatalthough the modes and methods of measure-ment may change, the basic maxims of mea-surement, and especially of validity, willlikely retain their essential character. Al-though there are other aspects of testing(such as reliability and practicality) which arecrucial and should not be overlooked inlanguage testing, there is no doubt that valid-ity is the single most critical element in con-structing foreign language tests.

Validity concerns the question of "Howmuch of an individual's test performance isdue to the language abilities we want to mea-


sure?" (Bachman, 1990, p. 161). In otherwords, it is concerned with how well a testmeasures what it is supposed to measure(Thrasher, 1984).

Traditionally, validity has been classifiedinto several types including at least content,criterion-related, and construct validity(Bachman, 1990). These are all concernedwith the relationship between the test and thedomain to be measured (Davies, 1990). Fur-ther, Messick (1993) and APA (1985) arguethat validity as a unitary concept.

Davies (1990) discusses five kinds of valid-ity (face, content, construct, predictive, andconcurrent). Among these, predictive validityand concurrent validity are usually calledcriterion-related validity or external validitybecause they have criteria outside of theproposed new test. Content validity and con-struct validity, on the other hand, are con-cerned with internal aspects of validity.

Thrasher (1984) classifies these five typesof validity into two categories. One includespredictive, concurrent, and construct validity,the other face and content validity. He claimsthat the former three are prestigious becauseelegant statistical processes are available forcomputing them, while the other two lackprestige because no such statistical processescan be applied to them.

In the following discussion, I. will followBachman's (1990) view that we still find itnecessary to gather information about con-tent validity, predictive validity, concurrentvalidity, etc., and then investigate validityfurther from more detailed points of view.He also mentions the following issues asbeing relevant to validity: (a) cultural back-ground, (b) background knowledge,(c) cognitive characters, (d) native language,(e) sex, (f) age, (g) ethnicity, and (h) wash-back/backwash effect (or the effect of testingon curriculum).

Interestingly, Thrasher (1984) adds a newtype of validity which he calls educationalvalidity. This type of validity applies to therelationship between positive test effects andstudents' study habits, among other variables,by taking into consideration the educationaland testing situation in Japan. This is similarto Bachman's claim that positive washbackwill result when the testing procedures re-flect the skills and abilities that are taught inthe courses.

In this chapter, I will discuss the varioustypes of validity, basically following Thrasher's(1984) categorization of validity in the follow-ing order: predictive, concurrent, construct,face, content, and educational validity.

Predictive ValidityThrasher (1984) claims that predictive valid-

ity is an estimate of the goodness-of-fit be-tween test results and performance in someposttest real world activity. Consequently,predictive validity is concerned with the pre-dictive force of a test, the extent to which testresults predict some future outcome (Davies,1990). As we know from the TOEFL (Test ofEnglish as a Foreign Language) results, theideal of test results and performance in theposttest real world activity is very importantin the predictive validity. This is an appealingaspect of predictive validity and the associ-ated statistics, since prediction is an importantand justifiable use of language tests.

Predictive validity, however, also has someweak points. Since this type of validity com-pares a test with some criterion that is mea-sured after the test results are in, the nature of

NAKAMURA

the criterion skill or ability becomes veryimportant. TOEFL results, for example, havebeen compared to college and graduateschool performance, yet such performance isobviously affected by many factors other thanEnglish proficiency. In other words, measuresthat are valid predictors of some future per-formance are not necessarily valid indicatorsof language ability (Bachman, 1990). Thispoint is crucial because language tests shouldmeasure language abilities, and nothing else.Therefore, we should be careful decidingwhat abilities are being measured.

Concurrent ValidityDavies (1990) says that concurrent validity,

in its purest form, can be established onlywhen the test under scrutiny represents eithera parallel or a simplified version of the crite-rion test, and, as Hughes (1989) says, whenthe test and the criterion are administered atabout the same time. This last condition isone of the things that differentiates concur-rent validity from predictive validity as typesof criterion-related validity.

Construct ValidityBachman (1990) states that construct valid-

ity concerns the extent to which performanceon a test is consistent with predictions thatwe make on the basis of a theory of abilitiesor constructs. In brief, construct validity ex-amines if the test matches a theoretical con-struct (Thrasher, 1984).

Bachman further insists that construct valid-ity is a unifying concept, which is supportedby Messick's (1993) idea that construct valid-ity is the unifying concept that integratescriterion and content considerations. What isit that constitutes this unifying concept ofconstruct validity?

Bachman (1990) claims that construct valid-ity requires both logical analysis and empiri-cal investigation. Logical analysis is involvedin defining the constructs theoretically andoperationally. Construct validity has bothstrong points and weak points. One strengthis that it can be statistically analyzed throughthe multitrait-multimethod approach or factoranalysis, which avoids the reduction problem


NAKAMURA

due to the external criterion which occurs inconcurrent validity. One weak point is thattesters tend to view technical terms (such asgrammar, vocabulary, reading) as almightywords which are representatives of theoreticalconstructs and can measure the whole rangeof language abilities. Since the word constructrefers to any underlying ability (or trait)which is hypothesized in a theory of lan-guage ability, and since this construct is com-plex, construct validity cannot be establishedovernight.

Moreover, as Hughes (1989) says, it isthrough construct validation that languagetesting can be put on a sounder, more scien-tific footing, and this fact is the strongestargument for establishing construct validity.

Face ValidityThrasher (1984) questions the utility of face

validity when he asks "what is the meaning ofthe word `reasonable' in the sentence `the testlooks reasonable. However, I supportHeaton's (1989) belief that "in the past, facevalidity was regarded by many test writerssimply as a public relation exercise. Today,most designers of communicative tests regardface validity as the most important of all typesof test validity."

Face validity is hardly a scientific concept,but it is very important (Hughes, 1989). Testappearance (face validity) is a very importantconsideration in test use because a test whichdoes not have face validity may not be ac-cepted by candidates, teachers, etc.(Bachman, 1990). The most important part offace validity in school tests is that the stu-dents' motivation is maintained if a test hasgood face validity, for most students will tryharder if the test appears fair (Heaton, 1989).The weakest aspect of face validity is alsodiscussed by Heaton (1989) when he saysthat language tests which have been designedprimarily for one country and are adopted byanother country may lack face validity in thenew setting. In general, tests usually havehigh face validity because they look like othertests of that skill that the student has previ-ously taken. So if a test from one countrydoes not look like the tests with which stu-


dents are familiar in their own country, it willbe weak in face validity.

Finally, face validity cannot stand alone.More precisely, in order to make face validityconvincing, we must have very strong evi-dence of the other types of validity (cf.Thrasher 1984).

Content ValidityThrasher's (1984) statement about content

validity is interesting. He says that even if thelanguage teacher does not consider contentvalidity, his/her students will demonstrate theneed for it (probably by analyzing, criticizing,or complaining about the content of the test).

As Davies (1990) says, content validity isdefended through professional judgments,either by teachers or testers. The judges relyon their knowledge of the language to judgeto what extent the test provides a satisfactorysample of the syllabus, whether real, imag-ined, or theoretical.

A test is said to have high content validity ifits components constitute a representativesample of the language skills and structuresthat it is meant to test (Hughes, 1989). Ac-cordingly, a careful analysis of the languagebeing tested and of the particular courseobjectives is needed to demonstrate contentvalidity. Two possible strong aspects of con-tent validity are as follows:

1. We must consider a wide variety ofknowledge and content coverage (e.g.,definition of the content, ability do-main, a list of content areas, test tasks,etc.) (Bachman, 1990).

2. The greater a test's content validity, themore likely it is to be an accurate mea-sure of what it is supposed to measure(Hughes, 1989).

However, there are some weak points, too.Firstly, content validity says nothing about theappropriateness of what has been taught; inother words, we cannot be sure if we areteaching the right thing (Thrasher, 1984).Secondly, content validity (content relevancein Bachman, 1990) by itself is inadequate as abasis for making inferences about abilitiesbecause content validity looks only at the test

134

and does not consider how test takers per-form (Bachman, 1990). Thirdly, if contentvalidity is misunderstood and the content of atest is determined by what is easy to testrather than what is important to test, the testis likely to have a harmful washback effect.Areas which are not tested are likely to be-come areas ignored in teaching and learning(Hughes, 1989).

Since content validity has been the primarytype used in achievement testingthe kind ofmeasurement language teachers must do intheir classrooms (Thrasher, 1984), the weakpoints related to the appropriateness of teach-ing should be reconsidered.

Educational ValidityAs mentioned above, Thrasher (1984) offers

educational validity as a sixth validity in addi-tion to the other five standard validities (pre-dictive, concurrent, construct, face, andcontent). After pointing out the characteristicsof these five validities, he was still not satis-fied because of the weak points of contentvalidity, especially in terms of classroom tests(for achievement) rather than qualificationtests (for proficiency). His reason for feelingthis way was explained by one statement:

. . content validity says nothing about theappropriateness of what has been taughtalthough it tells you if the course and testcontent match," and, ". . . we are not sure ifwe are teaching the right thing."

This is closely related to his criticism ofSpolsky's (1975) statement "foreign languagetests used by classroom teachers have fewproblems in validity, because the textbook orsyllabus writer has already specified whatshould be tested."

Thrasher (1984) thought that content valid-ity was not sufficient from the viewpoint ofthe appropriateness of teaching, and he sug-gested using the notion of educational valid-ity for considering the tight relationshipsamong testing, teaching, study habits, testresults, and course objectives in terms of thepositive washback effects of the tests.

Although Bachman (1990) deals withwashback effects regarding validity,Thrasher's (1984) idea of educational validity

NAKAMURA

is more practical and relevant for Japan'seducational system. Thrasher (1984) takes upthe fact that college entrance exams havegreat influence in Japan. These tests have acrucial impact on what students learn, and hewonders if we can measure the result of theimpact. Thrasher's concept of educationalvalidity includes two assumptions: (a) anytest has effectsgood and badon studentmorale, study habits, and understanding ofwhat the course of study is trying to accom-plish, (b) such effects can be measured andjudged beneficial or detrimental to the goalthe teacher has laid down (Thrasher, 1984).He assumed two hypothetical objections: (a)test effects cannot be measured, and (b)study habits are determined not only by thetest but also by a complex of cultural, psy-chological, and educational backgroundfactors. Thrasher refuted both objectionsconvincingly, in my opinion.

Coincidentally or not, Bachman (1990)points out, in his definition of validity, someadditional elements (such as the student'sculture, educational background, washbackeffects, ethnicity, etc.) in addition to othertraditionally established aspects of validity(such as logical or empirical analysis).

If educational validity can be established asa separate type of validity, I think classroomteachers will benefit greatly.

Types of Validity Suitablefor Testing Speaking Ability

A test of speaking ability in a classroom set-ting is usually used as an achievement test.According to Davies (1990), an achievementtest should have both face and content validi-ties. I would argue that predictive validity andeducational validity as well as construct andconcurrent validities should also be analyzed.

First, as Davies says, content validity isunavoidable for a classroom speaking testwhich has the characteristics of an achieve-ment test. Since content validity simply asks ifthe test content (vocabulary, grammar, andtasks) matches the content of the course ofstudy, what testers (teachers) can do is tomatch the course objectives and syllabus


NAKAMURA

design (which themselves should be based onconstruct validity) with the test items. Thiseffort should reduce students' complaintsabout the content of the speaking test. In thetraditional understanding of content validity,tasks are less important than the match be-tween test and classroom vocabulary andgrammar. This attitude toward tasks by teach-ers is crucial in a classroom test becauseteachers my unconsciously tend to use testtasks which are different from the courseobjectives especially when oral/aural aspectsare involved.

Second, construct validity is concerned withmatching the theory of speaking and the tasksthe test-maker requires the test takers to per-form. This cannot easily be handled by class-room teachers because of the technical andstatistical analyses involved and because of theabstract nature of language abilities. Constructvalidity is the most fundamental validity for aspeaking test, however, even if it is difficult tocarry out, because the test tasks themselves(the speaking tasks in this case) are of primaryconcern in construct validity.

Third, face validity is a must in a classroomspeaking test. Semi-direct speaking tests liketape-recorded tests have much more facevalidity than indirect tests of speaking skillslike paper-and-pencil tests; accordingly, stu-dents' motivation is promoted and main-tained for speaking, and test results will bemore reliable. Although there is an argumentthat the move from multiple-choice to pro-ductive tests usually means reduced reliabil-ity, we need to make a distinction betweentask reliability and scorer reliability when itcomes to speaking tests. Task reliability willbe enhanced by having the students dosomething they believe is a valid speakingactivity, which means focusing on task reli-ability. Students' speaking abilities should bemeasured in a test in which they think theyare taking a speaking test. Ideally, a directspeaking test such as an interview test is thebest; few institutions can, however, conductinterviews because of financial and practicalconsiderations.

Fourth, predictive validity is feasible inmany schools for the following reason: We


136

can check students' posttest real world oralactivities in English-speaking countries whenthey go overseas during the vacation, be-cause many schools have a SEA (Study En-glish Abroad) program and send students toEnglish-speaking countries every year, for afew weeks to a year. We can compare stu-dents' in-class test results with their results inreal world communication in English-speak-ing countries.

Next is educational validity. Although this isnot an established form of validity, the idea ofthe relationship among testing, teaching,study habits, and test results from the view-point of positive washback effect is highlyrecommended in a classroom speaking test. Ifstudents change their study habits (from fo-cusing on grammar-centered study to focus-ing on spoken English, listening to English,and trying to speak out), that is one of theobjectives of the course and the speaking test.This is a case where a speaking test can havegood effects on students. Although there maybe a long way to go before this sort of valid-ity can be established, educational validitymay turn out to be an important factor injustifying the usefulness of a classroom speak-ing test to promote the speaking ability ofJapanese students.

Lastly, concurrent validity may not be easyto determine if we only think in terms ofhaving students take two tests at about thesame time and comparing the results. Regret-tably, we cannot find a relevant criterion test,since it should itself be valid, reliable, andpractical. There are, of course, some importedspeaking tests such as the FSI (Foreign Ser-vice Institute) Interview test or the ACTFL(American Council on the Teaching of For-eign Languages) OPI (Oral Proficiency Inter-view) test, for intermediate and high levelstudents. However, those tests are not rel-evant for lower level students because of thedifficulty level of the test items.

Rather than waiting for the completion ofanother speaking test, classroom teacherscould use native speaker teachers' assess-ments as a criterion. We could use conversa-tion teachers' class grades or their estimatesof speaking ability as criteria and investigate

concurrent validity by comparing the testresults and teachers' class grades or estimatesof speaking ability.

What we need to do for this purpose is toachieve high interrater reliability for theseassessments. One way to do so is to find areliable teacher as an estimator. Another is toconduct a comprehensive training session tocreate high interrater reliability among nativespeakers (NS). Still another way is to have NSestimators focus only on the evaluation ofspeaking ability rather than non-languageaspects like attendance, effort, or submissionof homework. Since there are many nativespeakers coming to secondary schoolsthrough the JET (Japan Exchange and Teach-ing) or AET (Assistant English Teachers) pro-grams, we could measure students' oralactivity outside the testing situation with thehelp of these native speakers.

Even in colleges, the number of NSs isincreasing. We could compare students' in-class speaking test results with their perfor-mance in talking with NSs under quasi-real-world situations. This type of validityinvestigation might be more practical in aschool situation because non-native Englishteachers can easily compare the students' testresults in their language classes with thesame students' actual behavior or perfor-mance in English in native English speakingteachers' classes.

A Case Study

Background of the Test DevelopmentPerformance testing (especially testing oral

proficiency) has become one of the mostimportant issues in language testing since therole of speaking ability has become morecentral in language teaching with the adventof the Communicative Approach. There is agreat discrepancy, however, between theexpansion of the communication boom andthe accurate measurement of communicationability (especially speaking ability) because ofthe many difficulties involved in the construc-tion and administration of any speaking test.

In this case study (Nakamura, 1993), I con-structed a speaking test based on Bachman's

NAKAMURA

Communicative Language Ability model(1990) and examined the detailed compo-nents of Japanese students' English speakingability based on that model. I also checkedthe validity, reliability, and practicality ofthese new testing procedures.

Four Kinds of Validity Actually UsedThe following four sub-categorizations of

validity were examined in the this case study:

1. construct validitywhether the testmatches a theoretical construct(Thrasher, 1984)

2. concurrent validitywhether a newtest is measuring the same thing asanother more established one(Thrasher, 1984)

3. face validitywhether the test looksvalid or reasonable to the examineeswho take it (cf. Weir, 1990)

4. washback validitythe effect of testingon teaching and learning (Hughes,1989)

Other validities such as predictive validity oreducational validity (Thrasher, 1984) were notdealt with individually in the present research.Predictive validity could not be implementedbecause of practical limitations, and educa-tional validity is included in the broader defini-tion of washback validity that I used.

MethodsEighty college students took the test con-

sisting of four tasks (Task I - Speech Making;Task II Visual-Material Description; Task IIIConversational Response Activities; Task IV -Sociolinguistic Competence Test named MiniContexts).

Eleven raters' (four Japanese and sevennative English speakers), all of whom hadbeen teaching English for at least one year,evaluated eighty audio tapes on which thestudents' responses had been recorded in thelanguage laboratory. The raters used the scor-ing sheet and scoring criteria designed by thepresent author. The first two tasks were ratedon a four point scale (below average, aver-age, above average, very good) in each lin-guistic component (grammar, vocabulary,

P7 LANGUAGE TESTING IN JAPAN 131

NAKAMURA

pronunciation, etc.). Conversational responseswere rated on a different four point scale (noanswer, conversationally inappropriate, con-versationally appropriate, very good).Sociolinguistic competence answers wererated on yet a third four point scale (no an-swer, sociolinguistically inappropriate,sociolinguistically appropriate, very good).

Data AnalysisEach rater's raw score for each task was

summed up to obtain a total score. Then,interrater reliability was measured througheach rater's total score on the 80 tapes usingPearson's formula. The internal consistencywas examined through Cronbach's alpha toestablish another measure of reliability.

Four tasks were examined from the view-point of the content validity. The four tasksshould be mutually exclusive and only mod-erately inter-dependent to be composites ofthe proposed framework of speaking ability.

Correlation coefficients were also calculatedbetween the four tasks and independentcriterion measures. The concurrent validitywas examined by looking at the correlationbetween four tasks and teachers' class gradesand a teacher's estimates. Factor analysis wasconducted to examine the construct validity.

In the questionnaire analysis, I asked the80 students who took the speaking test toanswer a questionnaire on their impressionsof the test from the viewpoint of their studyhabits toward the improvement of theirspeaking ability. The central question was,"Do you think this speaking test will changeyour study habits toward the improvement ofyour speaking ability?"

Results

Reliability. The interrater reliability was ac-ceptable (over .74 among ten raters). In addi-tion, a reasonably high correlation (the rangewas .74-.90 between individual native Englishspeaking raters and Japanese raters) wasobtained. This fact indicated that Japaneseteachers by themselves can conduct the testand score the results in a classroom settingwith little help or even without any help of


NSs (within the reliability range of .74-.90).An internal consistency reliability estimate(over .84 for 10 raters) was obtained, and itindicated that the items were measuring thestudents' speaking ability fairly consistently.

Task correlations. The task correlationresults demonstrate a strong relationshipbetween Task I (Speech Making) and Task II(Visual-Material Description), and the tightrelationship between Task III (ConversationalResponse Activities) and Task IV (Mini Con-texts). The two pairs of strongly related taskswere apparently playing complementary roleswith each other.

There was also a high task correlation (over.81) between Japanese and native Englishspeaking raters. In other words, Japanese raterswill be able to evaluate students' speakingability in almost the same way as native Englishspeaking raters. This task correlation also sup-ported the construct validity of the test in thatspeaking ability consists of four partially divis-ible tasks, which are theoretically motivated.

Factor analysis. Through factor analysis, Iwas able to obtain two factors: Factor 1 (Lin-guistic Ability) and Factor 2 (Interactional-Sociolinguistic Ability). Apparently, LinguisticAbility is measured primarily by two tasks(Speech Making and Visual-Material Descrip-tion) and Interactional-Sociolinguistic Abilityis found in the other two tasks (Conversa-tional Response Activities and Mini Contexts).

Discussion of the Validity Findings

Construct validity was examined throughfactor analysis and task correlations. Throughfactor analysis, two factors (Linguistic Abilityand Interactional-Sociolinguistic Ability) wereobtained. Task correlations further supportedthe construct validity of the test in thatspeaking ability consisted of partially divis-ible tasks (Speech Making Test, Visual-Mate-rial Description Test, ConversationalResponse Test, and Sociolinguistic Test). Thetwo-factor structure, with the help of taskcorrelations, partially supported the presentauthor's proposed framework of speakingability based on Bachman's CommunicativeLanguage Ability model.

Concurrent validity was investigated by com-paring the results of the present test with stu-dents' grades in English Conversation classesand a teacher's estimate of students' speakingability. The concurrent validity of this test wassupported not by the class grades but by theteacher's estimate. This is probably becausestudents' grades include non-language profi-ciency elements such as attendance, effort, andparticipation, among other things.

Washback validity was examined through aquestionnaire administered to the students.Students' responses suggested that they willchange their study habits (a) by focusingmore on the productive aspects of their lan-guage skills, and (b) by paying more atten-tion to the context, etc. It is hoped that thischange will reach teachers, too, so that teach-ers will modify their teaching styles fromgrammar-oriented classes to communication-oriented classes.

Face validity was partially supported by thepresent author's informal talk with students.They were excited about taking this type ofunfamiliar, but seemingly authentic, speakingtest. Therefore, they were highly motivated tospeak out in the testing situation.

Conclusion

In studying our language tests, we shouldstrike a good balance of the various kinds ofvalidity depending on the situation in eachinstitution. However, by far the most impor-tant validity is construct validity. Classroomteachers must always consider whether theyare really measuring what they intend tomeasure. There are at least three ways to lookat the construct of speaking: (a) the nature ofspeaking, (b) the theoretical or linguisticunderpinnings of speaking, and (c) classroomteachers' ideas based on their teaching expe-riences. As Weir (1993) says, tests should betheory driven. The theoretical part is indis-pensable; nevertheless, the viewpoints ofexperienced teachers should also play a sig-nificant role in the process of test construc-tion. By alternating back and forth between

NAKAMURA

the theoretical and practical aspects of aspeaking test, classroom teachers can alwaysfocus on what they are actually measuring.

1

Note

Eventually 10 raters' (four Japanese and sixNative) results were used for statistical analyses.

References

APA (American Psychological Association). (1985).Standards for educational and psychologicaltesting. Washington, D.C.: American Psycho-logical Association.

Bachman, L. F. (1990). Fundamental consider-ations in language Testing. Oxford: OxfordUniversity Press.

Davies, A. (1990). Principles of language testing.Oxford: Basil Blackwell.

Heaton, J. B. (1989). Writing English languagetests. London: Longman.

Hughes, A. (1989). Testing for language teach-ers. Cambridge: Cambridge University.

Messick, S. (1988). The once and future issuesof validity: Assessing the meaning and conse-quences of measurement. In H. Wainer & H.I. Braun (Eds.). Test validity (pp. 33-45).Hillsdale, NJ: Lawrence Erlbaum Associates.

Messick, S. (1993). Validity. In R. L. Linn (Ed.),Educational measurement (3rd ed.) (pp. 13-103). New York: American Council on Educa-tion/Macmillan.

Nakamura, Y. (1993). Measurement ofJapanesecollege students' English speaking ability in aclassroom setting. Unpublished doctoral dis-sertation, International Christian University,Tokyo.

Spolsky, B. (1975). Language testingThe prob-lem of validation. In L. Palmer & B. Spolsky(Eds.), Papers on language testing 1967-1974(pp. 147-153). Washington, DC: TESOL.

Thrasher, R. H. (1984). Educational validity.Annual reports, International Christian Uni-versity, 9, 67-84.

Weir, C. J. (1990). Communicative languagetesting. London: Prentice-Hall.

Weir, C. J. (1993). Understanding and develop-ing language tests. London: Prentice-Hall.


Section V

Innovative Testing

140

Chapter 15

Negotiating a Spoken-English Schemewith Japanese University Students

JEANETTE MCCLEAN

TOKYO DENKI UNIVERSITY

Since I am a teacher of English "conver-sation" in a private Japanese university,my students are typical: aged between

18 and 22, they follow a four-year degreecourse for which English is not a major sub-ject. Most are false beginners (i.e., learnerspreviously exposed to English instruction buthaving weak aural skills). Part of my role isto set examinations for these students andaward a final year-end grade. Until the startof the project described in this chapter, thestudents' oral English ability, like many oth-ers at universities throughout Japan, hadbeen measured through reading or writingtests. The likelihood was that this practicewould continue, despite warnings about theharmful effect of such indirect tests, whichencourage candidates to develop their abilityto handle indirect rather than realistic tasks(Hughs, 1989; Brindley, 1989; Weir, 1990).

To continue to test oral English proficiencyusing indirect testing techniques in a countrywhich now lays so much importance onspoken English proficiency is inappropriate.As Gruba (1993) states, "Test developers inJapan must seek to make examinations thatare themselves responses to changes in thefield" (p. 33). In the absence of any kind offramework for direct assessment of spokenEnglish at this particular university, a re-search experiment was set up in 1992-93 todesign and implement a direct test for assess-


1 41

ing the oral ability of freshmen. This chapteris based on the action research conducted asa part of this project.

Problems in Assessing Spoken English atJapanese Universities

Before contemplating an innovation of thiskind it is important to consider the contextin which the new form of testing is to beimplemented. By means of Heron's (1981)four-part assessment process, some of theproblems that testers of spoken English inJapanese universities encounter are consid-ered below.

What to Assess. In deciding what to assess,test developers face a fundamental dilemmaabout what actually constitutes oral compe-tence in spoken English. Canale and Swain(1993) define communicative competence asan oral skill involving grammatical,sociolinguistic, discoursal, and strategic com-petence, while Bachman (1990) defines com-municative language ability as "the ability touse language communicatively and involvesboth knowledge of or competence in thelanguage, and the capacity for implementing,or using this competence" (p. 81). Weir(1990) stresses that, unless speakers are ableto utilize the newly acquired language appro-priately according to the demand of the situa-tion, their knowledge of language effectively

counts for nothing. Fairclough (1992), on theother hand, dismisses the whole idea of ap-propriateness as a sociolinguistic ploy and the"political objective of the dominant,`hegemonic,' sections of a society in the do-main of language as in other domains, but ithas never been a sociolinguistic reality" (p.49). Taking a more practical view of interac-tion, Weir and Bygate (1993) claim that com-petent speakers communicate by means ofeither informational routines or interactionalroutinesimprovising, negotiating, dealingwith communication problems as they arise,and managing the interaction, as well as con-trolling the topic, the development, and dura-tion of the content.

While these notable theorists do hold incommon the belief that communicative com-petence involves use of the language, appro-priateness, and communicative strategies, thedivergence of opinion about how this ismanifested provides little reassurance fortester in Japanese universities, who are en-tirely responsible for deciding what they willteach to their students (Benson, 1991; Evanoff,1993) and are also responsible for decidinghow, and on what, students will be exam-ined. It should come as little surprise, there-fore, that what little oral assessment has beendone in Japanese universities so far has beenisolated, haphazard, lacking in form, andsubjective. Testers are confused about whatspecifically to focus on when assessing test-takers' oral competence. The autonomousnature of teaching in Japanese universitiesand the lack of cohesion in testing spokenEnglish is worrisome and exists despite ef-forts to work within a communicative para-digm and despite Monbusho encourage-ment to teach language communicatively.

Which Criteria to Use?

The second decision concerns which criteriato use for assessment. Testers are not onlyfaced with the problem of what to assess, butmust also face the immense cultural differ-ences between English L1 speakers and Japa-nese speakers of English. Testers must askthemselves whether test-takers should be

MCCLEAN

assessed on their "inner directed" behaviorbased on the social roles and models in Japa-nese society, or their "other directed" behav-ior based on Western, and particularlyAmerican, social norms. Alternatively, shouldthe criteria adopted be more international andpertinent to all nonnative speakers of English,or specifically oriented to the profile of aJapanese university student learning English?Test-makers and test-markers (those whomust mark and score the tests) face an ago-nizing dilemma over this issue alone.

For testers in Japanese universities, it is alsoimportant to bear in mind that the predomi-nant method of grading learners is norm-referenced and has its roots firmly based inthe traditional university system. The aim ofthe scoring procedure is to rank test-takers inrelation to their fellow test-takers by means ofthe score or percentage attained. Unfortu-nately, this evaluation method bears littlerelation to the learner's competence in thetarget language. However, alternative criteriafor language assessment based on perfor-mance objectives are being adopted moreand more, utilizing techniques which seek tofocus on the learners' mastery of the lan-guage. Criterion-referenced tests (CRT) evalu-ate an individual's performance of specificcommunicative skills (Brown, 1988), andprovide information directly about what thelearner can actually do with the target lan-guage. If testers in Japanese universitieschoose to adopt this method of assessingtheir students, not only must they be able toidentify the skill inherent in spoken English,they must then create or select the appropri-ate criteria against which to assess theirtestees' performance. Such choices are madeall the more difficult by the differences incultural perspectives mentioned above.

How to Apply the CriteriaThe next decision is concerned with how to

apply the criteria selected. For those wishing toadopt a more performance-based approach toassessment, the difficulty of converting a CRT-oriented assessment scheme, which may in-volve several scores, to the required singlegrade is strained by the very structure of the


MCCLEAN

grading system used in Japanese universitiesa system which is not negotiable. In a fairlytypical Japanese university, an "A" grade maybe awarded for a score of 80% and above, a"B" for 60%-75%, a "C" for scores within a 5-point range around 60%, and a "D" (failure) fora score below 55%1. A fundamental flaw existsin a grading structure of this type because thenumber of scores possible within the top andbottom grades (i.e., "A" and "D") are heavilyweighted, while the possible number of scorespossible in the middle range (i.e., grades "B"and "C") is limited. A graph plotted to show thenumber of scores possible within the givengrade ranges shoes a distribution in a U-shapedcurvethe inverse of a normal distribution. Anormal distributiona bell-curveassumes themajority of scores fall within the middle rangeof scores (i.e., grades "B" and "C") while highand low scores are rare. Perhaps this U-shapedstructure might account for some of the unusu-ally high number of "A" grade graduates pro-duced by Japanese universities.

A paradox exists within the system: theimportance of test results can be challengedby teachers. Teachers are able to dismissstudent test results through experiencedjudgment, intuition, or for other reasons.Testers are rarely held accountable for theirmethods of grading. This, together with theabsence of uniform test criteria, serves toshow the power that teachers have to influ-ence students' progress in the educationalcontext (Heron, 1981).

In addition to the authority exercised bystaff, Japanese university students are dis-couraged by their peers from displaying theiroral ability in English. Inculcating a commu-nicative teaching approach in a Japaneseuniversity is difficult because of the pressureto conform, rather than to display one's indi-vidual feelings, opinions, or personality inJapanese society (Nozaki, 1993). Japanesestudents are not trained in critical thinkingand debate because social training has rein-forced passive submission to authority(Jones, 1991a, 1991b; Nozaki, 1993;Davidson, 1994) where, to quote a well-known Japanese saying, "The nail that sticksup gets hammered down." The system offers


143

little incentive for improving personal perfor-mance, a view apparently supported by theprevailing belief amongst many Japaneseuniversity students that the university is an"escalator system" where, "if both feet areplaced in the bottom step, a student almostautomatically progresses to graduation"(Nozaki, 1993, p. 28).

Doing the Assessment

For testers in Japanese universities the fourthpart of the assessment processactuallydoing the assessmentmay well be likecharting a course through a minefield. Timeconstraints for the completion of studentassessment are one of the biggest obstaclesto achieving reliable testing, and these areseldom negotiable. Add to this the limitednumber of staff and accommodation avail-able for test administration, and circum-stances dictate that the teachers/assessorsmust add a third role to their repertoiresthat of test administrator.

The situation is compounded by one morefactor, one that Jones (1992) cites as theloudest complaint of foreign staff in Japaneseuniversities: large classes. In real terms, directassessments of speaking ability, such as inter-views, are time-consuming and difficult toadminister if there are large numbers of can-didates (Weir, 1990). In Japanese universities,it is commonplace for testers to be requiredto conduct English tests for 50 to 120 stu-dents within 90 minutes, and to have com-pleted marking and grading from 200 to 600students within one week prior to the exami-nations for "regular" subjects. Apart from thedifficulty that such numbers impose, thelikelihood of obtaining realistic discoursesamples from the test-takers is hamperedbecause the processing of interchanges undernormal time constraints, as advocated byWeir (1990) and Morrow (1979), is simplynot feasible under these conditions.

The staff shortage and large number oftest-takers poses the additional problem oftest security: how to prevent the studentspassing information on to each other. Fortu-nately, the very nature of spoken interaction

is unpredictable, reciprocal, and adaptive(Bachman, 1990), which, for the student,means that it is difficult to replicate natural,interactive speech. For the assessor, thesecharacteristics of spoken English mean that itis difficult to make tests for spoken Englishreliable. Weir (1990) stresses the importanceof familiarity from the test-takers' point ofview for a test to be acceptable; but Japaneseuniversity students are not used to beingassessed orally and, despite careful briefingand preparation of candidates, it is difficult tomake an oral performance test reliable inthis context by limiting the element ofunpredictability to the interchange alone.Furthermore, previous research has shownthe difficulty of achieving marker consis-tency in assessing student oral performancein English (McClean, 1993; Nambiar &Goon, 1993).

Hence, in this context, testers are restrictedin the form of assessment they can select toone that is predictable rather than desirable(Weir, 1990). This may, in part, explain thereticence thus far in Japanese universities forintroducing direct tests to assess spoken per-formance in English, and the retention ofentrance examinations with a strong biastowards grammar, reading, and writing skills(Benson, 1991) with no oral component(Jones, 1991a, 1991b).

Reasons for Initiation of the Testing Project

While acknowledging some of the more obvi-ous problems affecting the introduction ofdirect testing of spoken English inherent inthe present Japanese university system, I amill-at-ease in the knowledge that unilateralauthority of the teacher and unreliable evalu-ation methods leave students without a clearimpression of their performance in spokenEnglish. Learners are, instead, subject to thepolitics of knowledge present in an educa-tional model which is rigid and highly au-thoritarian. I concur with Heron's (1981)observation that "...The time is ripe for analternative, democratic model: that of equalhuman capacities which mutually support andenhance each other" (p. 61).

MCCLEAN

Kelly (1993) points out that the role of theuniversity in Japanese society is the reverseof that seen in Western society. Whereas inthe West the university phase is one of nar-row specialization, in Japan university is seenas a broadening phase important for personaldevelopment, cultivating independence, andhuman relations. However, Japanese univer-sity students lack skills in critical thinking,they are transient, and they have only limitedcontact time with their teachers. Surely whatwe, as language teachers, should be doing isoffering our students opportunities to experi-ence different learning styles and providingvaluable broadening experiences in whichthe learners can rehearse a degree of au-tonomy in their maturation process. Collabo-rative assessment offers one such avenuebecause it provides an intermediary stagebetween self-determination and unilateralassessment. As Kelly further notes,

Traditional Japanese patterns of interac-tion and problem solving are powerful intheir own right, but limited: they are stillrooted in an agricultural value systemwith a predisposition toward constancyrather than change. (p. 185)

Because of the low level of English, weakfluency skills, and limited amount of spokenlanguage produced by Japanese universitystudents in any given interactive situation,evaluation of their oral proficiency is verydifficult for an assessor. It seems appropriate,therefore, that what English language teachersshould now be seeking is a model of educa-tion in which intellectual, emotional, andinterpersonal competence go hand in hand.

The democratic model generates a set ofguiding norms for the management of feeling(Heron, 1981). To encourage the opening offeelings in Japanese students is not only veryimportant because the Ll culture activelydiscourages the display of emotion, but alsobecause they are learning a language inwhich communication is based on the inten-tion of the speaker. For Japanese learners,English is a counterculture and providesmuch of what Japanese culture withholds:the sense of one's uniqueness, direct expres-


MCCLEAN

sion of one's opinions, and a focus on con-tent rather than form (Kelly, 1993). Further-more, English offers an alternative mindsetfor dealing with modern problems by offer-ing a greater variety of responses to the in-creasingly diverse situations modernJapanese are faced with. In many ways, Japa-nese university students are not yet ready fororal assessment of their spoken English skills.If university level is the phase when Japanesestudents are expected to develop self-confi-dence, individuality, and a clearer under-standing of life, then a teacher-dominated,lecture-test approach is inappropriate andcounterproductive at this stage.

The outcome of two initial trials to developa suitable test and grading scheme at thisuniversity showed that achieving reliabilityin terms of test format and marker consis-tency were simply impossible to achieve inview of the problems described above. Ittherefore seemed more productive to switchthe emphasis in testing at this university fromthe product, i.e., "what" language the studentshad learned or produced, to the process, i.e.,"how" one goes about communicating in theL2. Heron (1981, p. 64) advocates that processassessment is more important than contentassessment because "...procedural competenceis more basic than product competence, sincethe former is a precondition of providing manygood products."

We decided, therefore, to focus on thedevelopment of a new grading schemethrough which students could be familiarizedwith the techniques and skills involved in theprocess of oral communication. To be of anyreal benefit to participants in the testing pro-cess, however, it was essential for all partiesto negotiate and agree on assessment criteria,against which the performance of the fresh-men would be assessed in their final exami-nation. The process employed to arrive at aNegotiated Grading Scheme (NGS) is de-scribed below.

Developing a NGS: The Process

Step 1Brainstorming. This involved searchingfor a common point of reference from which to


start the collaborative process. As their teacher,I was concerned to find out what the freshmenactually understood about speaking ability andthe skills involved in spoken English. Whenasked to describe a good speaker of English, itwas distressing to discover that, after complet-ing two-thirds of a course which emphasizedthe skills of spoken communication, the stu-dents were unable to even attempt a descrip-tion of a good speaker of English.

This, then, was the starting point. I decidedto present the freshmen with examples ofgood, fair, and weak English speakers fromamongst the target test population to see ifthey could distinguish between speakers'proficiency. Two videos were presented oftwo pairs of students performing an informa-,tion gap task (IGT) under examination condi-tions. After viewing the videos, the freshmenwere asked to comment on whether theythought the speakers were good, fair, orweak English speakers. They responded thatsome candidates' performances were fair butothers were "more fair"!

To give the freshmen the opportunity tofocus more carefully on the first pair of speak-ers, half of the class were asked to observeSpeaker 1 (S1) and the other half to observeSpeaker 2 (S2). After viewing the speakersagain, the students wrote three things theirspeaker was good at and three he was poor at.For instance, "S1 has clear pronunciation, hecan use long sentences, and he appears confi-dent and relaxed; however, he doesn't makeeye contact or try to help his partner." Gradu-ally, descriptions of the speakers' performanceswere developed by eliciting characteristics fromthe students and collating these on the board.The same process was repeated with the sec-ond pair of speakers. After viewing the secondIGT, discussion followed about what the fresh-men considered to be a good, fair, and weakspeaker of English. For homework they wereasked to write a short profile of these threelevels of speaker.

Step 2Correlating and Condensing. Thenext step was the development of a table ofdescriptors (Table 1). Descriptions were col-lected from the students and, together withinput from the teachers, a comprehensive

table describing the skills involved in spokenEnglish was drawn up by the teachers.

To honor input from all participants (in thespirit of cooperation), it was important to retainthe terminology used by the freshmen as muchas possible. Students' responses were corre-lated and simplified into short descriptions. Tosimplify comparisons, wherever possible de-scriptors were matched across the three levelsby adding intensifiers. Single descriptors con-sidered important were also included in thetable and placed under what the teachers con-sidered the appropriate level.

Table 1. Descriptors for Speakers of English

MCCLEAN

It was interesting that amongst the criteriafreshmen indicated as important in speakingEnglish were qualities such as confidence, apositive attitude to learning, and characteris-tics which might be considered non-commu-nicative. Brindley (1989) supports the practiceof including non-communicative criteria inCRT-assessment if they are considered impor-tant. It is also interesting that Jones (1991a,1991b) revealed from his survey of Englishlanguage teachers in Japanese universitiesthat these are the very same qualities teachersare most concerned to boost in their learners.

Step 3 JigsawExercise. Step 3 wasan important con-sultative phase inthe collaborationprocess, when thedescriptors weretaken back to the

is not afraid to make freshmen for ap-mistakes proval. To encour-makes good eye age them to thinkcontact critically, the table

A Poor Speaker A Fair Speaker A Good Speaker

lacks confidenceis nervousis shy

is afraid to makemistakes

makes no eye contact

does not use gestures

uses body tocommunicate, notwords

speaks in a quiet voice

pronunciation is notclearhas a small vocabularyspeech is not natural

has flat intonationthinks a long timebetween words, makeslong silencesis hard to understand

can't describe what hewants to say

can sound rude

tries to communicateis relaxed

is a little afraid to makemistakes

often makes eyecontacttries to use bodylanguage

sometimes can't beheardsometimes makespronunciation mistakesuses easy words

sometimes speakssmoothlyhas some intonationtries to speak fast butmakes pauses often

tries to speak but issometimes hard tounderstandis not perfect butlistener can understandcan describe what hewants to say but it isnot easy

is confident

uses body language

speaks in a loud voice

has clearpronunciationhas a large vocabularyspeaks smoothly

speaks expressively

can speak fast or slow

is easy to understand

helps other speakers

communicates well withthe listener

sounds polite

146

of descriptors wascut up into indi-vidual descriptors,placed in an enve-lope and given tosmall groups ofstudents. First theywere asked to ar-range the descrip-tors under threeheadings, therebydeveloping profilesof good, fair, andweak English speak-ers. They wereinvited to discardany descriptors theyfelt were irrelevantor inappropriate, oradd any they feltshould be includedand had been omit-ted.

In the secondtask, the freshmen


MCCLEAN

Figure 1. Spoken English Ability Performance Profile

Name: Student No.

English Oral Examination, January 1995Performance Profile

Points

Attitude &Confidence

BodyLanguage

Expressiveness(pronunciation,

intonation &volume)

Understand-ability(for the listener,is the messagedeliveredclearly?)

Communicativeability(can thespeaker saywhat he/shewants to say?)

Good5

4Fair

3

2

Weak1

Total:25

were asked to group the descriptors coveringsimilar skills or characteristics into categoriesacross the three ability levels (e.g., descriptorsabout nonverbal communication, expressive-ness, etc.). Once this was completed, thefreshmen were asked to rank the categoriesin order of importance. In the final task, theywere asked to give a title to each cluster ofdescriptors.

Clearly this task was the most difficult forthe students. Some interesting titles emerged,such as "Attitude/Heart/Emotion"; "BodyLanguage/Gestures/Action"; "Understanding/Listener's think/Easy to understand"; "Knowl-edge/Vocabulary /Words /Communicative-ness"; "Confidence/Relaxation/Feeling/Spirit,"revealing the possibility of significant culturaldifferences in perception between partici-pants about various elements of speech. In-terestingly, very similar terminology waselicited on a separate occasion from JapaneseEnglish language teachers when performingthe same exercise, which serves to show theimportance of this step in arriving at mutuallyacceptable criteria and terminology.


Step 4Drawing up the Grading Scheme.Again the input of the freshmen was corre-lated by teachers, and two rating scales weredeveloped: a Performance Profile and a De-scription of Performance.

1. The Performance Profile. In keeping withthe CRT-orientation of communicative testing,the Performance Profile (PP) (see Figure 1)sought to present an individual profile in-forming freshmen about their individual mas-tery of certain skills of spoken English. Thefive most important categories identified bythe freshmen were selected for inclusion intothe PP, the underlying components of whichare the criteria identified by participants inSteps 1, 2, and 3 above. The categories arerepresented on the horizontal axis of theassessment grid.

In response to the students' initial dissatis-faction with the distinctions "good", "fair",and "weak" speaking ability, two additionallevels were included in the vertical axis tofudge the distinctions, but for the sake ofconsistency these original terms were re-tained. Five bands of ability, ranging from

MCCLEAN

Figure 2. Description of Performance

Very Good Speaker(21-25 points)

Is very confident and relaxed; maintains eye contact and uses body language well; speechis clearly heard and pronunciation is clear and accurate; speaks expressively with variedintonation; seldom makes pauses and speaks fast or slow with ease; has a large vocabu-lary and can say what he/she wants to say with ease; helps other speakers; makes relevantreplies.

Good Speaker(16-20 points)

Is confident and not afraid to make mistakes; often uses body language and makes eyecontact; usually speaks in a clear voice but occasionally mispronounces; is usually expres-sive and often sounds polite; usually speaks smoothly but occasionally rephrases; vocabu-lary is adequate and can describe what he/she wants to say; listener usually understands;replies are usually relevant.

Fair Speaker(11-15 points)

Tries to communicate but is a little afraid to make mistakes; uses body language and fre-quently makes eye contact; sometime can't be heard; frequently mispronounces; intona-tion is fairly expressive but occasionally sounds rude; makes pauses often, sometimesspeaks smoothly; uses easy words, can describe what he/she wants to say but it is noteasy; is frequently hard to understand; occasionally replies are off the point.

Developing Speaker(6-10 points)

Often lacks confidence, is afraid to make mistakes; uses many Japanese gestures, occa-sionally makes eye contact; pronunciation is very Japanese and often unclear; intonation isflat and often sounds rude; speech is not smooth and has long pauses; has a small vo-cabulary and often has difficulty describing what he/she wants to say; is often difficult tounderstand; replies are frequently off the point.

Weak Speaker(1-5 points)

Lacks confidence, is nervous, shy and often does not communicate for fear of makingmistakes; uses body language to communicate not words, does not make eye contact;speaks quietly, pronunciation is unclear, intonation is flat; speech is slow and makes longsilences; has a very small vocabulary and cannot describe what he/she wants to say; isvery difficult for listener to understand; replies are often off the subject.


MCCLEAN

weak (level 1) to very good (level 5) andrepresented on the vertical axis, were in-cluded to allow the marker a wider scopeand more flexibility in marking. Test-takerswould be awarded a score of between 1 and5 according to their performance in eachcategory and given a total score out of 25 fortheir performance in all five categories.

2. The Description of Performance. In con-trast to the PP, the Description of Perfor-mance (DOP) (see Figure 2) is a holistic scaleand attempts to describe the learner's overallperformance as an English speaker. In thisrespect, the DOP resembles a norm-refer-enced type of grading scheme. On this scale,each level is equivalent to a range of ability(i.e., weak, developing, fair, good, verygood). For each level, a description of amodel speaker was drawn up from the tableof descriptors elicited from the freshmen inSteps 1, 2, and 3.

In formulating these two rating scales, itwas anticipated that the test-taker would firstbe assessed on the PP and then separately onthe DOP. The assessor would then comparethe total score on the PP and the score bandon the DOP to check that the scores wereequivalent. Where the score given by theassessor on the PP did not fall into theequivalent band on the DOP, the assessorwould need to reconsider the PP score, orremark the test-taker. In this way, the DOPacted as a check to ensure consistent scoring.

Step 5Trialing the Rating Scales. Thenegotiated Grading Scheme (NGS) wastrialed by both the freshmen and two testers.In the first trial, freshmen were shown pairsof university students performing an informa-tion gap task on video and were asked torate one speaker's performance on the PP inone category only ("Expression"). Afterwards,scores were compared and difficulties experi-enced in marking were discussed. The pro-cess was repeated, but this time, freshmenwere asked to rate the speaker's "Attitude/Confidence." Scores were again comparedand discrepancies discussed. After the thirdshowing of the video, the freshmen wererequired to mark the speaker's performancein the remaining categories on the PP.

144 JALT PA__. PLIED MATERIALS 149

After discussion about the scores, the stu-dents were asked to give an overall rating ofeach speaker on the DOP before totaling thescore on the PP. This final step operated as anecessary brake on the marking process, andan important precaution. It made the evalua-tors stop and think critically about thespeaker's overall performance and discour-aged them from making an immediate com-parison between the PP and the DOP scores.From the equivalence or discrepancy betweenscores on the two rating scales, the evaluatorscould see how consistent they were in assess-ing their peers' performance.

The NGS was also trialed by two teacherswho utilized the two scales for assessingfreshmen in their final oral examination. Thetest involved groups of four students, whowere given 15 minutes to discuss a picture orphoto of their choice. Each test-taker wasrequired to ask and answer a minimum ofone question in the discussion. These test-takers were assessed simultaneously by theassessor, but the interactions were also re-corded on video.

Bearing in mind that the test format usedwas not necessarily the most desirable, butwas most practicable under the circum-stances, the grading scheme proved easy touse for the assessor as it facilitated quick andsimultaneous marking of 4 candidates within15 minutes. Only occasional remarks werenecessary, and these were made possible bythe video recordings. In this way, the testingschedule was not disrupted. Interestingly, theassessors endorsed the freshmen's findingthat the PP was easier to use for marking thanthe DOP, but felt that the DOP acted as animportant check against which to confirm thePP score. Results showed that random re-scoring of these test-takers by the assessorindicated a very high degree of marker con-sistency (99 percent using the Pearson Prod-uct-Moment Correlation method).

Step 6Feedback to Students on TheirResults. After their examination, during thefinal class period of the year, freshmen wereinvited to collect their PP and DOP from theassessor and informally discuss their results.Most students were concerned to know how

they had performed and, although some weredisappointed with their results, they gained amore realistic idea of their performance underexam conditions. As their teacher, I was ableto indicate discrepancies between classroomand examination performances and givefreshmen a clear idea about which skills tofocus on in the future to improve their spo-ken English.

Benefits To Participants

All participants agree that several importantbenefits for the students emerged from thisproject. On a questionnaire completed justprior to their examination (n=104), 64% of thefreshmen stated that they enjoyed being in-volved in the negotiation process, 73% re-sponded that as a result of this process theyunderstood what skills a good speaker ofEnglish displays, and 70% felt that the de-scriptions in the grading scheme were clear tofollow. The collaborative process was helpful,as 56% of the freshmen reported that they didnot know their own strengths and weak-nesses as English speakers before their coursebegan. Many also stated that they were nowless afraid of making mistakes. Before theexamination, 61% of the test-takers believedthat knowing how they were going to beassessed would improve their performance inthe examinationa view which, I believe,indicates that the majority of freshmen hadtaken ownership of this grading scheme be-fore the test took place.

From this feedback, it can be concludedthat the collaborative approach increased thelearners' confidence and reduced anxiety as,by the time they performed the examination,the freshmen knew what they were going tobe assessed on. As an exercise in criticalthinking, the development of this gradingscheme encouraged individual initiative andgave freshmen a sense of control over alearning environment which for so long hascontrolled them. That 72% of respondentsagreed that developing the grading schemeincreased their desire to improve their spokenEnglish shows the significance of consultationin fostering a high level of motivation

MCCLEAN

amongst university students.Although the collaborative process still

allows the teacher to retain the power ofveto, this experiment suggests that theseparticular learners are capable of setting theirown standards and will become more re-sponsible for their own progress in the fu-ture. In bringing about a shift inresponsibility from the teacher to the learner,Japanese university students can be preparedfor a change of teacher from year to yearand, more importantly, for lifelong learning.By adapting this technique for peer evalua-tion in the classroom, rather than being adrawback for the learners, utilizing an NGScould actually be less threatening than theunilateral examiner/test-taker method andprove to be a more meaningful method ofassessment for them. Such preparation isincreasingly important for young graduates inJapan today, in view of the insecure andcompetitive job market they face on exitingthe university. Japanese university studentsdo, however, need to be prepared andtrained in the collaborative process.

For the assessor, there were several ben-efits in utilizing this NGS. Firstly, a very sig-nificant degree of intra-marker reliability wasachieved on the total scores, which shows ahigh degree of certainty and consistency onthe part of the marker. This suggests that, notonly is the process of agreeing on criteriabeneficial for students, but it also helps clarifythe criteria in the assessor's mind prior tomarking. The inclusion of a second ratingscale may also have raised intra-marker reli-ability because the DOP provided an immedi-ate reference check, taking the pressure offthe assessor to provide a "cut-off' score or beexact to one point on the PP. By instilling thiselement of flexibility, fatigue was reduced forthe assessor, which is an important consider-ation for testers working under the strict timeconstraints imposed by Japanese universities.For instance, when the assessor remainedundecided about a test-takers' performance,recording test-takers' interactions on video-tape provided the means for an additionalcheck without delaying the test schedule orrecalling the test-taker.


MCCLEAN

The overall results of this experiment werealso encouraging. The distribution of test-takers' scores obtained using the PP revealeda normal distribution (a bell shaped curve)with scores falling over a fairly wide range(range = 19 out of 25). This range, togetherwith the standard deviation of 4.2, show awide dispersion of scores, suggesting that theassessor has flexibility to award scores over awider range using an NGS than a commercialgrading scheme. In the initial trial to developa suitable oral test, the TEEP2 grading schemewas utilized and, although scores were dis-tributed over a wide range (range = 17 out of20), the standard deviation was narrower(3.91) and a negatively skewed distribution ofscores was obtained. Using the NGS, a me-dian and mean of 13.5 was achieved, whichwas higher than that attained with the TEEP(median = 9; mean = 9.32). This suggests thatan NGS is more suitable for use with thisgroup of test-takers because on this schemethey can achieve a higher overall score,whereas the TEEP scheme is oriented towardall non-native speakers of English.

Relatively low correlations were obtained,however, in individual categories using theNGS (Understandability = 0.638; Intelligibility= 0.218), suggesting that some skills are moredifficult to mark and/or that the tester wasaccommodating to Japanese learners and, inparticular, to her students. This simply illus-trates the necessity of a collaborative processto train assessors to a common standard, inorder to attain a high degree of inter-markerreliability and the use of at least two markersto counter subjectivity in marking.

For the teacher, the main benefit of thisCRT type of assessment scheme is that itprovides information about learners' perfor-mances. Strengths and weaknesses can beisolated across the whole test population,allowing adjustments to be made in bothcurriculum and course content to accommo-date the learners' needs. It also permits spe-cific information to be gained about anindividual's performancea distinct advan-tage in a learning context typified by largeclasses and limited contact time. An NGSprovides a tangible base from which the


teacher can advise and cooperate with indi-vidual students, and together they can agreeon attainable target levels of performance toaim for by the end of the following year.And, while there is the drawback of testing inJapanese universities that the assessor andteacher are usually one and the same person,this situation does allow discrepancies be-tween learners' examination and classroomperformances to be recognized.

Through this cooperative process, teachersare also prompted to think critically and be-come better able to judge the appropriatenessof different forms of evaluation. This is animportant point to consider in the existinglight of English language departments inJapanese universities, which employ largenumbers of part-time teachers and staff onshort-term contacts who have varying profi-ciency and experience in the teaching andassessing of EFL/ESL.

Conclusion

The collaborative process used in this experi-ment increased trust between the students,teachers, and assessors because all parties'contributions were respected. Consultationbuilt consensus, a vital pre-requisite for theacceptance of innovation in Japanese society,and the test and grading scheme achievedface validity in this context because theyappeared to measure what they were sup-posed to (Hughes, 1989), i.e., competence inspoken English. There is no question thatthis grading scheme has much room for im-provement, and further effort is required toarrive at a test with high degrees of contentvalidity and test format reliability. But, giventhe lack of awareness about communicativecompetence, a pervading fear of change, andregrettably little desire for inculcating commu-nicative forms of assessment in Japanese uni-versities, the problem, as Gruba (1993) puts it,is that "An understanding of the inability oftests to serve as perfectly neutral, perfectlyeffective instruments to inform decisions ofresource management is crucial here" (p. 4).

Developing an NGS to fit the particular testpopulation is an important step towards

151

achieving political acceptability (Henning,'1987) for a test of oral English in Japaneseuniversities. It opens up the mysticism ofassessment, making test criteria transparent,and in so doing encourages ethical markingby assessors (Gnibe, 1993) and conformity tostated criteria, holds teachers accountable forthe quality of their courses, and gives stu-dents the benefit of more effective teachingand evaluation. A negotiated grading schemeis, therefore, a worthwhile tool for teaching,and as a pre-requisite to the development andintroduction of alternate forms of evaluationin Japanese universities out of which, it ishoped, more reliable and valid tests of spo-ken English may evolve.

Notes

' The existing grading structure used at TokyoDenki University.The Test in English for Educational Purposesfrom the Associated Board, taken from Weir, C.J. (1990) Communicative language testing, p.147-148. For the oral section of the TEEP testcriteria focus on six categories of communicativecompetence (Appropriateness, Adequacy ofVocabulary, Grammatical Accuracy, Intelligibil-ity, Fluency, and Relevance and adequacy ofcontent). In each category four levels or bandsof ability are described. Performance is ratedfrom 0-3 according to the test-taker's fulfillmentof the description in each band. A total scoreout of 20 is awarded.

References

Bachman, L. F. (1990). Fundamental consider-ations in language testing. Oxford: Oxford Uni-versity Press.

Benson, M. (1991). Attitudes and motivation to-wards English: A survey of Japanese freshmen.RELC Journal, 22(1), 34-48.

Brindley, G. (1989). Assessing achievement in thelearner-centred curriculum. Sydney: NationalCentre for English Teaching.

Brown, J. D. (1988). Understanding research insecond language learning. New York: Cam-bridge University Press.

Canale, M., & Swain, M. (1980). Theoretical basisof communicative approaches for second Ian-

MCCLEAN

guage teaching and testing. Applied Linguistics,1, 1-47.

Davidson, B. W. (1994). Critical thinking: A per-spective and prescriptions for language teachers.The Language Teacher, 18(4), 20-25.

Evanoff, R. (1993). Making a career of universityteaching in Japan. In P. Wadden (Ed.), A hand-book for teaching English at Japanese collegesand universities (pp. 15-22). New York: OxfordUniversity Press.

Fairclough, N. (1992). The appropriacy of "Appro-priateness." In N. Fairclough (Ed.), Criticallanguage awareness (pp. 35-56). London:Longman.

Gruba, P. (1993). Student assessment final report.Chiba, Japan: Kanda Institute of Foreign Lan-guages.

Henning, G. (1987). A guide to language testing.Cambridge, MA: Newbury House.

Heron, J. (1981). Assessment Revisited. In D.Boud (Ed.), Developing student autonomy inlearning (pp. 55-68). London: Kogan Page.

Hughes, A. (1989). Testing for language teachers.Cambridge: Cambridge University Press.

Jones, J. B. (1991a). The FCET survey. Bulletin ofJoetsu University of Education, 10(2), 223-224.

Jones, J. B. (1991b). The FCET survey. Bulletin ofJoetsu University of Education, 11(1), 159-174.

Jones, J. B. (1992). The FCET survey. Bulletin ofJoetsu University of Education, 11(2), 263-273.

Kelly, C. (1993). The hidden role of the university.In P. Wadden (Ed.), A handbook for teachingEnglish at Japanese colleges and universities (pp.172-191). New York: Oxford University Press.

McClean, J. M. (1993). An experiment in testdesign for assessing the English oral ability ofTDU freshmen 1992-3. Research reports of theFaculty of Engineering General Education,Tokyo Denki University, 12, 43-50.

Morrow, K. (1979). Communicative languagetesting: Revolution or evolution. In C. J. Brumfit& K. Johnson (Eds. ), The communicative ap-proach to language teaching. Oxford: OxfordUniversity Press, 143-158.

Nambiar, M. K., & Goon, C. (1993). Assessment oforal skills: A comparison of scores obtainedthrough audio recordings to those obtainedthrough face-to-face evaluation. RELC Journal,24(1). 15-31.

Nozaki, K. N. (1993). The Japanese student andthe foreign teacher. In P. Wadden (Ed.), Ahandbook for teaching English at Japanesecolleges and universities (pp. 27-34). New York:Oxford University Press.

152 LANGUAGE TES11NG IN JAPAN 147

MCCLEAN

Weir, C. J. (1990). Communicative languagetesting. London: Prentice Hall International.

Weir, C. J. & Bygate, M. (1993). Meeting the crite-

148 JALT APPLIED MMERIALS

ria of communicativeness in a spoken languagetest. Paper presented at RELC Conference onTesting and Evaluation, Singapore.

153

Assessing the Unsaid:The Development of Tests

of Nonverbal Ability

NICHOLAS 0. JUNGHEIM

RYUTSU KEIZAI UNIVERSITY

As the definition of what constitutescommunicative competence is ex-panded and refined, our view of the

learner's needs in the foreign language class-room have also changed considerably. Whileclassroom teachers have attempted to dealwith these developments by introducing morecommunicative teaching methods, appropri-ate instruments are not always available totest the effectiveness of these methods. Howdo we know if the learner has really acquiredthe new communicative tools that the teacherhas exposed them to?

The purpose of this chapter is to show howstandard test construction methods can beapplied to the development of tests whichaddress abilities not usually considered intraditional language testing. This chapter willdescribe a theoretical framework for testingnonverbal behavior and the construction andadministration of two tests of nonverbal abil-ity: the Gesture Test (Gestest) for assessingthe comprehension of English gestures andthe Nonverbal Ability (NOVA) Scales for as-sessing nonverbal behavior in conversations.

Background

With the expansion of the communicativecompetence framework (Canale & Swain,

Chapter 16

1980; Canale, 1983), efforts have been madeto improve oral proficiency assessment usingthe communicative paradigm. Along theselines, Bachman and Palmer (1983) developedthe Oral Interview Test of CommunicativeProficiency, which includes scales for measur-ing grammatical competence, pragmatic com-petence, and sociolinguistic competence aspart of what Bachman (1988, 1990) refers toas Communicative Language Ability (CLA), histesting model of communicative competence.

While Bachman and Palmer's test was amajor step away from previous tests whichhave been criticized as discrete-point testswithout a basis in theory or research(Bachman, 1988; Bachman & Savignon, 1986),it relied solely on the vocal channel for itsevaluation of oral proficiency, in spite clearindications that nonverbal behavior is also apart of the sociolinguistic and strategic com-petence components of communicative com-petence (Canale, 1983; Canale & Swain,1980). While there have been ongoing calls topay more attention to nonverbal behavior asan integral part of communication (Alshabbi,1993; Baird, 1983; Hurley, 1992; Kellerman,1992; Pennycook, 1985; von Raffler-Engel,1980), little has been done to address thisissue for language learners. On the otherhand, there has been an least one attempt to


JUNGHEIM

deal with the basic nonverbal communicationskills necessary to do well at American col-leges with the Communication CompetencyAssessment Instrument (CCAI) (reported inRubin, 1982) affirming the importance ofhaving nonverbal communication skills.

A further rationale for creating tests toevaluate language learners' nonverbal behav-ior is provided by research that has examinedhow ratings of oral proficiency are affectedby what the raters see. In Neu's (1990) com-parison of two learners' oral proficiency rat-ings and their use of nonverbalbehaviors, shefound that raters were "fudging," or changingtheir ratings, according to their perceptions ofeach learner's nonverbal behavior. This re-sulted in a mismatch between oral proficiencyratings and actual linguistic data. The Japa-nese subject's scores were negatively affectedby the raters' perception of his nonverbalbehavior while the Saudi Arabian's werepositively affected.

In their comparison of the facetofaceratings of learners' oral proficiency with rat-ings of audio recordings of the same conver-sations two months later, Nambiar and Goon(1993) found that subjects received lowerratings on the subsequent audio recordings.Raters were "irritated" by long pauses andrepeated grammatical and phonological errorswhich had been less noticeable in facetoface ratings "because of the need to simulta-neously focus on visual paralinguistic andextralinguistic cues" (p. 24). Facetofaceassessment resulted in the raters giving thesubjects more favorable ratings.

In addition to problems related to oralproficiency testing illustrated by the aboveexamples, the simple gesture is a nonverbalcue which can cause confusion when differ-ent cultures meet (Jungheim & Ushimaru,1990; Kimura, 1979). A simple wave of thehand to "come here" in Japan can also looklike a Japanese person is waving hello orgoodbye; the American just smiles and waveshack. An important tool for strategic compe-tence, or getting the message across by allmeans available, is missing in this intercul-tural exchange if the American is not awareof the meaning of a simple Japanese gesture.

150 JA LT -2Al PLIED MATERIALS

The acquisition of gestures, however, varieseven in Ll. Kumin and Lazar (1974) foundsignificant differences between three andfouryearolds in their ability understand anduse gestures included in a 30item list ofgestures called emblems, gestures that have adirect translation and meaning known to allthe members of a given culture and are un-ambiguous even if taken out of context (Ek-man, 1976) such as the "come here" gesture.They also found that both age groups weresignificantly better at understanding thesegestures than using them.

Similar evidence exists for L2 gestures.Mohan and Helmer (1988) compared theunderstanding of English gestures by fourand fiveyearold native English speakers andnonnative speakers and found significanteffects for age and culture on the ability tounderstand an inventory of 36 emblems andillustrators, gestures used to underline, em-phasize, or illustrate spoken language.

There is also evidence that the effectivenessof instruction in nonverbal communication inthe foreign language classroom also dependson the learning style of the students. In hisstudy of the classroom acquisition of emblem-atic gestures, Jungheim (1991) found thatJapanese students who received instructionusing a traditional presentation, practice, andproduction approach did significantly betteron a posttest than students who were taughtthe same gestures using a more "communica-tive" approach.

The above discussion has shown some ofthe issues involving nonverbal communica-tion as it pertains to language learners, butuntil now, there has been no theoreticalframework to help language teachers dealwith it.

Testing Fragment

A basic framework for the development oftests of nonverbal communication was firstpresented in Jungheim (1994d). This frame-work expands Bachman's (1990, 1991) Com-municative Language Ability (CLA) model toinclude a threepart nonverbal ability com-ponent. Nonverbal ability can he defined here

.155

as knowing how to use and interpret a varietyof nonverbal behaviors or cues appropriatelyfor the target language and culture. The threeparts of this framework are nonverbal textualability, nonverbal sociolinguistic ability, andnonverbal strategic ability.

Textual AbilityThis ability is referred to as textual because

in the CLA model (Bachman 1990, 1991)textual competence includes written andspoken language. Nonverbal textual ability,then, will also include "ways in which inter-locutors organize and perform the turns inconversational discourse . . . .( B a c h m a n,

1990, p. 88). The primary nonverbal behav-iors under textual ability are gestures, headnods, and gaze direction when used to facili-tate the interaction process as inbackchanneling or turntaking signals. Facialexpressions such as smiles and frowns canalso be included under textual ability.

Textual ability can be interpreted in terms ofboth the frequency and appropriateness of aperson's use of the nonverbal behaviors con-cerned. When Japanese use head nods as abackchannel signal, for example, they havebeen found to nod both more frequently and atdifferent times than Americans when speakingEnglish (Maynard, 1987, 1989, 1990). This isrelated to what is called aizuchi in Japanese.Nonverbal and otherbackchannel signals suchas "uhuh" or "yeah" are more likely to occur atthe point of grammatical completion for Ameri-cans, whereas, for Japanese who frequently useaizuchi, they are also used by the listener atthe speaker's pauses as a kind of encourage-ment to continue speaking. When and howoften someone nods, therefore, can be used toassess their nonverbal textual ability.

Gaze direction can also be viewed in termsof frequency and appropriateness. Japanese,for example, can be expected to have differ-ent gaze behavior from Americans (Barnlund,1989; Hattori, 1986). Asians in general havebeen found to focus less on the face or headof a speaker (Watson, cited in Argyle &Cook, 1976, p. 27).

Gaze direction, however, is consistentamong members of a particular culture

JUNGHEIM

(Greenbaum, 1985; Jungheim, 1994b, LaFrance & Mayo, 1978). Therefore, differencesamong cultures may often become noticeablewhen someone is speaking a second or for-eign language. Listeners also tend to look attheir partners much of the time they arelistening, but they have been consistentlyobserved to look away prior to beginning aspeaking turn (Duncan & Fiske, 1985). In thecase of a language learner, this type of look-ing away to think prior to speaking alsoincreases when they have less control of theforeign language. According to Bialystok(1990), effective control gives the impressionof fluency or automaticity. Inappropriate useof head nods and gaze direction, thus, willaffect the listeners impression of the learner'sfluency. Frequency and appropriateness ofgaze direction changes can therefore also beused to assess nonverbal textual ability.

Sociolinguistic AbilitySociolinguistic ability includes the ability to

recognize the appropriate use of nonverbalbehaviors such as the less frequent use ofhead nodding by native speakers of Englishas well as the ability to use and interpretgestures that vary from culture to culture. Itis related to CLA sociolinguistic competenceas a sensitivity to differences in dialect orvariety, to differences in register and to natu-ralness, and the ability to interpret culturalreferences and figures of speech. (Bachman,1990, p. 95).

Previous tests of nonnative speakers' ges-tural understanding such as Jungheim (1991)and Mohan & Helmer (1988) are examples ofassessing nonverbal sociolinguistic ability.These tests primarily covered gestures calledemblems, or gestures which have a directverbal translation that is understood by allmembers of the same group or culture. Asmany as 67 emblems such as the "come here"gesture have been identified as used by NorthAmericans (Johnson, Ekman, & Friesen,1975). The Profile of Nonverbal Sensitivity(PONS) (Rosenthal, Hall, Archer, DiMatteo, &Rogers, 1979), while not designed for lan-guage learners, is a good example of how atest of nonverbal sociolinguistic ability could


JUNGHEIM

be designed. In this test, subjects are asked tojudge nonverbal behavior in video perfor-mances that have the sound distorted to makethe speech incomprehensible while leavingthe intonation patterns intact.

It goes without saying that languagelearners need nonverbal sociolinguisticability not only to improve communicationbut also to avoid what could be agonizingnonverbal misunderstandings. A simple"thumbs up" gesture that has a good mean-ing in North America is the equivalent ofgiving someone the "finger" (an obscenegesture in the United States) in some cul-tures. Nonverbal sociolin-guistic ability notonly enhances the language learner's com-munication but also helps to avoid embar-rassing misunderstandings.

Strategic AbilityNonverbal strategic ability is important for

language learners because it covers the com-pensatory role of nonverbal behavior as wellas its role in supporting and enhancingspeech. This ability includes the learner's useof nonverbal behaviors such as gestures ormime to compensate for insufficient linguisticknowledge, when necessary, as well as theappropriateness of the learner's use of ges-tures to support or enhance speech.

Lack of the use of gestures does not, how-ever, indicate poor strategic ability unlessthere is an unresolved linguistic deficiencyor, on the contrary, the learner is perfectlyunderstandable without gestures. On theother hand, nonverbal behaviors such as"nose pointing" and too frequent selfpoint-ing for "me" (sometimes found among Japa-nese learners).would be rated lower forappropriateness for enhancing and support-ing speech with gestures signifying spacialrelationships or shapes.

Table 1 summarizes some of the produc-tive uses of nonverbal behavior according tothe three nonverbal abilities in the nonverbalability framework. This list consists primarilyof productive uses of nonverbal behavior andis not meant to be an exhaustive review ofnonverbal behaviors. Rather, this is a listofsome of the major nonverbal behaviors that

152 JALT -2PLIED MATERIALS

could be of interest to language teachers aseasily observable and ratable behaviors.

Developizng Tests of Nonverbal AbilityThe following sections will outline the pro-

cess of developing two tests of nonverbal abil-ity and the results of their administration togroups of Japanese learners of English. The firsttest is the Gesture Test (Gestest) developed tomeasure language learners' nonverbalsociolinguistic ability to interpret English ges-tures. The second test is the Nonverbal Ability(NOVA) Scales developed to measure languagelearners' nonverbal textual ability and nonver-bal strategic ability in conversations.

The Gestest

The Gestest described here is a completelyredeveloped version of the test used inJungheim (1991) as a research instrument. Itconsists of a female North American nativespeaker of English performing 30 gestures onvideo tape and a fouroption multiplechoiceanswer sheet written in Japanese. TheGesttest was developed as a normrefer-enced test using item facility (IF) and itemdiscrimination (ID) as guidelines for gooditems. The development process basicallyfollowed Brown's (1995) three steps of (a)piloting a large number of items, (b) analyz-ing the items, and (c) selecting the best itemsto make up a smaller and more effectiveversion of the test.

The following research questions frame thisapproach to the construction of a test of non-verbal sociolinguistic competence:

1. What gestures are important for lan-guage learners?

2. What is the difference in the under-standing of these gestures betweennative and nonnative speakers of En-glish?

3. What are the characteristics of a test ofthese gestures when applied to lan-guage learners?

4. How reliable and valid is this test as ameasure of nonverbal sociolinguisticability?

157

JUNGHEIM

Table 1. Use of nonverbal behaviors in a nonverbal ability framework

Textual Use

Gestures

Hands are used by the speaker to emphasize speech.Vertical head movement (nod) is used as a backchannel signal by the listener to indi-cate attention, understanding, or agreement.Vertical head movement (nod) is used by the speaker as a within-turn or turn-endsignal.Horizontal head movement (shake) is used by the listener to indicate disagreement orwith laughter.

GazeListener-directed gaze is used at the end of an utterance to elicit a backchannel re-sponse.Terminal gaze (prolonged gaze starting just before the end of an utterance) is used tosignal the end of the utterance.Speaker-directed gaze is used to signal attentiveness.

Facial ExpressionsSmiles are used to indicate attention or agreement.Frowns are used to indicate disagreement or lack of understanding.

Sociolinguistic Use

GesturesGestures unaccompanied by speech are used to convey specific meanings in a givenculture.

Strategic use

GesturesMime (hand gestures) is used to compensate for a linguistic deficiency such as the lackof a necessary lexical item.Hand gestures are used to support spoken language to communicate spatialrelationships and physical shapes which are not always easily understood using spokenlanguage alone.

Method

Subjects. The subjects in the Gestest develop-ment process were 14 North American nativespeakers (NS) of English (two females, 12males) and 22 Japanese nonnative speakers(NNS), who were graduate students in TESOLat an American university in Japan. Thesesubjects were used to pilot the gesture video.Subjects used to pilot the actual 30itemGestest were 56 Japanese university students

(12 females, 44 males) in three intact fresh-man English conversation classes at a Japa-nese university.

Materials. Materials consisted of a list of 54gestures compiled from previous research(e.g., Jungheim, 1991; Kumin & Lazar, 1974;Mohan & Helmer, 1988) used to collectbaseline data, a revised list of 38 gestureswith an accompanying video, and a 30itemmultiplechoice Gestest with a gesture video.

Procedures. The first step in the construc-


JUNG HEIM

Table 2. Gestest descriptive statistics

Statistic

Pilot

Video (NS)Pilot

Video (NNS)Pilot

GestestRevisedGestest

N 14 22 56 56k 38 38 30 23Mean 32.60 24.50 19.04 17.09S 2.49 4.06 3.49 3.51Median 32.00 24.50 20.00 18.00Low 24.50 12.50 11.00 9.00High 36.00 30.00 25.00 22.00

.87 .90a .63 .75SEM .90 1.28 2.12 1.76

tion of the Gestest was to determine whichgestures to use for the collection of baselinedata for the initial piloting. A list of 54 ges-tures was compiled from previous studies ofgesture comprehension (e.g., Kumin & Lazar,1974; Jungheim, 1991; Mohan & Helmer,1988). The lists were given to three NorthAmerican native speakers (NS) of English inJapan who were asked to (a) cross out thosegestures that they thought would not beuseful for language learners or they did notunderstand, (b) write simple descriptions ofhow the remaining gestures would beperformed(examples of useful expressionsfor describing gestures were included), and(c) perform the gestures for a Japanese non-native speaker (NNS) of English who wasinstructed to write verbal equivalents forthem in Japanese. Space was also includedfor adding any gestures that the three NSsthought were important.

As a result of the above elicitation task, 38gestures that two out of three of the Ameri-can NSs thought were important for languagelearners were chosen for the creation of apilot video to collect basic item data. Thegesture video was made by using a cam-corder in a brightly lit room with a plaincurtain for a backdrop. A female Englishteacher from the United States performedeach of the gestures two times in successionwhile seated, according to the descriptionswritten by the abovementioned three NSs.


The video was then copied by connectingthe camcorder to an ordinary video cassetterecorder and adding numbers before eachvideo performance. This video was shown totwo groups of NS and NNS English teacherswho were asked to write the meaning of eachgesture. The NSs (14 North Americans) wereasked to write the answers in English andNNSs (22 Japanese) in Japanese.

A Japanese and a NS English teacher thenrated the answers using the gesture cues fromthe original 54-gesture list as a guide. Theseresults were used to decide which gestures toinclude in the pilot version of the Gestest.

In order to determine how the understand-ing of each gesture differed between NSs andNNS, the item facility (IF) for each of thegestures was first calculated separately forNSs and NNSs. IF is simply the proportion ofcorrect responses to a question on a test(number of correct answers divided by thetotal numberof persons taking the test). Inthis case, if 77 percent of the NSs correctlyidentified the gesture for "Okay," the IFwould be .77, a relatively easy item.

Next, item discrimination (ID) was calculatedfor each item. Ordinarily, ID is used to see howwell an item discriminates between a group ofhigh scorers and a group of low scorers (seeBrown, 1995 for a complete description of thismethod, or Chapter 5 for a brief description).In the present study ID was used to find outwhich gestures distinguished between NSs and

159

NNSs. The IFs of NNSs were subtracted fromthose of NSs for each item to see how welleach discriminated between NSs and NNSs.

Using the IF and ID statistics for each of the38 gestures, a list of 30 gestures was compiledfor the construction of a pilot test. Simplyspeaking a good item should be very easy forNSs and have some degree of difficulty forNNSs. Gestures which all NNSs and NSs identi-fied correctly, therefore, were also dropped.Gestures that were difficult for NSs weredropped, as well as gestures that were equallydifficult for both groups. A number of theseitems were included, however, in spite of highIF and low ID statistics. The primary criterionfor these items was that all of the NSs got theitem correct or that the IF was below 80 per-cent for the NNSs. It was assumed that the highIFs for these items were a result of the Englishteachers' greater experience and longer years

JUNGHEIM

of speaking English. Such items/gestures wereincluded to make sure that the pilot test had alarger number of items.

A new video was created using the originalgesture video tape. Gestures were copied inrandom order preceded by a number. A four-option 30-item multiple-choice answer sheetwas created with all options written in Japa-nese so that differences in English proficiencyamong the learners would not directly affectthe results for gesture comprehension. Thecorrect answers as well as the distractors werewritten using the answers written by the threeJapanese persons who viewed performancesfrom the initial 54-gesture list and answers forthe pilot gesture video. These were arranged inrandom order for each item and reviewed by anative speaker of Japanese to assure the accu-racy of the Japanese.

The resulting pilot Gestest was then ad-

Table 3. Gesture Video and Pilot Test Item Facility

Item English Cue

1. Okay2. Me3. Yes4. I won't listen (too loud)5. Hello (goodbye)6. I'm cold7. I'm happy*8. Tastes good9. Peace*10. What time is it?11. Tastes awful*12. Get up13. You14. Blowing a kiss love15. No Good16. Over there17. Louder (I can't hear you)18. I'm tired19. Crazy20. I don't know21. I'm hot22. Give it to me*23. Stop24. No25. I'm sad*26. Naughty child (don't do that)*27. Come here28. Oh, no!29. Money*30. It smells bad

Going my way (hitchhiking)Go awayQuietGoing to sleepPunch in the nose angerGet outSit downBig and round

VideoItem

Video Pilot Test Pilot

IF NS NNS ID IF High Low ID

(1) 0.87 1.00 0.77 0.23 0.68 0.88 0.31 0.56(7) 0.87 1.00 0.77 0.23 0.77 0.94 0.56 0.38(19) 0.95 1.00 0.91 0.09 0.98 1.00 0.94 0.06(27) 0.98 1.00 0.95 0.05 0.80 0.94 0.56 0.38(18) 0.98 1.00 0.95 0.05 0.91 1.00 0.69 0.31(24) 0.98 1.00 0.95 0.05 0.84 1.00 0.56 0.44(36) 0.66 0.87 0.59 0.28 0.29 0.25 0.31 -0.06(28) 0.76 0.87 0.68 0.18 0.46 0.63 0.31 0.31(31) 0.49 0.80 0.18 0.62 0.14 0.19 0.19 0.00(33) 0.90 1.00 0.82 0.18 0.86 1.00 0.75 0.25(26) 0.66 1.00 0.41 0.59 0.63 0.69 0.69 0.00(25) 0.51 0.80 0.27 0.53 0.75 0.88 0.50 0.38(5) 0.59 0.87 0.36 0.50 0.86 0.94 0.69 0.25(20) 0.63 1.00 0.32 0.68 0.71 0.88 0.50 0.38(4) 0.85 1.00 0.73 0.27 0.27 0.50 0.00 0.50(37) 0.78 0.87 0.77 0.09 0.39 0.50 0.25 0.25(30) 0.95 1.00 0.91 0.09 0.96 1.00 0.88 0.13(21) 0.37 0.73 0.14 0.60 0.00 0.00 0.00 0.00(13) 0.93 1.00 0.86 0.14 0.89 0.94 0.69 0.25(2) 0.87 0.93 0.82 0.12 0.71 0.94 0.38 0.56(32) 0.66 0.80 0.64 0.16 0.63 0.88 0.25 0.63(11) 0.40 0.67 0.18 0.48 0.14 0.19 0.13 0.06(12) 0.93 1.00 0.86 0.14 0.89 1.00 0.81 0.19(15) 0.98 1.00 0.95 0.05 0.77 1.00 0.44 0.56(34) 0.63 0.73 0.59 0.14 0.36 0.31 0.38 -0.06(6) 0.49 0.93 0.23 0.71 0.05 0.06 0.00 0.06(14) 0.93 1.00 0.86 0.14 0.77 0.94 0.56 0.38(3) 0.72 0.80 0.68 0.12 0.71 1.00 0.56 0.44

(9) 0.67 0.93 0.45 0.48 0.23 0.25 0.19 0.06(29) 0.98 1.00 0.95 0.05 0.98 1.00 0.94 0.06(8) 0.23 0.33 0.09 0.24(10) 0.38 0.33 0.41 -0.08(16) 1.00 1.00 1.00 0.00(17) 1.00 1.00 1.00 0.00(22) 0.37 0.47 0.32 0.15(23) 0.80 0.80 0.77 0.03(35) 0.54 0.40 0.64 -0.24(38) 0.66 0.67 0.68 -0.02

* Items eliminated for revised 23-item gesture test.

BEST COPY AVAILABLE 160 LANGUAGE TESTING IN JAPAN 155

JUNGHEIM

ministered to three groups of Japanese uni-versity students at their university's languagelaboratory. Students were able to view thevideo on small monitors on their desks aswell as on four large televisions suspendedfrom the classroom ceiling.

After correcting the tests, the IF and IDwere calculated for each item. In this case,ID was the difference between the IF for thestudents who scored in the upper quarter ofthe 56 students for the whole test minusthose who scored in the lower quarter. De-scriptive statistics, including measures ofreliability, were also calculated. Seven itemswhich did not work well according to theirIF and ID statistics were then eliminated andthe descriptive statistics recalculated for a"revised" version of the Gestest with feweritems and improved reliability.

Results

Table 2 shows the descriptive statistics forNSs and NNSs for the pilot video withinterrater reliability estimates (r) and for thepilot Gestest and revised Gestest withCronbach's alpha (a) measuring internalconsistency. All results appear to be normallydistributed. Thanks to the careful selection ofgestures and the thorough piloting of thevideo, the pilot Gestest already has a fairlyacceptable level of internal consistency.

Table 3 shows the item analysis of theresults of the pilot video and the pilotGestest. The eight items at the bottom of thelist were eliminated for the pilot Gestest.Those that had low IFs even for NSs mayhave had some problem with the video per-formance itself. This is very important tokeep in mind for anyone attempting to con-struct a similar video test of nonverbalsociolinguistic ability.

Gestures marked with an asterisk were elimi-nated because they were either too difficult,too easy, or did not discriminate well enoughbetween the highest and lowest scorers. Theremaining items were used to recalculate theresults as if this were a 23item revisedGestest. As seen in Table 1, the acceptablereliability of the improved version (a = .75)


was achieved by merely removing items thatdid not work well.

Discussion

The above description showed how tradi-tional normreferenced item analysis using IFand ID can be applied to the construction ofa test of nonverbal sociolinguistic ability. Itwas found that:

1. There are 38 gestures that NSs feel areimportant for language learners inJapan.

2. Thirty of these gestures can be used todiscriminate between NSs and NNSsunderstanding of North American ges-tures.

3. By eliminating seven of the thirty ges-ture items (because of their item statis-tics), a leaner and more effectiveversion of the test could be created.

As for research question 4, the use of IFand ID to construct this test after the carefulcollection of gesture data from NSs and NNS,clearly resulted in a more reliable test. Inaddition, this test also appears to be a validtest of nonverbal sociolinguistic ability at leastin terms of content validity for the task itselfas well for the construct being measured.

In terms of task, if these gestures collectedfrom the literature are truly emblematic, theyshould be interpretable without the help ofverbal or contextual clues. NSs were able toaccurately identify the gestures to such adegree that many of them used exactly thesame language as the English cues listed inTable 3 taken directly from the literature. IDstatistics comparing NSs and NNSs providesome evidence of construct validity by show-ing that the ability to identify these gesturesdiffers between North Americans and Japa-nese. As previously stated, sociolinguisticcompetence involves "a sensitivity to differ-ences in variety . . . .( B a c h m a n , 1990, p. 95).Item statistics revealed differences in thesensitivity toward these gestures and, alongwith the results of the test, provided evidencefor the existence of a nonverbal

161

sociolinguistic ability. Other evidence forvalidity can only be collected in the future bycomparing Gestest results with the results ofother yettobe constructed tests of nonver-bal ability. A further discussion of validity andtests of nonverbal ability is included at theend of this chapter.

The development of the Gestest did notend here. Items for the revised version of thetest were chosen but not retested in fact.Ultimately, a test such as this requires furtherfine tuning through what is called item qual-ity analysis. By examining which distractorsworked and did not work and revising theitem options, the overall reliability couldprobably be improved further.

Item 18 on the pilot test, for example,turned out to be a problematic one due to thetranslation of the correct answer. "I'm tired"was meant to mean tired as in sleepy. Unfor-tunately, both "overworked" and sleepy"tired" were included in the options. Becauseof the translation process, tsukareta or I'mtired was inadvertently designated as thecorrect answer, but nemui or I'm sleepy wasalso included. No one chose the "correct"answer, but item analysis showed that nemuiwas actually the correct answer, and thus, theitem analysis in Table 2 uses that answer as abasis for analysis.

Another problem encountered had to dowith the use of the video performances them-selves. In the case of the Gestest, some of thegestures that were dropped were poor itemsmore because of the ambiguity in the perfor-mance than because of the lack of validity ofthe gesture itself. It remains to be shownwhether performances by a male or a differ-ent female would produce different results.

The NOVA Scales

The Nonverbal Ability (NOVA) Scales weredeveloped to measure language learners'nonverbal textual and strategic ability in termsof head nods, gaze direction changes, andgestures. The test consists of a role play toelicit the target role play behaviors, the NOVARater's Guide (Jungheim, 1994c) containing adescription of the test, rating guidelines to

JUNGHEIM

train raters, and a rater training video withexamples of actual role play performances.

Since this test is based on a communicativetask, it was created keeping in mind Brown's(1995) suggestions for avoiding complicationsthat may arise with this type of test construc-tion: (a) the task was clearly defined withspecific instructions for the NS tester (in En-glish) and for the NNS examinee (in Japa-nese); (b) the task was narrow enough inscope to fit into the allotted time confirmedby piloting the role play task; (c) scoringprocedures were worked out clearly in ad-vance and described in the abovementionedrater's manual; and (d) points on the ratingscales were clearly defined to facilitate theraters' task; and (e) outside raters were usedto make the rating as anonymous as possible.

This section will present a general outlineof the steps taken to develop the NOVAScales. The following research questionsguided this approach to the construction of atest of nonverbal textual and strategic compe-tence:

1. What nonverbal behaviors are used bylanguage learners in conversations?

2. What kind of scales can be used to ratelanguage learners' nonverbal ability inrelation to these behaviors?

3. What are the characteristics of languagelearners' scores when rated using thesescales?

4. To what extent are the these scales reli-able and valid for assessing nonverbalability?

Method

Subjects. The subjects in the NOVA Scalesdevelopment process were 28 nonnativespeakers and 20 native speakers of English.The NNSs were educated middle-class Japa-nese (23 males and five females) EFL learnerscomprising 24 students, two faculty members,and two office staff members of a Japaneseuniversity. The NSs were 20 educated whitemiddle-class North Americans (15 males andfive females) who were EFL teachers orgraduate students in TESOL at a Japanese


JUNGHEIM

Table 4. Nonverbal ability (NOVA) scales

Textual Ability

Rating0

1

2

3

Appropriateness0

1

2

3

FrequencyExtremely limited use of head nods and infrequent changes in gaze directiontoward partner in conversationFrequent use of head nods and changes in gaze direction that are not accept-able by native speaker normsFrequency of head nods and changes in gaze direction approaches nativespeaker normsFrequency of head nods and changes in gaze direction acceptable by nativespeaker norms

Totally inappropriate use of head nods and gaze direction by native speakernormsFrequent inappropriate use of head nods and changes in gaze directionFew inappropriate uses of head nods and changes in gaze directionUse of head nods and changes in gaze direction acceptable by native speakernorms

Strategic Ability

Rating Compensatory Usage0 No evidence of hand gestures to solve considerable linguistic problems1 Limited use of hand gestures to solve linguistic problems with occasionally

unsuccessful results2 Hand gestures successfully used to solve linguistic problems3 Few linguistic problems requiring the use of hand gestures for compensation

Appropriateness0 Never uses hand gestures to support or enhance meaning1 Occasionally uses hand gestures to support or enhance meaning, often inap-

propriately by native speaker norms2 Most hand gestures approach native speaker norms3 Use of hand gestures appropriate by native speaker norms

branch of an American university in Japan. NSsubjects provided the baseline data for non-verbal behaviors.

Materials. Materials used were role playcards describing the role play tasks, videotapes of the subjects performing the roleplays, the NOVA Rater's Guide for trainingraters, training videos, and rating sheets.

Procedures. NNS subjects were recruitedthrough either a poster placed on a universitybulletin board or personally by the re-searcher. Next, they were paired according totheir English proficiency level using previ-


ously administered linguistic and oral profi-ciency tests. This was followed by their per-formance of a thoroughly piloted series offour role plays (described in detail inJungheim, 1994a). The first three were per-formed by each pair. The fourth (see Appen-dix B for role play instructions), which wasused for the application of the NOVA Scales,was performed individually with theresearcher's NS assistant. The second andthird were then repeated in Japanese. Sub-jects performed the role plays in front of twovideo cameras in the researcher's office.

1 3

To collect baseline data, NSs followed thesame procedure for performing the role playsin rooms at their university, with the excep-tion of the repetition of the second and thirdrole plays.

Videos of the role play performances wereobserved and transcribed. Head nods, gazedirection changes, and gestures were chosenfor analysis and coded on the transcripts of therole plays. (Details of the further analyses ofthe role play performances as well as the profi-ciency tests can be found in Jungheim, 1994d.)

As a result of the analyses of nonverbalbehavior used in the role play performances,four four-point (0-3) NOVA Scales werecreated on the basis of the above-mentionedCLA nonverbal ability framework (Jungheim,1994d). They included the rating of fre-quency and appropriateness for nonverbaltextual ability and compensatory usage andappropriateness for nonverbal strategic abil-ity. See Table 4 for the details of these scales.

The Nova Rater's Guide was then createdto train raters. It included the role play in-structions for role play four (a student re-questing a letter of recommendation from hisprofessor) used for the application of therating scales, training instructions, the NOVAScales, ratings of performances on the train-ing video, explanations of the ratings to helpraters interpret "native speaker norms," and asample rating sheet. "Native speaker norms"

Table 5. Descriptive statistics for the NOVA scales

JUNGHEIM

were established by observing and analyzingthe role play performances of the 20 NorthAmerican NSs. This norm was chosen be-cause of the need for a particular norm to actas a guide for rating. Since many Englishteachers in Japan are North Americans, theirnonverbal behavior is actually an input vari-ety for Japanese learners, and as such, it canserve as a norm (Kasper, 1992). It must beemphasized here that other norms would beequally valid, but for research purposes theNorth American norm was chosen.

A training video was created using pilotperformances of role play four. A rating videowas also created by copying the NNS subjects'performances of role play four in random orderonto a separate video with each performancepreceded by the subject's number.

Two raters were then given the trainingmaterials and rating video and asked to rate theperformances. A follow-up rating was per-formed by the researcher as a third rater amonth later because a number of ratings dif-fered by more than one point between the firsttwo raters. This was carried out to improve thereliability of the ratings. Although it is normallynot desirable for the researcher to do this be-cause of the issue of anonymity, it was justifiedhere on the basis of the three-month lapsesince he had last viewed the videos.

Finally, descriptive statistics including reli-ability estimates using intraclass correlation

OverallScore

Textual Ability

1* 2

Strategic Ability

3 4

Mean 1.57 1.61 1.37 1.42 1.49S .53 .50 .85 .81 .63Median 1.67 1.67 1.67 1.67 1.54Low .33 .67 .00 .00 .25High 2.67 3.00 3.00 3.00 2.92Range 2.34 2.33 3.00 3.00 2.67

rick .67 .70 .81 .81 .93SEM .30 .29 .37 .35 .17

* 1 = Frequency, 2 = Appropriateness, 3 = Compensatory Usage, 4 = Appropriateness


JUNGHEIM

for three raters were calculated for each ofthe four scales and the average score for thescales of the NOVA Scales.

Results

Table 5 shows the descriptive statistics forthe three raters' ratings of role play fourperformances of the 28 Japanese subjectsusing the NOVA Scales. The high intraclasscorrelation (rkk) for the test average meansthat there was a high degree of agreementamong the raters for the test as a whole (i.e.,rkk = .93). Reliability estimates for each scale,rkk =.67.81, were also within acceptablelimits. Textual ability ratings for the use ofhead nods and gaze direction changes ap-pear to be normally distributed. The slightlypositive skew (i.e. the average for the wholegroup is much lower than the median or"middle" score) for nonverbal strategic abilityratings may be related to the number of sub-jects who did not gesture at all even thoughthey could have used gestures to compensatefor their linguistic difficulties.

Discussion

The NOVA Scale results have shown that it ispossible to construct a productive test ofnonverbal textual and strategic ability bycollecting baseline data to establish norms forthe nonverbal behaviors to be tested and bytaking necessary test construction precautionsfor creating a test based on a communicativetask. It has found that:

1. Head nods, gaze direction changes,and gestures are nonverbal behaviorsused by language learners in conversa-tions.

2. Based on a CLA nonverbal abilitymodel, nonverbal ability scales can beconstructed to assess the frequencyand appropriateness of the use ofhead nods and gaze direction changesas a part of textual ability and thecompensatory usage and appropriate-ness of gestures as a part of strategicability.


3. Ratings using the NOVA scales gener-ally produce a normal distribution fortextual ability but a slightly positivelyskewed distribution for strategic ability.

4. The NOVA Scales as a whole are avery reliable measure of nonverbalability related to the use of head nods,gaze direction changes, and gesturesfor this group of students.

5. Content validity was assured by theuse of the role play task, which wasappropriate for the university subjectswho might actually need to ask for aletter or recommendation in real life,and by reviewing the literature onnonverbal behavior and analyzing thebehavior of the baseline data subjects.

The issues of reliability and validity will bediscussed further in the next section.

Conclusions

This chapter has described how tests of vari-ous types of nonverbal behavior can be con-structed under a communicative competenceframework using standard language testingpractices. However, these tests have beenconstructed and used primarily in a researchcontext and are certainly not the only alter-natives for assessing nonverbal behavior.

Outside of actual assessment for the pur-pose of grading students, such instrumentscan also be used as classroom tools. TheGestest, for example, has proven to be usefulfor heightening learners' awareness of cul-tural differences in nonverbal communicationas a kind of pretest instrument in Englishconversation classes and intercultural com-munication seminars. Constructing it as amultiplechoice test with all of the options inJapanese has made it possible to administerthe Gestest in only about 15 minutes, shortenough to allow ample time for followupdiscussion in the classroom.

ProblemsWhile the Gestest and the NOVA Scales

have been shown to be reliable tests of non-verbal abilities, the issue of validity has still

1.l5

not been thoroughly investigated. In theabove descriptions of these tests, only con-tent validity has been dealt with. It should benoted, however, that although we often di-vide validity into content, criterionrelated,and construct validities, validity is actually aunitary concept. Furthermore, not only is thetest itself being validated but inferences arealso being made about the uses of the test(APA, 1985). That is why it was emphasizedthat, in this study, the tests were primarilyused as research instruments. The validity ofthese tests for other purposes would have tobe considered in light of who the learnersare as well as the purpose for testing theirnonverbal ability.

Evidence for content validity has beenprovided for both tests in terms of the targetnonverbal behaviors and the tasks them-selves. In order to provide a fuller picture ofthese tests' validity, it is necessary to discusscriterionrelated and construct validity.

Criterionrelated validity can only be deter-mined by comparing the results of a test withanother measure of the same construct. Dueto the lack of other measures of nonverbalability, this is not possible. One answer, then,is to create other measures of nonverbalability following the same careful steps asdescribed for the Gestest and the NOVAScales.

The PONS Test (Rosenthal, Hall, Archer,DiMatteo, & Rogers, 1979) offers some poten-tial as a test of sociolinguistic competence,although its reliability and validity for lan-guage learners must first be investigated. Inaddition, Jungheim(1992) suggested twoother tests of nonverbal ability, a gestureappropriateness test and a bicultural gesturemeasure. The gesture appropriateness testwould consist of brief, randomlyorderedvideo conversations. There would be bothappropriate and inappropriate uses of thetarget gestures in addition to performanceswith six other easy gestures used appropri-ately. Subjects would be required to decideonly whether the gesture was used correctlyor incorrectly. A bicultural gesture measurecould be patterned after the Bilingual SyntaxMeasure (Burt, Du lay, & HernandezChavez,

JUNGHEIM

1975) and used as a productive measure ofnonverbal ability. Subjects would perform adialogue that required the use of certaintarget gestures. Raters would rate videos ofthe performances for the number of correctusages of gestures in obligatory occasions.Furthermore, an alternate version of theGestest itself could also be created usingthe same items with different video perfor-mances and arranged in different orders.These are just some of the possibilities forcreating tests of nonverbal ability that couldbe used to establish criterionrelated validity.

Construct validity is established by showingexperimentally that a test actually assesses aparticular psychological construct that cannotbe measured directly. Factor analysis hasshown that the NOVA Scales do measure anonverbal ability (Jungheim, 1994d), thusconfirming their construct validity. Furtherresearch, however, will need to include addi-tional measures of nonverbal ability such asthe Gestest in this analysis.

Future Test DevelopmentThis chapter has provided a framework for

dealing with nonverbal ability in a communi-cative competence context and a number ofexamples of tests that could be used to as-sess learners in this respect. Others who areinterested in developing their own testsshould keep in mind three important aspectsof developing tests of communicative compe-tence: "what to look for, how to gather rel-evant information, and how to use theinformation" (Cana le, 1988, p. 67).

What to look for should include a thoroughreview of what nonverbal behaviors are im-portant for target learners. This may vary fromculture to culture. Instruments developedhere were constructed with Japanese univer-sity students in mind. How to gather relevantinformation will depend on the nature of thetest itself. Aside from referring to the nonver-bal communication literature, it is importantto establish some norm and collect at least alimited amount of baseline data. How to useevaluation information is ultimately the mostimportant aspect because it involves the issueof validity. The tests described here may be


valid for research or as a pedagogical tool butnot necessarily to evaluate the nonverbalability of hotel trainees or teaching assistantsat an American university. Teachers will needto take a hard look at how the results of theirtests will be used.

It is hoped that this chapter will stimulateinterest in the assessment of nonverbal abilityand further contribute to the refinement of aframework for testing as well as teaching theuse of nonverbal communication by foreignlanguage learners.

Note

Anyone who is interested in obtaining acopy of the NOVA Rater's Guide or furtherinformation about any of the other materialsdescribed here should write the author at 5-22-10 Shimoigusa, Suginami -ku, Tokyo 167,Japan.

References

AJ-shabbi, A. E. (1993). Gestures in the communi-cative language teaching classroom. TESOLJournal, 2(3), 16-19.

APA. (1985). Standards for educational and psy-chological testing. Washington, DC: AmericanPsychological Association.

Argyle, M., & Cook, M. (1976). Gaze and mutualgaze. Cambridge, England: Cambridge UniversityPress.

Bachman, L. S. (1988). Problems in examining thevalidity of the ACTFL Oral Proficiency Interview.Studies in Second Language Acquisition, 10, 149-163.

Bachman, L. S. (1990). Fundamental consider-ations in language Testing. Oxford: OxfordUniversity Press.

Bachman, L. S. (1991). What does language testinghave to offer? TESOL Quarterly, 254), 671-704.

Bachman, L. S., & Palmer, A. S. (1983). Oral inter-view test of communicative proficiency in En-glish. Urbana, Ill.: Photo-offset.

Bachman, L. S., & Savignon, S. (1986). The evalua-tion of communicative language proficiency: Acritique of the ACTFL oral interview. The ModernLanguage Journal, 70(4), 380-390.

Baird, L. L. (1983). The search for communicationskills. Educational Testing Service Research


Report, No. 83-14.Barnlund, D. (1989). Communicative styles of

Japanese and Americans: Images and realities.Belmont, CA: Wadsworth.

Bialystok, E. (1990). Communication strategies: Apsychological analysis of second-language use.Oxford: Basil Blackwell.

Brown, J. D. (1995). Testing in language programs.Englewood Cliffs, NJ: Prentice-Hall.

Burt, M. K., Dulay, H. C., & Hernandez-Chavez, E.(1975). Bilingual syntax measure I. San Antonio,TX: Psychological Corporation.

Canale, M. (1983). From communicative compe-tence to communicative language pedagogy. InJ. Richards & R. Schmidt (Eds.). Language andcommunication. London: Longman.

Canale, M. (1988). The measurement of communi-cative competence. Annual Review of AppliedLinguistics, 8, 67-84.

Canale, M., & Swain, M. (1980). Theoretical basesof communicative approaches to second-lan-guage teaching and testing. Applied Linguistics,1, 1-47.

Duncan, S., Jr., & Fiske, D. W. (1985). Interactionstructure and strategy. Cambridge: CambridgeUniversity Press.

Ekman, P. (1976). Movements with precise mean-ings. Journal of Communication, 3,14-26.

Greenbaum, P. E. (1985). Nonverbal differences incommunication style between American Indianand Anglo elementary classrooms. AmericanEducational Research Journal, Z1), 101-115.

Hattori, T. (1986). A study of nonverbal intercul-tural communication between Japanese andAmericans focusing on the use of the eyes. JALTJournal, 8, 109-118.

Hurley, D. S. (1992). Issues in teaching pragmatics,prosody, and non-verbal communication. Ap-plied Linguistics, 13(3), 259-281.

Johnson, H. G., Ekman, P., & Friesen, W. V.(1975). Communicative body movements: Ameri-can emblems. Semiotica, 15(4), 335-353.

Jungheim, N. 0. (1991). A Study on the classroomacquisition of gestures in Japan. The Journal ofRyutsu Keizai University, 2X2), 61-68.

Jungheim, N. 0. (1992). Interactions of explicit andimplicit knowledge and the effect of practice onthe acquisition of L2 gestures. Unpublishedpaper, Tokyo: Temple University Japan.

Jungheim, N. 0. (1994a). Assessing the nonverbalability of foreign language learners. Paper pre-sented at the Annual Meeting of the AmericanEducational Research Association (New Orleans,La, April 4-8, 1994). (ERIC Document Reproduc-

167

tion Service No. 374 676).Jungheim, N. 0. (1994b). Through the learner's

eyes: Nonverbal behavior and personality in theforeign language classroom. Temple UniversityJapan Research Studies in TESOL, 2, 93-108.

Jungheim, N. 0. (1994c). Nonverbal ability assess-ment scales: NOVA rater's guide. Unpublishedmanuscript, Temple University, Tokyo.

Jungheim, N. 0. (1994d). Assessing nonverbalability as a component of language learners'communicative competence. Unpublished doc-toral dissertation, Temple University, Tokyo.

Jungheim, N., & Ushimaru, A. (1990). Kaiwatatsujin e no michi "pafomansu" ("Perfor-mance"-A key to the art of conversation).Hyakumannin no Eigo (English for Millions), 5,20-25.

Kasper, G. (1992). Pragmatic transfer. SecondLanguage Research .5(3), 203-231.

Kellerman, S. (1992). 'I see what you mean': Therole of kinesic behaviour in listening and impli-cations for foreign and second language learn-ing. Applied Linguistics, 13(3), 239-258.

Kimura, T. (1979). Language is culture: Culture-based differences in Japanese and English. In H.M. Taylor (Ed.) English &Japanese in contrast.New York: Regents.

Kumin, L., & Lazar, M. (1974), Gestural communi-cation in preschool children. Perceptual andMotor Skills, 38, 708-710.

La France, M., & Mayo, C. (1978). Gaze direction ininterracialdyadic communication. Ethnicity, 5,167-73.

Maynard, S. K. (1987). Interactional functions of anonverbal sign: Head movement in Japanesedyadic casual conversation. Journal of Pragmat-ics, 11, 589-606.

JUNGHEIM

Maynard, S. K. (1989). Japanese conversation: Self -contextualization through structure and interac-tional management. Norwood, NJ: Ablex.

Maynard, S. K. (1990). Conversation managementin contrast: Listener response in Japanese andAmerican English. Journal of Pragmatics, 14,397-412.

Mohan, B., & Helmer, S. (1988). Context and sec-ond language development: Preschoolers' com-prehension of gestures. Applied Linguistics, 9(3),275-292.

Nambiar, M. K., & Goon, C. (1993). Assessment oforal skills: A comparison of scores obtainedthrough audio recordings to those obtainedthrough face-to-face interaction. RELC Journal,24(1), 15-31.

Neu, J. (1990). Assessing the role of nonverbalcommunication in the acquisition of communica-tive competence in L2. In R. C. Scarcella, E. S.Andersen, & S. D. Krashen (Eds.). Developingcommunicative competence in a second lan-guage. New York: Newbury House.

Pennycook, A. (1985). Actions speak louder thanwords: Paralanguage, communication, and edu-cation. TESOL Quarterly, 19(2), 259-282.

Rosenthal, R., Hall, J. A., Archer, D., DiMatteo, M.R., & Rogers, P. L. (1979). The PONS testmanual: Profile of nonverbal sensitivity. NewYork: Irvington Publishers.

Rubin, R. B. (1982). Assessing speaking and listen-ing competence at the college level: The com-munication competency assessment instrument.Communication Education, 31, 19-32.

von Raffler-Engel, W. (1980). Kinesics andparalinguistics: A neglected factor in second-language research and teaching. CanadianModern Language Review, 36(2), 225-37.


JUNGHEIM

Appendix A: Gestest Answer Sheet

EEt tRIJ

e=i-4-0)AfthIsfq7),Pili,1.-.k 41(4)..6t 1,015, x i t):if4-51/442.9.M 'DA: t1. a. #.6.15..,-C- !

b. !

c. !

d. tryttb. !

2. a. fi,-V-4-1.b. /\-- cto

c.

d. .11 To

3. a. 9t4t1.00/14t'.b. WhI:iti-D-CtoZ,c.

d. ft b7), -:d

4. a. itti).1.0.b. 47)101.0.c. 5 ZtVNan.d. 11*.

5. a.

b. ,6 5 *0.t.c. 5.*6.e. ta-, Oilvvro

6. a. t75.--)tc.b. MO-C. <

c.

d.

7. a. toVvt'lla.b. b7).c. #3 fi 1,60.d. 5 4. L.O.

8. a. #301,0.b. -0c.

d. ODEffiltOtc.

9. a. )3LJetz.Iob. Sotoc. 44[1.t.d. Blot

10. a. lirCtaN.b. 3114T 106.c. 4.. RICd. E

11. a. -rt:/--<-0b. t),E1.0,t,c. tfolod.

b

b

ra7 b

a

a

a

a

a


b

b

b

b

b

b

b

b

1c I

I d

I d

d

d

d

d

d

Id I

d

d

d

12. a. 4 h 3541-DT.b. ft.t2*tc. 5

d.

13. a. 6t.tts,k.b.

c. -t- 6.d. t

14. a. gb. AT

c. &to.d.

15. a. !

b. JJ o !

c. If !

d. OicfP13 !

16. a. 5 1.b. tilts tc. &50)A.d. f1l Z.: 5

17. a. .t <11Hlivr <b. s. ?

c. 7 t obta.d. #.)41 tv/N07),..

18. a. by8tc.,t.b. iate-0.c. 04.4tc,d. ot 4to.

19. a. ITAZA75,4.0.

b. Mbl'tsti1_,I.,N.C.

d.

20. a. tevtb. 11101 -* O,

c. i-,71,N6710/.

d.fotida21. a. 11V:PtV).

b. Ab-,c. fe-.t.t41 -) it.d. ,to

22. a. t 5 itto.b. #3A 19

c. T < to.d. f50

a

a

a

a

a

a

a

Ia

b

b

b

b

b

b

lb I

I c I

I d

lc I

a

a

I c I

d

d

d

d

d

d

d

d

d

d

16 9BEST COPY AVAILABLE

23. a. k)--)4c..1-70-0b. setteec. #64-.±1T.d. !

24. a. Atottob. *?4,6*to.c. :t: T1'd.

25. a. F itt.4b. 81.-0t).C. 7',C-10.,

d. Lit.

26. a. gVNIF !b. liNlivt'-tN.c. AVvitod. .(73L#.319.

27. a. N,..-C-Ftb.

C.

d. C:f A,.

28. a. liAbilMo.b. E LI 5 !

C.

d.

29. a.b.

c. #61#.:k0

d. t 5 L.

30. a. bw):.b. <C. 41.751CZ.

d. 0751%ot -pit.

a I

a l

a l

a

Ibl

b

b

Ibj

b

b

a

d

d

d

d

Id

JUNGHEIM

Appendix BRole Play Instructions for Nonnative Speakers

You will be asked to perform a number ofrole plays. Read each situation and imaginethat you are really the person in the role play.Specific details are not given, but try to speakin as much as detail as possible.

Nonnative Speaker Subject's RoleYou want to apply to an American

university's graduate school to study in yourmajor. You need a letter of recommendationfrom your teacher. You go to your teacher'soffice to ask him for one. Be ready to explainwhy you need the letter, what information itshould contain, when you need it, and whatyour teacher should do with it in as muchdetail as possible.

Instructions for Native Speaker Assistant1. Greet the student: "Hi, (name), what can I

do for you?"2. Remember. You do not know what the

student wants beforehand.3. Help the student speak, but do not put

words in his/her mouth. The burden tocommunicate should be on the student aslong as pauses are not too long. Be pa-tient.

4. If the student does not give instructionsabout the letter of recommendation, askquestions about why he/she wants tostudy abroad, the information to be con-tained in the letter, when it is needed,where to send it, and so on.

5. You will have four minutes for the roleplay, but do not worry about the time. Ifthere is time, close the conversation withsmall talk and a word of "good luck."

Native Speaker Assistant's RoleYou are a professor at a Japanese univer-

sity. A student comes to you to ask for a letterof recommendation so that he can study at anAmerican university. Ask the student why hewants to study abroad, what information theletter should contain, when he needs it, andwhere he should send it. Try to get the stu-dent to explain in as much detail as possible.


Chapter 17

Cloze Testing Options for the Classroom

CECILIA B. IKEGUCHI

DOKKYO UNIVERSITY

controversies in the field of doze test-ing research are far from solved, andthe debate on the pros and cons of

doze procedure continues. Whatever hasbeen said and done, the merits of dozetesting seem to outweigh its demerits(Chapelle & Abraham, 1990). Although re-search reports come up with inconsistentresults from one doze to another, dozeappears to be a measure of distinct languageskills (Bachman, 1990). Aside from theseempirical inconsistencies, there exists anunfortunate reality in terms of the practicalapplications of the doze. It is a fact that theconcept of doze testing has often been con-fined within the walls of empirical research,"R," which is often considered irrelevant andfar-out from actual classroom testing situa-tions, "r" (Lo Castro, 1994). This chapteraims to bring research on doze testing intothe classroom setting by introducing simpli-fied ways of testing language skills by usingdoze. Specifically, this chapter aims first, toprovide a simplified review of the researchon doze testing for practical classroom uses;it does not attempt to be comprehensive. Inso doing, I wish to orient the languageteacher to what's going on in the researchrelated to their actual teaching practices.This chapter seeks as much as possible toavoid theoretical discussions (since they aregiven in voluminous publications else-where), and the terminology is simplified. Iftheoretical background information is in-

166 jAIT APPLIED_ MATERIALS

cluded, it is for the purpose of providing thenecessary theoretical justifications for thepracticality of doze in the classroom.

Second, this chapter seeks to highlight theimplications of empirical findings for class-room testing. More specifically and mostimportantly, it aims to present a systematicdiscussion of the four types of doze tests(the fixed-rate deletion, the rational, themultiplechoice doze, and the C-test), theirtheoretical justifications, the features of each,as well as their distinct merits and demerits.Since the accuracy of measuring what the lan-guage teacher wants to assess may depend onthe kind of deletion procedure that is selected,each procedure will be discussed in turn.

So much has been written about dozetesting in general that, on first thought, itseemed redundant to include a history ofdoze testing in this chapter. However, tounderstand the different doze deletions thatare central to the present discussion, it isnecessary to get a bit of information on theprecedents and causes that gave birth todoze testing.

Background of Cloze Testing

In the doze procedure, the examinees aregiven a segment from which words havebeen deleted and replaced by blanks, andthey must provide or choose the word thatbest fits the blank. The theory behind doze isthat a language learner, presented with a

171

piece of language mutilated in this mannercan use his or her acquired competence torestore either the word in the original text oran acceptable word. As learners developlanguage competence their ability to useclues from the text to restore missing itemsincreases (Klein-Braley & Raatz, 1984).

The seeming simplicity of this technique,regardless of its theoretical justification, hasmade it an attractive instrument for both testconstructors and classroom teachers. Clozetesting has been regarded by researchers fromtwo opposing and yet complementary theoreti-cal viewpoints. Some researchers claim thatdoze is an integrative rather than a discrete-point test because it draws at once on theoverall grammatical, semantic, and rhetoricalknowledge of the language. To reconstruct thetextual message, students have to understandkey ideas and perceive relationships within astretch of continuous discourse, and to pro-duce, rather than simply recognize, an appro-priate word for each blank. The focus of thetask involved is more communicative thanformal in nature, and is therefore considered toreflect a person's ability to function in the lan-guage (Hanania & Shikhani, 1986).

On the other end of the theoretical con-tinuum, are those researchers who argue thatdoze testing measures only basic skills, andlower levels of reading comprehension(Shanahan, Kamil, & Tobin, 1982). In an-swering a doze item, the examinee relies onclues within the immediate environment ofthe blank, and as such, this type of test isonly measuring lower order skills. Correla-tional experiments with doze reveal thatdoze tests are correlated more closely withgrammar tests than with reading tests(Alderson, 1979).

My aim here is not to argue for or againstany of these theories; I propose rather tooutline those language skills, whether they begrammar, reading, or textual skills, which thefour types of doze are believed to assess.Validity research, after all, goes on and on.Meanwhile, teachers might as well make thebest use of what research to date can offer.

Originally developed by Taylor in 1953 tomeasure text difficulty for native readers,

IKEGUCHI

other researchers have since shown thatdoze can likewise measure foreign and sec-ond language proficiency as well (Anderson,1976). Cloze can most appropriately be de-scribed as a learner-centered teaching andtesting device in second language situationssince it is thought to challenge the efficiencyof the developing L2 grammar of a student ina way that reflects natural language processes(Oiler, 1972). Moreover, doze provides acontextualized challenge to learner grammarefficiency, displaying an inevitable simplicitywhich Oiler calls nothing short of a stroke ofraw genius (Jonz, 1976). This chapter main-tains that doze testing does hold potentialfor measuring aspects of students' writtengrammatical competence, "knowledge ofvocabulary, morphology, syntax, and phonol-ogy," as well as textual competence, "knowl-edge of cohesive and rhetorical properties oftext" in second language (Bachman, 1990).The specific traits measured by a particulardoze test probably depend in part on meth-ods of test construction, each of which willbe described below.

Fixed-rate Deletion Cloze

Rationale for Fixedrate ClozeThe rationale for using the doze procedure

to measure the examinee's reading compre-hension is that, if they understand the struc-ture and content of a text, then they will beable to utilize redundancy in the text to re-cover the deleted words at a better-than-chance level. Cloze test scores are said to bea direct measure of a text's redundancy be-cause they are obtained by deleting wordsfrom the text and asking the examinees torecover the deleted words. In this context,text redundancy refers to the degree towhich the language in a text is predictablewhen only parts of it are unknown.

Originally, there were two deletionschemes for constructing doze texts: thefixed-ratio (also called random) deletion, andthe rationaldeletion method. In the fixed-ratio deletion, the test writer deletes every nthword, and in so doing produces a semirandom sample of the words from the pas-


IKEGUCHI

sage. In rationaldeletion, the test writerchooses the words to be deleted in advance,e.g., content words, or pronouns, or preposi-tions, etc.

In fixed-rate deletion, the procedure foromitting words in a regular pattern inevitablysamples various types of words, some ofwhich are governed by local grammaticalconstraints, others of which are governed bylong-range textual constraints. This meansthat the students' ability to answer the localconstraint test items depends on their abilityto perceive and use grammar rules and theirrelationships, while answering long-rangeconstraint items requires the students' overallcomprehension of the passage.

Studies have shown, however, that differ-ences in test results, subject to text difficultyand topic, are problematic for fixedratedoze testing (Brown, 1983). Alderson (1983)claims that differences in doze test results aredue not to deletion frequencies but rather todifferences in the particular words deleted.According to Alderson, random deletion ig-nores the syntactic and semantic relationshipsin a text, and is therefore likely to yield in-consistent results depending upon what pro-portion of syntactic and textual functions aretapped. Those who support the use of doze,Brown (1991) for instance, suggest that dozeitems assess a wide range of language pointsfrom morphemes and grammar rules at theclause level to pragmatic level rules of cohe-sion and coherence, as well as discourselevels.

Format of the Fixedratio ClozeIt seems to me that evidence supporting

the claim that fixedrate doze items test dif-ferent aspects of the examinee's languageability outweigh the criticisms against it, thusI will continue the discussion along this line.

The fixed-ratio doze test is constructed bydeleting words from a passage according to afixed-pattern, traditionally every 7th word ofa text. Why every 7th? Brown's (1983, 1988a)experiments revealed that doze passages canbe made to fit a group when the distancebetween items (or words in a passage) wasno less than five words, and no more than

168 JALT Al__11.1ED MATERIALS

nine, giving an average of seven words. Al-though the 7thword deletion seems to be apopular deletion pattern, 7thword doze testresults sometimes reveal that they are toodifficult for particular groups of studentswith say an average score of 25%. Brown(1983) suggested ways that can be used toincrease the mean level and make the testless difficult such as by lengthening the pas-sage, and increasing the distance betweenthe blanks, from say every 7th to every 11th,or even every 15th word.

In constructing a fixedrate doze, in mostcases, one or two sentences are left intact(i.e., no deletions are made) at the beginningof the passage, and one or two sentences arealso unmodified at the end of the passage toprovide a complete context. There is usuallya total of 30-50 blanks to be completed. Mak-ing an answer key while constructing the testwill reveal to the teacher what words arebeing tested and the percentage of the differ-ent kinds of words deleted or what balanceof content and function words is beingsampled. Usually, a 50item doze test is longenough to represent the different aspects ofgrammar, vocabulary, and comprehensionskills being checked at the end of a courseunit. Language teachers, especially those inthe high school, usually give a term test or acourse-end language check composed of 50-60 points of grammar and vocabulary. Thistraditional test type could easily be replacedby two fixedrate doze tests of at least 30items each. University students usually take30-40 minutes to answer a 40-50 item fixedrate doze. Because of lack of research, highschool language teachers will have to esti-mate the time needed by their students toanswer a fixedrate doze test at the appro-priate level of difficulty. However, highschool students may be very similar to uni-versity students; we just don't know.

Scoring is usually done in one of two ways:the exact-word scoring method, and the ac-ceptable-word scoring method. The exactword scoring method refers to any scoringmethod where only the word that was origi-nally in the blank in the original passage iscounted as correct. The exact-word scoring

173

system has its merits for doze tests used forresearch in that: (a) it yields high correlationswith other scoring methods, and (b) a singlecorrect answer for each blank is essential forsome linguistic analyses.

For classroom testing, however, the accept-ableword scoring system has more advan-tages than disadvantages. True, theacceptableword scoring system can betroublesome and a bit time consuming toscore, but this can be overcome if teachersmeet ahead of time and decide which an-swers are correct and acceptable. It pays toexert a little extra effort at this stage becauseacceptableword scoring is easier for studentsaccept, yields higher performance results, andoften turns out to have higher test reliability.

What Skills Does Fixedratio Cloze Assess?It has been argued that there is a di-

chotomy between language core proficiencytests of linguistic skills of a relatively low-order and higher-order skills tests. Languageskills belonging to the loworder categoryinclude grammar and vocabulary, while theskills of reading comprehension might moreappropriately be classified as higherorderskills. Some doze researchers have arguedthat doze items primarily assess students'ability to comprehend words and ideas at thesentence level, which are said to be lowerorder skills. On the other hand, researchersclaim that doze items measure intersententiallanguage components, which are said to behigher-order skills. If doze assesses higher-order skills, they would probably includestudents' abilities to comprehend meaningand structural relationships among sentences.

Research using the fixed-rate deletion andsupporting the lowerorder skills hypothesisincludes Shanahan, Kamil, & Tobin (1982)and Porter (1983). In the Shanahan et al.(1982) study, which used intact andscrambled versions of doze tests, no signifi-cant difference was found in students' perfor-mance in these the two types of doze tests.This has led researchers to conclude thatdoze tests do not measure the ability to com-prehend information beyond the sentencelevel. Porter used doze tests varying the

IKEGUCHI

points at which the deletions started. Hisfindings reveal that doze items are not sensi-tive to contexts beyond five to six words.

However, I feel that the empirical findingsfrom the other camp outweigh the argumentsjust posed. For example, Brown (1988b, 1991),using the fixed-rate deletion pattern, has ar-gued that words in a doze test provide a fairrepresentation of the text at the word, clause,and sentence level, i.e., at least some items aresensitive to constraints ranging beyond thelimits of sentence boundaries. If true, thiswould mean that doze test items do assess astudent's comprehension at the discourse andpragmatic level, which definitely involveshigher-order skills. Chavez-011er, Chihara,Weaver, and 011er, Jr. (1985) made similarclaims that at least some items in a doze testare sensitive to long-range constraints beyondfive-ten words. Yamashita's (1994) experiment,although limited to the morphemic and clausallevel, concluded that doze tests using thefixed-rate deletion are effective in analyzing thereading comprehension of native and non-native English learners.

This discussion of both sides of this theo-retical issue are not intended to confuse thelanguage practitioner, but are instead aimedat providing some background on what ishappening in doze testing. More importantlyfor the language teacher, these empiricalresults should provide a total picture of thereality of language testing. As for the higher-order and lower-order skills issue, the con-cern of most language teachers is to assesssome of or both of these skills during thetime available.

I feel that, when teachers make doze testsusing the fixed-rate deletion, they can restassured that both of these levels of skills arebeing assessed and that each of the followingstructural categories is proportionately repre-sented in the doze test they are making:(a) word level, (b) clause level, (c) sentencelevel, and (d) paragraph level (Perkins &German, 1985). A word level deletion, alsocalled local level, means that for the studentsto retrieve the deleted word, they rely onlyon the clues found within two words of ablank. A clausal level deletion means that


IKEGUCHI

students can find clues to answer the blankfrom within the clause but beyond two wordsof where the blank appears. A sentence leveldeletion means that students must find clueswithin the sentence where the blank appears.Paragraph level deletion means clues arefound within the paragraph in order to re-trieve the missing word. Thus the examineemust have understood at least part of thecontent of the entire paragraph in order to fillin the deleted words.

Previous research where these differentlevels of deletion were made on a single textis reported in Perkins and German (1985),where 10, 11, 5, and 9 deletions were madeat the word, clause, sentence, and paragraphlevel, respectively. Chapelle and Abraham(1990) following these structural categories,had 4, 12, 6, and 3 such deletions, respec-tively. To me, this research clearly indicatesthat even in a single passage, teachers canconstruct a doze test that can assess differentlevels of skills by systematically varying thenumber of deletions in each of the categorieslisted by Perkins and German (1985).

Pros and Cons of the Fixed-rate DeletionPattern

The fixed-rate doze is the most difficult of allthe doze types to answer because no delibera-tion over items takes place during test con-struction. In addition, the literature onfixed-rate deletion doze is marked by inconsis-tent results from one study to another. Theseinconsistencies, which appear to occur evenwhen different deletion rates are used for thesame text, may be caused by uneven samplingof skills measured from one doze to another(Alderson, 1979). Another explanation given fordoze result inconsistencies are text and itemdifficulty, suggesting the need for passage fit(Brown, 1983). This probably means that teach-ers should use passages that fit their students interms of difficulty level, topic, interest level,and the like. Obviously, doze items are at theroot of doze test performance, and it may bepossible to improve a doze test by explicitlyselecting the words to be deleted through theuse of the rational-deletion dozethe topicwhich I will consider next.


Rational-Deletion Cloze Technique

Rationale for Rational-deletion ClozeWhile fixed-ratio doze tests rely on regular

sampling of words in a text by deleting wordsin a regular pattern, the rational doze as-sumes that the different doze items can beexplicitly chosen to measure different lan-guage traits. Bachman (1982) provides evi-dence that the test writer can select wordsreflecting distinct aspects of the learners'grammatical and textual competence.

Considerable debate has occurred amongdoze researchers as to whether all deletionsin a doze passage measure the same abili-ties. Some researchers had assumed thatexaminees' item responses to function worddeletions provides information about theirunderstanding of the written text's structurewhile their item responses to content wordsprovides information about their comprehen-sion of the written text's content. It wouldseem then that rational deletion doze couldbe designed to measure a wide range oflanguage abilities.

To develop a test that would potentiallymeasure textual relationships (students' com-prehension of structure and meaning) be-yond the clause level, it would be necessaryto identify a set of criteria for classifying andselecting the words to be deleted, perhapsusing the following three deletion types: (a)syntactic (understanding of words and theirrelationship within clause), (b) cohesive(based on the student's understanding ofmeaning and structure in the intersententiallevel), (c) strategic (depends on long rangepatterns of coherence, i.e., comprehension ofthe passage).

Halliday and Hasan (1976) developed aframework that can be used for constructingitems that assess cohesion, but such a frame-work proved to be difficult and subjective,and almost impossible for the classroomteacher to apply. Another attempt was madeby Bachman (1985) in which four types ofdeletions were defined according to a fourhypothesized levels of context required forclosure: (a) within clause, (b) across clausesbut within sentence, (c) across sentences,

175

within text, and (d) extra-textual. To make atest primarily based on cohesion (and thus,testing higher the order skills of comprehen-sion), the test maker would have to maximizedeletions of types (b) and (c), and minimizedeletions of types (a) and (d).

Skills Measured by Rational ClozeBachman's (1982) study found support for

the claim that doze tests can be used tomeasure higher order skills of cohesion andcoherence if a rational deletion procedure isused. Having control over the particularwords deleted, the language teacher canconstruct doze passages that measure textualrelationships beyond clause boundaries. Thetarget skills in a rational doze can includeone or all of the following: (a) syntax (de-pending on the students' understanding ofclause level), (b) context cohesive (depend-ing on the students' comprehension of text,based on interclausal and intersententialclues of cohesion), and (c) strategic (depend-ing on the students' comprehension of text,based on patterns of coherence).

If a teacher makes deletions using thecategories just outlined, for example, a 30item text with an average deletion of oneword in twelve could probably be adminis-tered and completed in about 20 minutes byuniversity students. Otherwise, a teachercould use the same contextual categories asthose suggested in the section on fixed-ratedeletion to explicitly choose items for a ratio-nal doze test. For example, a rational dozetest with 3, 13, 5, 14 words deleted in theword, clause, sentence, and paragraph level,respectively, which was given to universitystudents, took 25 minutes to answer. Thesetime limits are merely meant to be sugges-tive; teachers may have to modify the timeallowed for students to complete a particulardoze test depending on their level, and onthe kind of deletion patterns made.

Scoring a teacher-made rational-deletiondoze test can be done using either the ac-ceptable-word or alternative-word criterionmethod (wherein a list of syntactically orsemantically alternative answers can be pre-pared). Ideally, the list of acceptable answers

IKEGUCHI

will be made based on the responses ofsamples of native English speakersif pilot-testing of the doze test can be done. Almostalways, however, using native speakers toestablish an answer key is quite time con-suming and therefore not possible. The sim-plest way to make an answer key may be todraw up a list based on the judgements ofthe test development team or the teacher(s)making the test.

Advantages and Disadvantages of Rational-deletion Cloze

Empirical results on rational-deletion dozeitem performance have been mixed. For in-stance in Bachman's (1985) study, the ratio-nal-deletion doze produced a test that waseasier than the fixed-ratio doze, while otherresults have shown that the overall test diffi-culty of the two types of doze were almostthe same (Chapelle & Abraham, 1990). Therational-deletion doze has also been shownto produce results with higher reliability thanthe fixed-ratio, but also, results have beenproduced with about the same reliability forboth rational and fixed-deletion doze tests(Bachman, 1985).

Multiple-choice Cloze Technique

Rationale for Multiple-choice ClozeBoth the fixed-ratio (or random deletion)

and the rational-deletion doze discussedabove are administered to the students withinstructions to replace the missing items. Themultiple-choice doze gives students a limitedrange of choices with which to compare theresponses they generate. The self-generatedresponse in answering an multiple-choicedoze is what Jonz (1976) called the student-centered remodeling of the doze procedure.However, the most important characteristic ofthe multiple-choice doze is its ability toassess a wide range of language skills.

Generally, research on this type of dozeindicates that the number of items can be re-duced without sacrificing the reliability of thetest. Thus a shorter multiple-choice doze canreduce the strain the student is exposed to andmay represent a marked improvement from the


IKEGUCHI

teacher or test administrator point of viewbecause it introduces objectivity in scoring inaddition to the smaller number of items toscore. Research has demonstrated that con-structing a test response (as in a fill-in type ofdoze) is more difficult for test takers than se-lecting one (as in a multiplechoice doze).Some language teachers might ask: Doesn'tmaking a test less difficult endanger its validityand reliability? The answer is no. A shorter andeasier test can be just as high in reliability as alonger and more difficult one. The researchindicates that comparable reliability levels canbe obtained for multiplechoice and fill-inversions of the doze test, even if the multiplechoice versions are shorter.

For example, Bensousan and Ramraz(1984) show higher mean scores obtained onmultiplechoice doze tests (which are there-fore easier than the other doze types). Otherstudies show that multiplechoice doze pro-duces adequate reliabilities. For instance,Jonz (1976) found a reliability of .76;Bensousan and Ramraz (1984) report .82 and.84; and, Hale, Stansfield, Rocks, Hicks, But-ler, and 011er (1989) show a reliability of .88for multiplechoice doze.

While, multiple-choice variants of the dozeprocedure have been less thoroughly re-searched than others, this type of doze seemsto involve processes similar to those requiredby the traditional doze types. For instance,research on multiplechoice doze testingshows that multiplechoice doze obtainscomparable levels of correlation with criterionmeasures (Hale et al, 1989).

Pros and Cons of Multiplechoice ClozeDeletion rate. The deletion process in-

volved in a multiplechoice doze can eitherbe random deletion or rational deletion,(while providing a set of other responses tochoose from). Whichever deletion pattern isused, research has shown the reliability ofmultiplechoice doze tests to be higher than.70 (.76 in Jonz, 1976 & 1990 and .76 inChapelle & Abraham, 1990).

Distractors. The major drawback with theuse of the multiplechoice doze lies in the factthat it is difficult to construct. Hinofotis and


Snow (1980) argue that constructing a mul-tiplechoice doze is a considerably more com-plicated procedure than constructing anopen-ended doze test. The most importantconsideration in making multiplechoice dozeitems is the process of providing distractors(i.e., the words, other than the correct answer,from which the student must choose).

Obtaining the distractors used to create aset of four options usually requires pretestingand involves checking students' responses foritem facility and discrimination. The results ofpretesting can be used to determine twothings: the range of possible correct re-sponses, and distractors for the final versionof the test. The highest frequency acceptableresponse can become the correct response onthe multiplechoice test, and the unaccept-able responses with the highest frequencycan then be chosen as distractors (Jonz,1976). Based on item analysis of the pilotadministration, items that do not discriminatewell can also be discarded or modified.

A simpler procedure that can be adopted isthat used by Chapelle and Abraham (1990).In most of their items, there were four alter-natives given, and these distractors were thesame part of speech as the correct answer.The results yielded high reliability.

Number of test items. A short (say, thirtyitems) multiplechoice doze has the followingadvantages: it takes the slowest university stu-dent 20 minutes to complete, and it takes onlya minute to score each paper. In addition, sucha test yields high reliability (Jonz, 1976), whichindicates that the quality of multiplechoicedoze is on par with other doze test types.There is no minimum or maximum number ofitems required to come up with a good mul-tiplechoice test, but apparently, one-third toone-tenth fewer items can be used on mul-tiplechoice doze as compared to the otherformats without sacrificing reliability.

Test administration. It has also been ar-gued that any multiplechoice doze whichcalls for a separate answer sheet (either con-taining lists of multiple-choice options or amachine response form) tends to distract theexaminee and is less desirable than the fill-indoze (Jonz, 1976). Generally, for classroom

177

use, no separate sheet will be necessary be-cause the options can simply be aligned in arectangular frame within the text. A separateanswer sheet is only really needed in order tofacilitate scoring procedures in large scaletesting. In a classroom situation, where imme-diate feedback be desirable, checking couldeven be done by the students with theteacher's guidance.

What Skills Does Multiplechoice Cloze Mea-sure?

Now to the most important section which Ibelieve every concerned language teacher iscurious about: which area of the second lan-guage competence does an multiplechoicetest assess? The section above mentioned thatthe deletion process chosen by the teacher inconstructing a multiplechoice test will de-pend on what skills are targeted for the par-ticular test. There are several empiricallysupported ways of constructing multiplechoice doze.

Generally, multiplechoice test items requirethe reader to focus on a specific amount of textin order to answer a question. Such focus canvary from one word within the sentence to theentire text. The ability to answer multiplechoice doze items is not restricted to the com-prehension of single words; it is likewisebelieved to tap understanding of a wider con-text. Thus, it is possible to employ rationaldeletion procedures to measure skills oflong-range comprehension by choosing itemssensitive to long-range constraints (Bensousan& Ramraz, 1984).

One study comparing multiplechoicedoze results with other doze types indicatedthat multiplechoice doze scores correlatewell with composition scores (Jonz, 1976);while another study (Hinofotis & Snow, 1980)claimed that multiplechoice doze test resultscorrelate highest with structure and readingtest scores. Most empirical findings on mul-tiplechoice doze show strong correlationswith reading tests, supporting the assertionthat multiplechoice test items can assessreading skills in ESL (Ozete, 1977; Brown,1985; and Chapelle & Abraham, 1990). It hasalso been argued that the multiple-choice

IKEG UCHT

doze falls closer to the discrete-point end ofthe test continuum.

However, I feel that the degree to whichmultiplechoice doze items are discretepointmust be determined on the basis of the kindsof items actually used in the test. Items thatrequire learners to process the meaning ofconnected pieces of language rather thandiscrete segments are integrative (Chapelle,1988). Some multiplechoice doze items areaimed at reading comprehension (defined interms of textual constraints ranging acrossclauses) as contrasted with knowledge ofgrammar (short-range surface syntax, andmorphology or vocabulary). A number ofresearch studies have indicated that multiplechoice doze can be used for teaching andtesting reading comprehension. Brown(1985), for instance, recommended ways ofusing doze for the teaching of reading.

The classification scheme developed byHale et al. (1989) is the most detailed so far,and can serve as a useful reference for teach-ers who would like to improve their mul-tiplechoice doze tests. Their schemeconsisted of four categories of multiplechoice doze items, based on the three skillareas of grammar, vocabulary, and readingcomprehension. In answering each dozeitem, the student has to use two of theseskills simultaneously, with reading alwaysinvolved. Each of the four categories will beexplained in turn.

Reading comprehension and grammaritems. In this type of multiplechoice dozeitem, the student has to understand proposi-tional information at an inter-clausal level,emphasizing knowledge of syntax. For in-stance: A ballad is a folk song; however afolk song is not a ballad [because, if, whether,unless] it tells a story.

Reading comprehension and vocabularyitems. To answer this type of item, the studenthas to comprehend inter-clausal relationships,and at the same time, knowledge of vocabularyis involved. For example: ... known as the LostSea. It is listed in the Guinness Book of WorldRecords as the world's largest underground[water, body, lake, cave].

Grammar/reading comprehension. Items of

17& LANGUAGE TESTING IN JAPAN 173

IKEGUCHI

this type tap the students' knowledge of sur-face syntax, while reading comprehension isinvolved only to the degree that the readermust understand within clause propositionalinformation. For instance: It is generally un-derstood that a ballad is a song that tells astory (but) a folk song is not so [easy, easily,ease, easier] defined.

Vocabulary/reading comprehension. Thisitem type checks on the student's vocabularyskill, although it involves reading comprehen-sion based on information within clauseboundaries. For example: In fact, there aremany folk songs about occupations rail-roading, [following, mustering, concentrating,herding] cattle, and so on.

These examples illustrate that, althoughreading comprehension of a text is basic inanswering a multiple-choice doze test item,other skills such as vocabulary and grammarcan also be effectively tested. Classroomteachers have control over which aspects ofgrammar and vocabulary they would like toemphasize depending on which of theseskills is the focal point of their assessment atthe time.

C-Test

Rationale for C-TestResearch on the C-test has provided ample

theoretical justification and proven the em-pirical reliability and validity of this dozevariant. Two major investigations, Alderson(1979) and Klein-Braley (1983), pointed to anumber of problems with the ordinary dozeas follows:

1. The results are unpredictable for variousdeletion rates. Also different deletion ratesand starting points of deletion even in thesame passage can cause considerablevariation in difficulty.

2. Particularly for homogenous samples,doze tests can result in unsatisfactory testreliability.

3. Students may perform differently on differ-ent doze tests depending on the topic anddifficulty level of the passage, and thesevariables can also result in different de-


grees of test reliability and validity.4. The generally accepted practice of using

only one passage for a doze test wasfound to be a source of bias in scores.

5. Deleting every nth word in a doze testmay not always produce a set of wordsthat represents the word classes found inthe passage used.

6. There is the problem of lack of criterionreferencing native-speaker performance.In doze testing with L1 examinees, evenadult speakers rarely obtain a perfectscore, and many experience a certainamount of frustration.

In response to these problems, an alterna-tive testing system, developed and christenedby Klein-Braley and Raatz (1985) as the C-test, was developed and investigated, The C-test improves on the psychometric propertiesof the doze by using an every-other-half-word deletion called the Rule of Two. First,by using several short texts from differentpassages in a C-test, students find it less frus-trating than other types of doze (Mochizuki,1994) in terms of difficulty level and texttopic, and thereby text bias in scores isavoided. Because of its every-other-wordpartial deletion pattern, the C-test was felt toadequately include words that are of thesame part of speech or word class as in thetexts used (Klein-Braley, 1985). The problemof lack of criterion referencing to native-speaker performance is also apparently over-come because, in experiments using theC-test, adult educated examinees have beenfound to achieve virtually perfect scores.

Format of the C-testIn the C-test, the second half of every sec-

ond word is deleted instead of the wholeword, according to the Rule of Two. A C-testconsists of a number of short texts (usuallyfive or six). The first sentence of each pas-sage is left intact to provide context. Thenbeginning in the second sentence, the secondhalf of every second word is deleted until thedesired number of mutilations is reached.Each short text (usually a paragraph whichconsists of a series of blanks) makes up one

179

superitem of the test, with a total of five orsix superitems. The students are required tofill in the blanks with even and odd numbersof letters alternately. For example: (1) sto_[Lit], (2) ph [one], (3) mou_ lthl, (4) ov[err]. In word (1) the two letters in bracketsare deleted, in word (2) three letters are de-leted, in word (3) two letters, and in word (4)three letters. Numerals and proper nouns(e.g., 5100 km, Mr. James Stewart, etc.) aretypically disregarded in counting every sec-ond word (Mochizuki, 1994).

To overcome the problems found in theother types of doze tests, a list of criteria wasdeveloped by Klein-Braley and Raatz (1984)to which a Ctest should comply to be satis-factory: (a) the test should have several texts:there should be at least four paragraphs eachcoming from different passages, (b) the testshould have at least 100 deletions, (c) adultnative speakers should obtain virtually perfectscores, (d) the deletions should affect a repre-sentative sample of the text, (e) only exactword scoring should be possible, and (f) thetest should have high reliability (0.80 orhigher) and empirical validity.

The theoretical justifications for these rulesare also given in Klein-Braley and Raatz(1984). However, busy secondary ESL teach-ers may not have time to comply with all ofthese requirements, and in fact, the results ofprevious empirical research may justify ignor-ing them.

First, according to Klein-Braley and Raatz(1984), the Ctest should be constructed us-ing 5-6 short texts. In fact, it may be moreimportant to be concerned about the appro-priateness of whatever text is used for the testin terms of reliability, validity, and of coursepracticality. Mochizuki (1994) reports theresults of an experiment with a C-test usingfour different kinds of texts: Narration, Expla-nation, Argumentation, and Description. Theresults showed that C-tests based on a longpassage, especially Narration and Explanationtexts, were more reliable. Especially notewor-thy is the fact that C-tests using the (Narrationtext) long passage met the requirements setby Klein-Braley & Raatz of high reliability:(.90), though it had only moderate criterion

IKEGUCHI

related validity (a .50 correlation with a crite-rion measure). Mochizuki concluded that a C-test based on a single long passage, eventhough there was only one kind of text, mightwork well with secondary school classes notonly because of its high reliability results butalso because of its practicality in terms ofexempting the teacher from the burden offinding numerous kinds of texts.

Second, Klein-Braley and Raatz (1984)argued that a Ctest should have at least 100deletions. It is easy to come up with a 100item test using a long narrative text or expla-nation passage (if it has more than 200words) from a high school reader, a textbookpassage that has not yet been taken up inclass, or a supplementary text. To the teacherwho wants to use a one-text test, the textshould preferably be a narration or explana-tion passage of four or more paragraphs.Then each paragraph can be treated as asuperitem with at least 15 deletions in eachparagraph. In most cases, the reliabilityshould be relatively high. Chapelle andAbraham (1990) and Ikeguchi, (unpublishedms.) each used less than 100 deletions, andyet the reliability was as high as .90. How-ever, test makers who use C-tests for pro-gramlevel decisions like placement maywell want to use four or more short texts of100 deletions each in accordance to the theo-retical justifications outlined by Klein-Braleyand Raatz (1984).

Third, Klein-Braley and Raatz (1984) arguedthat adult native speakers should achievevirtually perfect scores. I feel that ESL teach-ers generally need not worry about this be-cause researchers on C-tests have shown thatone of the most interesting features of the C-test is the consistently excellent empiricalperformance of native speakers on thesetests. Educated adult native speakers almostalways achieve virtually perfect scores on C-tests (Klein-Braley & Raatz, 1984).

Fourth, Klein-Braley and Raatz (1984) ar-gued that deletions should be a representativesample of the text. I feel that the very natureof the deletion system in a C-test makes itpossible for deleted words to adequatelyrepresent all of the word classes that are


IKEGUCHI

present in the passage. The language practi-tioner need not worry about this becauseempirical research gives considerable reasonto believe that a representative sample will becreated in most Ctests.

Fifth, Klein-Braley and Raatz (1984) saidthat only the exact-word scoring methodshould be possible. Typically, only the exact-word answers occur on a C-test. Hence, scor-ing has been shown repeatedly to be a loteasier for the C-test than for most other dozetest types. One word of advice to make scor-ing much easier is to have students use aseparate answer sheet where they will writethe whole word answers. It is easier for scor-ers to recognize whole words rather thanrecognize groups of letters. This strategy willlead to more practical, faster, and efficientscoring.

Sixth, Klein-Braley and Raatz (1984) say thata Ctest should have high reliability and valid-ity. Chapelle & Abraham (1990) used one longpassage with five paragraphs for their C- testexperiment. Each paragraph was treated as asuperitem with fifteen deletions in each para-graph. The reliability was found to be .81, andthe lowest validity coefficient was .50. InMochizuki's (1994) experiment, one long pas-sage of the Narration type was used, with fiveparagraphs having a series of deletions. Theresults were higher than .90 reliability and .50criterionrelated validity. The results forIkeguchi (unpublished ms.), which used onelong passage with five paragraphs containingapproximately fifteen deletions each, indicateda reliability of .90 and a .60 correlation with acriterionrelated validity measure.

What Does Ctest Measure?The next important question that has to be

dealt with at this point is what specific lan-guage traits does the C-test technique mea-sure? What the C-test is measuring is stillopen to question. There have always beentwo sides on this issue. On the one hand arethose who argue that C-tests assess moregrammatical competence than textual compe-tence. As seen in the discussions above, theC-test only requires the student to fill in sec-ond halves of words. In completing a given


word, the most important clues for the testtaker are in the immediate environment of theblank (Klein-Braley, 1985), including the firsthalf of the word itself. On the basis of studentperformance, researchers have found thatrecognition of syntactical relationships comesfirst, after which understanding of semanticrelationships is necessary for the true compre-hension necessary for good performance.Simply stated, the students probably look firstfor clues in the half of the word that is given;then after formulating a guess, they graduallylook for the relationship in meaning betweentheir guess and the words around it.

Experiments with C-tests have primarilyinvolved correlating Ctest scores with othertypes of language tests. For example, theChapelle and Abraham (1990) study revealedthat C-tests correlated strongly with a vocabu-lary test. On the basis of this, the conclusionthey reached was that C-tests tend to measurethe grammatical competence of students. Onthe other hand, validity research suggests thatthe C-test is a measure of overall languageproficiency (Stanfield & Hansen, 1983). The C-test was explicitly developed as a test of gen-eral language proficiency (Klein-Braley andRaatz, 1984). Indeed, some results obtainedwith C-tests show meaningful relationshipswith other tests of general language knowledgeand performance. Still other experiments revealthat C-test scores increase regularly and predict-ably with an individual's linguistic maturationallevel (Klein-Braley, 1985). This notion is furthersupported by the fact that adult native speakersperform well on C-tests.

When and How to Use the C-testC-tests have been shown to be both empiri-

cally and theoretically valid. It is now time togo back to the more practical issues involvedin classroom testing. Being an objective testtype, the most important characteristics of aCtest are the fact that it is easy to constructand easy to score. Depending on the lengthof the test, it can take 30-45 minutes to ad-minister, at least at the high school level.

Ctests are, however, typically very diffi-cult: "one expects that on average only half ofthe mutilations will be correctly restored"

1 81

(Klein-Braley & Raatz, 1984). A test with a50% average may be very frustrating for boththe students and teacher. Thus the C-testshould not be used at the end of a courseunit where mastery of the subject is requiredand when an average of about 80% may bedesired. The main usefulness of the C-test liesat the start or end of a course, when it isnecessary to determine the ranking of stu-dents in relation to each other within a group.In addition, the empirical research indicatesthat Ctests are grammatically based. Hence,teachers might want to make use of this testto measure the students' grammatical compe-tence and perhaps vocabulary development.Both teachers and language administratorscan use the classroom C-test results withother more complex language tests in select-ing and placing students in a language pro-gram. The section above on what skills theC-test measures cited some evidence that C-test performance increases with the studentlinguistic maturational level. Thus the lan-guage teacher can use the results of a C-testto measure and compare periodic languageskills progress of the same group of students.This may sound a little too big a task, butgiven the relative ease with which the C-testcan be constructed, administered, and scored,it may prove worth trying.

Conclusion

This chapter has outlined four different typesof doze in relation to classroom testing. Clozeprocedures hold potential for measuring sev-eral different aspects of students' secondlanguage competence (Bachman, 1990). Thespecific language traits measured by a par-ticular doze depends on the methods ofdoze construction and on the types of re-sponses required of students.

The first type, the fixed-ratio doze, is in-tended to sample various types of words on aregular basis; some words may be governedby local, grammatical constraints, and others,by long-range, textual constraints. The secondtype, the rationaldeletion doze, allows thetest developer control over the types ofwords deleted, and thus the language traits

IKEGUCHI

measured. The third type, by altering themode of expected response, requires thestudents to select the correct answer from agiven set of choices. In the fourth type, the C-test, deletions are made on the second half ofevery other word. Because of the shortersegments of text and the importance of cluesin the immediate environment (Chapelle &Abraham, 1990), this procedure most likelyresults in a test that is more directly related togrammatical competence.

While debate within the empirical researchon doze testing still continues, the best thatsecond language teachers can do is to applywhatever research can offer so far.

References

Alderson, J. C. (1979). The doze procedure andproficiency in English as a second language.TESOL Quarterly, 13, 219-226.

Alderson, J. C. (1983). The doze procedure andproficiency in English as a second language. InJ. W. 011er, Jr. (Ed.), Issues in language testingresearch (pp. 205-217). Rowley, MA: NewburyHouse.

Anderson, J. (1976). Psycholinguistic experimentsin foreign language testing. St. Lucia,Queensland, Australia: University ofQueensland Press.

Bachman, L. (1982). The trait structure of dozetest scores. TESOL Quarterly, 16, 61-70.

Bachman, L. (1985). Performance on doze testswith fixed-ratio and rational deletions. TESOLQuarterly, 19, 535-556.

Bachman, L. (1990). Fundamental considerationsin language testing. Oxford: Oxford University.

Bensousan, M., & Ramraz, R. (1984). Testing read-ing comprehension using a multiple-choicerational doze. Modern Language Journal, 68,230-239.

Brown, J. D. (1983). A closer look at doze: Valid-ity and reliability. In J. W. 011er, Jr. (Ed.), Issuesin language testing research (pp. 237-250).Rowley, MA: Newbury House.

Brown, J. D. (1985). Cloze procedure: A tool forteaching second language reading. TESOL News-letter, 2((5), 1 & 7.

Brown, J. D. (1988a). Tailored doze: Improvedwith classical item analysis techniques. Lan-guage Testing, 51), 19-31.


Brown, J. D. (1988b). What makes a doze itemdifficult? University of Hawaii working papers inESL, X2), 17-39.

Brown, J. D. (1991). What test characteristicspredict human performance on doze test items?In the Proceedings of the Third Conference onLanguage Research in Japan (pp. 1-26). Urasa,Japan: International University of Japan.

Chapelle, C. (1988). Field independence: A sourceof language test variance? Language Testing, 5,62-82.

Chapelle, A., & Abraham, R. (1990). Clozemethod: What difference does it make? Lan-guage Testing, 7, 121-146.

Chavez - Oiler, M. A., Chihara, T., Weaver, K. A., &Oiler, J. W., Jr. (1985). When are doze itemssensitive to constraints across sentences? Lan-guage Learning, 5, 62-82.

Hale, G., Stansfield, C., Rock, D., Hicks, M., But-ler, F., & Oiler, J. (1989). The relation of mul-tiple-choice doze items to the Test of Englishas a Foreign Language. Language Testing, 6, 49-78.

Halliday, M. A. K., & Hasan, R. (1976). Cohesionin English. London: Longman.

Hanania, E., & Shikhani, M. (1986). Interrelation-ships among three tests of language profi-ciency: Standardized ESL, doze and writing.TESOL Quarterly, 20, 97-109.

Hinofotis, F., & Snow, B. (1980). An alternativedoze testing procedure: multiple-choice format.In J. W. Oiler, Jr., & K. Perkins (Eds.), Researchin language testing (pp. 238-245). Rowley, MA:Newbury House.

Ikeguchi, C. (Unpublished ms.) The four dozetest types: To each its own. Dokkyo University,Soka-shi, Saitama Prefecture, Japan.

Jonz, J. (1976). Improving on the basic egg: TheM-C doze. Language Learning, 26, 255-265.

Jonz, J. (1990). Another turn in the conversation:What does doze measure? TESOL Quarterly, 24,


61-83.Klein-Braley, C. (1983). A doze is a question. In J.

W. Oiler, Jr. (Ed.), Issues in language testingresearch (pp. 134-156). Rowley, MA: NewburyHouse.

Klein-Braley, C. (1985). A doze-up on the C test.Language Testing, 2, 76-104.

Klein-Braley, C., & Raatz, U. (1984). A survey ofresearch on the C-test. Language Testing, 1,134-146.

Lo Castro, V. (1994). Teachers helping themselves.The Language Teacher, 18 (2), 4-7.

Mochizuki, A. (1994). Four kinds of texts, theirreliability and validity. JALTJournal, 16, 41-54.

Ozete, P. (1977). The doze procedure: A modifi-cation. Foreign Language Annals, 10, 565-568.

Perkins, J., & German, P. (1985). The effect ofstructure gain on different structural categorydeletions in a doze test. Paper presented atMidwest TESOL. Milwaukee, WI.

Porter, D. (1983). The effect of quantity and qual-ity on the ability to make predictions. In D.Porter, & A. Hughes (Eds.), Current develop-ments in language testing. London: AcademicPress.

Shanahan, T., Kamil, M. L., & Tobin, A. W. (1982).Cloze as a measure of intersentential compre-hension. Reading Research Quarterly, 17, 229-255.

Stansfield, C., & Hansen, J. (1983). Field depen-dence-independence as a variable in secondlanguage test performance. TESOL Quarterly,17, 29-38.

Taylor, W. L. (1953). Cloze procedure a new toolfor measuring readability. Journalism Quarterly,30, 415-433.

Yamashita, S. (1994). Is the reading comprehen-sion performance of learners of Japanese as asecond language the same as that of Japanesechildren? An analysis using a doze test. Sekaino Nihongo Kyoiku, 4, 133-146.

1-83

Chapter 18

The Validity of WrittenPronunciation Questions:

Focus on Phoneme Discrimination

SHIN'ICHI INOI

OHU UNIVERSITY

pronunciation questions have been sopopular that they are often included inentrance examinations to colleges and

universities. They are even found in CenterExams administered by the Ministry of Edu-cation. Some researchers, however, haverecently cast doubt on the validity of pronun-ciation questions on written tests as a meansof evaluating students' actual pronunciationability. Among these critics are Katayama,Endo, Kakita, and Sasaki (1985), Takei(1989a, 19891)), and Wakabayashi andNegishi (1993). These critics all claim that theteacher cannot assess students' pronunciationperformance with questions on a written testand that such questions cannot be a substi-tute for an oral test. They all stress the impor-tance of an oral test in evaluating students'pronunciation ability. However, Sasaki andTomohiko (1991) and Shirahata (1991), tooka somewhat different view on this point.While they admitted the low-content validityof phoneme discrimination questions on theirwritten test, they claimed the validity of pri-mary stress questions.

In a previous study (Inoi, 1994), I investi-gated whether pronunciation questions onwritten tests are a valid measurement of stu-dents' pronunciation ability. A written test of

20 questions on primary stress and another 20questions on phoneme discrimination wereadministered to 44 Japanese college fresh-men in a language laboratory. After the writ-ten test, an oral version was given to thesame subjects. The written primary stressquestions were all of the same type: thesubjects were to choose which syllable of theword in question had primary stress. Thewritten phoneme discrimination test, dividedinto two sections, had the same format as inthe present study (see Appendix A). On theoral version of the primary stress test, thesubjects were instructed to pronounce eachof the words. On the oral phoneme discrimi-nation test, the subjects were asked to pro-nounce all the words on the items, includingalterative responses. The subjects' pronuncia-tions on both primary stress and phonemediscrimination tests were recorded on cassettetapes. The scoring criteria employed for theoral data, which were the same as those for thepresent study, are explained in detail later. Thedata obtained from the written test and the oraltest were analyzed in terms of the agreementrate, that is, the extent to which the answerswere identical between the two tests, as well asin terms of the correlation between the scoreson the written and oral tests.


INOI

Figure 1. Contingency Table

Written

Test

Oral Test

Correct Wrong

Correct A B

Wrong C D

To calculate a subject's agreement rate,each subject's answers on the two tests weresorted into a two-by-two contingency table,as shown in Figure 1. Cell A shows the num-ber of answers correct on both the writtenand oral tests. Cell B shows the pairs of an-swers correct on the written test but wrongon the oral test. Cell C shows the pairs correcton the oral test but wrong on the written test,and D shows the pairs wrong on both tests.The agreement rate was calculated as follows:the sum of A and D were divided by the total(i.e., A + B + C + D), then the value obtainedwas multiplied by 100.

For the primary stress test, the averageagreement rate of the subjects was 77.9 per-cent, or about 80 percent. A moderate corre-lation was observed between scores on thetwo tests (r = .65, df = 42, p < .01). There-fore, it was concluded that the primary stressquestions on the written test were reasonablysound measures.

As for phoneme discrimination questions,the average agreement rate was 67.7 percent,about 10 percent lower than the one for pri-mary stress questions. In one of the two sec-tions of the test, a moderate correlation wasfound (r = .61, 4f = 42, p <. 01), but in theother section, no statistically significant corre-lation was observed (r = .22, df = 42, NS).Thus, compared with the primary stress test,the phoneme discrimination test was rela-tively weak. The non-significant correlationobtained indicated that the phoneme discrimi-nation in that section may not be valid.

However, each of the two sections of thephoneme discrimination test contained only10 questions and the questions used wererandomly taken from various college entrance

180 JALT APPLIED MATERIAIS

examinations. There remains the possibilitythat a greater number of questions and agreater variety of words used for the ques-tions may produce results different from thosein my previous study. The present study is anattempt to pursue that possibility. By focusingon phoneme discrimination questions, thisstudy further investigates the effectiveness ofa written phoneme discrimination test:

Experimental Procedure

The experiment was administered to 60 col-lege freshmen in the language laboratory ofOhu University in Koriyama, Fukushima-ken,on July 18, 1994. The subjects were allJapanese learners of English as a foreignlanguage. As in the previous study, a writtentest was administered first, then it was fol-lowed by an oral version. Each subject wasgiven a test sheet of 30 phoneme discrimina-tion items. The test was multiple-choice andwas divided into two parts (Section A andSection B), each of which had 15 items. InSection A, the first five items were taken fromthe 1994 Center Exam, the next five from the1993 exam, and the last five from the 1992exam. In Section B, the first five were fromthe 1989 exam, the second five from the 1989supplementary exam, the next four from the1988 exam, and the last one from the 1988supplementary exam (see Appendix). No useof dictionaries was allowed. All of the sub-jects finished the test in less than 20 minutes.After the written test, each subject was givenanother answer sheet. It was the same as theone used for the written test. The subjectswere instructed to read aloud each of thewords on the items. Their pronunciations ofthe words were recorded on tape. Both thewritten tests and the oral recordings werescored by the author.

The experimental procedure and the data-scoring criteria were the same as those em-ployed in the earlier study. I wanted tocompare statistical results with those found inmy previous study. The data were analyzed interms of the correlational relationship of thewritten and oral test scores and in terms of theagreement rate of answers between the two.

185

The agreement rate of answers was calculatedfor each subject and each item. The subjects'agreement rates were examined further to seewhether they had any correlation with subjects'accuracy scores. Based on the results, the valid-ity of phoneme discrimination questions on awritten test will be discussed along with someof the problems they pose.

Data Analysis

Each subject produced one pair of answersfor each test item (i.e., one answer on thewritten test and another on the oral test). Intotal, each subject produced 30 pairs of an-swers: 15 pairs in Section A and another 15 inSection B. As in the previous study, the scor-ing of the oral test was based on the pronun-ciation of the whole word. This was donebecause, in general, teachers seem to paymost attention to the pronunciation of a wordas a whole rather than to its parts when theyevaluate how a student pronounces Englishwords. The written test was only concernedwith the underlined parts of words as indi-cated in the instructions. In Section A, eachsubject was given a score when the subjectcorrectly pronounced both the head word(i.e., the first word on the left in each item)and the correct option.' However, when ei-ther of the two words was pronounced withprimary stress on the wrong syllable, a scorewas not given: On item 15, for instance,when the head word "dessert" was pro-nounced as [clezart], the answer was notjudged to be correct. In section B, each sub-ject was given a correct score when the sub-ject pronounced all four options correctly.When any one of the four options was pro-nounced with stress on the wrong syllable,the answer was counted as incorrect.

Results and Discussion

The reliability of the written and oral testswas estimated by using Kuder-RichardsonFormula 20 (Brown, 1988). Table 1 shows thereliability coefficients for each section of thewritten and oral tests as well as for the test asa whole. Relatively high reliabilities of .79 and

I NO1

.86 were obtained for the whole written testand for the whole oral test, respectively.

Table 1. Reliability estimates of the writtenand oral tests

Test Section K-R20

Written

Oral

A

B

A+B

A

B

A+B

.64

.69

.79

.68

.75

.86

Correlational AnalysesTables 2, 3, and 4 indicate both the indi-

vidual and combined results of the correla-tional analyses of Sections A and B. In eachcase, a fairly strong correlation was observedbetween written test scores and oral testscores. As shown in the tables, the correlationcoefficient for Section A was .73, that forSection B was .74, and that for both SectionsA and B was .81. These figures are consider-ably higher than those observed in the previ-ous study, where the correlation coefficient ofthe section corresponding to Section A was.61, that for the section corresponding toSection B was only .22, and the overall coeffi-cient was .51. Thus, in the previous study, Idoubted the validity of phoneme discrimina-tion items in the section where no significantcorrelation was observed. In the presentstudy, however, the stronger correlations thatwere observed generally support the validityof phoneme discrimination on the writtentest. But analyses of agreement rates belowshow that there may still be some problemswith the validity of the written test.

Agreement Rate AnalysesAgreement rates of answers on the written

and oral tests were analyzed from two differ-ent points: for each subject and each item.

Subjects' agreement rates. In calculating asubject's agreement rate, the same procedurewas employed as in the previous study.Table 5 indicates the agreement rates of each


INOI

Table 2. Pearson Product-Moment Correlationof scores between the written and oral tests onSection A

N Mean(%) S

Written Test 60 47.8 19.0 .73*

Oral test 60 46.6 19.5

p <.01

Table 3. Pearson Product-Moment Correlationof scores between the written and oral tests onSection B

N Mean(%) S

Written Test 60 68.2 18.4 .74*

Oral test 60 46.3 20.6

sp < .01

Table 4. Pearson Product-Moment Correlationof scores between the written and oral tests onSections A and B

N Mean(%) S

Written Test 60 48.0 16.9 .81 *

Oral test 60 47.4 19.0

*p<.01

subject's answers between the written andoral tests. The average agreement rates forSection A, Section B,and both Sections A andB combined were 65.3, 68.3, and 66.8 per-cent, respectively. The overall agreement rateof 66.8 percent was almost the same as the67.7 percent found in the previous study.

An analysis was done to detect any correla-tion between the subjects' agreement rates andtheir accuracy scores. As shown in Table 6, aweak but statistically significant correlation of.37 was observed between agreement rates andwritten test scores; a moderate and statisticallysignificant correlation of .56 was found be-tween agreement rates and oral test scores.

These correlations can be more clearly seenin Figure 2. It shows the accuracy scores onthe written and oral tests attained by thosesubjects who belonged to four different


agreement rate categories. The abscissashows the agreement rate categories and theordinate represents accuracy scores. Thefigures in parentheses represent the numberof subjects who fall into each of the fouragreement rate categories. The first pair ofbars in Figure 2 indicates that, for those sub-jects whose agreement rate averaged 80 per-cent or more, their average accuracy scorewas 81.9 percent on the written test and 70.5percent on the oral test. There is a clear ten-dency for high agreement rate achievers tohave high accuracy scores and for low agree-ment rate achievers to have low accuracyscores on the tests. However, this tendency isless clear when comparing those subjects inthe lowest rate category with those in thesecond lowest category.

This relationship between agreement ratesand accuracy scores might be accounted forby the following: high accuracy score achiev-ers may have known many of the Englishwords used on the tests, which enabled themto choose the correct option on the writtentest and to pronounce the words correctly onthe oral test. Low accuracy score achievers,on the other hand, may have had less knowl-edge of the test words. Thus, on the writtentest, low accuracy score achievers may havesimply chosen answers through randomguessing, and, consequently, on some itemsmay have select the correct answer. On theoral test, however, these subjects may oftenhave been unable to pronounce the wordscorrectly, and they were less able to providecorrect answers by chance alone. In this way,high agreement rates would not be foundbetween the answers given on both tests bythe subjects with low accuracy scores. Inother words, phoneme discrimination itemsused on the written test may not accuratelyassess low accuracy score achievers' actualpronunciation performance.

Agreement rates of items. Table 7 shows theagreement rates of items, or the extent towhich the subjects' answers on each itemwere the same between the written and oraltests. On item 1, for example, the answersgiven by 46 subjects, or 76.7 percent of thetotal, were identical between the written and

187

'NOT

Table 5. Agreement rates of each subject's answers between the written and oral tests

Subject A

SECTIONA+B Subject A

SECTION

A+BB B

Si 66.7 80.0 73.3 S31 66.7 73.3 70.0S2 66.7 73.3 70.0 S32 46.7 60.0 53.3S3 53.3 73.3 63.3 S33 40.0 66.7 53.3S4 93.3 80.0 86.7 S34 73.3 66.7 70.0,

S5 66.7 60.0 63.3 S35 60.0 53.3 56.7S6 60.0 80.0 70.0 S36 53.3 73.3 63.3S7 73.3 73.3 73.3 S37 60.0 53.3 56.7S8 73.3 66.7 70.0 S38 46.7 73.3 60.0S9 80.0 80.0 80.0 S39 60.0 66.7 63.3SI 0 40.0 60.0 50.0 S40 46.7 46.7 46.7SI 1 66.7 60.0 63.3 S41 66.7 86.7 76.7S12 66.7 73.3 70.0 S42 66.7 80.0 73.3

513 73.3 73.3 73.3 S43 73.3 60.0 66.7S14 73.3 73.3 73.3 S44 66.7 73.3 70.0S15 73.3 86.7 80.0 S45 66.7 73.3 70.0S16 73.3 73.3 73.3 S46 53.3 80.0 66.7517 73.3 80.0 76.7 S47 66.7 53.3 60.0S18 66.7 80.0 73.3 S48 73.3 60.0 66.7S19 60.0 73.3 66.7 S49 73.3 60.0 66.7S20 40.0 66.7 53.3 S50 93.3 73.3 83.3S21 53.3 60.0 56.7 S51 60.0 73.3 66.7S22 73.3 93.3 83.3 S52 73.3 46.7 60.0S23 60.0 73.3 66.7 S53 73.3 66.7 70.0S24 80.0 80.0 80.0 S54 60.0 53.3 56.7S25 73.3 66.7 70.0 S55 46.7 73.3 60.0S26 73.3 46.7 60.0 S56 66.7 60.0 63.3S27 53.3 53.3 53.3 S57 80.0 73.3 76.7S28 80.0 73.3 76.7 S58 66.7 93.3 80.0S29 66.7 53.3 60.0 S59 66.7 73.3 70.0S30 53.3 53.3 53.3 S60 66.7 33.3 50.5

AVERAGE 65.3 68.3 66.8

the oral tests. On three items, items 15, 19,and 20, the agreement rates failed to reach 50percent.

A number of factors may have affected thelow agreement rates on each of these items.On item 15, there were many subjects whochose the correct option on the written testbut were unable to correctly pronounce ei-ther the head word "dessert" or the correctoption "possess" on the oral test: the headword was incorrectly pronounced as [dozortl,with primary stress on the wrong syllable; the

correct option was pronounced as [pzes],[pbuziz], or [pazia As revealed in their oralrecordings of these words, these subjectspronounced the phonemes as lzl, and yet, onthe written test, because they judged thatboth the phoneme in question (i.e., the un-derlined part of the head word) and the onein the correct option had the same pronuncia-tion, they were able to select the appropriateanswer. In fact, there were 25 such subjects,or 41.7 percent of the total, which could haveled to the low agreement rate on this item.


INOI

Table 6. Pearson-Product Moment Correla-tion between agreement rates and accuracyscores

N Mean(%) S

Agreementrate

60 66.8 9.2 .81*

Accuracyscore on thewritten test

60 58.0 16.9 .37*

Accuracyscore on theoral test

60 47.4 19.0 .56*

p < .01

On item 19, the low agreement rate wasalso caused by some subjects' inconsistentperformance between the written and oraltests. As many as 25 subjects were not able tocorrectly pronounce the "bathe" option onthe oral test, while they all chose this correctoption on the written test. A close analysis oftheir oral recordings of the words on the itemshows that they were able to pronouncecorrectly all the options except the "bathe." Itseems that the distractors in this item were atoo easy for these subjects so they couldeliminate the options as possible answerseven if they did not know how to pronouncethe correct answer. In short, the inclusion ofeasy words as distractors may have causedthe low agreement rate on item 19.

Figure 2. Correlational Relationship betweenAreement Rate and Accuracy

AccuracyScore (%)90 _80 _70 _60 _50 _40 _30

20 _10 -0

81.97 78.5

Agreement Rate

62.1 Written Oral

50.7

38.1

54.7

39.2

80+ 70+ 60+ 50+

(7) (22) ( 9) (12)

On item 20, as many as 22 subjects, aboutone-third of the total, were able to pronounceall four options correctly, but somehow theychose a wrong option as the answer on thewritten test. At the moment, no reasonableexplanation can be given for these results.

Since the above three items did not seemto accurately assess subjects' oral perfor-mance, their validity is in doubt.

Conclusion

The present study, a follow-up to my earlierstudy, addressed the issue of the validity ofpronunciation questions on a written test forassessing the pronunciation of English words.In the previous study, phoneme discrimina-

Table 7. Agreement rates of answers on items between the written and oral tests

Item 1 2 3 4 5 6 7 8 9 10

N* 46 38 38 38 34 41 42 35 43 38AR(%)** 76.7 63.3 63.3 63.3 56.7 68.3 70.0 58.3 71.7 63.3

Item 11 12 13 14 15 16 17 18 19 20N 46 35 44 44 24 37 32 41 29 27AR(%) 76.7 58.3 58.3 73.3 40.0 61.7 53.3 68.3 48.3 45

Item 21 22 23 24 25 26 27 28 29 30N 32 43 37 48 49 52 54 41 48 45AR(%) 53.3 71.7 61.7 80.0 81.7 86.7 90.0 68.3 80.0 75.0

N = Number of subjects whose answers were identical between the written and oral test"AR = Agreement rate


tion questions were taken from various col-lege entrance examinations, and no signifi-cant correlation was observed betweenscores on the written and oral sections of thetest. As a result, for this study, I chose to usephoneme discrimination questions takenfrom Center examinations administered bythe Ministry of Education.

Unlike the earlier study, the results of thepresent study showed a relatively strong corre-lation between scores on the written and oraltests in each of the two sections (Section A andSection B). From a correlational point of view,the statistically significant correlations sup-ported the validity of phoneme discriminationitems on the written test.

From an agreement-rate point of view,however, their validity was not so stronglysupported. Though the subjects' averageagreement rate between answers on thewritten and oral tests reached nearly 70 per-cent for each of the two sections, an analysisof the agreement rates showed that there wasa tendency for high agreement rate achieversto get high accuracy scores and for lowagreement rate achievers to get low accuracyscores. As for subjects with low accuracyscores, or presumably low-level students,phoneme discrimination items on the writtentest did not appear to very accurately assesstheir actual performance on the oral test andsome doubt should be cast on the validity ofsuch questions.

An analysis of the agreement rates on theitems showed that there were three itemswhose agreement rates were below 50 per-cent. An examination of the subjects' answerson these items revealed some problems inthe written test. One problem was that evenif subjects did not know how to pronouncecorrectly either the head word or the correctoption on one item in Section A, there wassome chance for them to choose the correctanswer on the written test. Another problemwas that the distractors on one item in sec-tion B were so easy that subjects could elimi-nate them as possible answers. Still anotherproblem was the fact that the good perfor-mance by some subjects on the oral test didnot necessarily mean they would perform

INOI

well on the written test. These problemsindicate that some phoneme discriminationitems on the written test were not validmeans of assessing the subjects' actual pro-nunciation ability of English words.

In making phoneme discrimination ques-tions on a written test with a multiple-choiceformat, teachers should bear in mind thefollowing points:

1. The distractors on an item should becarefully designed lest they be eliminatedtoo easily

2. Teachers may not be able to very accu-rately assess the pronunciation ability oflow-level students with a written test

3. Good performance on a written test doesnot always guarantee good performanceon an oral test

Suggestions for Future Research

The results of the present study were basedon data collected from a written test of 30phoneme discrimination questions and theiroral versions. While the reliability estimatewas, as a whole, relatively high for both thewritten and oral tests, it was not so high foreach section of the test, particularly for Sec-tion A of the written test. This was probablybecause each section contained only 15items. A test with a greater number of itemsis necessary to improve the reliability of awritten phoneme discrimination test. Resultsmay change for the same items when differ-ent samples of subjects are used, or whendifferent test formats or scoring criteria areused. Results may also change dependingupon learners' English proficiency levels.Future research should examine these pointsto further investigate the validity of pronun-ciation questions on a written test.

Note1 Of course, there may be other ways of scor-ing the oral data. But I wanted to employ thesame scoring procedure as the one used inthe previous study.

el,

.0. a LANGUAGE TESTING IN -JAPAN 185

INOI

References

Brown, J. D. (1988). Understanding research insecond language learning: A teacher's guide tostatistics and research design. New York: Cam-bridge University Press.

Henning, G. (1987). A guide to language testing:development, evaluation, and research. Rowley,MA: Newbury House.

Inoi, S. (1994). A study of the validity of pronun-ciation questions on a written test. fACET Bulle-tin, 25, 39-52.

Katayama, Y., Endo, H., Kakita, N. & Sasaki, A.(1985). Shin-eigokakyoiku no kenkyu. Tokyo:Taishukan.

Kyogakusha-shuppan senta. (1994).Daigakunyushisentashiken mondaikenkyu.Tokyo: Kyogakusha.

Sasaki, C. & Tomohiko, S. (1991). An investiga-

tion into the validity of paper test problems onstress and phonemes. Annual Review of EnglishLanguage Education in japan, 3, 119-127.

Shirahata, T. (1991). Validity of paper test prob-lems on stress: Taking examples fromMombusho's daigaku nyushi senta shiken.Bulletin of the Faculty of Education, ShizuokaUniversity, Educational Research Series, 23, 161-172.

Takei, A. (1989a). Paper and pencil tests-forpronunciation: Are they valid? Bulletin of theKanto-Koshinetsu English Language EducationSociety, 2, 17-22.

Takei, A. (1989b). More on the validity of paperand pencil tests for pronunciation. The IRLTBulletin, 3, 1-21.

Wakabayashi, S. & Negishi, M. (1993).Musekininna tesutoga ochikobore wo tsukuru.Tokyo: Taishukan.

AppendixA. Choose the correct option whose underlined part is pronounced the same as that of the first word on

the left.1. wilderness a. hive b. myth2. raw a. coast b. naughty3. chimney a. action b. cheminstry4. passion a. assure b. blossom5. conquer a. conquest b. liquid6. mood a. flood b. floor7. dear a. beard b. heart8. country a. although b. doubt9. recent a. ancient b. decorate10. language a. argument b. distinguish11. allow a. bowl b. coward12. imagine a. capital b. false13. brush a. bury b. bush14. rough a. brought b. cough15. dessert a. assume b. message

c. themec. noticec. naturalc. confessc. quietc. shootc. pearlc. southernc. financialc. guessc. growc. majorc. rudec. ghostc. permission

B. Choose the correct option whose underlined part is pronouncedexamples.16. a. control b. hostess c. improve17. a. breath b. creature c. feather18. a. horizon b. isolate c. polite19. a. bathe b. both c. thirsty20. a. accident b. account c. accuse21. a. behavior b. canal c. label22. a. heaven b. pleasant c. steady23. a. family b. reply c. ugly24. a. arch b. March c. speech25. a. loose b. news c. poison26. a. coffee b. joke c. nose27. a. delight b. describe c. mild28. a. bear b. dear c. fear29. a. chemistry b. chimney c. chorus30. a. eight b. height c. neighbor


191

d. triumphd. routed. scholard. scissorsd. uniqued. woold. weard. thoughd. societyd. regulard.knowledged. sacredd. thumbd. thoroughd. possess

differently from the three other

d, postcardd. treatmentd. risend. thousandd. accustomd. paraded. streamd. weeklyd. stomachd. resembled. smoked. wisdomd. neard. mechanicald. weight

14 Reasons Why You Should Join

The Japan Association of Language Teachers

Compiled by Don Modesto

1Leading au-

thorities in Ianguage teaching

regularly visit us: H.Douglas Brown,John Fanselow, JackRichards, KathleenGraves, AlanMaley... (If youdon't know whothey are, come toJALT to find out.)

2Insights on the

job market,introductions...

JALT plugs you intoa network of over3,600 languageteaching profession-als across Japan.

3Ten specialinterest groupsand their news-

letters (and more ofeach on the way):Bilingualism, Collegeand University Edu-cators, ComputerAssisted LangaugeLearning, GlobalIssues in LanguageEducation, Japaneseas a Second Lan-guage, Learner De-velopment, MaterialsWriters, TeacherEducation, TeamTeaching, and Video.

4JALT is a placeto call yourprofessional

home. And with 38chapters across Japan,JALT is not far fromyour other home.

5Monthly chap-ter programsand regular

regional conferencesprovide both valu-able workshops andthe chance to shareideas and hone yourpresentation skills.

6Professionalorganizationslook great on a

resume. Volunteer fora chapter position,work on a conference,or edit for the publica-tions. You gain man-agement andorganizational skills inthe bargain.

7JALT maintainslinks with otherimportant lan-

guage teaching orga-nizations such asTESOL and IATEFL.We have also forgedlinks with our coun-terparts in Korea andTaiwan.

8Publishing yourresearch? Sub-mit it to the

JALT Journal.

9Looking for a

regular sourceof language

teaching tips? Look toour celebrated maga-zine The LanguageTeacherthe onlylanguage teachingpublication in theworld published on amonthly basis.

1 0JALT pro-ducesAsia's

largest language teach-ing conference, withdozens of publishersdisplaying the latestmaterials, hundreds ofpresentations by lead-ing educators, andthousands of attenders.

1 1JALT nur-tures astrong

contingent of domes-tic speakers: DaleGriffee, MarcHelgesen, Kenji Kitaoand Kathleen Kitao,Don Maybin,Nishiyama Sen, andWada Minoru.

192

C. Conduct

re-search

project? Apply forone of JALT's re-search grants.

1Free ad-mission tomonthly

chapter meetings,reduced conferencefees, subscriptions toThe LanguageTeacher and JALTJournal and this forjust V7,000 per yearfor individual mem-bership, Y6,000 forjoint (2 people), andV4,500 if you hustle alittle and get up agroup of four to joinwith you.

14Easy ac-cess tomore

information, applica-tion procedures, andthe contact numberof the chapter near-est you.

Contact JALT'sCentral Office today.

03-3802-7121(phone) or

03-3802-7122 (fax)

ISBN: 4-9900370-0-6 Y2500

Language Testing in Japan is the first collection ofarticles in the JALT Applied Materials series, as wellas the first contemporary collection of articles inEnglish devoted to the issues of testing, assessmentand evaluation in Japan. Written by classroomteachers in Japan, Language Testing in Japan offerschapters on classroom testing strategies, program-level testing strategies, standardized testing, oralproficiency testing, and innovative testing, while atthe same time providing information about actualtesting that is currently being done in many of thelanguage programs and classrooms throughoutJapan. All of the chapters have been written tomake them as practical, useful, straightforward, andeasy to understand as possible.

Standardized tests such as TOEFL, TOEIC, SPEAK,and STEP are discussed and their strengths andweaknesses analyzed. Other issues addressed areuniversity entrance exams, doze tests and how tomake them, how to make and improve teacher-made' classroom tests, how to test young learners,whether paper pronunciation tests are valid, andwhether it is possible to test nonverbal abilities.Technical testing terms such as norm-referencedand criterion-referenced tests, item facility,skewdness, normal curve, wasbback, and so on areexplained in detail.

Edited by the internationally recognized languagetesting expert James Dean Brown and Japan testingexpert Sayoko Okada Yamashita, and written bylanguage teachers for language teachers, LanguageTesting in Japan assumes no prior knowledge oflanguage testing yet makes a long-neededcontribution to this much discussed area. It iscertain to stimulate discussion for years to comeabout language testing as it is practiced today inJapan.

A Special Supplement to The Language Teacher

193BEST COPY AVAILABLE 1

a- 0240_0 O

ERIC REPRODUCTION RELEASE

I. Document Identification: ISBN 4-9900370-0-6 (Collection of articles)

Title- I 1,/ tage_Testinginllapan

Author: James Dean Brown & Sayoko Okada Yamashita (eds.)

Corporate Source: Japan Association for Language Teaching (JALT)

Publication Date: September, 1995

II. Reproduction Release: (check one)

In order to disseminate as widely as possible timely and significantmaterials of interest to the educational community, documentsannounced in Resources in Education (RIE) are usually made available tousers in microfiche, reproduced in paper copy, and electronic/opticalmedia, and sold through the ERIC Document Reproduction Service(EDRS) or other ERIC vendors. If permission is granted to reproduce theidentified document, please check one of the following options and signthe release form.

[ XX I Level 1 - Permitting microfiche, paper copy, electronic, and opticalmedia reproduction.

[ ] Level 2 Permitting reproduction in other than paper copy.

Sign Here: "I hereby grant to the Educational Resources InformationCenter (ERIC) nonexclusive permission to reproduce this document asindicated above. Reproduction from the ERIC microfiche orelectronic /optical media by persons other than ERIC employees and itssystem contractors requires permission from the copyright holder.Exception is made for non-profit reproduction by libraries and otherservice agencies to satisfy information needs of educators in response todiscrete inquiries."

Signature:Gene van Troy

Printed Name: Gene van Troyer

osition: JALT President

Organization: Japan Association for Language Teaching

Address: JALT Central OfficeUrban Edge Bldg. 5th FL1-37-9 Taito, Taito-kuTokyo 110, JAPAN

Date: July 22, 1996

Telephone No: 03-3837-1630; (fax) -1631

III. Document Availability Information (from Non-ERIC Source):

Complete if permission to reproduce is not granted to ERIC, or if youwant ERIC to cite availability of this document from another source.

Publisher / Distributor: JALT

Address: (See above)

Price per copy: Y2500 Quantity price: Standard bookseller discount

Documents

FL 024 200 AUTHOR Ed.Brown, James Dean, Ed.; Yamashita, Sayoko Okada, Ed. Language Testing in Japan. Japan Association for Language Teaching, Tokyo. ISBN-4-9900370-1-6 95 193p. JALT