33
RR-00-14 AN EXPLORATORY DIMENSIONALITY ASSESSMENT OF THE TOEIC TEST Kenneth M. Wilson September 2000 R E S E A R C H R E P O R T Princeton, New Jersey 08541

R E S E A R C HR E P O R T

Embed Size (px)

Citation preview

RR-00-14

AN EXPLORATORY DIMENSIONALITYASSESSMENT OF THE TOEIC TEST

Kenneth M. Wilson

September 2000

RESEARCH R

EPORT

Princeton, New Jersey 08541

An Exploratory Dimensionality Assessmentof the TOEIC Test

Kenneth M. Wilson

September 2000

Research Reports provide preliminary and limiteddissemination of ETS research prior to publication. They areavailable without charge from the

Research Publications OfficeMail Stop 07-REducational Testing ServicePrinceton, NJ 08541

ABSTRACT

To measure English language listening comprehension andreading comprehension skills in samples of nonnative-Englishspeakers, the TOEIC (Test of English for InternationalCommunication) test uses seven different types of test items(four for assessing listening comprehension and three forassessing reading comprehension). Results of an exploratoryfactor analysis involving data for native-speakers of Japaneseand Korean, respectively, and other correlational evidencereported herein, suggest unidimensionality for the ListeningComprehension (LC) but not for the Reading Comprehension (RC)section: an RC item type labelled "reading" appears to betapping aspects of "reading comprehension" that arepsychometrically distinguishable from those measured in commonby RC item types labelled "error recognition" and "incompletesentences". Further research is needed to assess theconsistency of these findings in samples of native-speakers ofJapanese and Korean, respectively, their possiblegeneralizability to other native-language populations of TOEICtest takers, and their practical implications.

Key words: TOEIC test, factor structure, readingcomprehension

ACKNOWLEDGMENTS

Essential indirect support from the ETS ResearchDivision, and helpful reviews of the manuscript by BrentBridgeman and Isaac Bejar, and editorial suggestions by LarryStricker are acknowledged with appreciation.

1

INTRODUCTIONThe TOEIC (Test of English for International

Communication) Test is a multiple-choice, norm-referenced ESLproficiency test designed to measure English languagelistening comprehension and reading comprehension skills insamples of nonnative-English speakers a (here and hereafter,see correspondingly lettered endnotes). TOEIC Test affairsworldwide are administered under the aegis of The ChaunceyGroup International (see, for example, The Chauncey GroupInternational, 1999)a (here and hereafter, see correspondinglylettered endnotes). Listening comprehension items included inthe TOEIC test are designed to measure understanding of spokenEnglish in real-life situations; and reading comprehensionitems are designed to measure examinees' ability to comprehendtypes of materials that people in the business-world use,including manuals, reports, forms, notices, advertisements,periodicals, memoranda and so on.

To measure these skills, the test uses seven differenttypes of test items or questions: four of the item types aredesigned to measure aspects of developed ability to comprehendutterances in English (listening comprehension) and theremaining three types are designed to measure aspects ofdeveloped ability to read and comprehend material written inEnglish (reading comprehension). Brief designations of theseven item types and the number of items by type are providedin Table 1; the item types are described briefly in Table 2,and illustrative items of each type are shown in the appendix.

For scoring purposes, differences in performance on itemtypes within the respective sections are not taken intoaccount. Summary number-right raw scores on the respective100-item sections (listening comprehension and readingcomprehension) are translated into an arbitrarily definedstandard-score scale with scores ranging from 5 to 495; atotal score is derived by adding the two scaled section-scores, hence can vary between 10 and 990. About two hours ofactual testing time are involved.

Internal consistency reliability coefficients for the twosection scores in samples used to develop new versions of theTOEIC test tend to exceed .90 and total score reliabilitytypically is higher than that for either section. Thus, TOEICtest scores provide a highly reliable basis for inferencesabout individual and group differences in performance.

Evidence Bearing on TOEIC Test Validity

The two TOEIC test sections have general face validity asmeasures requiring the exercise of English-language listeningcomprehension skills and reading comprehension skills,respectively. Findings of previous research reviewed briefly

2

Table 1. Item-Type Composition of the Test of English for International Communication (TOEIC)

Enumeration of TOEIC Item Types

Listening Comprehension N Items

Single Picture ( 20) Question-Response ( 30) Short Conversations ( 30) Short Talks ( 20)

Total (LC) (100)

Reading Comprehension

Incomplete Sentences ( 40) Error Recognition ( 20) Reading Comprehension# ( 40) Total (RC) (100)

TOEIC Total (200)

#To avoid confusing references to the "ReadingComprehension" item type with references to theidentically named section in which it is found,throughout the remainder of this report, the item typeitself will be referred to as the "Reading Passage(s)item type".

3

Table 2. Brief Description of TOEIC Item Types

Listening Comprehension

Single Picture Each item involves a picture in the test booklet, showing a familiar situation (e.g., manwriting at a desk, girl sitting on a park bench). The examinee is asked to choose the letter in the test bookletthat corresponds to most accurate of four spoken statements describing the picture.

Question-Response A question in English, spoken only one time is followed by three spoken responses,also spoken only one time in English. Questions pertain to situations deemed to be generally familiar (e.g.,What will the weather be like tomorrow; How do I get to the airport from here?). Examinees are asked tochoose the letter in the test booklet that corresponds to the most accurate of the spoken responses.

Short Conversations Examinees hear a short conversation between two people, followed by a brief writtenquestion and four short, written answers. They are to choose the best answer to each question and mark iton the answer sheet. Conversations are on general topics (e.g., need for resurfacing a stretch of highway,how was your trip to Manila?).

Short Talks A short talk is presented on a general topic (e.g., sale on women's coats and hats, news-styleaccount of how Big Ben stopped after more than a century). Two or more written questions and responseoptions (four) are provided in the test booklet.

Reading Comprehension

Incomplete Sentences Examinees are asked to identify the word or phrase that best completes a sentencefrom which a word or phrase has been omitted; no spoken material is involved.

Error Recognition Four words or phrases are underlined and lettered in a sentence. Examinees areinstructed to identify the one underlined word or phrase that should be corrected or rewritten and mark thecorresponding letter on the answer sheet.

Reading Comprehension A brief reading passage is followed by one or more written questions, each withfour written answer options, to be answered on the basis of what is stated or implied in the written passage. Examinees are instructed to choose the one option that best answers a question. Questions are based on avariety of reading passages, such as notices, letters, newspaper and magazine articles (e.g., announcementfor prospective museum vistors, welcome card for hotel vistors, travel agency blurb, and so on).

Note. To avoid confusing references to the "Reading Comprehension" item type with references to theidentically named section in which it is found, throughout the remainder of this report, the item type itselfwill be referred to as the "Reading Passage(s) item type".

4

below provide evidence regarding concurrent- and discriminantvalidity properties of TOEIC test scores.

In the initial TOEIC test validation study (Woodford,1982a, 1982b) involving data for a sample (N = 99) of Japaneseexaminees from introductory TOEIC test administrations inJapan in 1979, TOEIC test LC scores were found to berelatively highly correlated (r = .83) with a direct measureof ESL speaking proficiency, namely the Language ProficiencyInterview (LPI) procedure (see, for example, Lowe andStansfield, 1988).b LPIs were administered by TOEIC-trained,native-English speaking ESL professionals resident in Japan. Woodford, above, also assessed concurrent relationshipsbetween scores on the TOEIC test and scores on the Test ofEnglish as a Foreign Language (e.g., ETS, 1991), which iswidely used to assess the English language skills of foreign-ESL students applying for admission to U.S. and Canadiancolleges and universities. Observed correlations wererelatively high. For example, coefficients for TOEIC ReadingComprehension versus TOEFL Reading Comprehension andVocabulary, and TOEFL Structure and Written Expression,respectively, were r = .85 and r = .87, in the sample ofnative-Japanese speakers involved.

Subsequent research (e.g., Saegusa, 1989; Wilson, 1989;Wilson and Chavanich, 1989; Wilson, Komarakul and Woodhead,1994) has confirmed and extended Woodford's finding of strongand theoretically consistent patterns of correlation betweenTOEIC test scores and LPI ratings. Strong concurrentrelationships between the TOEIC test and the TOEFL have beenreported by Hemingway (1999) for a linguisticallyheterogeneous sample of ESL learners/users studing in theUnited States, and by Wilson, Berquist and Bell (1998) for asample of native-French speaking students, thus confirming andextending Woodford's findings regarding these two measures. The TOEIC/TOEFL findings suggest that the two tests tend to bemeasuring generally similar aspects of ESL proficiency.c

The empirical studies reviewed above were designed toaddress issues relating to the use and validity of TOEIC testsummary (scaled) scores, only; questions regarding theexternal validity properties of TOEIC test components, such asindividual items or item types (see Tables 1 and 2, above),were not at issue in any of the studies involved. No suchstudies could be located in a review of pertinent journals,and none appears to have been conducted under either CGI orETS auspices. This is understandable because it has beentacitly assumed as a working proposition that the fourdifferent types of questions included in the ListeningComprehension section constitute primarily somewhat differentmethods of measuring the same general underlying proficiencydimension, and that this also holds for the three item typesin the Reading Comprehension section; also that the twogeneral proficiency dimensions involved, albeit closelyrelated are psychometrically distinguishable.

5

Questions Regarding TOEIC's Dimensionality

Evidence of discriminant validity for TOEIC LC (ListeningComprehension) and RC (Reading Comprehension) scores, alludedto above, suggests that each of the scaled section scoresconveys some unique information about individual differences. Such an inference is supported as well by results of internalanalyses of intercorrelations of item-type subscores insamples used in routine test analyses involving samples ofJapanese Secure Program examinees (e.g., ETS, 1992; also J.Breyer, personal communication, July, 1993). These resultssuggest, inter alia, that the aspects of proficiency beingtapped by the four item types that involve spoken stimulusmaterial exclusively or primarily, are psychometricallydistinguishable from those being tapped by the three itemtypes that do not involve spoken stimulus material.d Forexample, after correction for attenuation due to measurementerror, correlations between the the Listening Comprehensionand Reading Comprehension scores typically do not equal orexceed the .90 level--consistent with the working assumptionthat the two TOEIC sections are tapping psychometricallydistingishable aspects of proficiency, and the currentorganization of the TOEIC test.

However, results of internal test analyses such as ETS(1982) also indicate that whereas intercorrelations ofsubscores corresponding to each of the four differentlistening comprehension item types tend to equal or exceed .90after correction for attenuation due to measurement error,this does not tend to be true for subscores based on threereading comprehension item types. For the latter, correctedwithin-section coefficients typically meet or exceed .90 forError Recognition versus Incomplete Sentences subscores, butneither of these subscores meets the .90 correlation criterionwith the subscore for the Reading Passages item-type. Theseinternal findings tend to support the assumption ofunidimensionality for the LC section but suggest thepossibility of potentially useful differentiation ofrelatively closely related but psychometricallydistinguishable aspects of proficiency involving jointperformance on the Incomplete Sentences and Error Recognitionitems, on the one hand, and performance on the ReadingPassages items, on the other.

THE PRESENT STUDY

The study reported herein was undertaken as anexploratory assessment of TOEIC's dimensionality, using datafrom the May 1992 TOEIC Secure Program (SP) administrationsa

in Japan and Korea.

6

Study Data and Procedures

The study used data from TOEIC Secure Program files for arandom sample (N = 4,000) from a total of 33,522 Japaneseexaminees tested in the May 1992 Secure Program administrationin Japan, and a sample (N = 4,214) of Korean examinees,representing the entire May 1992 Secure Progam cohort in thatcountry (ETS, 1992).

Preliminary Analysis of Intact Item-Type Subscores

Table 3 shows for the Japanese and Korean samplesinitially selected (a) observed correlations between number-right scores for the seven TOEIC item-types (above thediagonal), (b) coefficients corrected for attenuation due tomeasurement error (below the diagonal), (c) the estimatedreliability coefficients that were used in correcting theobserved coefficients (in the diagonal) and (d) correspondingdescriptive statistics. From inspection of Table 3, outcomesfor the Japanese and Korean samples appear to be generallyquite similar.

As noted earlier, corrected coefficients at or above .9for two or more item-type subscores can be thought of as beingconsistent with the hypothesis that the different types ofitems involved are simply different methods of measuring thesame general underlying ability or construct. From findingsreported in Table 3 it can be seen that scores on the fouritem types that make up the Listening Comprehension (LC)section correlate very highly with each other (correctedcoefficients above .90) indicating strong within-sectionconsistency; corresponding across-section correctedcoefficients for LC item types with the three ReadingComprehension (RC) section item types average in the mid-.70s,suggesting their psychometric distinctiveness. However, thethree RC item types do not demonstrate such internalconsistency: for Incomplete Sentences and Error Recognitionsubscores, the corrected coefficient surpasses .90, but thesetwo subscores correlate no more highly with the ReadingPassages subscore than with the several LC-section subscores. As for the Reading Passages subscore, its correlations withall other subscores, regardless of type, clearly are wellbelow the .90 level.

The general correlational findings just reviewed suggestthat the acquired skills measured by items involving spokenstimuli (in the Listening Comprehension section) and thosemeasured by the items that do not involve spoken stimuli (allfrom the Reading Comprehension section) may tend to tapdistinguishable albeit closely related aspects of developedproficiency in English; and that this may be true as well foraspects of proficiency being tapped by the Reading Passages

7

Table 3. Observed Correlation and Correlation Corrected for

Attenuation Due to Measurement Error (below Diagonal), Estimated Reliability Coefficients Used in Correcting Observed Coefficients, and Descriptive Statistics for TOEIC Item-Type Part Scores in Japanese and Korean TOEIC Special Program Samples (May 1992)

Japanese examinees (N = 4,000) Item-type subscore Single Short Quest- Short Incomp. Error Reading Picture Conv. Response Talks Sent. Recog. Passages

Single Picture (.72) .72 .72 .68 .58 .53 .64Short Conversations .93 (.84) .80 .76 .65 .60 .71Question-Response .95 .97 (.81) .74 .63 .58 .67Short Talks .93 .96 .98 (.75) .62 .58 .69

Incomplete Sentences .74 .76 .75 .77 (.87) .75 .71Error Recognition .72 .71 .64 .78 .93 (.75) .63

Reading Passages .79 .81 .78 .84 .80 .76 (.91)

Korean examinees (N = 4,014)

Single Picture (.74) .73 .71 .62 .58 .56 .59Short Conversations .99 (.81) .76 .70 .59 .58 .65Question-Response .93 .95 (.79) .64 .59 .57 .60Short Talks .90 .96 .90 (.65) .52 .51 .58

Incomplete Sentences .73 .70 .72 .69 (.87) .77 .71Error Recognition .74 .73 .73 .72 .94 (.78) .64

Reading Passages .73 .77 .72 .79 .84 .77 (.88)

No. items 20 30 30 10 40 20 40

Mean(Jpn) 15.1 17.0 18.0 10.9 27.0 12.1 28.6Mean(Kor) 14.0 15.0 16.5 10.0 29.0 12.8 29.2

SD (Jpn) 3.1 5.5 5.6 3.8 6.9 3.9 7.9SD (Kor) 3.2 5.3 5.3 3.4 6.7 4.1 6.9

Note. Entries in parentheses are estimated reliability coefficientsused in correcting coefficients for attenuation. Reliablities wereestimated using the Rulon formula (e.g., Guilford, 1950, p. 497). In evaluating these coefficients, it should be kept in mind thatinternal consistency reliability estimates generally tend tooverestimate reliability, especially when several items arereferenced to the same stimulus--e.g., several questions associatedwith a single reading passage.

8

items as opposed to the Incomplete Sentences and ErrorRecognition items.

Exploratory Factor Analysis

To provide more detailed evidence bearing ondimensionality, an exploratory factor analysis was conductedinvolving intercorrelations of scores on TOEIC item-typeparcels (subsets of several items of the same type). In thepresent instance, the 200 TOEIC test items (100 from theListening Comprehension section and 100 from the ReadingComprehension section) were divided into a total of 40parcels, each including five items as enumerated in Table 4.The decision to include five items in each parcel wasarbitrary, based on the general assumption that division ofthe test into parcels of five items would provide enoughparcels of each type to permit adequate identification offactors. The parcels themselves were defined primarily as"spaced" samples of the respective types (i.e., parcelsincluded every fifth item, by type). A strict "equal spacing"design was modified somewhat for the last five items in eachitem-type grouping within the test; these were assigned toparcels in such a way that the last item in each item-type setwas included in the same parcel as the first item in that set,the next to last item was included in the same parcel as thesecond item, and so on (see Table 4). This was done in order(a) to attain rough balance across parcels with respect todifficulty (assuming a tendency to place easier items earlierthan more difficult items within item-type sets and testsections, respectively) and (b) to reduce the likelihood ofinterpretive complications associated with possible end-of-set, -section or -test effects ("practice" effects,speededness, fatigue, and so on).

The factor analytic approach was exploratory in nature. First, principal components analyses were made of matrices ofintercorrelations of scores on the 40 item-type parcels forthe respective language groups (Japanese and Korean) and forthe combined sample (N=2,080). In each analysis, the firstthree principal components had eigenvalues of 1.0 or greater,a recognized criterion (e.g., Kaiser, 1960; Harman, 1976) forinitial decisions regarding the number of common factors in ananalysis (see Table 5, which shows the four largesteigenvalues, only). Thus, the findings of the principalcomponents analysis--like those of the analysis of intactitem-type subscores described above--suggested exploration ofa three-factor model in each of the samples. Accordingly thethree components were rotated to a three-factor, orthogonal(varimax) solution in each of the national samples and in thecombined sample. For perspective the corresponding two-factorsolutions were also obtained. Table 6 shows factor loadingsof parcels for the two-factor and three-factor solutions,respectively.

9

Table 4. Definition of TOEIC Item-Type Parcels

Item type Parcel Items making up parcel label

Single Picture Pic01 = L1 L5 L9 L13 L20 Pic02 = L2 L6 L10 L14 L19 Pic03 = L3 L7 L11 L15 L18 Pic04 = L4 L8 L12 L16 L17

Question-Response Ques1 = L21 L27 L33 L39 L50 Ques2 = L22 L28 L34 L40 L49 Ques3 = L23 L29 L35 L41 L48 Ques4 = L24 L30 L36 L42 L47 Ques5 = L25 L31 L37 L43 L46 Ques6 = L26 L32 L38 L44 L45

Short Conv1 = L51 L57 L63 L69 L80 Conversations Conv2 = L52 L58 L64 L68 l79 Conv3 = L53 L59 L65 L71 L78 Conv4 = L54 L60 L66 L72 L77 Conv5 = L55 L61 L67 L73 L76 Conv6 = L56 L62 L68 L74 L75

Short Talks Talk1 = L81 L85 L89 L93 L100 Talk2 = L82 L86 L90 L94 L99 Talk3 = L83 L87 L91 L95 L98 Talk4 = L84 L88 L92 L96 L97

Sentence Snc01 = R1 R9 R17 R25 R40 Completion Snc02 = R2 R10 R18 R26 R39 Snc03 = R3 R11 R19 R27 R38 Snc04 = R4 R12 R20 R28 R37 Snc05 = R5 R13 R21 R29 R36 Snc06 = R6 R14 R22 R30 R35 Snc07 = R7 R15 R23 R31 R34 Snc08 = R8 R16 R24 R32 R33

Error Err01 = R41 R45 R49 R53 R60 Recognition Err02 = R42 R46 R50 R54 R59 Err03 = R43 R47 R51 R55 R58 Err04 = R44 R48 R52 R56 R57

Reading Rd001 = R61 R69 R77 R85 R100 Passage Rd002 = R62 R70 R78 R86 R99 Rd003 = R63 R71 R79 R87 R98 Rd004 = R64 R22 R80 R88 R97 Rd005 = R65 R73 R81 R89 R96 Rd006 = R66 R74 R82 R90 R95 Rd007 = R67 R75 R83 R91 R94 Rd008 = R68 R76 R84 R92 R93Note. "L" = listening item, "R" = reading item.

10

Table 5. Eigenvalues and Percentages of Variance Associated with the First Four Principal Components in Desig- nated Samples

Component Japanese Korean Combined examinees examinees sample

Eigen- Pct Eigen- Pct Eigen- Pct value var value var value var

PC 1 15.85 (39.6) 14.97 (37.4) 15.36 (38.4) PC 2 2.13 ( 5.3) 2.29 ( 5.7) 2.20 ( 5.5) PC 3 1.44 ( 3.6) 1.40 ( 3.5) 1.41 ( 3.5) PC 4 0.95 ( 2.4) 0.89 ( 2.2) 0.88 ( 2.2)

(N) 1,007 1,073 2,080

11

Table 6. Factor Loadings of Parcels in Two- and Three-Factor Orthogonal (Varimax) Rotations for Japanese, Korean and Combined Samples Two factors

JapaneseParcel F1 F2

Pic01 .62 .16Pic02 .60 .35Pic03 .56 .30Pic04 .52 .17Ques1 .57 .30Ques2 .55 .22Ques3 .64 .26Ques4 .67 .21Ques5 .71 .22Ques6 .61 .27Conv1 .65 .29Conv2 .60 .26Conv3 .64 .24Conv4 .71 .31Conv5 .63 .29Conv6 .57 .32Talk1 .62 .29Talk2 .53 .31Talk3 .58 .33Talk4 .50 .32

Snc01 .17 .59Snc02 .34 .69Snc03 .27 .61Snc04 .16 .66Snc05 .26 .62Snc06 .23 .60Snc07 .25 .62Snc08 .17 .59Err01 .21 .62Err02 .32 .49Err03 .25 .59Err04 .25 .55

Rd001 .42 .49Rd002 .35 .62Rd003 .36 .59Rd004 .33 .63Rd005 .47 .58Rd006 .50 .57Rd007 .44 .52Rd008 .45 .54

KoreanParcel F1 F2

Pic01 .26 .57Pic02 .34 .60Pic03 .26 .59Pic04 .30 .49Ques1 .32 .55Ques2 .17 .53Ques3 .24 .60Ques4 .24 .56Ques5 .23 .66Ques6 .28 .59Conv1 .28 .60Conv2 .23 .64Conv3 .18 .61Conv4 .29 .69Conv5 .31 .60Conv6 .30 .65Talk1 .24 .55Talk2 .22 .52Talk3 .29 .52Talk4 .15 .54

Snc01 .57 .25Snc02 .68 .25Snc03 .66 .22Snc04 .68 .23Snc05 .61 .20Snc06 .67 .23Snc07 .68 .26Snc08 .59 .17Err01 .64 .27Err02 .56 .25Err03 .63 .29Err04 .62 .28

Rd001 .55 .31Rd002 .64 .31Rd003 .56 .37Rd004 .57 .27Rd005 .61 .38Rd006 .51 .39Rd007 .57 .25Rd008 .55 .41

Combined sampleParcel F1 F2

Pic01 .59 .22Pic02 .62 .32Pic03 .63 .22Pic04 .51 .23Ques1 .57 .30Ques2 .54 .19Ques3 .62 .25Ques4 .60 .23Ques5 .71 .19Ques6 .62 .25Conv1 .64 .27Conv2 .60 .26Conv3 .64 .18Conv4 .72 .27Conv5 .62 .29Conv6 .65 .25Talk1 .62 .22Talk2 .52 .27Talk3 .57 .29Talk4 .49 .26

Snc01 .26 .53Snc02 .26 .71Snc03 .24 .64Snc04 .17 .68Snc05 .21 .63Snc06 .24 .63Snc07 .22 .67Snc08 .11 .62Err01 .27 .61Err02 .27 .53Err03 .22 .64Err04 .25 .60

Rd001 .34 .54Rd002 .31 .65Rd003 .36 .58Rd004 .28 .62Rd005 .42 .60Rd006 .46 .53Rd007 .33 .56Rd008 .43 .54

12

Table 6 concluded

Japanese F1 F2 F3

Pic01 .62 .14 .15Pic02 .57 .26 .30Pic03 .55 .26 .21Pic04 .50 .11 .19Ques1 .57 .29 .17Ques2 .57 .26 .04Ques3 .65 .27 .13Ques4 .66 .18 .17Ques5 .69 .16 .24Ques6 .59 .22 .23Conv1 .64 .24 .23Conv2 .58 .21 .21Conv3 .63 .23 .15Conv4 .68 .24 .28Conv5 .59 .21 .29Conv6 .52 .20 .34Talk1 .60 .23 .23Talk2 .49 .23 .27Talk3 .54 .24 .31Talk4 .49 .29 .19

Snc01 .17 .58 .20Snc02 .32 .65 .31Snc03 .26 .57 .25Snc04 .17 .68 .17Snc05 .27 .64 .17Snc06 .21 .55 .28Snc07 .25 .61 .21Snc08 .17 .58 .19Err01 .22 .64 .17Err02 .33 .50 .15Err03 .24 .56 .23Err04 .26 .57 .15

Rd001 .31 .24 .61Rd002 .25 .36 .65Rd003 .24 .29 .69Rd004 .22 .36 .65Rd005 .37 .34 .60Rd006 .40 .31 .64Rd007 .32 .22 .69Rd008 .33 .25 .68

Three factors

Korean F1 F2 F3

Pic01 .57 .27 .09Pic02 .59 .31 .19Pic03 .58 .20 .22Pic04 .48 .25 .19Ques1 .53 .25 .23Ques2 .54 .21 .02Ques3 .60 .23 .12Ques4 .55 .19 .18Ques5 .66 .21 .14Ques6 .58 .22 .22Conv1 .59 .21 .22Conv2 .61 .11 .30Conv3 .61 .20 .06Conv4 .68 .25 .19Conv5 .58 .21 .29Conv6 .63 .21 .27Talk1 .53 .16 .24Talk2 .51 .17 .19Talk3 .49 .18 .29Talk4 .53 .10 .15

Snc01 .25 .57 .18Snc02 .25 .66 .26Snc03 .22 .66 .22Snc04 .22 .65 .26Snc05 .20 .59 .23Snc06 .22 .63 .27Snc07 .25 .65 .26Snc08 .15 .52 .29Err01 .27 .67 .17Err02 .25 .56 .18Err03 .30 .65 .18Err04 .29 .65 .16

Rd001 .24 .26 .64Rd002 .24 .35 .66Rd003 .30 .28 .62Rd004 .20 .30 .60Rd005 .31 .34 .62Rd006 .32 .24 .60Rd007 .18 .27 .65Rd008 .34 .30 .58

Combined sample F1 F2 F3

Pic01 .59 .21 .11Pic02 .61 .27 .22Pic03 .62 .18 .17Pic04 .50 .19 .16Ques1 .56 .26 .19Ques2 .55 .22 .04Ques3 .62 .24 .12Ques4 .59 .19 .18Ques5 .70 .15 .17Ques6 .61 .20 .20Conv1 .62 .21 .22Conv2 .58 .18 .25Conv3 .64 .18 .10Conv4 .71 .22 .21Conv5 .60 .20 .28Conv6 .62 .17 .25Talk1 .60 .16 .21Talk2 .50 .19 .23Talk3 .54 .19 .28Talk4 .48 .21 .19

Snc01 .27 .53 .15Snc02 .25 .67 .29Snc03 .24 .62 .22Snc04 .18 .68 .21Snc05 .22 .63 .20Snc06 .23 .59 .25Snc07 .22 .66 .23Snc08 .10 .58 .25Err01 .28 .64 .14Err02 .28 .54 .17Err03 .22 .63 .21Err04 .26 .63 .14

Rd001 .26 .26 .62Rd002 .23 .37 .65Rd003 .28 .29 .65Rd004 .20 .35 .63Rd005 .35 .35 .61Rd006 .39 .26 .62Rd007 .24 .25 .68Rd008 .36 .27 .62

13

It can be discerned in Table 6 that solutions involvingthree factors emerged relatively consistently and clearlyacross samples. For example, three comparably marked factorsemerged in the same order in each analysis; the first factorwas defined by uniformly high loadings for parcels of item-types from the Listening Comprehension section, the second bysimilarly high loadings for Incomplete Sentences and ErrorRecognition parcels, and the third by loadings for ReadingPassages parcels. Moreover, in each solution, loadings forparcels defining a factor were uniformly high for that factorand relatively uniformly low for the other factors --that is,the factors were relatively cleanly defined. Moreover,findings not shown in the table indicated that extraction of athird factor resulted in a reduction in the percentage ofresiduals greater than or equal to .05 in absolute value, forexample, in the combined-sample analysis, from 15 percent to 9percent.

With somewhat less across-sample consistency than thatdescribed above for the three-factor solutions, the severaltwo-factor solutions were nonetheless generally similar. Forexample, in each analysis two similarly marked factorsemerged, one relatively clearly defined by loadings forListening Comprehension items and the other characterized byhigher loadings for parcels involving the three nonlisteningitem types than for the Listening Comprehension parcels. However, lack of strong correlational affinity between theReading Passage item type and the other two nonlistening itemtypes, evidenced by the emergence of separate factors in thethree-factor solutions, is suggested indirectly in the two-factor solutions by uneven patterns of loadings for the threesets of Reading Passages item-type parcels as compared tothose of the two other Reading Comprehension section itemtypes (cf. factor loadings). For example, the ReadingPassages parcels had somewhat higher average loadings for thefactor involving Listening Comprehension item types than didthe Incomplete Sentences and Error Recognition parcels. Median loadings (not shown in the table) for the ReadingPassage parcels on the "listening" factor across analyses were.43, .34, and .35) whereas corresponding medians for the othertwo nonlistening item types were .25, .25 and .24). Note inTable 6 that such unevenness of factor loadings is notdiscernible in the three-factor solutions.

On balance, it appears that the correlational outcomesshown in Table 4, above, and the factor-related findings (inTables 5 and 6, above) tend to support the current division ofthe TOEIC into sections measuring listening skills versusnonlistening skills, but at the same time they also indicatethat further differentiation of nonlistening skills may bewarranted. More specifically it appears that the ReadingPassages item type may be tapping aspects of reading that arepsychometrically distinguishable from those being tapped incommon by the Error Recognition and Incomplete Sentences item

14

types. The former involves the ability to read "connecteddiscourse" and answer questions designed to assessunderstanding (comprehension) of the the meaning, direct orimplied, of material in a passage. Correct answers to ErrorRecognition and Incomplete Sentences items may be somewhatmore dependent than are those for Reading Passages items, uponrelatively specific prior knowledge--especially but notexclusively knowledge of proper English usage, specific wordknowledge, ability to recognize proper and improper usage whenreading sentence-level material, and so on. Both of the"nonlistening" dimensions thus identified--tentativelylabelled Reading and Usage, respectively--correspond toproficiency domains that appear to be given somewhat moreemphasis than is listening comprehension in English languagecurricula in Japan (e.g., Saegusa, 1983) and very likely inKorea as well.e By inference, however, due to their focus oncorrect "usage" (reflecting formal knowledge or grammaticalrules, for example), it is possible that performance on Usageitems may tend to be more sensitive than is performance onReading or Listening Comprehension items to differences inlevel of formal education in English as a foreign languagewhich is mandatory over the last six years of secondary schooland the first two years of college or university study inJapan.

Further Analysis

Using data for the sample involved in the factor analysis(N=2,080), exploratory analyses were undertaken to assessdifferences in performance on Usage and Reading subscores, aswell as the summary Listening Comprehension section score, forsubgroups of examinees classified by responses to backgroundquestions regarding gender, highest educational levelattained, and patterns of use/exposure to English as a secondlanguage (ESL). The latter variable reflects patterns ofresponse to two different "Yes/ No" questions regarding,respectively, daily use of English and experience abroad in anEnglish-speaking environment.

A univariate analysis was made of differences among therespective subgroups with respect to the each of the scores(LC, Usage and Reading) under consideration, using onewayanalysis of variance. The univariate analysis was intended topermit a general evaluation of the relative differentiation ofthe subgroups on the variables involved, especially Usage andReading. It was supplemented by a multivariate (multiplediscriminant) analysis of differences among compoundcategories reflecting joint responses to the three backgroundquestions with respect to performance on Usage and Reading. For the multiple discriminant analysis (MDA), 16 compoundcategories were formed, reflecting classification of thesample by gender (two categories), educational level(collapsed into two categories, namely, less than universitylevel vs university or graduate level), and ESL use/exposure

15

(four categories). The specific categories of educationallevel, English use/exposure and gender involved in theunivariate analysis, as well as the 16 compound categoriesdefined for the multiple discriminant analysis are shown inTable 7. The MDA was designed to permit a systematicassessment of the possibility that if treated separately thetwo Reading Comprehension item type part-scores mightcontribute information about group differences that isobscured by their joint inclusion in the Reading Comprehensionscore, based on a rationale to be developed in detailfollowing consideration of the findings of the univariateanalysis.

Univariate Analysis of Group Differences

Table 8 shows descriptive statistics and results of one-way analysis of variance for subgroups reflecting differencesin patterns of English use/exposure, educational level, andgender, respectively. Values of Eta-squared (e.g., Guilford,1950, pp. 316-318) with subgroups as the dependent variablesare also shown for additional perspective, keeping in mindthat when the number of categories for a dependent variable issmall, Eta may tend to underestimate the correlation (see, forexample, Guilford, 1950, p. 323). Analyses involving eachbackground question included only examinees with test data whoalso responded to that question, hence Ns varied slightly forthese analyses. The subgroup means shown in Table 8 wereexpressed as deviations from the corresponding combined-samplemeans in combined-sample standard deviation units. Thecorresponding z-scaled means are plotted in Figure 1.

From inspection of Table 8 and Figure 1 it can be seenthat performance on Listening Comprehension varies more withEnglish use/exposure, which reflects experience beyond theclassroom, than with educational level, which reflectsdifferences in amount of (required) formal instruction inEnglish as a foreign language.f The opposite pattern obtainsfor both Usage and Reading subscores. It may be seen in Figure1, for example, that examinees with less extensive formaleducation, that is, less than university level, had relativelylower means for Usage than for Reading whereas the oppositepattern was present albeit less pronounced for the highest twoeducational categories. On the other hand, when examinees areclassified by English use/exposure, differences in Readingappear to be more pronounced than are differences in Usage;differences in Listening Comprehension appear to be morepronounced than are differences on either of the two reading-related subtests. Gender differences in performance aregenerally less pronounced than are the differences associatedwith educational level and ESL use/ exposure. Although theobserved differences are relatively slight as indexed by theEta-squared values, they are nonetheless of interest becausethey suggest the possibility of differential patterns of

16

Table 7. Classification of Examinees According to Responses to Selected Background Questions for Univariate and Multivariate Analyses

Univariate analysis

Level of use/exposure to English "No/No" Daily use (No )/Experience abroad (No ) "Yes/No" Daily use (Yes)/Experience abroad (Yes) "No/Yes" Daily use (No )/Experience abroad (Yes) "Yes/Yes" Daily use (Yes)/Experience abroad (Yes)

Highest educational level attained

Less than junior college Junior college University Graduate school

Gender

Male Female

Compound categories used in multivariate analysis Male /Less than university/ No use, No stay abroad Male /University or more / No use, No stay abroad Male /Less than university/ Use, No stay abroad Male /University or more / Use, No stay abroad Male /Less than university/ No use, Stay abroad Male /University or more / No use, Stay abroad Male /Less than university/ Use, Stay abroad Male /University or more / Use, Stay abroad Female/Less than university/ No use, No stay abroad Female/University or more / No use, No stay abroad Female/Less than university/ Use, No stay abroad Female/University or more / Use, No stay abroad Female/Less than university/ No use, Stay abroad Female/University or more / No use, Stay abroad Female/Less than university/ Use, Stay abroad Female/University or more / Use, Stay abroad

Table 8. Descriptive Statistics and Selected Analysis of Variance Results for Subgroups Classified by English Use/Exposure, Educational Level and Gender,Respectively

Listening Comp. English use/ (raw LC score)exposure1 N Mean S.D.

No/No 986 55.55 14.55 Yes/No 729 56.04 14.87 No/Yes 151 73.40 14.19 Yes/Yes 149 76.36 14.46 Total 2015 58.60 16.14F=146.4 (df=3,2011)p < .001; Eta-squared=.189

Educational Level

LtJrCol 90 52.97 18.21 JrCol 145 54.95 16.39 Univ 1631 59.14 15.82 GradSch 189 59.65 16.44 Total 2055 58.62 16.10F=7.1 (df=3,2051)p < .001; Eta-squared=.018

Gender

Male 1535 56.85 15.86 Fem 545 63.71 15.69 Total 2080 58.64 16.10F=75.7 (df=1,2078)p < .001; Eta-squared=.037

Usage subscore

N Mean S.D.

986 39.63 9.97 729 39.58 10.19 151 45.98 8.98 149 46.80 9.94 2015 40.62 10.26F=39.4 (df=3,2011)p < .001; Eta-squared=.065

90 33.06 12.21 145 33.63 10.72 1631 41.56 9.67 189 43.40 9.50 2055 40.80 10.22F=51.7 (df=3,2051)p < .001; Eta-squared=.075

1535 41.21 10.39 545 39.36 9.71 2080 40.73 10.24F=13.2 (df=1,2078)p < .001; Eta-squared=.018

Reading subscore

N Mean S.D.

986 27.79 7.55 729 28.37 7.34 151 33.62 5.25 149 33.42 6.93 2015 28.85 7.54F=49.2 (df=3,2011)p < .001; Eta-squared=.084

90 24.24 9.19 145 24.01 8.62 1631 29.41 7.05 189 30.76 7.01 2055 28.93 7.48F=41.0 (df=3,2051)p. < .001; Eta-squared=.059

1535 28.74 7.54 545 29.32 7.41 2080 28.89 7.51F=2.42 (df=1,2078)p=.120; Eta-squared=.005

1 The categories are based on Yes/No answers to questions regarding use of English regularly andexperience in an English-speaking environment, respectively (see Table 7).

18

Figure 1

+++Not Available in Electronic Form++++

19

performance by gender on Usage and Reading as well asListening Comprehension: males performed slightly better thanfemales on Usage, but females performed somewhat better thandid males on both Listening Comprehension and Reading.

Multivariate Analysis of Group Differences

Results of the factor analysis suggest that a subscorereflecting performance on Reading items (parcels of whichdefined a factor labelled "Reading) tends to tap aspects ofproficiency that are factorially independent of those tappedby a subscore reflecting performance on Incomplete Sentencesand Error Recognition items (parcels of which marked a factorlabelled "Usage"). Such independence suggests that part-scoresbased on the corresponding item types may tend to provide someunique information about individual or group differences thatis obscured when they are summarized in the ReadingComprehension score. For example, groups may tend to differnot only in level of average performance on Usage and Reading(which together make up the summary RC score), but also withrespect to patterns of average performance on the two part-scores, for example, relatively better on Usage than onReading or vice versa, as suggested by the univariate results.To explore this possibility, the method of multiplediscriminant analysis (using procedures described in Norusis,1990, pp. 1-42) was used to analyze differences among 16groups reflecting joint classification of members of the totalsample by gender, educational level (collapsed into twocategories--less than university level vs. university orgraduate level) and ESL use/exposure, with respect to scoreson Usage and Reading.

Generally speaking, given observations on p test or othervariables for members of G groups, multiple discriminantanalysis yields either p or G-1 statistically uncorrelatedfunctions (linear combinations of the p variables) whicheveris smaller. Functions are derived in such a way that the firstfunction (weighted linear composite of scores on theindependent variables involved) accounts for the largestpercentage of among-groups variation, the second,statistically independent linear function accounts for thesecond largest percentage, and so on. For an analysis such asthat here under consideration (involving 16 groups and twoindependent variables) only two functions are derived. Giventhe nature of the groups and measures under consideration, thefirst function might be expected to reflect differences inlevel of performance on the Reading Comprehension (RC)section, which would be suggested by positive weighting of thetwo RC-component part-scores on that function. If there aresubgroup differences with respect to relative levels ofperformance on Usage and Reading, this would be indicated bynegative weighting for one of the part scores on the seconddiscriminant function.f Selected results of the multiplediscriminant analysis analysis are summarized in Table 10a and10b.

20

Table 10a. Selected Results of the Multiple Discriminant Analysis of Differences Among 16 Groups Reflecting Gender (2)by Educational Level (2) by English Use/Exposure (4) with respect to Part-Scores for Usage and Reading

Function First Second attribute function function

Percent of variance 85.86 14.14 Canonical correlation .39 .17 Standardized coefficients Usage .47 1.29 Reading .62 -1.23

Note. Values of Wilks' lambda for overalldifferentiation of groups by both components and fordifferentiation involving the second component only were,respectively, .82 and .97 (smaller values indicatinggreater differentiation); corresponding chisquare values,with 30 and 14 degrees of freedom, respectively, weresignificant at p < .0001.

21

Table 10b. Canonical Discriminant Functions Evaluated at Group Means (Group Centroids: Standardized (M=0,SD=1)

N Discriminant function 1 2Group (.47xUsage (1.29xUsage +.62xReading) 1.23xReading)

M/ lower edu/no use, no stay 43 -1.27 .22M/ lower edu/ use, no stay 36 -.98 -.38Fem/lower edu/no use, no stay 59 -.82 -.19Fem/lower edu/ use no stay 59 -.87 -.25

M/ higher edu/no use, no stay 691 -.06 .14M/ higher edu/ use, no stay 486 -.02 .05Fem/higher edu/no use, no stay 187 .03 -.23Fem/higher edu/ use, no stay 133 .23 -.27Fem/lower edu/no use, stay 11 .41 -.61Fem/lower edu/ use, stay 13 -.16 .12M/ lower edu/no use, stay 6 .55 .37M/ lower edu/ use, stay 6 .03 .25

M/ higher edu/no use, stay 112 .70 -.02M/ higher edu/ use, stay 87 .83 .09Fem/higher edu/no use, stay 22 .74 -.48Fem/higher edu/ use, stay 40 .87 -.11

Total 1990 .00 .00

Note. The group designations indicate classification bygender (M=male,F=female), educational level (lower= less thanuniversity, higher=university or graduate school) and Englishlanguage use/exposure (daily use vs. no use, stay in Englishspeaking environment vs. no stay). Thus, for example,"Fem/higher edu/use, stay" includes 40 females with universityor graduate level education who both use English in their workand spent some time in an English- speaking environment.

22

Findings summarized in Table 10a indicate that bothdiscriminant functions convey statistically significantinformation about group differences. The first function,which accounted for some 85 percent of variance among groups,reflects group differences in level of average performance onReading and Usage and as indicated by the standardizeddiscriminant coefficients (.47 for Usage and .62 for Reading),somewhat more so for Reading than for Usage. The second(statistically independent) function reflects differences inpatterns of performance on Usage and Reading for which thestandardized coefficients were 1.29 and -1.23, respectively.Table 10b summarizes mean values for the 16 groups on each ofthe standardized discriminant functions. The groups arearrayed generally from low to high in terms of average scoreon the first function. It can be seen that the four subgroupswith highest scores on the first function are made up ofuniversity graduates without regard to gender who have hadsome experience in an English-speaking environment withoutregard to whether or not they use English in their work. Onthe other hand, the four lowest scoring subgroups are made upof individuals educated at less than university level, againwithout regard to gender, who have had no experience in anEnglish-speaking environment. Regarding the second function,it is noteworthy that seven of the eight lowest scoring groups(those with negatively weighted means) are made up exclusivelyof females, while only two of the eight highest scoring groupsare composed of males. Thus, by inference, the secondfunction appears to reflect gender-related differences inpattern of average performance on Usage relative to Reading(females tending to have relatively higher Reading than Usagescores, and vice versa for males) after controlling for levelof performance on a composite of the two scores which reflectsdifferences in level of performance. For present purposes itis sufficient to note that such differences would necessarilybe obscured in an analysis of differences among these groupswith respect to performance on Reading Comprehension (Usageplus Reading).

DISCUSSION

The findings that have been reviewed indicate a cleardistinction between listening and nonlistening proficiencycomponents, as reflected in the current division of item typesin the TOEIC. At the same time, the findings suggest thepossibility of further potentially useful differentiation of"nonlistening" aspects of proficiency being tapped by thethree item types in the TOEIC Reading Comprehension section. The findings lend support to a three-dimensional view of theTOEIC test as having the potential for measuring three relatedbut psychometrically distinguishable ability or proficiencydimensions labelled "Listening Comprehension", "Reading" and"Usage", respectively. Listening Comprehension would remainas currently defined by the four TOEIC test listening

23

comprehension item types. "Reading" would be defined by the"reading" item type, involving a short reading passage andquestions that can be answered based on what is stated orimplied in the passage (e.g., reading materials such asnotices, advertisements, newspaper and magazine articles). "Usage" would be defined operationally by Error Recognitionand Incomplete Sentences item types that provide relativelylimited, sentence-level context and appear to call forrelatively specific prior knowledge of formal aspects of theEnglish language, per se (e.g., ability to recognizeappropriate and inappropriate usage of words or phrases insentences, knowledge of tense and so on, as well as lexicon).

Findings of both univariate and multivariate analyses ofsub-group performance on Usage and Reading suggest that whentreated separately these part-scores may provide potentiallyuseful information about group differences--information thatnecessarily is obscured by their inclusion in a single,summary Reading Comprehension score. Further research isneeded to assess the extent to which findings based onJapanese and Korean samples are consistent for different formsof the TOEIC test and for samples from the othernational/linguistic populations now being served by the TOEICprogram. It is also important to assess concurrent anddiscriminant validity properties of subscores reflectingperformance on item types that marked the two reading-relateddimensions identified in this study. Such research shouldhelp to shed light on questions regarding the pragmaticimplications of differences in relative performance on the tworeading-related subdomains identified in this study.

24

REFERENCES

Brumfit, C. (Ed.). English for International Communication.New York: Pergamon Press.

Educational Testing Service (1992). Test analysis: Test ofEnglish for International Communication, Form 30IC2,May31, 1992 administration (Unpublished report). Princeton, NJ: Author.

Educational Testing Service (1991). TOEFL Test and ScoreManual. Princeton, NJ: Author.

Guilford, J. P. (1950). Fundamental Statistics in Psychologyand Education. New York: McGraw Hill.

Hale, G. A., Rock, D. A., and Jirele, T. (1989). Confirmatoryfactor analysis of the Test of English as a ForeignLanguage (TOEFL Research Report No. 32). Princeton, NJ: Educational Testing Service.

Harman, H. H. (1976). Modern Factor Analysis (Third RevisedEdition). Chicago: The University of Chicago Press.

Kaiser, H. F. (1960). The application of electronic computersto factor analysis. Educational and PsychologicalMeasurement, 20, 141-151.

Lowe, P. L., Jr. and Stansfield, C. W. (Eds.)(1988a). SecondLanguage Proficiency Assessment: Current Issues.Englewood Cliffs, NJ: Prentice Hall Regents.

Lowe, P. L., Jr. and Stansfield, C. W. (1988b). Introduction.In P. L. Lowe, Jr. and C. W. Stansfield (Eds). SecondLanguage Proficiency Assessment: Current Issues (pp. 1-11). Englewood Cliffs, NJ: Prentice Hall Regents.

McKinley, R. L. and Way, W. D. (1992). The feasibility ofmodelling scondary TOEFL ability dimensions usingmultidimensional IRT models (TOEFL Technical Report, TR-5). Princeton, NJ: Educational Testing Service.

Norusis, M. J. (1990). SPSS/PC+ Statistics 4.0 for the IBMPC/XT/AT and PS/2. Chicago: SPSS Inc.

Perkins, K. (1987). Test of English for InternationalCommunication. In C. J. Alderson and C. W. Stansfield(Eds.), Reviews of English Language Proficiency Tests(pp. 81-83). Washington, D.C.: Teachers of English toSpeakers of Other Languages.

Saegusa, Y. (1983). Japanese college students' readingproficiency in English, Musashino English and AmericanLiterature, 16, 99-117.

Saegusa, Y. (1989). Japanese company workers' Englishproficiency, WASEDA Studies in Human Sciences, 2, 1-12.

25

The Chauncey Group International (1999). TOEIC User Guide. Princeton, NJ: Author.

The Chauncey Group International (1996). TOEIC: Report onTest-Takers Worldwide, 1996. Princeton, NJ: Author.

Wilson, K. M., and Graves, K. (1999). Validity of theSecondary-Level English Proficiency Test at TempleUniversity-Japan (ETS RR-99-11). Princeton, NJ:Educational Testing Service.

Wilson, K. M., Bell, I., and Berquist, A. (1998). Guidelinesfor comparing performance on two tests of ESLproficiency: The TOEIC Test and the TOEFL (Unpublishedfield study report). Princeton, NJ: Educational TestingService.

Wilson, K. M., Komarakul, S., and Woodhead, R. (1997). TOEIC/LPI relationships in academic and employmentcontexts in Thailand (Unpublished field study report). Princeton, NJ: Educational Testing Service.

Wilson, K. M. (1989). Enhancing the interpretation of a normreferenced second-language test through criterionreferencing: A research assessment of experience in theTOEIC testing context (TOEIC Research Report Number 1 &ETS RR-89-39). Princeton, NJ: Educational TestingService.

Wilson, K. M., and Chavanich, K. (1989). Further evidence ofstability in TOEIC/LPI relationships across diversesamples (Unpublished field study report). Princeton, NJ:Educational Testing Service.

Wilson, K. M. (1986). The relationship of GRE General Testscores to undergraduate grades: An exploratory study forselected subgroups (GRE Board Professional Report No. 83-19P and ETS RR-86-37). Princeton, NJ: EducationalTesting Service.

Wilson, K. M. (1984). The relationship of GRE General Testscores to undergraduate grades (GRE Board ProfessionalReport GREB No. 81-22P and ETS RR-85-9). Princeton, NJ: Educational Testing Service.

Woodford, P. E. (1982). The Test of English for InternationalCommunication (TOEIC). In C. Brumfit (Ed.), English forInternational Communication (pp. 61-72). New York:Pergamon Press.

26

ENDNOTES

a. For an independent review and evaluation of the TOEICtest, see Perkins (1987; pp. 81-82); see Woodford [1982] fordevelopmental detail regarding the TOEIC test which wasdeveloped under the general aegis of the Educational TestingService (ETS), Princeton, NJ, USA, and is now administeredunder the aegis of The Chauncey Group International,Princeton, NJ, USA. The TOEIC test is used primarily bycorporate clients outside the United States (see, for example,The Chauncey Group International, 1999) to assess ESLproficiency in samples of educated ESL speakers in orpreparing for positions requiring the use of English as asecond language (for detail regarding the characteristics ofTOEIC test-takers and their patterns of test performance see,for example, The Chauncey Group International, 1996). Regularly scheduled national examinations requiring candidatepre-registration are offered in only two countries, namely,Japan and Korea, in the TOEIC Secure Program; in these twocountries and in all other countries being served by the TOEICprogram the majority of examinees are tested in their placesof work by representatives of the respective national TOEICagencies.

b. Correlations at about this level were also reported forTOEIC-LC and/or TOEIC-R with concurrent direct measures oflistening, reading, and writing that were developed ad hoc--direct listening and reading measures involved, for example,taped and written English stimuli, respectively, withquestions and answers in Japanese. After considering thefindings reported by Woodford, TOEIC test content and otherpsychometric properties of the test, Perkins (1987, p. 82),offered the following summary conclusions in an independenttechnical review: "In sum, TOEIC is a standardized, highlyreliable and valid measure of English, specifically designedto assess real-life reading and listening skills of candidateswho will use English in a work context. Empirical studiesindicate that it is also a valid indirect measure of speakingand writing. The items assess major grammatical structuresand reading skills and, in addition to being an integrativetest, TOEIC also appears to tap communicative competence inthat the items require the examinee to utilize his or hersociolinguistic and strategic competence" (p. 82).

c. Each of these two tests contains item types not found inthe other; the TOEFL is designed for use in academic settingsand the TOEIC is designed for use in workplace contexts, andtest content differs accordingly (see Perkins, 1987, for moreon this latter point). Their score scales are different andscores on the two tests are not interchangeable.

d. A similar distinction has been found to obtain consistentlyin numerous studies involving the Test of English as a ForeignLanguage (see Hale, Rock and Jirele, 1989, for a critical

27

review of studies of TOEFL's internal structure in the contextof their own comprehensive factor study; see also, Mckinley &Way, 1992).

e. Some indirect empirical evidence of greater emphasis onreading-related skills than on listening/speaking skills inJapanese secondary-level EFL curricula is provided in a study(Wilson and Graves, 1999) of the performance of recentgraduates of Japanese secondary schools on an ETS-developedESL proficiency test designed for use in academic screening. In a sample assessed for placement in ESL by TempleUniversity-Japan, scores on the reading section of the ESL-proficiency test correlated more highly than did scores on thelistening comprehension section with secondary-school gradesin English as a foreign language.

f. For further development of the rationale for employingmultiple discriminant analysis to evaluate the extent to whichitem- type part scores contributed independently todifferentiation of subgroups in another testing context, seeWilson (1984, 1986).

28

APPENDIX+++Not Available in Electronic Form++++

Illustrative TOEIC Test Item Types

Listening Comprehension--

Single PictureQuestion-ResponseShort ConversationsShort Talks

Reading Comprehension--

Incomplete SentencesError RecognitionReading Comprehension

(labelled "Reading Passages" in the text to avoid confusion with the identically labelled section score)