Administration, Scoring, Reporting - Huhta

Introduction

Administration, scoring, and reporting scores are essential elements of the testing process because they can significantly impact the quality of the inferences that can be drawn from test results, that is, the validity of the tests (Bachman & Palmer, 1996; McCallin, 2006; Ryan, 2006). Not surprisingly, therefore, professional language-testing organizations and educational bodies more generally cover these elements in some detail in their guidelines of good practice.

The Standards for Educational and Psychological Testing devote several pages to describing standards that relate specifically to test administration, scoring, and reporting scores (American Educational Research Association, American Psycho-logical Association, & National Council on Measurement in Education [AERA, APA, & NCME], 1999, pp. 616). Also the three major international language-testing organizations, namely the International Language Testing Association (ILTA), the European Association for Language Testing and Assessment (EALTA), and the Association of Language Testers in Europe (ALTE), make specific recom-mendations about administration, scoring, and reporting scores for different con-texts and purposes (e.g., classroom tests and large-scale examinations) and for different stakeholders (e.g., test designers, institutions, and test takers).

Although the detailed recommendations vary depending on the context, stake-holder, and professional association, the above guidelines endorse very similar practices. Guidelines on the administration of assessments typically aim at creat-ing standardized conditions that would allow test takers to have a fair and equal opportunity to demonstrate their language proficiency. These include, for example, clear and uniform directions to test takers, an environment that is free of noise and disruptions, and adequate accommodations for disadvantaged test takers, such as extra time for people with dyslexia or a different version of the test for

The Companion to Language Assessment, First Edition. Edited by Antony John Kunnan. 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.DOI: 10.1002/9781118411360.wbcla035

58

Administration, Scoring, and Reporting Scores

Ari HuhtaUniversity of Jyvskyl, Finland

2 Assessment Development

blind learners. A slightly different consideration is test security: Individual test takers should not have an unfair advantage over others by accessing test material prior to the test or by copying answers from others during the test because of inadequate invigilation, for example. Administration thus concerns everything that is involved in presenting the test to the test takers: time, place, equipment, and instructions, as well as support and invigilation procedures (see Mousavi, 1999, for a detailed definition).

Scoringgiving numerical values to test items and tasks (Mousavi, 1999)is a major concern for all types of testing, and professional associations therefore give several recommendations. From the point of view of test design, these associations emphasize the creation of clear and detailed scoring guidelines for all kinds of tests but especially for those that contain constructed response items and speaking and writing tasks. Accurate and exhaustive answer keys should be developed for open-ended items, raters should be given adequate training, and the quality of their work should be regularly monitored. Test scores and ratings should also be analyzed to examine their quality, and appropriate action should be taken to address any issues to ensure adequate reliability and validity.

The main theme in reporting, namely communicating test results to stakehold-ers (Cohen & Wollack, 2006, p. 380), is ensuring the intelligibility and interpretabil-ity of the scores. Reporting just the raw test scores is not generally recommended, so usually test providers convert test scores onto some reporting scale that has a limited number of score levels or bands, which are often defined verbally. An increasingly popular trend in reporting scores is to use the Common European Framework of Reference (CEFR) to provide extra meaning to scores. Other recom-mendations on reporting scores include that test providers give information about the quality (validity, reliability) of their tests, and about the accuracy of the scores, that is, how much the score is likely to vary around the reported score.

Test Administration, Scoring, and Reporting Scores

In the following, test administration, scoring, and reporting scores are described in terms of what is involved in each, and of how differences in the language skills tested and the purposes and contexts of assessment can affect the way tests are administered, scored, and reported. An account is also given of how these might have changed over time and whether any current trends can be discerned.

Administration of Tests

The administration of language tests and other types of language assessments is highly dependent on the skill tested and task types used, and also on the purpose and stakes involved. Different administration conditions can significantly affect test takers performance and, thus, the inferences drawn from test scores. As was described above, certain themes emerge in the professional guidelines that are fairly common across all kinds of test administrations. The key point is to create standardized conditions that allow test takers a fair opportunity to demonstrate what they can do in the language assessed, and so to get valid, comparable

Administration, Scoring, and Reporting Scores 3

information about their language skills. Clear instructions, a chance for the test taker to ask for clarifications, and appropriate physical environment in terms of, for example, noise, temperature, ventilation, and space all contribute in their own ways to creating a fair setting (see Cohen & Wollack, 2006, pp. 35660, for a detailed discussion of test administration and special accommodations).

A general administration condition that is certain to affect administration condi-tions and also performance is the time limit set for the test. Some tests can be speeded on purpose, especially if they attempt to tap time-critical aspects of per-formance, such as in a scanning task where test takers have to locate specific information in the text fast. Setting up a speeded task in an otherwise nonspeeded paper-based test is challenging administratively; on computer, task-specific time limits are obviously easy to implement. In most tests, time is not a key component of the construct measured, so enough time is given for almost everybody to finish the test. However, speededness can occur in nonspeeded tests when some learners cannot fully complete the test or have to change their response strategy to be able to reply to all questions. Omitted items at the end of a test are easy to spot but other effects of unintended speededness are more difficult to discover (see Cohen & Wollack, 2006, pp. 3578 on research into the latter issue).

A major factor in test administration is the aspect of language assessed; in prac-tice, this boils down to testing speaking versus testing the other skills (reading, writing, and listening). Most aspects of language can be tested in groups, some-times in very large groups indeed. The prototypical test administration context is a classroom or a lecture hall full of learners sitting at their own tables writing in their test booklets. Testing reading and writing or vocabulary and structures can be quite efficiently done in big groups, which is obviously an important practical consideration in large-scale testing, as the per learner administration time and costs are low (for more on test practicality as an aspect of overall test usefulness, see Bachman & Palmer, 1996). Listening, too, can be administered to big groups, if equal acoustic reception can be ensured for everybody.

Certain tests are more likely to be administered to somewhat smaller groups. Listening tests and, more recently, computerized tests of any skill are typically administered to groups of 1030 learners in dedicated language studios or com-puter laboratories that create more standardized conditions for listening tests, as all test takers can wear headphones.

Testing speaking often differs most from testing the other skills when it comes to administration. If the preferred approach to testing speaking is face to face with an interviewer or with another test taker, group administrations become almost impossible. The vast majority of face-to-face speaking tests involve one or two test takers at a time (for different oral test types, see Luoma, 2004; Fulcher, 2003; Taylor, 2011). International language tests are no exception: Tests such as the International English Language Testing System (IELTS), the Cambridge examinations, the Goethe Instituts examinations, and the French Diplme dtudes en langue franaise (DELF) and Diplme approfondi de langue franaise (DALF) examina-tions all test one or two candidates at a time.

Interestingly, the practical issues in testing speaking have led to innovations in test administration such as the creation of semidirect tests. These are administered in a language or computer laboratory: Test takers, wearing headphones and micro-


phones, perform speaking tasks following instructions they hear from a tape or computer, and possibly also read in a test booklet. Their responses are recorded and rated afterwards. There has been considerable debate about the validity of this semidirect approach to testing speaking. The advocates argue that these tests cover a wider range of contexts, their administration is more standardized, and they result in very similar speaking grades compared with face-to-face tests (for a summary of research, see Malone, 2000). The approach has been criticized on the grounds that it solicits somewhat different language from face-to-face tests (Shohamy, 1994). Of the international examinations, the Test of English as a Foreign Language Internet-based test (TOEFL iBT) and the Test Deutsch als Fremdsprache (TestDaF), for example, use computerized semidirect speaking tests that are scored afterwards by human raters. The new Pearson Test of English (PTE) Academic also employs a computerized speaking test but goes a step further as the scoring is also done by the computer.

The testing context, purpose, and stakes involved can have a marked effect on test administration. The higher the stakes, the more need there is for standardiza-tion of test administration, security, confidentiality, checking of identity, and meas-ures against all kinds of test fraud (see Cohen & Wollack, 2006, for a detailed discussion on how these affect test administration). Such is typically the case in tests that aim at making important selections or certifying language proficiency or achievement. All international language examinations are prime examples of such tests. However, in lower stakes formative or diagnostic assessments, admin-istration conditions can be more relaxed, as learners should have fewer reasons to cheat, for example (though of course, if an originally low stakes test becomes more important over time, its administration conditions should be reviewed). Obviously, avoidance of noise and other disturbances makes sense in all kinds of testing, unless the specific aim is to measure performance under such conditions. Low stakes tests are also not tied to a specific place and time in the same way as high stakes tests are. Computerization, in particular, offers considerable freedom in this respect. A good example is DIALANG, an online diagnostic assessment system which is freely downloadable from the Internet (Alderson, 2005) and which can thus be taken anywhere, any time. Administration conditions of some forms of continuous assessment can also differ from the prototypical invigilated setting: Learners can be given tasks and tests that they do at home in their own time. These tasks can be included in a portfolio, for example, which is a collection of different types of evidence of learners abilities and progress for either forma-tive or summative purposes, or both (on the popular European Language Portfo-lio, see Little, 2005).

Scoring and Rating Procedures

The scoring of test takers responses and performances should be as directly related as possible to the constructs that the tests aim at measuring (Bachman & Palmer, 1996). If the test has test specifications, they typically contain information about the principles of scoring items, as well as the scales and procedures for the rating of speaking and writing. Traditionally, a major concern about scoring has been reliability: To what extent are the scoring and rating consistent over time and


across raters? The rating of speaking and writing performances, in particular, continues to be a major worry and considerable attention is paid to ensuring a fair and consistent assessment, especially in high stakes contexts. A whole new trend in scoring is computerization, which is quite straightforward in selected response items but much more challenging the more open-ended the tasks are. Despite the challenges, computerized scoring of all skills is slowly becoming a viable option, and some international language examinations have begun employ-ing it.

As was the case with test administration, scoring, too, is highly dependent on the aspects of language tested and the task types used. The purpose and stakes of the test do not appear to have such a significant effect on how scoring is done, although attention to, for instance, rater consistency is obviously closer in high stakes contexts. The approach to scoring is largely determined by the nature of the tasks and responses to be scored (see Millman & Greene, 1993; Bachman & Palmer, 1996). Scoring selected response items dichotomously as correct versus incorrect is a rather different process from rating learners performances on speaking and writing tasks with the help of a rating scale or scoring constructed response items polytomously (that is, awarding points on a simple scale depend-ing on the content and quality of the response).

Let us first consider the scoring of item-based tests. Figure 58.1 shows the main steps in a typical scoring process: It starts with the test takers responses, which can be choices made in selected response items (e.g., A, B, C, D) or free responses to gap-fill or short answer items (parts of words, words, sentences). Prototypical responses are test takers markings on the test booklets that also contain the task materials. Large-scale tests often use separate optically readable answer sheets for multiple choice items. Paper is not, obviously, the only medium used to deliver tests and collect responses. Tape-mediated speaking tests often contain items that are scored rather than rated, and test takers responses to such items are normally recorded on tape. In computer-based tests, responses are captured in electronic format, too, to be scored either by the computer applying some scoring algorithm or by a human rater.

In small-scale classroom testing the route to step 2, scoring, is quite straightfor-ward. The teacher simply collects the booklets from the students and marks the papers. In large-scale testing this phase is considerably more complex, unless we have a computer-based test that automatically scores the responses. If the scoring is centralized, booklets and answer sheets first need to be mailed from local test

Figure 58.1 Steps in scoring item-based tests

3 4 5 6 71Individualitemresponses

Individualitemscores

(Weightingof scores)

Sum ofscores(scorescale)

Applicationof cutoffs

(Standard setting)

Score bandor reportingscale

(Item analyses:deletion ofitems, etc.)

Scoring key

Scoring2


centers to the main regional, national, or even international center(s). There the optically readable answer sheets, if any, are scanned into electronic files for further processing and analyses (see Cohen & Wollack, 2006, pp. 3727 for an extended discussion of the steps in processing answer documents in large-scale examinations).

Scoring key: An essential element of scoring is the scoring key, which for the selected response items simply tells how many points each option will be awarded. Typically, one option is given one point and the others zero points. However, sometimes different options receive different numbers of points depending on their degree of correctness or appropriateness. For productive items, the scoring can be considerably more complex. Some items have only one acceptable answer; this is typical of items focusing on grammar or vocabulary. For short answer items on reading and listening, the scoring key can include a number of different but acceptable answers but the scoring may still be simply right versus wrong, or it can be partial-credit and polytomous (that is, some answers receive more points than others).

The scoring key is usually designed when the test items are constructed. The key can, however, be modified during the scoring process, especially for open-ended items. Some examinations employ a two-stage process in which a propor-tion of the responses is first scored by a core group of markers who then complement the key for the marking of the majority of papers by adding to the list of acceptable answers based on their work with the first real responses.

Markers and their training: Another key element of the scoring plan is the selec-tion of scorers or markers and their training. In school-based testing, the teacher is usually the scorer, although sometimes she may give the task to the students themselves or, more often, do it in cooperation with colleagues. In high stakes contexts, the markers and raters usually have to meet specified criteria to qualify. For example, they may have to be native speakers or non-native speakers with adequate proficiency and they probably need to have formally studied the lan-guage in question.

Item analyses: An important part of the scoring process in the professionally designed language tests is item analyses. The so-called classical item analyses are probably still the most common approach; they aim to find out how demand-ing the items are (item difficulty or facility) and how well they discriminate between good and poor test takers. These analyses can also identify problematic items or items tapping different constructs. Item analyses can result in the accept-ance of additional responses or answer options for certain itemsa change in the scoring keyor the removal of entire items from the test, which can change the overall test score.

Test score scale: When the scores of all items are ready, the next logical step is to combine them in some way into one or more overall scores. The simplest way to arrive at an overall test score is to sum up the item scores; here the maximum score equals the number of items in the test, if each item is worth one point. The scoring of a test comprising a mixture of dichotomously (0 or 1 point per item) scored multiple choice items and partial-credit/polytomous short answer items is obviously more complex. A straightforward sum of such items results in the short answer questions being given more weight because test takers get more


points from them; for example, three points for a completely acceptable answer compared with only point from a multiple choice item. This may be what we want, if the short answer items have been designed to tap more important aspects of proficiency than the other items. However, if we want all items to be equally important, each item score should be weighted by an appropriate number.

Language test providers increasingly complement classical item analyses with analyses based on what is known as modern test theory or item response theory (IRT; one often-used IRT approach is Rasch analysis). What makes them particu-larly useful is that they are far less dependent than the classical approaches on the characteristics of the learners who happened to take the test and the items in the test. With the help of IRT analyses, it is possible to construct test score scales that go beyond the simple summing up of item scores, since they are adjusted for item difficulty and test takers ability, and sometimes also for item discrimination or guessing. Most large-scale international language tests rely on IRT analyses as part of their test analyses, and also to ensure that their tests are comparable across administrations.

An example of a language test that combines IRT analysis and item weighting in the computation of its score scale is DIALANG, the low stakes, multilingual diagnostic language assessment system mentioned above (Alderson, 2005). In the fully developed test languages of the system, the items are weighted differentially, ranging from 1 to 5 points, depending on their ability to discriminate.

Setting cutoff points for the reporting scale: Instead of reporting raw or weighted test scores many language tests convert the score to a simpler scale for reporting purposes, to make the test results easier to interpret. The majority of educational systems probably use simple scales comprising a few numbers (e.g., 15 or 110) or letters (e.g., AF). Sometimes it is enough to report whether the test taker passes or fails a particular test, and thus a simple two-level scale (pass or fail) is sufficient for the purpose. Alternatively, test results can be turned into developmental scores such as age- or grade-equivalent scores, if the group tested are children and if such age- or grade-related interpretations can be made from the particular test scores. Furthermore, if the reporting focuses on rank ordering test takers or com-paring them for some normative group, percentiles or standard scores (z or T scores) can be used, for example (see Cohen & Wollack, 2006, p. 380).

The conversion of the total test score to a reporting scale requires some mecha-nism for deciding how the scores correspond to the levels on the reporting scale. The process through which such cutoff points (cut scores) for each level are decided is called standard setting (step 6 in Figure 58.1).

Intuition and tradition are likely to play at least as big a role as any empirical evidence in setting the cutoffs; few language tests have the means to conduct systematic and sufficient standard-setting exercises. Possibly the only empirical evidence available to teachers, in particular, is to compare their students with each other (ranking), with the students performances on previous tests, or with other students performance on the same test (norm referencing). The teacher may focus on the best and weakest students and decide to use cutoffs that result in the regular top students getting top scores in the current test, too, and so on. If the results of the current test are unexpectedly low or high, the teacher may raise or lower the cutoffs accordingly.


Many large-scale tests are obviously in a better position to make more empiri-cally based decisions about cutoff points than individual teachers and schools. A considerable range of standard-setting methods has been developed to inform decisions about cutoffs on test score scales (for reviews, see Kaftandjieva, 2004; Cizek & Bunch, 2006). The most common standard-setting methods focus on the test tasks; typically, experts evaluate how individual test items match the levels of the reporting scale. Empirical data on test takers performance on the items or the whole test can also be considered when making judgments. In addition to these test-centered standard-setting methods, there are examinee-centered methods in which persons who know the test takers well (typically teachers) make judgments about their level. Learners performances on the items and the test are then compared with the teachers estimates of the learners to arrive at the most appropriate cutoffs.

Interestingly, the examinee-centered approaches resemble what most teachers are likely to do when deciding on the cutoffs for their own tests. Given the diffi-culty and inherent subjectivity of any formal standard-setting procedure, one wonders whether experienced teachers who know their students can in fact make at least equally good decisions about cutoffs as experts relying on test-centered methods, provided that the teachers also know the reporting scale well.

Sometimes the scale score conversion is based on a type of norm referencing where the proportion of test takers at the different reporting scale levels is kept constant across different tests and administrations. For example, the Finnish school-leaving matriculation examination for 18-year-olds reports test results on a scale where the highest mark is always given to the top 5% in the score distribu-tion, the next 15% get the second highest grade, the next 20% the third grade, and so on (Finnish Matriculation Examination Board, n.d.).

A recent trend in score conversion concerns the CEFR. Many language tests have examined how their test scores relate to the CEFR levels in order to give added meaning to their results and to help compare them with the results of other language tests (for a review, see Martyniuk, 2011). This is in fact score conversion (or setting cutoffs) at a higher or secondary level: The first one involves converting the test scores to the reporting scale the test uses, and the second is about convert-ing the reporting scale to the CEFR scale.

Scoring Tests Based on Performance Samples

The scoring of speaking and writing tasks usually takes place with the help of one or more rating scales that describe test-taker performance at each scale level. The rater observes the test takers performance and decides which scale level best matches the observed performance. Such rating is inherently criterion referenced in nature as the scale serves as the criteria against which test takers performances are judged (Bachman & Palmer, 1996, p. 212). This is in fact where the rating of speaking and writing differs the most from the scoring of tests consisting of items (e.g., reading or listening): In many tests the point or level on the rating scale assigned to the test taker is what will be reported to him or her. There is thus no need to count a total speaking score and then convert it to a different reporting scale, which is the standard practice in item-based tests. The above simplifies


matters somewhat because in reality some examinations use more complex pro-cedures and may do some scale conversion and setting of cutoffs also for speaking and writing. However, in its most straightforward form, the rating scale for speak-ing and writing is the same as the reporting scale, although the wording of the two probably differs because they target different users (raters vs. test score users).

It should be noted that instead of rating, it is possible to count, for example, features of language in speaking and writing samples. Such attention to detail at the expense of the bigger picture may be appropriate in diagnostic or formative assessment that provides learners with detailed feedback.

Rating scales are a specific type of proficiency scale and differ from the more general descriptive scales designed to guide selection test content and teaching materials or to inform test users about the test results (Alderson, 1991). Rating scales should focus on what is observable in test takers performance, and they should be relatively concise in order to be practical. Most rating scales refer to both what the learners can and what they cannot do at each level; other types of scales may often avoid references to deficiencies in learners proficiency (e.g., the CEFR scales focus on what learners can do with the language, even at the lowest proficiency levels).

Details of the design of rating scales are beyond the scope of this chapter; the reader is advised to consult, for example, McNamara (1996) and Bachman and Palmer (1996). Suffice it to say that test purpose significantly influences scale design, as do the designers views about the constructs measured. A major deci-sion concerns whether to use only one overall (holistic) scale or several scales. For obtaining broad information about a skill for summative, selection, and placement purposes, one holistic scale is often preferred as a quick and practical option. To provide more detailed information for diagnostic or formative purposes, analytic rating makes more sense. Certain issues concerning the validity of holistic rating, such as difficulties in balancing the different aspects lumped together in the level descriptions, have led to recommendations to use analytic rating, and if one overall score is required, to combine the component ratings (Bachman & Palmer, 1996, p. 211). Another major design feature relates to whether only language is to be rated or also content (Bachman & Palmer, 1996, p. 217). A further important question concerns the number of levels in a rating scale. Although a very fine-grained scale could yield more precise information than a scale consisting of just three or four levels, if the raters are unable to distinguish the levels it would cancel out these benefits. The aspect of language captured in the scale can also affect the number of points in the scale; it is quite possible that some aspects lend themselves to be split into quite a few distinct levels whereas others do not (see, e.g., the examples in Bachman & Palmer, 1996, pp. 21418).

Since rating performances is usually more complex than scoring objective items, a lot of attention is normally devoted, in high stakes tests in particular, to ensuring the dependability of ratings. Figure 58.2 describes the steps in typical high stakes tests of speaking and writing. While most classroom assessment is based on only one rater, namely the teacher, the standard practice in most high stakes tests is for at least a proportion of performances to be double rated (step 3 in Figure 58.2). Sometimes the first rating is done during the (speaking) test (e.g., the rater is present in the Cambridge examinations but leaves the conduct of the test to an


interlocutor), but often the first and second ratings are done afterwards from an audio- or videorecording, or from the scripts in the writing tests. Typically, all raters involved are employed and trained by the testing organization, but some-times the first rater, even in high stakes tests, is the teacher (as in the Finnish matriculation examination) even if the second and decisive rating is done by the examination board.

Large-scale language tests employ various monitoring procedures to try to ensure that their raters work consistently enough. Double rating is in fact one such monitoring device, as it will reveal significant rater disagreement in their ratings; if this can be spotted while rating is still in progress, one or both of the raters can be given feedback and possibly retrained before being allowed to continue. Some tests use a small number of experienced master raters who continuously sample and check the ratings of a group of raters assigned to them. The TOEFL iBT has an online system that forces the raters to start each new rating session by assessing a number of calibration samples, and only if the rater passes them is he or she allowed to proceed to the actual ratings.

Figure 58.2 Steps in rating speaking and writing performances

1 Performance during the test

2 First rating

(a) During the test (speaking)(b) Afterwards from a recording or a script

3 Second rating (typically afterwards)

5 Third and possibly more ratings

6 Compilation of different raters ratings

7 (Compilation of different rating criteria into one, if analytic rating is used but only one score reported)

8 (Sum of scores if the final rating is not directly based on the rating scale categories or levels)

9 (Application of cutoffs)

10 Reporting of results on the reporting scale

(Standard setting)

Rater or ratinganalyses

Rating scale(s) &benchmark samples



Monitoring

Monitoring

Monitoring

4 Identification of (significant) discrepancies between raters; identification of difficult performances to rate


A slightly different approach to monitoring raters involves adjusting their ratings up or down depending on their severity or lenience, which can be esti-mated with the help of multifaceted Rasch analysis. For example, the TestDaF, which measures German needed in academic studies, regularly adjusts reported scores for rater severity or lenience (Eckes et al., 2005, p. 373).

Analytic rating scales appear to be the most common approach to rating speak-ing and writing in large-scale international language examinations, irrespective of language. Several English (IELTS, TOEFL, Cambridge, Pearson), German (Goethe Institut, TestDaF), and French (DELF, DALF) language examinations implement analytic rating scales, although they typically report speaking and writing as a single score or band.

It is usually also the case that international tests relying on analytic rating weigh all criteria equally and take the arithmetic or conceptual mean rating as the overall score for speaking or writing (step 7, Figure 58.2). Exceptions to this occur, however. The International Civil Aviation Organization (ICAO) specifies that all aviation English tests adhering to their guidelines must implement the five dimen-sions of oral proficiency in a noncompensatory fashion (Bachman & Palmer, 1996, p. 224). That is, the lowest rating across the five criteria determines the overall level reached by the test taker (ICAO, 2004).

Reporting Scores

Score reports inform different stakeholders, such as test takers, parents, admission officers, and educational authorities, about individuals or groups test results for possible action. Thus, these reports can be considered more formal feedback to the stakeholders. Score reports are usually pieces of paper that list the scores or grades obtained by the learner, possibly with some description of the test and the meaning of the grades. Some, typically more informal reports may be electronic in format, if they are based on computerized tests and intended only for the learn-ers and their teachers (e.g., the report and feedback from DIALANG). Score reports use the reporting scale onto which raw scores were converted, as described in the previous section.

Score reports are forms of communication and thus have a sender, receiver, content, and medium; furthermore, they serve particular purposes (Ryan, 2006, p. 677). Score reports can be divided into two broad types: reports on individuals and reports on groups. Reporting scores is greatly affected by the purpose and type of testing.

The typical sender of score reports on individual learners and based on class-room tests is the teacher, who acts on behalf of the school and municipality and ultimately also as a representative of some larger public or private educational system. The sender of more formal end-of-term school reports or final school-leaving certificates is most often the school, again acting on behalf of a larger entity. The main audiences of both score reports and formal certificates are the students and their parents, who may want to take some action based on the results (feedback) given to them. School-leaving certificates have also other users such as higher-level educational institutions or employers making decisions about admit-ting and hiring individual applicants.


School-external tests and examinations are another major originator of score reports for individuals. The sender here is typically an examination board, a regional or national educational authority, or a commercial test provider. Often such score reports are related to examinations that take place only at important points in the learners careers, such as the end of compulsory education, end of pre-university education, or when students apply for a place in a university. The main users of such reports are basically the same as for school-based reports except that in many contexts external reports are considered more prestigious and trustworthy, and may thus be the only ones accepted as proof of language profi-ciency, for instance for studying in a university abroad.

In addition to score reports on individuals performance, group-level reports are also quite common. They may be simply summaries of individual score reports at the class, school, regional, or national level. Sometimes tests are administered from which no reports are issued to individual learners; only group-level results are reported. The latter are typically tests given by educa-tional authorities to evaluate students achievement across the regions of a country or across different curricula. International comparative studies on edu-cational achievement exist, in language subjects among others. The best known is the Programme for International Student Assessment (PISA) by the Organisa-tion for Economic Co-operation and Development (OECD), which regularly tests and reports country-level reports of 15-year-olds reading skills in their language of education.

The content of score reports clearly depends on the purpose of assessment. The prototypical language score report provides information about the test takers proficiency on the reporting scale used in the educational system or the test in question. Scales consisting of numbers or letters are used in most if not all edu-cational systems across the world. With the increase in criterion-referenced testing, such simple scales are nowadays often accompanied by descriptions of what dif-ferent scale points mean in terms of language proficiency. Entirely non-numeric reports also exist; in some countries the reporting of achievement in the first years of schooling consists of only verbal descriptions.

Score reports from language proficiency examinations and achievement tests often report on overall proficiency only as a single number or letter (e.g., the Finnish matriculation examination). Some proficiency tests, such as the TOEFL iBT and the IELTS, issue subtest scores in addition to a total score. In many placement contexts, too, it may not be necessary to report more than an overall estimate of candidates proficiency. However, the more the test aims at support-ing learning, as diagnostic and formative tests do, the more useful it is to report profiles based on subtests or even individual tasks and items. For example, the diagnostic DIALANG test reports on test-, subskill-, and item-level performance.

Current Research

Research on the three aspects of the testing process covered here is very uneven. Test administration appears to be the least studied (McCallin, 2006, pp. 63940),


except for the types of testing where it is intertwined with the test format, such as in computerized testing, which is often compared with paper-based testing, and in oral testing, where factors related to the setting and participants have been studied. Major concerns with computerized tests include the effect of computer familiarity on the test results and to what extent such tests are, or should be, comparable with paper-based tests (e.g., adaptivity is really possible only with computerized tests) (Chapelle & Douglas, 2006).

As far as oral tests are concerned, their characteristics and administration have been studied for decades. In particular, the nature of the communication and the effect of the tester (interviewer) have been hotly debated. For example, can the prototypical test format, the oral interview, represent normal face-to-face com-munication? The imbalance of power, in particular, has been criticized (Luoma, 2004, p. 35), which has contributed to the use of paired tasks in which two candi-dates interact with each other, in a supposedly more equal setting. Whether the pairs are in fact equal has also been a point of contention (Luoma, 2004, p. 37). Research seems to have led to more mixed use of different types of speaking tasks in the same test, such as both interviews and paired tasks. Another issue with the administration conditions and equal treatment of test takers concerns the consist-ency of interviewers behavior: Do they treat different candidates in the same way? Findings indicating that they do not (Brown, 2003) have led the IELTS , for example, to impose stricter guidelines on their interviewers to standardize their behavior.

An exception to the paucity of research into the more general aspects of test administration concerns testing time. According to studies reviewed by McCallin (2006, pp. 6312), allowing examinees more time on tests often benefits everybody, not just examinees with disabilities. One likely reason for this is that many tests that are intended to test learners knowledge (power tests) may in fact be at least partly speeded.

Compared with test administration, research on scoring and rating of perform-ances has a long tradition. Space does not allow a comprehensive treatment but a list of some of the important topics gives an idea of the research foci:

analysisoffactorsinvolvedinratingspeakingandwriting,suchastherater,rating scales, and participants (e.g., Cumming, Kantor, & Powers, 2002; Brown, 2003; Lumley, 2005);

linkingtestscores(andreportingscales)withtheCEFR(e.g.,Martyniuk,2011); validity of automated scoring ofwriting and speaking (e.g., Bernstein,Van

Moere, & Cheng, 2010; Xi, 2010); and scoringshortanswerquestions(e.g.,Carr&Xi,2010).

Research into reporting scores is not as common as studies on scoring and rating. Goodman and Hambleton (2004) and Ryan (2006) provide reviews of prac-tices, issues, and research into reporting scores. Given that the main purpose of reports is to provide different users with information, Ryans statement that what-ever research exists presents a fairly consistent picture of the ineffectiveness of score reports to communicate meaningful information to various stakeholder groups (2006, p. 684) is rather discouraging. The comprehensibility of large-scale


assessment reports, in particular, seems to be poor due to, for example, the use of technical terms, too much information too densely packed, and lack of descriptive information (Ryan, 2006, p. 685). Such reports could be made more readable, for example, by making them more concise, by providing a glossary of the terms used, by displaying more information visually, and by supporting figures and tables with adequate descriptive text.

Ryans own study on educators expectations of the score reports from the state-wide assessments in South Carolina, USA, showed that his informants wanted more specific information about the students performance and better descriptions of what different scores and achievement levels meant in terms of knowledge and ability (2006, p. 691). The educators also reviewed different types of individual and group score reports for mathematics and English. The most meaningful report was the achievement performance level narrative, a four-level description of content and content demands that systematically covered what learners at a par-ticular level could and could not do (Ryan, 2006, pp. 692705).

Challenges

Reviews of test administration (e.g., McCallin, 2006, p. 640) suggest that nonstand-ard administration practices can be a major source of construct-irrelevant varia-tion in test results. The scarcity of research on test administration is therefore all the more surprising. McCallin calls for a more systematic gathering of information from test takers about administration practices and conditions, and for a more widespread use of, for example, test administration training courseware as effec-tive ways of increasing the validity of test scores (2006, p. 642).

Scoring and rating continue to pose a host of challenges, despite considerable research. The multiple factors that can affect ratings of speaking and writing, in particular, deserve further attention across all contexts where these are tested. One challenge such research faces is that applying such powerful approaches as multifaceted Rasch analysis in the study of rating data requires considerable expertise.

Automated scoring will increase in the future, and will face at least two major challenges. The first is the validity of such scoring: to what extent it can capture everything that is relevant in speaking and writing, in particular, and whether it works equally well with all kinds of tasks. The second is the acceptability of auto-mated scoring, if used as the sole means of rating. Recent surveys of users indicate that the majority of test takers feel uneasy about fully automated rating of speak-ing (Xi, Wang, & Schmidgall, 2011).

As concerns reporting scores, little is known about how different reports are actually used by different stakeholders (Ryan, 2006, p. 709), although something is already known about what makes a score report easy or difficult to understand. Another challenge is how to report reliable profile scores for several aspects of proficiency when each aspect is measured by only a few items (see, e.g., Ryan, 2006, p. 699). This is particularly worrying from the point of view of diagnostic and formative testing, where rich and detailed profiling of abilities would be useful.


Future Directions

The major change in the administration and scoring of language tests and in the reporting of test results in the past decades has been the gradual introduction of different technologies. Computer-based administration, automated scoring of fairly simple items, and the immediate reporting of scores have been technically possible for decades, even if not widely implemented across educational systems. With the advent of new forms of information and communication technologies (ICT) such as the Internet and the World Wide Web, all kinds of online and computer-based examinations, tests, and quizzes have proliferated.

High stakes international language tests have implemented ICT since the time optical scanners were invented. Some of the more modern applications are less obvious, such as the distribution of writing and speaking samples for online rating. The introduction of a computerized version of such high stakes examina-tions as the TOEFL in the early 2000s marked the beginning of a new era. The new computerized TOEFL iBT and the PTE are likely to show the way most large-scale language tests are headed.

The most important recent technological innovation concerns automated assess-ment of speaking and writing performances. The TOEFL iBT combines human and computer scoring in the writing test, and implements automated rating in its online practice speaking tasks. The PTE implements automated scoring in both speaking and writing, with a certain amount of human quality control involved (seealsotheVersantsuiteofautomatedspeakingtests[Pearson,n.d.]). It can be predicted that many other high stakes national and international language tests will become computerized and will also implement fully or partially automated scoring procedures.

What will happen at the classroom level? Changes in major examinations will obviously impact schools, especially if the country has high stakes national examinations. Thus, the inevitable computerization of national examinations will have some effect on schools over time, irrespective of their current use of ICT. The effect may simply be a computerization of test preparation activities, but changes may be more profound, because there is another possible trend in com-puterized testing that may impact classrooms: more widespread use of compu-terized formative and diagnostic tests. Computers have potential for highly individualized feedback and exercises based on diagnosis of learners current proficiency and previous learning paths. The design of truly useful diagnostic tools and meaningful interventions for foreign and second language learning are still in their infancy and much more basic research is needed to understand language development (Alderson, 2005). However, different approaches to designing more useful diagnosis and feedback are being taken currently, includ-ing studies that make use of insights into dyslexia in the first language (Alderson & Huhta, 2011), analyses of proficiency tests for their diagnostic potential (Jang, 2009), and dynamic assessment based on dialogical views on learning (Lantolf & Poehner, 2004), all of which could potentially lead to tools that are capable of diagnostic scoring and reporting, and could thus have a major impact on lan-guage education.


SEE ALSO: Chapter 51, Writing Scoring Criteria and Score Reports; Chapter 52, Response Formats; Chapter 56, Statistics and Software for Test Revisions; Chapter 59, Detecting Plagiarism and Cheating; Chapter 64, Computer-Automated Scoring of Written Responses; Chapter 67, Accommodations in the Assessment of English Language Learners; Chapter 80, Raters and Ratings

References

Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Eds.), Language testing in the 1990s: The communicative legacy (pp. 7186). London, England: Macmillan.

Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. New York, NY: Continuum.

Alderson, J. C., & Huhta, A. (2011). Can research into the diagnostic testing of reading in a second or foreign language contribute to SLA research? In L. Roberts, M. Howard, M. Laioire, & D. Singleton (Eds.), EUROSLA yearbook. Vol. 11 (pp. 3052). Amster-dam, Netherlands: John Benjamins.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bachman, L., & Palmer, L. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, England: Oxford University Press.

Bernstein,J.,VanMoere,A.,&Cheng,J.(2010).Validatingautomatedspeakingtests.Lan-guage Testing, 27(3), 35577.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 125.

Carr, N., & Xi, X. (2010). Automated scoring of short-answer reading items: Implications for constructs. Language Assessment Quarterly, 7(2), 20518.

Chapelle, C., & Douglas, D. (2006). Assessing language through computer technology. Cam-bridge, England: Cambridge University Press.

Cizek, G., & Bunch, M. (2006). Standard setting: A guide to establishing and evaluating perform-ance standards on tests. London, England: Sage.

Cohen, A., & Wollack, J. (2006). Test administration, security, scoring, and reporting. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 35586). Westport, CT: ACE.

Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86, 6796.

Eckes,T.,Ellis,M.,Kalnberzina,V.,Piorn,K.,Springer,C.,Szolls,K.,&Tsagari,C.(2005).Progress and problems in reforming public language examinations in Europe: Cameos from the Baltic States, Greece, Hungary, Poland, Slovenia, France and Germany. Lan-guage Testing, 22(3), 35577.

Finnish Matriculation Examination Board. (n.d.). Finnish Matriculation Examination. Retrieved July 14, 2011 from http://www.ylioppilastutkinto.fi

Goodman, D., & Hambleton, R. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(2), 145221.

International Civil Aviation Organization. (2004). Manual on the implementation of ICAO language proficiency requirements. Montral, Canada: Author.

Jang,E.(2009).CognitivediagnosticassessmentofL2readingcomprehensionability:Valid-ity arguments for Fusion Model application to LanguEdge assessment. Language Testing, 26(1), 3173.

http://www.ylioppilastutkinto.fi


Kaftandjieva, F. (2004). Standard setting. Reference supplement to the preliminary pilot version of the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. Strasbourg, France: Council of Europe.

Lantolf, J., & Poehner, M. (2004). Dynamic assessment: Bringing the past into the future. Journal of Applied Linguistics, 1, 4974.

Little, D. (2005). The Common European Framework and the European Language Portfolio: Involving learners and their judgments in the assessment process. Language Testing, 22(3), 32136.

Lumley, T. (2005). Assessing second language writing: The raters perspective. Frankfurt, Germany: Peter Lang.

Luoma, S. (2004). Assessing speaking. Cambridge, England: Cambridge University Press.Malone, M. (2000). Simulated oral proficiency interview: Recent developments (EDO-FL-00-14).

Retrieved July 14, 2011 from http://www.cal.org/resources/digest/0014simulated.htmlMartyniuk, W. (Ed.). (2011). Aligning tests with the CEFR: Reflections on using the Council of

Europes draft manual. Cambridge, England: Cambridge University Press.McCallin, R. (2006). Test administration. In S. Downing & T. Haladyna (Eds.), Handbook of

test development (pp. 62551). Mahwah, NJ: Erlbaum.McNamara, T. (1996). Measuring second language performance. Boston, MA: Addison Wesley

Longman.Millman, J., & Greene, J. (1993). The specification and development of tests of achievement

and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 33566). Phoenix, AZ: Oryx Press.

Mousavi, S. E. (1999). A dictionary of language testing (2nd ed.). Tehran, Iran: Rahnama Publications.

Pearson. (n.d.). Versant tests. Retrieved July 14, 2011 from http://www.versanttest.comRyan, J. (2006). Practices, issues, and trends in student test score reporting. In S. Downing

& T. Haladyna (Eds.), Handbook of test development (pp. 677710). Mahwah, NJ: Erlbaum.Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing,

11(2), 99123.Taylor, L. (2011). Examining speaking: Research and practice in assessing second language speak-

ing. Cambridge, England: Cambridge University Press.Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we

heading? Language Testing, 27(3), 291300.Xi, X., Wang, Y., & Schmidgall, J. (2011, June). Examinee perceptions of automated scoring of

speech and validity implications. Paper presented at the LTRC 2011, Ann Arbor, MI.

Suggested Readings

Abedi, J. (2008). Utilizing accommodations in assessment. In E. Shohamy & N. Hornberger (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd ed., pp. 33147). New York, NY: Springer.

Alderson, J. C. (2000). Assessing reading. Cambridge, England: Cambridge University Press.Becker, D., & Pomplun, M. (2006). Technical reporting and documentation. In S. Downing

& T. Haladyna (Eds.), Handbook of test development (pp. 71123). Mahwah, NJ: Erlbaum.Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human

sciences (2nd ed.). Mahwah, NJ: Erlbaum.Buck, G. (2000). Assessing listening. Cambridge, England: Cambridge University Press.Fulcher, G. (2003). Testing second language speaking. Harlow, England: Pearson.

http://www.cal.org/resources/digest/0014simulated.htmlhttp://www.versanttest.com


Fulcher, G. (2008). Criteria for evaluating language quality. In E. Shohamy & N. Hornberger (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd ed., pp. 15776). New York, NY: Springer.

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. London, England: Routledge.

North, B. (2001). The development of a common framework scale of descriptors of language profi-ciency based on a theory of measurement. Frankfurt, Germany: Peter Lang.

Organisation for Economic Co-operation and Development. (n.d.). OECD Programme for International Student Assessment (PISA). Retrieved July 14, 2011 from http://www.pisa.oecd.org

Weigle, S. (2002). Assessing writing. Cambridge, England: Cambridge University Press.Xi, X. (2008). Methods of test validation. In E. Shohamy & N. Hornberger (Eds.), Encyclo-

pedia of language and education. Vol. 7: Language testing and assessment (2nd ed., pp. 17796) . New York, NY: Springer.

http://www.pisa.oecd.orghttp://www.pisa.oecd.org

Documents

Administration, Scoring, Reporting - Huhta