2 Principles of Language Assessment

PRINCIPLES OF LANGUAGE ASSESSMENTPrinciples of Language Assessment

2. Validity

'Validity' is an all-encompassing term which is related to questions about what the test is actually assessing. Is the test telling you what you want to know? Does it measure what it is intended to measure? A test is not valid, for example, if it is intended to test a student's level of reading comprehension in a foreign language but instead tests intelligence or background knowledge.When a new test is constructed it should be assessed for validity in as many ways as possible. The aspects of validity which are looked at will, of course, depend on the purpose for which the test has been designed and will partly depend on the importance of the assessment. A teacher writing a classroom quiz will not have the time or the inclination to carry out many different investigations of validity, but the constructors of an examination which will affect candidates' futures are duty-bound to examine as many aspects of validity as possibleThere are different views on the best ways of assessing validity, but there are some key aspects, and it is good practice to investigate as many of these as possible:

2.1 Construct validity

The term 'construct validity' refers to the overall construct or trait being measured. It is an inclusive term which, according to some testing practitioners, covers all aspects of validity, and is therefore a synonym for 'validity'. If a test is supposed to be testing the construct of listening, it should indeed be testing listening, rather than reading, writing and/or memory. To assess construct validity the test constructor can use a combination of internal and external quantitative and qualitative methods. For more about this, see section 6. An example of a qualitative validation technique would be for the test constructors to ask test-takers to introspect while they take a test, and to say what they are doing as they do it, so that the test constructors can learn about what the test items are testing, as well as whether the instructions are clear, and so on. Construct validation also relates to the test method, so it is often felt that the test should follow current pedagogical theories. If the current theory of language teaching emphasises a communicative approach, for example, a test containing only out-of-context, single-sentence, multiple-choice items, which test only one linguistic point at a time, is unlikely to be considered to have construct validity.

2.2 Content validity

The content validity of a test is sometimes checked by subject specialists who compare test items with the test specifications to see whether the items are actually testing what they are supposed to be testing, and whether the items are testing what the designers say they are. (On specifications as a whole, see Davidson & Lynch 2002, and Alderson, Clapham & Wall 1995: Chapter 2.) In the case of a classroom quiz, of course, there will be no test specifications, and the deviser of the quiz may simply need to check the teaching syllabus or the course textbook to see whether each item is appropriate for that quiz. One of the advantages of even the most rudimentary content validation is that it identifies those items which are easy to test but which add nothing to our knowledge of what the students know; it is tempting for a test writer to write easy-to-test items,

and to ignore essential aspects of a foreign language, for example, because they are difficult to assess.

2.3 Face validity

It refers to the degree to wich a test looks right, and appears to measure the knowledge or ability it claims to measure, based on the subjective judgment of the examinees who take it.

Face validity is an important aspect of a test; it relates to the question of whether non-professional testers such as parents and students think the test is appropriate. If these non-specialists do not think the test is testing candidates' knowledge in a suitable manner, they may, for example, complain vociferously and the candidates may not tackle the test with the required zeal. If the test lacks face validity, it may not work as it should, and may have to be redesigned. (See Alderson, Clapham & Wall 1995: 172-73.)

2.4 Criterion-related validity

The aspects of test validity described so far relate to the 'internal' validity of the test, but some methods, and these are widely used for 'high-stakes' tests, also assess the 'external', 'criterion-related' validity of a test. To assess criterion-related validity, the students' test scores may, for example, be correlated with other measures of the students' language ability such as teachers' rankings of the students, or with the scores on a similar test. Such measures assess the concurrent validity of the measure. Similarly the future ability of the students can be assessed (the test's predictive validity) to see if the test can accurately foretell how the candidates will fare in the future. For example, if a test is supposed to assess whether students have a high enough level of a foreign language to be able to teach that language to secondary school children, the test should be validated, perhaps by classroom observation, to see whether students who have passed the test do actually have enough of the foreign language to be able to teach it in the classroom.

Consequential validity (impact)

How well of assessment results accomplishes intended purposes and avoids unintended effect.

3. Reliability

The reliability of a test is an estimate of the consistency of its marks; a reliable test is one where, for example, a student will get the same mark if he or she takes the test, possibly with a different examiner, on a Monday morning or a Tuesday afternoon. A test must be reliable, as a test cannot be valid unless it is reliable. However, the converse is not true: it is perfectly possible to have a reliable test which is not valid. For example, a multiple-choice test of grammatical structures may be wonderfully reliable, but it is not valid if teachers are not interested in the grammatical abilities of their students and/or if grammar is not taught in the related language course.

If the test consists of right/wrong items such as multiple-choice items or some sorts of

short answer questions, a reliability estimate such as the Alpha Coefficient or Kuder Richardson 21 may be calculated (see Alderson, Clapham & Wall 1995: 87-89); but if the test consists of an essay or an oral interview, for example, then other forms of test reliability must be estimated. A statistic which can be used by the statistically sophisticated is based on Generalizability Theory (see Crocker & Algina 1986: Chapter 8.), but more simple measures such as correlations between the scores a marker gives on Day 1 and Day 5 (intra-rater reliability), and correlations between two different markers' scores (inter-rater reliability) can be estimated, along with calculations of whether the levels of raters' marks, as well as the order of the scores, are similar.

4. Washback

Any language test or piece of assessment must have positive washback (backwash), by which I mean that the effect of the test on the teaching must be beneficial. This should be held in mind by the test constructors; it is only too easy to construct a test which leads, for example, to candidates learning material by heart or achieving high marks by simply applying test-taking skills rather than genuine language skills (see Wall 1997).

Documents

2 Principles of Language Assessment