Language Testing

1Language testing

A test is a sample of an individuals

behaviour/performance on the basis of

which inferences are made about the

more general underlying competence of

that individual.

Language tests refer to any kind of

measurement/examination technique

which aims at describing the test takers

foreign language proficiency, e.g. oral

interview, listening comprehension task,

free composition writing.

1. Kinds of tests and testing

Proficiency tests

Proficiency tests aim to measure students' L2 competence regardless of any training they previously had in the language. In these tests, designers specify what the candidates should be able to do to pass the test.

Achievement tests

Achievement tests assess whether learners have acquired specific elements of language that they were taught in the language course they took part in. There are two types of achievement tests: final tests at the end of the course and progress tests during the course.

Diagnostic tests

Diagnostic tests help identify learners'

strengths and weaknesses in L2. Their

main aim is to help teachers decide what

needs to be taught to students.

Placement tests

With the help of placements tests

students can be placed in the learning

group that is appropriate for their level of

competence.

Direct versus indirect testing

Direct tests: candidates are required to

perform the skill the test intends to

measure.

Indirect tests want to measure skills that

underlie performance in a particular task.

Discrete point versus integrative testing

In discrete point tests every item focuses on one

clear-cut segment of the target language without

involving the others Typical test format: written

multiple-choice test.

In integrative tests candidates need to use a

number of language elements at the same time

in completing the test tasks. For example: essay

writing, dictation, cloze test.

2Norm referenced tests

In norm-referenced tests, candidates performance is assessed in comparison with that of the other candidates. For these reasons the cut-off points (line between fail and pass) are determined afterthe test results are obtained from the group of students based on the distribution of the scores.

TOTAL

75,070,0

65,060,0

55,050,0

45,040,0

35,030,0

25,020,0

TOTAL

Fre

qu

en

cy

50

40

30

20

10

0

Std. Dev = 11,96

Mean = 53,1

N = 270,00

Criterion referenced tests

Criterion-referenced tests compare all the testees to a predetermined criterion. In such tests everybody whose achievement comes up to the pre-set criterion will receive a pass mark, while those under it will fail. The criteria are often set in terms of tasks that students have to be able to perform (e.g. to interact with an interlocutor with ease; to ask for information and understand instructions).

Common European Framework of

References for Languages

Proficient user

C2

Can understand with ease virtually everything heard or read. Can summarise information from different spoken andwritten sources, reconstructing arguments and accounts in a coherent presentation. Can express him/herselfspontaneously, very fluently and precisely, differentiatingfiner shades of meaning even in more complex situations.

C1 Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herselffluently and spontaneously without much obvious searchingfor expressions. Can use language flexibly and effectivelyfor social, academic and professional purposes. Canproduce clear, well-structured, detailed text on complexsubjects, showing controlled use of organisational patterns, connectors and cohesive devices.

Independent user

B2Can understand the main ideas of complex text on bothconcrete and abstract topics, including technicaldiscussions in his/her field of specialisation. Can interactwith a degree of fluency and spontaneity that makesregular interaction with native speakers quite possiblewithout strain for either party. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages anddisadvantages of various options.

B1 Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likelyto arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topicswhich are familiar or of personal interest. Can describeexperiences and events, dreams, hopes and ambitionsand briefly give reasons and explanations for opinionsand plans.

Basic User

A2 Can understand sentences and frequently used expressionsrelated to areas of most immediate relevance (e.g. very basicpersonal and family information, shopping, local geography, employment). Can communicate in simple and routine tasksrequiring a simple and direct exchange of information onfamiliar and routine matters. Can describe in simple termsaspects of his/her background, immediate environment andmatters in areas of immediate need.

A1 Can understand and use familiar everyday expressions andvery basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce him/herself and others and canask and answer questions about personal details such aswhere he/she lives, people he/she knows and things he/shehas. Can interact in a simple way provided the other persontalks slowly and clearly and is prepared to help.

Objective testing versus subjective

testing

The scoring of a task is objective if the

rater does not have to make a judgement

because the scoring is unambiguous. For

example: multiple choice test.

In subjective test tasks, raters have to

make a judgement when assessing

candidates' performance. For example:

marking of an essay.

3Consider the tasks in the 199 exam:

1. C-test

2. Gap-fill task

3. Summary writing

Decide whether these tasks are direct or

indirect, subjective or objective,

integrative or discrete point tasks.

Reliability

Reliability is the extent to which a test is free of random measurement error and produces consistent results when administered under similar conditions. This means that a reliable test is not affected by circumstances outside the test (e.g. the people who administer and mark the test, the time and place of the test)

Types of reliability:

internal consistency: whether the test items are related to each other and measure the same ability

parallel or alternate form reliability: how well parallel or alternate forms of the same test measure the same ability

test-retest reliability: whether test-takers perform

similarly each time they complete the test

intra-rater reliability: whether the same rater

assesses the test-takers' performance in the

same way each time he/she evaluates the test

inter-rater reliability: whether two raters assess

the test-takers' performance in the same way

Validity

Validity is the extent to which a test measures what it is supposed to measure and nothing else.

content validity: whether the test measures the ability it intends to measure;

concurrent validity: whether the test takers' performance in a test correlates with their results in a different type of test;

predictive validity: whether the test results accurately predict future performance;

construct validity: whether the test appropriately represents the theory of language competence it is based on;

face validity: whether the test looks as if it measures what it is supposed to measure.

About the validity of the C-test

The item-related strategies used by the participants

Type of strategy Percentage of total

Lexical 12.41

Syntactic 9.97

Morphological 3.71

Textual 5.36

Background knowledge 0.83

Translation 6.48

Counting the number of letters 15.31

No strategy used - automatically filled in 45.87

Total 100

3. Types of frequently used objective

test tasks

Multiple choice.

It consists of a stem: 1. He ______________ three letters since 9 o'clock.

And options, one of which is correct and the others are distractors.

A writes

B has written

C has been written

D had written

Cloze test

It is a continuous text in which every Nth word is mechanically deleted. N is usually between five and ten. The examinees have to fill in these blanks. It aims to test reading comprehension, syntax and vocabulary.

4C-test

In the C-test the second half of every second word is left out. C-tests can provide a rough measure of learners' global level of proficiency.

Dictation

The basis of the procedure is that each individual dictated chunk is long enough (10-25 words) to exceed the learners short-term memory, and so the forgotten items have to be filled in from the context and the learners knowledge of the language.

Editing

The editing test is the is reverse of the cloze test.

For example:

extra words extra are inserted put placed gone into to a text, and testees are is required to crossing cross these out.

Matching

Candidates are given a list of possible answers which they have to match with another list of words.

For example:

Match the words on the left with those on the right to make other English words.

1 head A partner

2 room B wife

3 business C master

4 house D mate

Ordering

In ordering tasks, candidates have to put a group of words, sentences or paragraphs in order.

For example:

Put the following words in order to complete the sentence:

went yesterday I cinemafriend to with.

The oral proficiency interview

Ideally the oral proficiency interview consists four phases:

1. Warm-up: usually not marked;

2. Level-check: getting an approximate idea of the learners proficiency level and the topics he/she feels comfortable in;

3. Probes: actual rating starts only at this stage, the interviewee is pushed up to or beyond his/her level of competence;

4.Wind-up: rounding off the interview by turning back to activities within the learners ability so as not to send him/her away with a feeling of failure.

Analysis of test results

The three most simple

analyses of test

results are the

following:

1. Distribution curve

shows the number

of students scoring

within a particular

range. Score14,0

12,0

10,0

8,0

6,0

4,0

2,0

0,0

20

10

0

Std. Dev = 3,25

Mean = 8,3

N = 61,00

2. Facility value expresses the proportion of

students who responded correctly to an item.

For example: if 100 students took part in a test,

and 54 of them got the item right, the facility

value is 0.54.

3. Discrimination index expresses how well an

item can discriminate between good and bad

students. Ranges from 1 to - 1.

Statistical features of good tests

The distribution curve should be bell-shaped.

Facility values should be between 0.3 and 0.7 (or in more lenient approaches to test design 0.2-0.8).

Discrimination indices should be above 0.4 (or in more lenient approaches to test design above 0.3).

5Washback

Washback is the effect tests have on teaching and learning.

A beneficial washback effect can be if a so far neglected skill (e.g. listening) is put into the focus of teaching as a result of the introduction of a test where scores in this skill are important in determining the candidates' grades.

A negative washback effect can be if most of the time in lessons in secondary schools is spent on practising multiple choice tests.

Tests have effect on those who take the test, the teachers who prepare the students for the tests, the teaching materials (e.g. course-books), the society and the educational system.

1. Explain the difference between

proficiency and achievement tests;

b) diagnostic and placement tests;

c) direct and indirect tests;

d) subjective and objective tests;

e) norm-referenced and criterion referenced tests;

f) integrative and discrete point tests.

2. What is reliability? List the various types of reliability.

3. What is validity? List the various types of validity.

4. What are the most frequently used objective test tasks?

5. What are the most frequent statistical measures of test performance?

6. What effects can tests have on teaching and learning?

Documents

Language Testing