Videos (Summary)

Testing summary (videos)

Diego Ulloa Lorena Salazar

Pre – testing J. D. Brown What is Pre – testing? For this author, Pre testing is also called Piloting

It is when you take a group of language items and try them out with a group of students.

It allows us to know which items will be better for all our students. It is done when bottom and top students discriminate the items.

Why is it important to do Pre – Testing? Because it helps us to know which items are useful and which aren’t. Most teacher tend to think that all items are good but by pretesting they are able to prove which ones really are and which one are not. How do you select the items? It helps to understand how well your items are working

1) It is important to develop more items that the ones that are going to be needed (if you need 40 items for the final test, would be better if you develop about 60 items)

2) Then administrate it to the students that are going to take the test. (Ideally different students that the ones who are going to take the test)

3) Then write the items on a sheet of paper in columns (see example on the video)

4) Then we are able to calculate the item difficulty which represents the student’s percentage that did wrong on a specific item.

5) Then we can calculate the item discrimination what involves diving the student into three groups (according to their scores) Doing this you can calculate the item difficulty to each group.

Then you are able to choose the items you are going to test. You do it according to the item discrimination and the item difficulty.



Vocabulary John Read At first, Vocabulary was not consider an important thing in assessing

Imbibed vocabulary: when you look at the vocabulary within the context of a larger construct

Vocabulary is useful for:

- Placing students in Language - Measuring progress - Diagnoses

Vocabulary knowledge is important because it helps in fluent reading comprehension Measuring vocabulary size:

1) Take samples of a word frequency list 2) Then divide the text to know if those words are known or not 3) Then we make a list of the total vocabulary size of the native speaker or

the 2nd lg learner Vocabulary tasks:

1) Descript vocabulary 2) Yes / no 3) Multiple choice 4) Matching task 5) Gap filling

Depth of vocabulary knowledge It goes beyond that only testing student’s vocabulary meaning to assess other aspects of word knowledge Word associated format: the test taker is presented with the target word and then a series of other words which are related (students have to select the words that are related to the target word) Vocabulary knowledge scale: students must write how well they know the word’s meaning. They are also asked to give synonyms or translation and also use the word in sentence. Vocabulary is good for lexical units like idioms, phrasal verbs, collocations, formulating sequences and lexical phrases



Reading Caroline Clapham - By taking just a passage it would test only identification of facts such as addresses dates and names. So it would be better to think of abilities that go into reading skills:

items that test inferencing, a whole string of skills and strategies used in reading

- You should have a list where one specifies the skills needed in reading and then try samples of that list:

skimming, scanning, looking for small pieces of information, and looking for the general meaning of a paragraph

How should you choose the reading passages? - Considering the purpose of the students:

Use those sorts of texts that people will read when they are using English (contextualization). These texts do not need to be totally authentic but look like that at least.

Use several texts: it makes the reading text far more reliable. o You can vary things like the genre the topic some students may be more

familiar with. o short texts for intensive reading (when looking for specific facts) o long texts for skimming and scanning for extensive reading (when looking

for general meaning) How should you test these passages? All depends on both students and circumstances. It’s better if you have more than one sort of method:

Multiple Choice Questions (MCQ): very good for testing reading comprehension but very difficult to write, so you can use this alternative only if you are a trained and experienced writer in MCQ. It is compulsory to pre-test them to see they are working well.



Short Answer Questions: very revealing by limiting the answer to say--for instance, three words. It is also a fairly easy to mark.

Selection of Headings: very good alternative if you are trying to get understanding of overall meaning of a passage only if they are provided in a “matching” way.

Gapped Summary: for advanced readers, in which the test writer summarizes a text with gaps and students should fill in those gaps.

Information Transfer: students get the text, understand it and then transfer what they have read into a chart.

True or False Questions: they can work well if treated carefully because of the 50%-getting-right-item issue.



Speaking

Glenn Fulcher

A speaking test is essentially composed of three parts:

1. Task: something that we give to the learners; it elicits language from them. 2. Rating Scale: it is used to grade the sample of language (the task). 3. Rater: It may be the interlocutor or different people that grade that sample

of language. Where does the rating scale come from? Why do they seem to be pretty important? They have been designed in a number of different ways that suit teacher’s own purposes:

Intuitional: The different levels of the rating scale are created on the base of the teachers/interlocutors’ experiences. Fulcher calls this “armchair method” because it is based more on one’s own intuition)

Empirical:

I. By collecting samples of language from the students to show them to teachers and come to an agreement on which one is the best.

II. By collecting samples of language but analyzing what the students are saying perhaps using discourse analysis or conversation analysis and then use those descriptors to generate the rating scales actually to write the different bands.

What about the raters? Sometimes the rater is a separate persona or the interlocutor itself, but in the first case it allows them to concentrate on the rating process and reliability. Reliability is divided into two:

Intra-rater reliability: if an individual rater can agree with him or herself in rating the same sample of language over a period of time.

Inter-rater reliability: if different raters can agree with each other when they are marking or grading the same sample of language.

Is speaking the most difficult ability to test? Testing speaking brings its own special problems:



In terms of validity: o Rating Scales: they have to be well-written in terms of the definition

and specification of what we want to measure. o Task: we need to be able to make sure that the kind of language

that we are listing can actually be rated using the rating scale previously created, so the rating scale and the task design has to go hand in hand.

o Generalisability: concerned with the use of the understanding of the score that the student is given, i.e., the interpretation of the score out of the context itself and making predictions about what the student is or is not capable of doing in the future or in a non-test context.

In terms of practicality:

o Speaking tests are difficult to organize, so they need to have enough trained raters.



Listening Gary Buck

o Listening comprehension is a process of constructing meaning where two different sort of knowledge play a great role on this: linguistic and nonlinguistic knowledge.

o Listening comprehension is quite difficult because people do not speak in sentences as in written language form: we find many pauses and hesitation; we do not finish our “oral sentences”.

o Listening takes time in real time. This means that the processes must be automatic and then the information needs to be memorized.

We find two problems when testing listening comprehension:

1. We cannot examine a piece of listening comprehension. 2. There is interference. It is possible that people interpret the listening in

different ways.

There are two types of listening The interactive listening (when there is a speaker and a listener who change roles) The non interactive listening (when there is only a listener. Ex: listen to radio) There are two purposes in listening interaction:

1. Convey information 2. Establish relationship (social interaction – transactional use to provide

information)

What type of item task should we use for listening?

Dictation: It is the easiest way but it tests a very narrow range of language skills. (It is easy to make and to mark)

Statement evaluation activities: in which we give people a statement and they evaluate if it is true or not. (it also tests a very narrow range of skills)

Listening for longer discourses or a deeper level: we use comprehension questions and information transfer items.

How can we do that communicative?



There are two ways to do it communicative:

1. Comprehending explicit linguistic information. 2. Comprehension in terms of interpreting the meaning in a broad context.

Listening comprehension can be defined as understanding samples of realistic language and the process is automatic and in real time. This means to also understand any inferences and implications of the content or context.



Item Writing Charles Stansfield What makes a good item writer?

They have to be an excellent writer in the language in which they are writing.

They have to have teaching experience. Are there any principles of good item writing?

Those principles vary according to the paper that one is developing. In general, good writing items starts with looking at:

o test specifications o the audience itself o the purpose for the test o the test content o item writing guidelines o sample items

What do you look for when reviewing an item? This process is an integrated one that typically involves four issues:

Content review: the item is compared with the specifications we are looking to see whether the reviewer agrees with the content classification of the item.

Key check: It tends to be ignored. It is actually specified by the item writer and submitted with the content classification, and the reviewer will consider whether there are any other possible keys that should be listed in the key list of acceptable responses.

Bias review: there are two things we look for here. o Content Bias: it is the nature of the item going to favor a particular

group of examinees in terms of familiarity with that context or the content. It has to be neutral.

o Sensitivity Bias: it has to do with wording of the item that is likely to be offensive or insensitive to any particular group of examinees.

Editorial review: It includes re-write and re-word items, but the issues are mainly mechanics, punctuation, spelling, grammar, clarity and style of writing.

How many times should an item writer be reviewed?



As many as needed; every item writer needs multiple reviews. The item review process consists of two horizontal stages:

o When the items go out to more than one reviewer and a set of comments get back from them.

o When once those items are already compared, they have to be sent to a different set of reviewers and then a set of comments get back from those new reviewers.



Testing Writing Liz Hamp-Lyons What is writing assessment?

The most important characteristic is that in this form of assessment the test taker actually writes as opposed to completing multiple choice items or some other non-productive form.

We also expect to have: o a focused topic o specification of audience o specification of purpose for writing o the method by which the writing is going to be judged o people or any computer system to make those judgments o a scale for reporting the performance of the writers o means of validity for the assessment instrument

How reliable are writing tests? It depends on how carefully one works on ensuring reliability:

Having more than one item though it is difficult because it takes so long to produce a text.

Having more than one judge, and a third when there is a disagreement between those two judges.

Scoring different dimensions of the text (grammar, organization, content, etc)

Having more than one writing occasion though it does not happen very often because it is so time consuming.

What do we have to remember in order to end up with a writing test that is as reliable as it can be and still keep validity that this kind of assessment has?

In terms of validity: it is a need to have people thinking about educational goals, curriculum and syllabus to assess what has been taught.

In terms of reliability: o The teacher has the opportunity to have more than one item. o Teachers can work with other teachers to exchange pieces of writing

and discuss them. If I’m looking at writing test from the point of view of the teacher or future test taker what should I look for about the test? what information should be provided to me?



From the teacher’s viewpoint, the more information provided and positive the assessment is the better, because teachers tend to pick out what students cannot do rather than what they actually can do.

Education

Videos (Summary)