RDA08 CBA-Assigment 1 2215080093 Chitra Dwi Rahmasari

Classroom Based Assessment

Assignment I

Analysis of Classroom Assessment Tool

ULANGAN TENGAH SEMESTER

SMP NEGERI 194 JAKARTA

TAHUN AJARAN 2010/2011

CHITRA DWI RAHMASARI

08 DIK A REG

2215080093

ENGLISH DEPARTMENT

THE FACULTY OF LANGUAGES AND ARTS

STATE UNIVERSITY OF JAKARTA

ACKNOWLEDGEMENT

First of all, Praise to Allah SWT who gives His blessing and grace, so that the

writer can finish this report of classroom observation in SMPN 99 Jakarta.

The writer would like to give her gratitude, respect, and appreciation to the

following people who have supported, helped, and made this assignment possible.

1. Ma’am Sri Sulastini, as the writer’s lecturer in this course for her guidance

and patience. Without her guidance, it will be harder for the writer to finish

the assignment.

2. The writer’s beloved parents and family for their endless prayers for the

writer’s success

3. All her friends in English Education class of 08 DIK A REG

4. Others for their contribution, assistance, and prayer whose names could not

be mentioned.

Finally, the writer hopes that Allah SWT blesses them and hopefully this

assignment can be useful for whoever reads it, especially for English Department

students of the State University of Jakarta..

Jakarta, April 1, 2011

Chitra Dwi Rahmasari

CONTENTS

ACKNOWLEDGEMENT .……………………………………………….. 2

CONTENTS ..…………………………………………………………….. 3

CHAPTER I INTRODUCTION

I. Background of Study ……..…………………………………….. 4

II. Problem Statements ……………………………………………. 5

III. Purposes of Study ……..……………………………………….. 5

IV. Structure of Study ……………………………………………… 5

CHAPTER II LITERATURE REVIEW

I. Practicality ………………..……………………………………. 7

II. Reliability ……………...……………………………………….. 8

III. Validity ……….………………………………………………… 10

IV. Authenticity …………………………………………………….. 11

V. Washback ………………………………………………………. 12

CHAPTER III FINDINGS AND DISCUSSIONS

I. Practicality ………………..……………………………………. 13

II. Reliability ……………...……………………………………….. 15

III. Validity ……….………………………………………………… 17

IV. Authenticity …………………………………………………….. 19

V. Washback ………………………………………………………. 20

CHAPTER IV CONCLUSION AND RECOMMENDATION

I. Conclusion ….……….…………………………………………. 21

II. Recommendation …...………………………………………….. 22

REFERENCES ………………………..………………………………….. 24

APPENDIX ………………………………………………………………. 25

CHAPTER I

INTRODUCTION

I. Background of Study

Assessment reflects ongoing process of students that covers a much wider

domain than evaluation (Brown, 2004: 4). It helps teachers to gain information

about students’ abilities using variety of procedure or sources to synthesize that

information, and then to make a judgment or evaluation about how well or how

much each student has learned in a manner it is appropriate, consistent, and

conducive to learning. It will also facilitate teachers to evaluate themselves in their

teaching, to see the effectiveness of their teaching and to see how far the objectives

have been acquired by the students. Assessment can be seen in various procedures,

from observations to paper-pencil test. The procedure is then categorized according

to its use; informal assessment (e.g. observations) and formal assessment (e.g.

paper pencil-tests).

In this report, the writer analyzes an example of formal assessment which is a

summative test. According to types of test proposed by Okulu (2008), the test

analyzed is an achievement test which analyzes the extent to which students have

acquired language features that have already been taught and are administer at the

end of a unit or term of study. It is a mid-term test of the second semester for

students Year VII of 194 Junior High School Jakarta year 2010/2011. In reference

to Permen No. 20/2007 tentang Standar Penilaian Pendidikan, a mid-term test is

an activity done by the teacher to measure student competency

achievement after 8-9 weeks learning activities that the coverage of the

test includes all indicators representing all basic competences during the period.

The test is an objective test consisting of fifty multiple choice questions which is

held on Wednesday, March 16, 2011.

The writer analyses the test since assessment is important in order to collect

information about the students’ competency achievement, decide whether or not

the program is successful, and improve teaching and learning activities. In

analyzing the test, she uses the criteria for “testing a test” (Brown, 2004: 19) and

(Genessee & Upshur, 1996) which are practicality, reliability, validity, authenticity,

and washback.

II. Problem Statements

Based on the background above, the questions arise as follows:

1. How is the practicality of the test?

2. How is the reliability of the test?

3. How is the validity of the test?

4. How is the authenticity of the test?

5. How is the washback of the test?

III. Purposes of Study

This study aims at:

1. analyzing the practicality of the test

2. analyzing the reliability of the test

3. analyzing the validity of the test

4. analyzing the authenticity of the test

5. analyzing the washback of the test

III. Structure of Study

To make the discussion easier, the writer divides this report into four

chapters, which are introduction, literature review, findings and discussions, and

conclusion and recommendation. First of all, the introduction describes background

of the study, problem statements, purposes of study, and structure of study. Second,

the literature review provides information about .

Then in the discussion, the writer elaborates points which are the problems of the

report. They are .

Last, the writer draws conclusion from what has been elaborated in the discussion

and provide recommendation which is derived from the discussion analysis.

CHAPTER II

LITERATURE REVIEW

In doing the analysis, the writer summarizes the criteria for “testing a test”

proposed by Brown (2004:19) and Genessee & Upshur (1996) which are

practicality, reliability, validity, authenticity, and washback.

I. Practicality

According to Brown (2004: 19), an effective test is practical. This means that

it is not excessively expensive, stays within appropriate time constraints, is

relatively easy to administer, and has a scoring/evaluation procedure that is specific

and time-efficient. Genessee & Upshur (1996: 56-57) provide more detail

information about practicality. They state 5 practical aspects of information

collection which are cost, administrative time, compilation time, administrator

qualifications, and acceptability. The cost is about whether or not the method of

information collection is affordable. The administrative time is about the time

availability to collect information using the method, while the compilation time is

about the time availability to score and interpret the information. Also, the

administrator qualifications are about the teachers’ qualification to use the method

of information collection, and acceptability is about whether or not the method of

collecting information is acceptable to students, parents, and the community. In

other words, the test should be practical for students as test takers, teachers as

raters, parents, and the community.

Based on both theories, the writer determines the practicality of the test

analyzed as follows:

Cost Is the cost of the test within budgeted limits?

Administrative

- Is there enough time in class to collect information using the

method?

- Can students complete the test reasonably within the set time

frame?

Compilation

- Is the scoring/evaluation system feasible in the teacher’s time

frame?

- Are methods for reporting results determined in advance?

Administrator

qualifications

Are the teachers qualified to use the method of information

collection?

Acceptability Is the method of collecting information acceptable to students,

parents, and the community?

Practicality Checklist

(Adapted from Brown (2004) and Genessee & Upshur (1996))

II. Reliability

As stated by Brown (2004: 20), a reliable test is consistent and dependable. If

you give the same test to the same student or matched students on two different

occasions, the test should yield similar results. The issue of reliability of a test may

best be addressed by considering a number of factors that may contribute to the

unreliability of a test. Mousavi (2002: 804) in Brown (2004: 21) considers

possibilities fluctuations in the student (student-related reliability), in scoring (rater

reliability), in test administration (test administration reliability), and in the test

itself (test reliability).

On the other hand, Genessee & Upshur (1996: 58) provides simpler terms to

classify general sources of unreliability; rater reliability, person-related reliability,

and instrument-related reliability. The rater reliability is about instability in the

person or among the people collecting the information. It can be enhanced if the

persons know exactly how to get the desired information and if they are well

trained and experienced with the information collection procedures. Also, it is

advisable to use more than a single rater in which they should make their

assessment independently. The person-related reliability concerns with the person

about whom information is being collected. The instability is about transitory

moods, momentary distractions, time of day, fatigue, or hundreds of other factors

beyond the control or even the recognition of the test taker or the assessor.

Therefore, this kind of reliability can be enhanced by assessing on several

occasions. Then the instrument-related reliability talks about the procedures used

for collecting information. It can be improved by using a variety of methods of

information collection. In this way, the bias or inaccuracy resulting from the use of

one method will be offset by other methods.

In addition, Okulu (2008) provides three indices of designing multiple-choice

items which are Item Facility (IF), Item Discrimination (ID), and Distractor

Efficiency (DE). The Item Facility is the extent to which an item is easy or difficult

for the proposed group of test takers. The Item Discrimination is the extent to

which an item differentiates between high- and low-ability test takers. The

Distractor Efficiency is the extent to which the distractors “lure” a sufficient

number of test takers, especially lower-ability ones, and those responses are

somewhat evenly distributed across all distractors.

In this report, the writer uses the terms used by Genessee & Upshur (1996:

58) to analyze the test. Since she only analyzes the test without any information

about students’ condition and raters of the test, the writer specifically focuses on

the analysis of instrument-related reliability and does not give very specific

analysis about rater and person-related reliability. Also, the writer involves the

three indices proposed by Okulu (2008) as a part of instrument-related reliability

analysis. Points will be analyzed are:

reliability

- Does the test use experienced, trained raters?

- Does the test use more than one rater?

- Does the test use consistent sets of criteria for a correct

response?

Person-related

reliability

- Does the test yield the same result in several occasions?

Instrument-

- Does every student have a cleanly photocopied test sheet?

- Does the test provide clear instructions to the students?

- Does the test provide opportunity for guessing?

- Do objective scoring procedures leave little debate about

correctness of an answer?

- Does the test meet three indices of designing multiple-choice

items (Okulu, 2008)?

Reliability Checklist

(Adapted from Brown (2004), Genessee & Upshur (1996), and Okulu (2008))

III. Validity

Validity is the extent to which inferences made from assessment results are

appropriate, meaningful, and useful in terms of the purpose of the assessment

(Gronlund in Brown, 2004: 22). Similarly, Genessee & Upshur (1996: 62) defines

validity as “the extent to which the information you collect actually reflects the

characteristic or attribute you want to know about”. In other words, validity

concerns with how close information collected to what we expect (our target).

Brown (2004) divides validity into five types which are content-related evidence,

criterion-related evidence, construct-related evidence, consequential validity, and

face validity. On the other hand, Genessee & Upshur (1996) only provides three

types of validity; content relevance, criterion-relatedness, and construct validity.

For the writer’s needs, she divides validity into four types which are content

validity, criterion validity, construct validity, and face validity.

Content validity refers to which the content of the test reflects the materials

that have been taught in the class in which the materials can be found in the

curriculum. Therefore, this validity is related to the curriculum of the lesson

(Gronlund, 1990: 72) in Brown (2004). Besides related to the curriculum, content

validity also matched to the content of the course of the study (Bachman, 1990) in

Brown (2004). In this report, content validity is analyzed based on the objectives of

year VII syllabus. Second, criterion validity is the extent to which the “criterion” of

the test has actually been reached (Brown, 2004: 24). It means that results obtained

from the assessment aggress with a set of ability criterion. There are two types of

criterion validity; class criterion and norm-referenced criterion. The class criterion

is related to the specification of the program in syllabus such as pronunciation,

intonation, etc. The norm-referenced criterion is related to standardized testing and

native speakers’ competence. In this report, the writer focuses on the class criterion

validity. Third is construct validity. It asks, “Does this test actually tap into the

theoretical construct as it has been constructed?”. It is in accordance with theories

of language behavior or learning. Last, face validity refers to the degree to which a

test looks right, and appears to measure the knowledge or abilities it claims to

measure, based on the subjective judgment of the examinees who take it, the

administrative personnel who decide on its use, and other psychometrically

unsophisticated observers (Mousavi, 2002: 244) in Brown (2004: 26). It means that

in face validity, the writer analyzes how the test looks like and whether it makes

students feel that they are being tested. In short, points of validity will be analyzed

Content

validity

- Are classroom objectives identified and appropriately framed?

- Are lesson objectives represented in the form of test

specifications?

Criterion

validity

- Do the results agree with the set of ability criterion?

Construct

validity

- Does the test actually tap into the theoretical construct as it

has been constructed?

Face valiidty - Are directions of the test clear?

- Is the structure of the test organized logically?

- Is its difficulty level appropriately pitched?

- Does the test have “surprises”?

- Is timing appropriate?

Validity Checklist

(Adapted from Brown (2004) and Genessee & Upshur (1996))

IV. Authenticity

Bachman & Palmer (1996: 23) in Brown (2004: 26) defines authenticity as

“the degree of correspondence of the characteristics of a given language test task to

the features of a target language task”. It is related to the word ‘real’ or ‘natural’. In

which natural in authenticity means that items are contextualized, common use in

real-life, and the topics are relevant to the learner’s condition.In this report, the

writer determines points of authenticity elaborated are as follows.

Authenticity - Is the language in the test as natural as possible?

- Are items as contextualized as possible rather than isolated?

- Are topics and situations interesting, enjoyable, and/or

humorous?

- Do tasks represent, or closely approximate, real-world tasks?

Authenticity Checklist

(Adapted from Brown (2004: 28))

V. Washback

A facet of consequential validity, discussed above, is “the effect of testing

onteaching and learning” (Hughes, 2003: 1) in Brown (2004: 28), otherwise known

among language-testing specialist as wasback. In large-scale assessment, waskback

generally refers to the effects the tests have on instruction in terms of how students

prepare for the test. Another form of washback that occurs more in classroom

assessment is the information that “washes back” to students in the form of useful

diagnoses of strengths and weaknesses. Washback also includes the effects of an

assessment on teaching and learning prior to the assessment itself, that is, on

preparation for the assessment.

In this report, the writer determines specific points of washback for the test

analyzed are as follows:

Washback - Are there any effects of the test on teaching and learning?

- What kinds of effects are there?

Washback Checklist

(Adapted from Brown (2004: 28-29))

CHAPTER III

FINDINGS AND DISCUSSIONS

In this chapter, the writer elaborates what she finds in the test analyze of the

test based on the criteria of “testing a test” proposed by Brown (2004) and

Genessee & Upshur (1996). The criteria are practicality, reliability, validity,

authenticity, and washback.

I. Practicality

Practicality of the test is analyzed based on cost, administrative time,

compilation time, administrator qualifications, and acceptability. First of all is

about the cost. An objective test requires more papers which mean more money.

From the point of view of a single photocopying sheet, it could not be expensive

since it consists only 4 pages. Also, students do not have to pay the photocopy

because it has been paid by either the school or their parents. However, from the

point of view of the whole photocopying sheets, the test is expensive as it is

distributed to all students year VII in the school. Its use is also limited only once in

the middle of semester. The test cannot be reused for the next year. If the teachers

want to use the same test for the next mid-term test, for instance, they still need

more papers again to edit the general information about the test or even revise

some questions.

Second, the test can be administered to large groups which mean less time

consuming than those that can be used only with individuals. (Genessee & Upshur,

1996: 56). Raters only need around ten to fifteen minutes to distribute the test to all

students in the class. However, the test does not provide time allocation so that the

writer does not know whether the students can complete the test reasonably within

the set time frame. If its time is 120 minutes, it will possibly be enough time for the

students to do the test. Meanwhile, if it is only 90 minutes, it will be difficult for

the student to complete the test. It is because unclear instructions make them spend

more time to answer the questions.

The third point is compilation time. Genessee & Upshur (1996: 56)

emphasize that compilation time is not necessary only about scoring time, but also

transforming results into a usable form. As stated by Brown (2004: 31), teachers

should avoid the temptation to offer only quickly scored multiple-choice selection

items that may be neither appropriate nor well-designed. It is because this kind of

test does not provide any time for the teachers to give feedback—comments and

suggestions—to students on their tests. In this analysis, scoring time is practical

since it is a multiple choice test with only one correct answer. The answer key

could also be provided for correctors or it could be corrected by computers. In

other words, the test is less time consuming in terms of scoring time. On the other

hand, it does not provide any time to give feedback to students. Results of the test

are already in the form of scores and will be compiled with other tests to score

students’ final score. This kind of result form could be useful for the teachers

because it ease their work to measure students’ achievement and their strengths and

weaknesses. Unfortunately, the easiness is not also for the students since they are

not informed of the analysis result of the test.

Next are administrator qualifications. For most multiple-choice language

tests, examiner qualifications pose no problems: Language teachers generally

posses the qualities needed to administer such tests. Most classroom teachers could

administer the test without special training. (Genessee & Upshur, 1996: 56). In this

analysis, the test is practical in terms of administrator qualifications since they do

not need special training to administer the test.

Last but not least about the practicality of the test is acceptability. The test is

acceptable to students, parents, and the community. It is because the test fulfills

their needs. For the students, the test helps them to measure their ability during the

mid-semester and evaluate their strengths and weaknesses. For the parents, the

result of the test will be beneficial to know their children’s achievement in learning

English. Then for the community, especially the school, the test is as an insight to

decide how much successful the program is and to improve the teaching and

learning activities.

II. Reliability

The reliability of test is analyzed into three parts; rater reliability, person-

related reliability, and instrument-related reliability. The first one is rater reliability.

As emphasized by Genessee & Upshur (1996: 60), the writer need not be greatly

concerned with rater-reliability when using multiple-choice tests. It is because a

multiple-choice test does not give a big problem of rater reliability. It has

consistent sets of criteria for a correct response; there is only one correct answer for

each question. The test does not need experienced, trained raters because related to

the practicality analysis, the test is practical in terms of administrator qualifications

since they do not need special training to administer the test. Everyone in the

school could help to score the test as long as they are informed of the answer key.

This means that the test uses more than one rater. In other words, two or more

scorers will get consistent scores because the test is an objective test with the clear

answer key. There is only one answer for one question written in the answer key

which will ease the raters.

The second is person-related reliability. As stated in Chapter 2, the writer

does not give specific analysis of the person-related reliability since she does not

know the students’ condition when they are taking the test. What the writer can

infer is the test will yield different results in several occasions. Students’

preparation, temporary illness, physical, and psychological factors may contribute

to students’ performance in taking the test. Whether they are best able to perform

well or not also determines the reliability of the test.

The third is instrument-related reliability: Next is the analysis of instrument-

related reliability. First of all, every student has a cleanly photocopied test sheet.

Unfortunately, there are no general instructions of the test such as whether the

answer should be crossed or circled. There is no information about what time and

how long the test takes place. Also, there are no specific instructions for questions

number 1 until 7 for instance. It will confuse the students with what are actually

expected from the questions. For question 2, especially, it will really confuse the

students since there is no question. It is only written:

2. X: …

Y: …Besides, there is a mistyped word at question number 50 option D. The word

“homework’s” should be “homework” since it is uncountable noun. Also, this type

of test provides opportunity for guessing (Genessee & Upshur, 1996: 58). Two

students with equal abilities in English might have different results since one of

them is lucky and guesses right. However, another might get a higher score if they

are tested for the same test in another occasion. Okulu (2008) also states that with

this kind of test, cheating may also be facilitated. Therefore, the information

collected would not be reliable.

Next are about scoring procedures. The procedures leave little debate about

correctness of an answer since it is an objective test with only one correct answer.

Only question number 2 has no correct answer. In addition, the test meets the

requirement of Item Facility (IF). The test is arranged from the easiest questions on

the first page until the most difficult questions at the last five numbers. This means

the test also fulfills the requirement of Item Discrimination (ID) which is the extent

to which an item differentiates between high- and low-ability test takers. The high-

ability test takers will easily answer the last five questions while the low-ability test

takers will get difficulty to answer them. In addition, most questions meet the

criteria of Distractor Efficiency (DE), that is the extent to which the distractors

“lure” a sufficient number of test takers, especially lower-ability ones, and those

responses are somewhat evenly distributed across all distractors. Examples are

questions number 13.

13. Do they have a swimming pool?a. yes, they dob. yes, they arec. no, they aren’td. no, they don’t

In the question, options A, B, and C may distract the students. Distractor C,

especially, will attract more responses from the high-ability group than the low-

ability group while distractor A and B will attract more responses from the students

who do not pay attention to the text provided for questions 11-13.

III. Validity

As mentioned in Chapter 2, validity of the test is analyzed into four types

which are content validity, criterion validity, construct validity, and face validity.

3.1. Content Validity

Brown (2004: 32) considers the first measure of an effective classroom

test is the identification of objectives, which are found in the syllabus.

In reference to year VII syllabus, the objectives of listening skill for

instance deal with “students should be able to: identify, respond, and

answer the expressions of asking and giving service, asking and giving

things, and asking and giving fact”. From the objectives, the modal

should is ambiguous (Brown, 2004: 33). Also, there are no standards

stated to fulfill the act of “responding” and “answering”.

In addition, the test does not really fulfill all materials for year VII

students. The materials are asking and giving service, asking and giving

things, and asking and giving fact, asking and giving opinion, showing

like and dislike, asking for clarification, responding interpersonally,

congratulations, shopping list, announcement, descriptive, and

procedure. In fact, there are no questions in the test related to asking and

giving things, showing like and dislike, asking for clarification, and

responding interpersonally, but there are questions related to asking and

giving instruction, prohibition, and introducing oneself and others.

Those materials are actually for the first semester of year VII students.

Next, the lesson objectives are not represented in the form of test

specification. The test is a written multiple-choice test that is expected

to measure students’ achievement in the area of listening, speaking,

reading, and writing. Unfortunately, the test is not divided into a

number of sections. There is no listening, speaking, or even writing

sections in order to correspond to the objectives being assessed. Also,

since it is a multiple choice test, it does not have any specifications in

the scoring rubric and give feedback (Brown, 2004: 33).

Above all, good points of this test are its design offers students a variety

of item types and gives an appropriate relative weight for year VII

students (Brown, 2004: 33).

3.2. Criterion Validity

The test does not fulfill the classroom criterion validity since it is not

related to the specification of the program in syllabus. In the syllabus,

English skills involve listening, speaking, reading, and writing, but in

the test, there is only reading skills measured. The writer cannot analyze

the specific criteria of reading skills since they are not specifically

written on the syllabus. It is only written “short written functional text:

lists of things/announcement/greeting cards and vocabulary: shopping

lists, room lists, congratulations, attention, etc.”

3.3. Construct Validity

This test cannot be used to asses students’ speaking, listening, and

writing ability since it does not provide sample of oral production, for

instance, to fulfill the principle of construct validity for speaking skill. It

only assesses reading ability in which the scoring analysis includes

several factors such as scanning, skimming, identifying topic, main

idea, supporting details, vocabulary use, and generic structure of the

text. From this point of view, the test has already fulfilled the criteria of

construct validity for reading skill. Text for questions number 41-43, for

example, lead students to identify the sentence order, the title, and

meaning of the underlined word. Questions number 44-45 are also

related to the questions 41-43, which are about procedure text. In the

questions, the students are asked to choose the right instruction based on

the picture and mention three parts in procedure text. In other words,

those sample questions reveal that the results of the test agree with the

theoretical construct of reading skill measured.

3.4. Face Validity

The test does not really meet the criteria of face validity. It is because

the test does not give clear directions, either general directions at the

beginning of the test or specific directions for each question. The test

also has “surprises”, which is there are questions about introduction. In

fact, introduction should be the material for the first semester of year

VII students. Then whether timing is appropriate or not is questionable

since the test does not provide any information about time allocation.

Despite the weaknesses, the test is good in the structure of the test

which is organized logically and the difficulty level which is

appropriately pitched for year VII students.

IV. Authenticity

The language used in this test is as natural as possible if it is seen from

English context. It uses simple language and directly points what the speakers

expect to be responded. An example is taken from question number 3.

3. Mr. Arif : it’s very dark here Latief : of, course

However, the writer thinks it would be better some questions like question

number 3 provides who will respond the last speaker, whether Mr. Arif or Latief so

that students ease to guess what are expected from the test. An example is as

follows.

3. Mr. Arif : it’s very dark here Latief : of, course Mr. Arif: …

Then questions 8 to 10 does not meet the criteria of natural language since in

daily life, students rarely say their plan and asks their friends’ opinion after

greetings. Next point is most of the test items are contextualized such as question

number 19, 23 to 26, etc. Questions 23 to 26, for instance, give clear context that

the text is about shopping list. Unfortunately, there are still some items which are

isolated such as question number 2. They do not provide any information about

what the dialogue is. Topics and situations used in the test are interesting and

enjoyable since it is closely related to students’ real life which are asking and

giving service, shopping list, etc. Last, tasks does not represent, or closely

approximate, real-world tasks since in the real-world, students are never asked to

complete the paragraph like in questions 46 to 50.

V. Washback

The test analyzed is a formal, summative, and multiple-choice test. According

to Brown (2004: 29), summative tests which provide assessment at the end of a

course or program do not need to offer much in the way of washback. The test

analyzed is a summative test in which related to the practicality analysis, this kind

of test does not provide any time for the teachers to give feedback—comments and

suggestions—to students on their tests. The students only receive a simple letter

grade or a single overall numerical score without knowing their strengths and

weaknesses in the test. In reality, letter grades or numerical scores are considered to

give absolutely no information of intrinsic interest to the students, reduce a

mountain of linguistic and cognitive performance data, and give a relative

indication of a formulaic judgment of performance as compared to others in the

class—which fosters competitive, not cooperative, learning (Brown, 2004: 29).in

adiition, Okulu (2008) states that one of the weaknesses in multiple choice items is

washback may be harmful.

In other words, for teachers, the test could be beneficial if they interpret the

scores as the decision of teaching effectiveness. From the results, they evaluate

their strengths and weaknesses, and find the way to improve their teaching quality.

However, the test does not provide any feedback for students so they will not know

where their mistake takes place. They are only informed of their scores after the

test, or even sometimes, in the report. They will not know their strengths and

weaknesses in the test. In fact, every language course or program is always the

beginning of further pursuits, more learning, more goals, and more challenges

(Brown: 2004,30). Therefore, if the students are not exposed to their mistakes, they

will get difficulties in facing the challenges. Also, the test leads to be a competitive

test because it encourages the students to get the best score or higher score than the

others. As a solution, the students can do self-assessment or peer discussion as

alternative ways to enhance washback from the test (Brown, 2004: 37).

CHAPTER IV

CONCLUSION AND RECOMMENDATION

I. Conclusion

From the discussion, the writer concludes that the test is practical in terms of

administrative time, compilation time, administrator qualifications, and

acceptability. Unfortunately, it is not practical in cost since the test is an objective

test which requires more papers and more money. Also, its use is also limited only

once in the middle of semester. If the teachers want to use the same test for the next

mid-term test, for instance, they still need more papers again to edit the general

information about the test or even revise some questions.

Second, the reliability of the test is not high since the multiple choice test

tend to yield different results in several occasions because of students’ preparation,

temporary illness, physical, and psychological factors. The test does not provide

general instructions, specific instructions for some questions, information about

what time and how long the test takes place, and provides opportunity for guessing.

However, it has consistent sets of criteria for a correct response which leave little

debate; there is only one correct answer for each question. Also, the test meets the

requirement of three indices in designing multiple choice test (Okulu, 2008).

Third is validity of the test. The test is quite valid in terms of the content

since there are materials which have not been covered in the test. From the criterion

validity, the writer cannot analyze the specific criteria of reading skills, which are

measured in the test, since they are not specifically written on the syllabus. Then

although the test cannot be used to asses students’ speaking, listening, and writing

ability, the results of the test agree with the theoretical construct of reading skill

measured. Nest, the test does not really meet the criteria of face validity since it

does not give clear directions, has “surprises”, which is there are questions about

introduction, and does not provide timing of the test. Despite the weaknesses, the

test is good in the structure of the test which is organized logically and the

difficulty level which is appropriately pitched for year VII students.

Fourth, generally, the authenticity of the test is high since the language used

in most questions is as natural as possible, most of the test items are contextualized,

and topics and situations used in the test are interesting and enjoyable. However,

tasks do not represent, or closely approximate, real-world tasks since in the real-

world, students are never asked to complete the paragraph like in questions 46 to

Last principle of “testing a test” is washback. From teachers’ point of view,

the test could be beneficial if they interpret the scores as the decision of teaching

effectiveness, the evaluation of their strengths and weaknesses, and the way to

improve their teaching quality. Unfortunately, the test does not provide any

feedback for students so they will not know where their mistake takes place. They

are only informed of their scores after the test, or even sometimes, in the report.

They will not know their strengths and weaknesses in the test. Also, the test leads

to be a competitive test because it encourages the students to get the best score or

higher score than the others.

In short, the test has not met all criteria for “testing a test” proposed by

Brown (2004) and Genessee (1996) which have been summarized in Chapter 2. the

result of the analysis confirms the opinion from Okulu (2008) which states:

“There’re a number of weaknesses in multiple-choice items:- The techniques tests only recognition knowledge.- Guessing may have a considerable effect on test scores.- The technique severely restricts what can be tested.- It is very difficult to write successful items.- Washback may be harmful.- Cheating may be facilitated.”

II. Recommendation

From the conclusion, the writer notices two major problems of the test which

are validity and washback. To overcome the validity problem, the writer suggests

that the raters should refer to English syllabus used for the level. They might

consider the materials taught during the semester and the objectives that are being

assessed in order to enhance content validity. This strong content validity will also

influence criterion, construct and face validity of the test. Then to overcome the

washback problems, the writer recommends that the test should provide feedback

to the students. Since it is a mid-term test which aims at measuring

student competency achievement after 8-9 weeks learning activities (Permen No.

20/2007 tentang Standar Penilaian Pendidikan), it will be important for students to

know their strengths and weaknesses after learning English for about 8 weeks. If

the feedback is not appropriate to be given directly in the class, the students can do

self-assessment or peer discussion as alternative ways to enhance washback from

the test (Brown, 2004: 37).

BIBLIOGRAPHY

Brown, H. D. (2004). Language Assessment: Principles and Classroom Practice.

Chapter 2, pp. 19-41. New York: Pearson Education, Inc.

Genessee & Upshur. (1996). Classroom-Based Evaluation in Second Language

Education. Chapter 4, pp. 54-73. Cambridge: Cambridge University Press.

Okulu. (2008). Chapter 3: Designing Classroom Language Tests. File PDF.

Permendiknas No.20/2007 tentang Standar Penilaian Pendidikan.

Silabus SMP Negeri 1 Bubulan Kelas VII.

Susilohadi, Gunarso, dkk. 2008. Contextual Teaching and Learning Bahasa

Inggris: SMP/MTs Kelas IX Edisi 4. Jakarta: Pusat Perbukuan, Depdiknas.

APPENDIX

RDA08 CBA-Assigment 1 2215080093 Chitra Dwi Rahmasari

Documents

Assigment DEB

mba assigment

Assigment soc

WTO Assigment

Assigment Internet

Ict Assigment

automata assigment

NTM ASSIGMENT

pultrusion assigment

Tender Assigment

Java Assigment

ASSIGMENT C++

Assigment Website

Assigment Problem

Swimming Assigment

Assigment 4.1

Solas assigment

Assigment 1

maths assigment

Empower Assigment