View
16
Download
2
Category
Preview:
Citation preview
1
Classroom Based Assessment
Assignment I
Analysis of Classroom Assessment Tool
ULANGAN TENGAH SEMESTER
SMP NEGERI 194 JAKARTA
TAHUN AJARAN 2010/2011
CHITRA DWI RAHMASARI
08 DIK A REG
2215080093
ENGLISH DEPARTMENT
THE FACULTY OF LANGUAGES AND ARTS
STATE UNIVERSITY OF JAKARTA
2011
2
ACKNOWLEDGEMENT
First of all, Praise to Allah SWT who gives His blessing and grace, so that the
writer can finish this report of classroom observation in SMPN 99 Jakarta.
The writer would like to give her gratitude, respect, and appreciation to the
following people who have supported, helped, and made this assignment possible.
1. Ma’am Sri Sulastini, as the writer’s lecturer in this course for her guidance
and patience. Without her guidance, it will be harder for the writer to finish
the assignment.
2. The writer’s beloved parents and family for their endless prayers for the
writer’s success
3. All her friends in English Education class of 08 DIK A REG
4. Others for their contribution, assistance, and prayer whose names could not
be mentioned.
Finally, the writer hopes that Allah SWT blesses them and hopefully this
assignment can be useful for whoever reads it, especially for English Department
students of the State University of Jakarta..
Jakarta, April 1, 2011
Chitra Dwi Rahmasari
3
CONTENTS
ACKNOWLEDGEMENT .……………………………………………….. 2
CONTENTS ..…………………………………………………………….. 3
CHAPTER I INTRODUCTION
I. Background of Study ……..…………………………………….. 4
II. Problem Statements ……………………………………………. 5
III. Purposes of Study ……..……………………………………….. 5
IV. Structure of Study ……………………………………………… 5
CHAPTER II LITERATURE REVIEW
I. Practicality ………………..……………………………………. 7
II. Reliability ……………...……………………………………….. 8
III. Validity ……….………………………………………………… 10
IV. Authenticity …………………………………………………….. 11
V. Washback ………………………………………………………. 12
CHAPTER III FINDINGS AND DISCUSSIONS
I. Practicality ………………..……………………………………. 13
II. Reliability ……………...……………………………………….. 15
III. Validity ……….………………………………………………… 17
IV. Authenticity …………………………………………………….. 19
V. Washback ………………………………………………………. 20
CHAPTER IV CONCLUSION AND RECOMMENDATION
I. Conclusion ….……….…………………………………………. 21
II. Recommendation …...………………………………………….. 22
REFERENCES ………………………..………………………………….. 24
APPENDIX ………………………………………………………………. 25
4
CHAPTER I
INTRODUCTION
I. Background of Study
Assessment reflects ongoing process of students that covers a much wider
domain than evaluation (Brown, 2004: 4). It helps teachers to gain information
about students’ abilities using variety of procedure or sources to synthesize that
information, and then to make a judgment or evaluation about how well or how
much each student has learned in a manner it is appropriate, consistent, and
conducive to learning. It will also facilitate teachers to evaluate themselves in their
teaching, to see the effectiveness of their teaching and to see how far the objectives
have been acquired by the students. Assessment can be seen in various procedures,
from observations to paper-pencil test. The procedure is then categorized according
to its use; informal assessment (e.g. observations) and formal assessment (e.g.
paper pencil-tests).
In this report, the writer analyzes an example of formal assessment which is a
summative test. According to types of test proposed by Okulu (2008), the test
analyzed is an achievement test which analyzes the extent to which students have
acquired language features that have already been taught and are administer at the
end of a unit or term of study. It is a mid-term test of the second semester for
students Year VII of 194 Junior High School Jakarta year 2010/2011. In reference
to Permen No. 20/2007 tentang Standar Penilaian Pendidikan, a mid-term test is
an activity done by the teacher to measure student competency
achievement after 8-9 weeks learning activities that the coverage of the
test includes all indicators representing all basic competences during the period.
The test is an objective test consisting of fifty multiple choice questions which is
held on Wednesday, March 16, 2011.
The writer analyses the test since assessment is important in order to collect
information about the students’ competency achievement, decide whether or not
the program is successful, and improve teaching and learning activities. In
5
analyzing the test, she uses the criteria for “testing a test” (Brown, 2004: 19) and
(Genessee & Upshur, 1996) which are practicality, reliability, validity, authenticity,
and washback.
II. Problem Statements
Based on the background above, the questions arise as follows:
1. How is the practicality of the test?
2. How is the reliability of the test?
3. How is the validity of the test?
4. How is the authenticity of the test?
5. How is the washback of the test?
III. Purposes of Study
This study aims at:
1. analyzing the practicality of the test
2. analyzing the reliability of the test
3. analyzing the validity of the test
4. analyzing the authenticity of the test
5. analyzing the washback of the test
III. Structure of Study
To make the discussion easier, the writer divides this report into four
chapters, which are introduction, literature review, findings and discussions, and
conclusion and recommendation. First of all, the introduction describes background
of the study, problem statements, purposes of study, and structure of study. Second,
the literature review provides information about .
6
Then in the discussion, the writer elaborates points which are the problems of the
report. They are .
Last, the writer draws conclusion from what has been elaborated in the discussion
and provide recommendation which is derived from the discussion analysis.
7
CHAPTER II
LITERATURE REVIEW
In doing the analysis, the writer summarizes the criteria for “testing a test”
proposed by Brown (2004:19) and Genessee & Upshur (1996) which are
practicality, reliability, validity, authenticity, and washback.
I. Practicality
According to Brown (2004: 19), an effective test is practical. This means that
it is not excessively expensive, stays within appropriate time constraints, is
relatively easy to administer, and has a scoring/evaluation procedure that is specific
and time-efficient. Genessee & Upshur (1996: 56-57) provide more detail
information about practicality. They state 5 practical aspects of information
collection which are cost, administrative time, compilation time, administrator
qualifications, and acceptability. The cost is about whether or not the method of
information collection is affordable. The administrative time is about the time
availability to collect information using the method, while the compilation time is
about the time availability to score and interpret the information. Also, the
administrator qualifications are about the teachers’ qualification to use the method
of information collection, and acceptability is about whether or not the method of
collecting information is acceptable to students, parents, and the community. In
other words, the test should be practical for students as test takers, teachers as
raters, parents, and the community.
Based on both theories, the writer determines the practicality of the test
analyzed as follows:
Cost Is the cost of the test within budgeted limits?
Administrative
time
- Is there enough time in class to collect information using the
method?
- Can students complete the test reasonably within the set time
8
frame?
Compilation
time
- Is the scoring/evaluation system feasible in the teacher’s time
frame?
- Are methods for reporting results determined in advance?
Administrator
qualifications
Are the teachers qualified to use the method of information
collection?
Acceptability Is the method of collecting information acceptable to students,
parents, and the community?
Practicality Checklist
(Adapted from Brown (2004) and Genessee & Upshur (1996))
II. Reliability
As stated by Brown (2004: 20), a reliable test is consistent and dependable. If
you give the same test to the same student or matched students on two different
occasions, the test should yield similar results. The issue of reliability of a test may
best be addressed by considering a number of factors that may contribute to the
unreliability of a test. Mousavi (2002: 804) in Brown (2004: 21) considers
possibilities fluctuations in the student (student-related reliability), in scoring (rater
reliability), in test administration (test administration reliability), and in the test
itself (test reliability).
On the other hand, Genessee & Upshur (1996: 58) provides simpler terms to
classify general sources of unreliability; rater reliability, person-related reliability,
and instrument-related reliability. The rater reliability is about instability in the
person or among the people collecting the information. It can be enhanced if the
persons know exactly how to get the desired information and if they are well
trained and experienced with the information collection procedures. Also, it is
advisable to use more than a single rater in which they should make their
assessment independently. The person-related reliability concerns with the person
about whom information is being collected. The instability is about transitory
moods, momentary distractions, time of day, fatigue, or hundreds of other factors
9
beyond the control or even the recognition of the test taker or the assessor.
Therefore, this kind of reliability can be enhanced by assessing on several
occasions. Then the instrument-related reliability talks about the procedures used
for collecting information. It can be improved by using a variety of methods of
information collection. In this way, the bias or inaccuracy resulting from the use of
one method will be offset by other methods.
In addition, Okulu (2008) provides three indices of designing multiple-choice
items which are Item Facility (IF), Item Discrimination (ID), and Distractor
Efficiency (DE). The Item Facility is the extent to which an item is easy or difficult
for the proposed group of test takers. The Item Discrimination is the extent to
which an item differentiates between high- and low-ability test takers. The
Distractor Efficiency is the extent to which the distractors “lure” a sufficient
number of test takers, especially lower-ability ones, and those responses are
somewhat evenly distributed across all distractors.
In this report, the writer uses the terms used by Genessee & Upshur (1996:
58) to analyze the test. Since she only analyzes the test without any information
about students’ condition and raters of the test, the writer specifically focuses on
the analysis of instrument-related reliability and does not give very specific
analysis about rater and person-related reliability. Also, the writer involves the
three indices proposed by Okulu (2008) as a part of instrument-related reliability
analysis. Points will be analyzed are:
Rater
reliability
- Does the test use experienced, trained raters?
- Does the test use more than one rater?
- Does the test use consistent sets of criteria for a correct
response?
Person-related
reliability
- Does the test yield the same result in several occasions?
Instrument-
related
reliability
- Does every student have a cleanly photocopied test sheet?
- Does the test provide clear instructions to the students?
- Does the test provide opportunity for guessing?
- Do objective scoring procedures leave little debate about
correctness of an answer?
- Does the test meet three indices of designing multiple-choice
10
items (Okulu, 2008)?
Reliability Checklist
(Adapted from Brown (2004), Genessee & Upshur (1996), and Okulu (2008))
III. Validity
Validity is the extent to which inferences made from assessment results are
appropriate, meaningful, and useful in terms of the purpose of the assessment
(Gronlund in Brown, 2004: 22). Similarly, Genessee & Upshur (1996: 62) defines
validity as “the extent to which the information you collect actually reflects the
characteristic or attribute you want to know about”. In other words, validity
concerns with how close information collected to what we expect (our target).
Brown (2004) divides validity into five types which are content-related evidence,
criterion-related evidence, construct-related evidence, consequential validity, and
face validity. On the other hand, Genessee & Upshur (1996) only provides three
types of validity; content relevance, criterion-relatedness, and construct validity.
For the writer’s needs, she divides validity into four types which are content
validity, criterion validity, construct validity, and face validity.
Content validity refers to which the content of the test reflects the materials
that have been taught in the class in which the materials can be found in the
curriculum. Therefore, this validity is related to the curriculum of the lesson
(Gronlund, 1990: 72) in Brown (2004). Besides related to the curriculum, content
validity also matched to the content of the course of the study (Bachman, 1990) in
Brown (2004). In this report, content validity is analyzed based on the objectives of
year VII syllabus. Second, criterion validity is the extent to which the “criterion” of
the test has actually been reached (Brown, 2004: 24). It means that results obtained
from the assessment aggress with a set of ability criterion. There are two types of
criterion validity; class criterion and norm-referenced criterion. The class criterion
is related to the specification of the program in syllabus such as pronunciation,
intonation, etc. The norm-referenced criterion is related to standardized testing and
native speakers’ competence. In this report, the writer focuses on the class criterion
11
validity. Third is construct validity. It asks, “Does this test actually tap into the
theoretical construct as it has been constructed?”. It is in accordance with theories
of language behavior or learning. Last, face validity refers to the degree to which a
test looks right, and appears to measure the knowledge or abilities it claims to
measure, based on the subjective judgment of the examinees who take it, the
administrative personnel who decide on its use, and other psychometrically
unsophisticated observers (Mousavi, 2002: 244) in Brown (2004: 26). It means that
in face validity, the writer analyzes how the test looks like and whether it makes
students feel that they are being tested. In short, points of validity will be analyzed
are:
Content
validity
- Are classroom objectives identified and appropriately framed?
- Are lesson objectives represented in the form of test
specifications?
Criterion
validity
- Do the results agree with the set of ability criterion?
Construct
validity
- Does the test actually tap into the theoretical construct as it
has been constructed?
Face valiidty - Are directions of the test clear?
- Is the structure of the test organized logically?
- Is its difficulty level appropriately pitched?
- Does the test have “surprises”?
- Is timing appropriate?
Validity Checklist
(Adapted from Brown (2004) and Genessee & Upshur (1996))
IV. Authenticity
Bachman & Palmer (1996: 23) in Brown (2004: 26) defines authenticity as
“the degree of correspondence of the characteristics of a given language test task to
the features of a target language task”. It is related to the word ‘real’ or ‘natural’. In
which natural in authenticity means that items are contextualized, common use in
12
real-life, and the topics are relevant to the learner’s condition.In this report, the
writer determines points of authenticity elaborated are as follows.
Authenticity - Is the language in the test as natural as possible?
- Are items as contextualized as possible rather than isolated?
- Are topics and situations interesting, enjoyable, and/or
humorous?
- Do tasks represent, or closely approximate, real-world tasks?
Authenticity Checklist
(Adapted from Brown (2004: 28))
V. Washback
A facet of consequential validity, discussed above, is “the effect of testing
onteaching and learning” (Hughes, 2003: 1) in Brown (2004: 28), otherwise known
among language-testing specialist as wasback. In large-scale assessment, waskback
generally refers to the effects the tests have on instruction in terms of how students
prepare for the test. Another form of washback that occurs more in classroom
assessment is the information that “washes back” to students in the form of useful
diagnoses of strengths and weaknesses. Washback also includes the effects of an
assessment on teaching and learning prior to the assessment itself, that is, on
preparation for the assessment.
In this report, the writer determines specific points of washback for the test
analyzed are as follows:
Washback - Are there any effects of the test on teaching and learning?
- What kinds of effects are there?
Washback Checklist
(Adapted from Brown (2004: 28-29))
13
CHAPTER III
FINDINGS AND DISCUSSIONS
In this chapter, the writer elaborates what she finds in the test analyze of the
test based on the criteria of “testing a test” proposed by Brown (2004) and
Genessee & Upshur (1996). The criteria are practicality, reliability, validity,
authenticity, and washback.
I. Practicality
Practicality of the test is analyzed based on cost, administrative time,
compilation time, administrator qualifications, and acceptability. First of all is
about the cost. An objective test requires more papers which mean more money.
From the point of view of a single photocopying sheet, it could not be expensive
since it consists only 4 pages. Also, students do not have to pay the photocopy
because it has been paid by either the school or their parents. However, from the
point of view of the whole photocopying sheets, the test is expensive as it is
distributed to all students year VII in the school. Its use is also limited only once in
the middle of semester. The test cannot be reused for the next year. If the teachers
want to use the same test for the next mid-term test, for instance, they still need
more papers again to edit the general information about the test or even revise
some questions.
Second, the test can be administered to large groups which mean less time
consuming than those that can be used only with individuals. (Genessee & Upshur,
1996: 56). Raters only need around ten to fifteen minutes to distribute the test to all
students in the class. However, the test does not provide time allocation so that the
writer does not know whether the students can complete the test reasonably within
the set time frame. If its time is 120 minutes, it will possibly be enough time for the
students to do the test. Meanwhile, if it is only 90 minutes, it will be difficult for
14
the student to complete the test. It is because unclear instructions make them spend
more time to answer the questions.
The third point is compilation time. Genessee & Upshur (1996: 56)
emphasize that compilation time is not necessary only about scoring time, but also
transforming results into a usable form. As stated by Brown (2004: 31), teachers
should avoid the temptation to offer only quickly scored multiple-choice selection
items that may be neither appropriate nor well-designed. It is because this kind of
test does not provide any time for the teachers to give feedback—comments and
suggestions—to students on their tests. In this analysis, scoring time is practical
since it is a multiple choice test with only one correct answer. The answer key
could also be provided for correctors or it could be corrected by computers. In
other words, the test is less time consuming in terms of scoring time. On the other
hand, it does not provide any time to give feedback to students. Results of the test
are already in the form of scores and will be compiled with other tests to score
students’ final score. This kind of result form could be useful for the teachers
because it ease their work to measure students’ achievement and their strengths and
weaknesses. Unfortunately, the easiness is not also for the students since they are
not informed of the analysis result of the test.
Next are administrator qualifications. For most multiple-choice language
tests, examiner qualifications pose no problems: Language teachers generally
posses the qualities needed to administer such tests. Most classroom teachers could
administer the test without special training. (Genessee & Upshur, 1996: 56). In this
analysis, the test is practical in terms of administrator qualifications since they do
not need special training to administer the test.
Last but not least about the practicality of the test is acceptability. The test is
acceptable to students, parents, and the community. It is because the test fulfills
their needs. For the students, the test helps them to measure their ability during the
mid-semester and evaluate their strengths and weaknesses. For the parents, the
result of the test will be beneficial to know their children’s achievement in learning
English. Then for the community, especially the school, the test is as an insight to
decide how much successful the program is and to improve the teaching and
learning activities.
15
II. Reliability
The reliability of test is analyzed into three parts; rater reliability, person-
related reliability, and instrument-related reliability. The first one is rater reliability.
As emphasized by Genessee & Upshur (1996: 60), the writer need not be greatly
concerned with rater-reliability when using multiple-choice tests. It is because a
multiple-choice test does not give a big problem of rater reliability. It has
consistent sets of criteria for a correct response; there is only one correct answer for
each question. The test does not need experienced, trained raters because related to
the practicality analysis, the test is practical in terms of administrator qualifications
since they do not need special training to administer the test. Everyone in the
school could help to score the test as long as they are informed of the answer key.
This means that the test uses more than one rater. In other words, two or more
scorers will get consistent scores because the test is an objective test with the clear
answer key. There is only one answer for one question written in the answer key
which will ease the raters.
The second is person-related reliability. As stated in Chapter 2, the writer
does not give specific analysis of the person-related reliability since she does not
know the students’ condition when they are taking the test. What the writer can
infer is the test will yield different results in several occasions. Students’
preparation, temporary illness, physical, and psychological factors may contribute
to students’ performance in taking the test. Whether they are best able to perform
well or not also determines the reliability of the test.
The third is instrument-related reliability: Next is the analysis of instrument-
related reliability. First of all, every student has a cleanly photocopied test sheet.
Unfortunately, there are no general instructions of the test such as whether the
answer should be crossed or circled. There is no information about what time and
how long the test takes place. Also, there are no specific instructions for questions
number 1 until 7 for instance. It will confuse the students with what are actually
expected from the questions. For question 2, especially, it will really confuse the
students since there is no question. It is only written:
2. X: …
16
Y: …Besides, there is a mistyped word at question number 50 option D. The word
“homework’s” should be “homework” since it is uncountable noun. Also, this type
of test provides opportunity for guessing (Genessee & Upshur, 1996: 58). Two
students with equal abilities in English might have different results since one of
them is lucky and guesses right. However, another might get a higher score if they
are tested for the same test in another occasion. Okulu (2008) also states that with
this kind of test, cheating may also be facilitated. Therefore, the information
collected would not be reliable.
Next are about scoring procedures. The procedures leave little debate about
correctness of an answer since it is an objective test with only one correct answer.
Only question number 2 has no correct answer. In addition, the test meets the
requirement of Item Facility (IF). The test is arranged from the easiest questions on
the first page until the most difficult questions at the last five numbers. This means
the test also fulfills the requirement of Item Discrimination (ID) which is the extent
to which an item differentiates between high- and low-ability test takers. The high-
ability test takers will easily answer the last five questions while the low-ability test
takers will get difficulty to answer them. In addition, most questions meet the
criteria of Distractor Efficiency (DE), that is the extent to which the distractors
“lure” a sufficient number of test takers, especially lower-ability ones, and those
responses are somewhat evenly distributed across all distractors. Examples are
questions number 13.
13. Do they have a swimming pool?a. yes, they dob. yes, they arec. no, they aren’td. no, they don’t
In the question, options A, B, and C may distract the students. Distractor C,
especially, will attract more responses from the high-ability group than the low-
ability group while distractor A and B will attract more responses from the students
who do not pay attention to the text provided for questions 11-13.
17
III. Validity
As mentioned in Chapter 2, validity of the test is analyzed into four types
which are content validity, criterion validity, construct validity, and face validity.
3.1. Content Validity
Brown (2004: 32) considers the first measure of an effective classroom
test is the identification of objectives, which are found in the syllabus.
In reference to year VII syllabus, the objectives of listening skill for
instance deal with “students should be able to: identify, respond, and
answer the expressions of asking and giving service, asking and giving
things, and asking and giving fact”. From the objectives, the modal
should is ambiguous (Brown, 2004: 33). Also, there are no standards
stated to fulfill the act of “responding” and “answering”.
In addition, the test does not really fulfill all materials for year VII
students. The materials are asking and giving service, asking and giving
things, and asking and giving fact, asking and giving opinion, showing
like and dislike, asking for clarification, responding interpersonally,
congratulations, shopping list, announcement, descriptive, and
procedure. In fact, there are no questions in the test related to asking and
giving things, showing like and dislike, asking for clarification, and
responding interpersonally, but there are questions related to asking and
giving instruction, prohibition, and introducing oneself and others.
Those materials are actually for the first semester of year VII students.
Next, the lesson objectives are not represented in the form of test
specification. The test is a written multiple-choice test that is expected
to measure students’ achievement in the area of listening, speaking,
reading, and writing. Unfortunately, the test is not divided into a
number of sections. There is no listening, speaking, or even writing
sections in order to correspond to the objectives being assessed. Also,
since it is a multiple choice test, it does not have any specifications in
the scoring rubric and give feedback (Brown, 2004: 33).
18
Above all, good points of this test are its design offers students a variety
of item types and gives an appropriate relative weight for year VII
students (Brown, 2004: 33).
3.2. Criterion Validity
The test does not fulfill the classroom criterion validity since it is not
related to the specification of the program in syllabus. In the syllabus,
English skills involve listening, speaking, reading, and writing, but in
the test, there is only reading skills measured. The writer cannot analyze
the specific criteria of reading skills since they are not specifically
written on the syllabus. It is only written “short written functional text:
lists of things/announcement/greeting cards and vocabulary: shopping
lists, room lists, congratulations, attention, etc.”
3.3. Construct Validity
This test cannot be used to asses students’ speaking, listening, and
writing ability since it does not provide sample of oral production, for
instance, to fulfill the principle of construct validity for speaking skill. It
only assesses reading ability in which the scoring analysis includes
several factors such as scanning, skimming, identifying topic, main
idea, supporting details, vocabulary use, and generic structure of the
text. From this point of view, the test has already fulfilled the criteria of
construct validity for reading skill. Text for questions number 41-43, for
example, lead students to identify the sentence order, the title, and
meaning of the underlined word. Questions number 44-45 are also
related to the questions 41-43, which are about procedure text. In the
questions, the students are asked to choose the right instruction based on
the picture and mention three parts in procedure text. In other words,
those sample questions reveal that the results of the test agree with the
theoretical construct of reading skill measured.
3.4. Face Validity
The test does not really meet the criteria of face validity. It is because
the test does not give clear directions, either general directions at the
19
beginning of the test or specific directions for each question. The test
also has “surprises”, which is there are questions about introduction. In
fact, introduction should be the material for the first semester of year
VII students. Then whether timing is appropriate or not is questionable
since the test does not provide any information about time allocation.
Despite the weaknesses, the test is good in the structure of the test
which is organized logically and the difficulty level which is
appropriately pitched for year VII students.
IV. Authenticity
The language used in this test is as natural as possible if it is seen from
English context. It uses simple language and directly points what the speakers
expect to be responded. An example is taken from question number 3.
3. Mr. Arif : it’s very dark here Latief : of, course
However, the writer thinks it would be better some questions like question
number 3 provides who will respond the last speaker, whether Mr. Arif or Latief so
that students ease to guess what are expected from the test. An example is as
follows.
3. Mr. Arif : it’s very dark here Latief : of, course Mr. Arif: …
Then questions 8 to 10 does not meet the criteria of natural language since in
daily life, students rarely say their plan and asks their friends’ opinion after
greetings. Next point is most of the test items are contextualized such as question
number 19, 23 to 26, etc. Questions 23 to 26, for instance, give clear context that
the text is about shopping list. Unfortunately, there are still some items which are
isolated such as question number 2. They do not provide any information about
what the dialogue is. Topics and situations used in the test are interesting and
enjoyable since it is closely related to students’ real life which are asking and
giving service, shopping list, etc. Last, tasks does not represent, or closely
20
approximate, real-world tasks since in the real-world, students are never asked to
complete the paragraph like in questions 46 to 50.
V. Washback
The test analyzed is a formal, summative, and multiple-choice test. According
to Brown (2004: 29), summative tests which provide assessment at the end of a
course or program do not need to offer much in the way of washback. The test
analyzed is a summative test in which related to the practicality analysis, this kind
of test does not provide any time for the teachers to give feedback—comments and
suggestions—to students on their tests. The students only receive a simple letter
grade or a single overall numerical score without knowing their strengths and
weaknesses in the test. In reality, letter grades or numerical scores are considered to
give absolutely no information of intrinsic interest to the students, reduce a
mountain of linguistic and cognitive performance data, and give a relative
indication of a formulaic judgment of performance as compared to others in the
class—which fosters competitive, not cooperative, learning (Brown, 2004: 29).in
adiition, Okulu (2008) states that one of the weaknesses in multiple choice items is
washback may be harmful.
In other words, for teachers, the test could be beneficial if they interpret the
scores as the decision of teaching effectiveness. From the results, they evaluate
their strengths and weaknesses, and find the way to improve their teaching quality.
However, the test does not provide any feedback for students so they will not know
where their mistake takes place. They are only informed of their scores after the
test, or even sometimes, in the report. They will not know their strengths and
weaknesses in the test. In fact, every language course or program is always the
beginning of further pursuits, more learning, more goals, and more challenges
(Brown: 2004,30). Therefore, if the students are not exposed to their mistakes, they
will get difficulties in facing the challenges. Also, the test leads to be a competitive
test because it encourages the students to get the best score or higher score than the
others. As a solution, the students can do self-assessment or peer discussion as
alternative ways to enhance washback from the test (Brown, 2004: 37).
21
CHAPTER IV
CONCLUSION AND RECOMMENDATION
I. Conclusion
From the discussion, the writer concludes that the test is practical in terms of
administrative time, compilation time, administrator qualifications, and
acceptability. Unfortunately, it is not practical in cost since the test is an objective
test which requires more papers and more money. Also, its use is also limited only
once in the middle of semester. If the teachers want to use the same test for the next
mid-term test, for instance, they still need more papers again to edit the general
information about the test or even revise some questions.
Second, the reliability of the test is not high since the multiple choice test
tend to yield different results in several occasions because of students’ preparation,
temporary illness, physical, and psychological factors. The test does not provide
general instructions, specific instructions for some questions, information about
what time and how long the test takes place, and provides opportunity for guessing.
However, it has consistent sets of criteria for a correct response which leave little
debate; there is only one correct answer for each question. Also, the test meets the
requirement of three indices in designing multiple choice test (Okulu, 2008).
Third is validity of the test. The test is quite valid in terms of the content
since there are materials which have not been covered in the test. From the criterion
validity, the writer cannot analyze the specific criteria of reading skills, which are
measured in the test, since they are not specifically written on the syllabus. Then
although the test cannot be used to asses students’ speaking, listening, and writing
ability, the results of the test agree with the theoretical construct of reading skill
measured. Nest, the test does not really meet the criteria of face validity since it
does not give clear directions, has “surprises”, which is there are questions about
introduction, and does not provide timing of the test. Despite the weaknesses, the
test is good in the structure of the test which is organized logically and the
difficulty level which is appropriately pitched for year VII students.
22
Fourth, generally, the authenticity of the test is high since the language used
in most questions is as natural as possible, most of the test items are contextualized,
and topics and situations used in the test are interesting and enjoyable. However,
tasks do not represent, or closely approximate, real-world tasks since in the real-
world, students are never asked to complete the paragraph like in questions 46 to
50.
Last principle of “testing a test” is washback. From teachers’ point of view,
the test could be beneficial if they interpret the scores as the decision of teaching
effectiveness, the evaluation of their strengths and weaknesses, and the way to
improve their teaching quality. Unfortunately, the test does not provide any
feedback for students so they will not know where their mistake takes place. They
are only informed of their scores after the test, or even sometimes, in the report.
They will not know their strengths and weaknesses in the test. Also, the test leads
to be a competitive test because it encourages the students to get the best score or
higher score than the others.
In short, the test has not met all criteria for “testing a test” proposed by
Brown (2004) and Genessee (1996) which have been summarized in Chapter 2. the
result of the analysis confirms the opinion from Okulu (2008) which states:
“There’re a number of weaknesses in multiple-choice items:- The techniques tests only recognition knowledge.- Guessing may have a considerable effect on test scores.- The technique severely restricts what can be tested.- It is very difficult to write successful items.- Washback may be harmful.- Cheating may be facilitated.”
II. Recommendation
From the conclusion, the writer notices two major problems of the test which
are validity and washback. To overcome the validity problem, the writer suggests
that the raters should refer to English syllabus used for the level. They might
consider the materials taught during the semester and the objectives that are being
assessed in order to enhance content validity. This strong content validity will also
influence criterion, construct and face validity of the test. Then to overcome the
23
washback problems, the writer recommends that the test should provide feedback
to the students. Since it is a mid-term test which aims at measuring
student competency achievement after 8-9 weeks learning activities (Permen No.
20/2007 tentang Standar Penilaian Pendidikan), it will be important for students to
know their strengths and weaknesses after learning English for about 8 weeks. If
the feedback is not appropriate to be given directly in the class, the students can do
self-assessment or peer discussion as alternative ways to enhance washback from
the test (Brown, 2004: 37).
24
BIBLIOGRAPHY
Brown, H. D. (2004). Language Assessment: Principles and Classroom Practice.
Chapter 2, pp. 19-41. New York: Pearson Education, Inc.
Genessee & Upshur. (1996). Classroom-Based Evaluation in Second Language
Education. Chapter 4, pp. 54-73. Cambridge: Cambridge University Press.
Okulu. (2008). Chapter 3: Designing Classroom Language Tests. File PDF.
Permendiknas No.20/2007 tentang Standar Penilaian Pendidikan.
Silabus SMP Negeri 1 Bubulan Kelas VII.
Susilohadi, Gunarso, dkk. 2008. Contextual Teaching and Learning Bahasa
Inggris: SMP/MTs Kelas IX Edisi 4. Jakarta: Pusat Perbukuan, Depdiknas.
25
APPENDIX
26
27
28
Recommended