Characteristics of a good test

Characteristics of a Good

Test

Arash Yazdani

A test may consist of a single item or a combination of items. Regardless of the number of items in a test, every single item should possess certain characteristics.

Having good items , however does not necessarily lead to a good test, because a test as a whole is more than a mere combination of individual items.

Therefore, in addition to having good items, a test should have certain characteristics.

1.Reliability 2.Validity 3.Practicality

Introduction:

There are different theories to explain

the concept of reliability in a scientific way.

Firs and simplest: A test is reliable if we get the same results repeatedly.

Second: when a test gives consistent results.

Third: reliability is ratio of true score variance to observed score variance.

Reliability

In order to explain

the concept of reliability in a non-technical term is to say: imagine that feeling that someone did not as well on a test as he could have.Now imagine if he could take the test again, he would do better.This may be quit true. Nevertheless, one should also admit that some factors, such as a good chance on guessing the correct responses, would raise his score higher than it should really be. Seldom does anyone complain about this.

Now, if one could take a

test over and over again, he would probably agree that his average score over all the tests in an acceptable estimate of what he really know or how he really feels about the test. On a “reliable test, one’s score on its varies administrations would not differ greatly. That is, one’s score would be quit consistent.On an “unreliable” test, on the other hand one’s score might fluctuate from one administration to the other. That is, one’s score on various administration will be inconsistence. The notion of consistency of one’s score with respect to one's average score over repeated administration is the central concern on the concept of reliability.

The change in one’s score is inevitable. Some of the changes might represent a steady increase in one’s score. The increase would most likely be due to some sort of learning. This kind of change, which would be predictable, is called systematic variation.

The systematic variation contributes to the reliability and the unsystematic variation, which is called error variation , contributes to the unreliability of a test.

True Score

Let’s assume that someone takes a test. Since all measurement devices are subject to error, the score one gets on a test cannot be true manifestation of one’s ability in that particular trait. In other words, the score contains one’s true ability along with some error. If this error part could be eliminated, the resulting score would represent an errorless measure of that ability. By definition, this errorless score is called a “true score”.

Observed score

The true score is almost always different from the score one gets, which is called the “observed score”. Since the observed score includes the measurement error, i.e., the error score, it can be grater than, equal to, or smaller than the true score. If there is absolutely no error of measurement, the observed score will equal the true score. However, when there is a measurement error, which is often the case, it can lead to an overestimation or an underestimation of true score. Therefore, if the observed score is represented by X, the true score by T and the error score by E, the relationship between the observed and true score can be illustrated as follows:

X=T or

X>T or

X<T

These relations, however, do not hold true when the scores are changed into their corresponding variance terms. The variance of the observed scores, nonetheless, fluctuates because of the extent of the error of measurement. since error variance is included in the observed variance, the variance of the observed score is always greater than the variance of the true score.If the variance of the observed score is represented by Vx, the variance of true score by Vt, and the variance of error score by Ve, formula number 1 can by rewritten as:

(2) Vx=Vt + Ve

These three variance components are crucial to understanding the concept of reliability in statistical terms.

From this formula, it can be understood that there is a close relationship between the degree of error in measurement and the exact amount of the true score. The greater the measurement the smaller the estimation of the true score.So by definition, reliability is the ration of true score variance of the observed score variance.

(3) r=

Of course, the true score is not measurable and thus the value of Vt is never known. Therefore , we can solve the unknown Vt through the following computations:Vx= Vt + VeOr:Vt= Vx + VeSubstituting the value of Vt, in formula 3 will lead to formula 4:

(4)r=

If the measurement is without error the error variance will be zero, i.e., Ve=0 thus, we can have:

R=

=

= = 1 R= 1This means there is no error in measurement, the reliability equals unity.

Second , if the error in measurement is large, so large to equal the observed score, i.e., all the observed score is error, in this case Vx=Ve and we will have:

R=

r== 0

So reliability can range from zero to one. The reliability of zero means the minimum, which is unreliable, and the reliability of one on the other hand indicates that there are no errors , and the test is completely reliable. Although this does not happen in reality, we can say that the closer the magnitude of reliability to unity the more reliable the test will be.

Standard Error of Measurement

It is necessary to find an index of error in measurement which could be applied to all occasions of a particular measure. This index of error is called standard error of measurement, abbreviated as SEM.

By definition, SEM is the standard deviation of all error score obtained from a given measure in different situations.

To calculate the numerical value of SEM, the formula for reliability can be used through the following procedures. (1)R= Vx=Vt + VeVt=Vx - Ve

(2)R=

Or:

R= -

(3)R= 1 - Solving (3) for Ve

Or :(4) Ve = Vx (1 – r)

From chapter 4 , it should be recalled that standard deviation is the square root of the variance . Taking the square root of the variance terms in formula 4 we will have:

Or:

Se = SxThe value of Se, the standard deviation of errors, as mentioned before, is called the SEM. Thus:

SEM = Sx

In the formula, Sx refers to the standard deviation of error and r is the reliability. From the formula it is clear that there is a negative relationship between reliability and SEM. The higher the reliability the smaller the SEM. For example , if the reliability is perfect i.e., r=1, the value of SEM will be zero because the value of (1 – r) will equal zero. By the same token, the lower the reliability, the grater the SEM.

For Example:

X = 20 Sx = 5 r = 0.84The SEM can be calculated to be:

SEM = Sx = 5 = 5 = (5) (0.4)SEM = 2 this means that the standard error of measurement of the test is 2. thus, when one interprets a given score on the test, he should be careful that on the average, the observed score may be lower or higher than the examinees’ true score. The degree of highness or lowness can be predicted from the value of the SEM. Usually, a safe estimate is to interpret the true score within the range of plus or minus one SEM from the observed Score. In formulaic form:Amore Accurate Score = observed score If the score of a given examinee were, for example, 25 on the test, his score might fluctuate between 23 or 27.A more accurate score = 25 1 SEM = 25 + 2= 27 = 25 -2 = 23So the observed score can not be taken as the most exact estimate of one’s ability.

Methods of Estimating Reliability

Test-Retest Method

Parallel-form Method

Split-Half Method

KR-21 Method

In this method reliability is obtained through

administrating a given test to a particular group twice and calculating the correlation between the two sets of score obtained from the two administration.

Since there has to be a reasonable amount of time between the two administrations, this kind of reliability is referred to as the reliability or consistency over time.

Test-Retest

Test-Retest

It requires two administrations.

Preparing similar conditions under which the administration take place adds to the complications of this method.

There should be a short time between to administration. Although not too short nor too long. To keep the balance it is recommended to have a period of two weeks between them.

Disadvantages of Test-Retest

In the parallel-forms method, two similar, or parallel forms of

the same test are administrated to a group of examinees just once.

The problem here is constructing two parallel forms of a test which is a difficult job to do.

The two form of the test should be the same. It means all the elements upon which test items are constructed should be the same in both forms. For example if we are measuring a particular element of grammar, the other form should also contain the same number of items on the same elements of grammar.

Subtests should also be the same, i.e., if one form of the test has tree subsection of grammar, vocabulary, and reading comprehension, the other form should also have the same subsections with the same proportions.

Parallel-Forms

In split-half method the items comprising a test are

homogeneous. That is, all the items in a test attempt to measure elements of a particular trait, E.g., tenses, propositions, other grammatical points, vocabulary, reading and listening comprehension, which are all subparts of the trait called language ability.

In this method, when a single test with homogeneous items is administrated to a group of examinees, the test is split, or divided, into two equal halves. The correlation between the two halves is an estimate of the test score reliability.

Split-Half

In using this method, two main points should

be taken into consideration. Firs, the procedure for dividing the test into two equal halves, and second, the computation of total test reliability from the reliability of one half of the test.

In this method easy and difficult items should be equally distributed in two halves.

Split-Half

As mentioned before the number of items are of

important factors in estimating score test reliability, and by dividing them in to two halves, the length of the test will be reduced to half of the length of total test. Thus, the correlation between the two halves will be reliability of one half of the test, not the total test. To estimate the reliability of the total test, the following formula known as the spearman brown prophecy formula, should be used.

Split-Half

Advantages: it is more practical than others. In using

the Split-Half method, there is no need to administer the same test twice. Nor is it necessary to develop two parallel form of the same test.

Disadvantages: the main shortcoming of this method is developing a test with homogeneous items because assuming the quality between the two halves is not a safe assumption. Furthermore, different subsections, in a test ,e.g., grammar, vocabulary, reading or listening comprehension, will jeopardize test homogeneity, and thus reduce test score reliability.

Split-Half Advantages and Disadvantages

Kudar and Richardson, two famous statisticians,

have developed a set of mathematical formulas for statistical computation.

This formula is based on assumption that all items in a test are designed to measure a single trait.

K = The number of the items in a test = The mean score V = The Variance

KR-21

The KR-21 method is the most practical, frequently used, and convenient method of estimating reliability.

For example: fi a 100-item test is administrated to a group of testees and resulted in a mean of 60 and variance of 48, the reliability of this test can easily be computed as follows.

KR-21

It depends on the function of the test. Test-retest method is appropriate when the

consistency of scores a particular time interval (stability of test scores over time) is important

The Parallel-forms method is desirable when the consistency of scores over different forms is of importance.

When the go-togetherness of the items of a test is of significance (the internal consistency), Split-Half and KR-21 will be the most appropriate methods.

Which method should we use?

To have a reliability estimate, one or two sets

of scores should be obtained from the same group of testees. Thus, two factors contribute to test reliability: the testee and the test itself.

Factors Influencing Reliability

Since human beings are dynamic creatures, the

attributes related to human beings are also dynamic. The implication is that the performance of human beings will, by their very nature, fluctuate from time to time, or from place to place. (e.g., students misunderstanding or misreading test directions, noise level, distractions, and sickness) can cause test scores to vary.

Heterogeneity of the Group Members.The greater the heterogeneity of the group members in the preferences, skills or behaviors being tested, the greater the chance for high reliability correlation coefficients.

The Effect of Testees

Test length. Generally, the longer a test is, the more reliable it is, however the length is up to a point.

Speed. When a test is a speed test, reliability can be problematic. It is inappropriate to estimate reliability using internal consistency, test-retest, or alternate form methods. This is because not every student is able to complete all of the items in a speed test. In contrast, a power test is a test in which every student is able to complete all the items.

Item difficulty. When there is little variability among test scores, the reliability will be low. Thus, reliability will be low if a test is so easy that every student gets most or all of the items correct or so difficult that every student gets most or all of the items wrong.

The Effect of Test Factors

The Effect of Administration Factors

• Poor or unclear directions given during administration or inaccurate scoring can affect reliability.

For Example - say you were told that your scores on being social determined your promotion. The result is more likely to be what you think they want than what your behavior is.

In an objectively-scored test, the likes and

dislikes of the scorers will not influence the results.

In a subjectively-scored test, the likes and dislikes of the scorers will influence the results and as a result reliability.

Intra-rater errors (Errors which are due to fluctuations of the same rater scoring a single test twice)

Inter-rater errors (Errors which are due to the fluctuations of different scorers-at least two- scoring a single test.

The Influence of Scoring Factors

Validity

The second major characteristic of a good test is validity.

What does validity mean? A test is valid if it measures

what we want it to measure and nothing else.

The extent to which a test measures what it is supposed to measure or can be used for the purposes for which the test is intended.

Validity is a more-test-dependant concept but reliability is a purely statistical parameter.

So, validity refers to the extent to which a test measures what it is supposed to measure.

There are four types of validity.

Content V

Criterion-Related V

Construct V

Types Of Validity

Relevance of the test item to the purpose of

the test. Does the test measure the objectives of the

course? It refers to the correspondence (agreement)

between the test content and the content of materials (subject matter and instructional objectives) taught to be tested.

The extent to which a test measures a representative sample of the content to be tested at the intended level of learning.

Content Validity

Content Validity is called appropriateness of

the test; that is appropriateness of the sample and the learning level.

Content Validity is the most important type of validity which can be achieved through a careful examination of the test content.

It provides the most useful subjective information about the appropriateness of the test.

Content Validity

Criterion-related Validity investigates the correspondence

between the scores obtained from the newly-developed test and the scores obtained from some independent outside criteria.

The newly-developed test has to be administered along with the criterion measure to the same group.

The extent to which the test scores correlate with a relevant outside criterion.

Criterion-related validity:Refers to the extent to which different tests intended to measure the same ability are in agreement. Depending on the time of administration, two types exist:

Concurrent Validity Predictive Validity

Criterion-related Validity

Correlation between the test scores (new test) with a recognized measure taken at the same time.

Concurrent Validity

Comparison (correlation) of students' scores with a criterion taken at a later time (date).

Predictive validity

Construct validity

Refers to measuring certain traits or theoretical construct

Refers to the extent to which the psychological reality of a trait or construct can be established.

It is based on the degree to which the items in a test reflect the essential aspects of the theory on which the test is based on.

Construct validity also refers to the accuracy with which the test measures certain psychological/theoretical traits

Reading comprehension – Oral language ability. This is done through factor analysis

a. Directions (clear and simple) b. Difficulty level of the test (not too easy nor

too difficult) c. Structure of the items (poorly constructed

and/or ambiguous items will contribute to invalidity)

d. Arrangement of items and correct responses (starting with the easiest items and ending with the difficult ones + arranging item responses randomly not based on an identifiable pattern)

Factors Affecting Validity

Reliability is a purely statistical parameter;

that is, it can be determined fairly independently of the test. But Validity is a test-dependent concept.

We have degrees of validity: very valid, moderately valid, not very valid

A test must be reliable to be valid, but reliability does not guarantee validity.

Validity and Reliability

How reliable and valid should a test be? The more important the decision to be made,

the more confidence is needed in the scores, an thus, the more reliable and valid test are required.

Nevertheless, it is a generally accepted tradition that validity and reliability coefficients below 0.50 ( low ) 0.5 to 0.75 ( moderate), 0.75 to 0.90 ( high )

Reliability, Validity and Acceptability

Generally speaking, practicality refers to the ease of administration and scoring of a test.

Practicality

It is = Clarity, simplicity and the ease of

reading instructions Fewer numbers of subtests The time required for test

Ease of administration

A test can be scored subjectively

or objectively. Since scoring is difficult and time

consuming, the trend is toward objectivity, simplicity and machine scoring.

Ease of scoring

The meaningfulness of scores obtained

from that test If the test results are misinterpreted or

misapplied, they will be of little value and may actually be harmful to some individual or group.

Ease of Interpretation and

Application

Thank You for Listening

The End

Education

Characteristics of a good test