Validity/Reliability and a recap of Statistics RCS 6740 6/27/05

Validity/Reliability and a recap of Statistics

RCS 67406/27/05

RELIABILITY AND VALIDITY

Reliability

From the perspective of classical test theory, an examinee's obtained test score (X) is composed of two components, a true score component (T) and an error component (E):

X=T+E The true score component reflects the examinee's status

with regard to the attribute that is measured by the test, while the error component represents measurement error. Measurement error is random error. It is due to factors that are irrelevant to what is being measured by the test and that have an unpredictable (unsystematic) effect on an examinee's test score.


The score you obtain on a test is likely to be due both to the knowledge you have about the topics addressed by exam items (T) and the effects of random factors (E) such as the way test items are written, any alterations in anxiety, attention, or motivation you experience while taking the test, and the accuracy of your "educated guesses."

Whenever we administer a test to examinees, we would like to know how much of their scores reflects "truth" and how much reflects error. It is a measure of reliability that provides us with an estimate of the proportion of variability in examinees' obtained scores that is due to true differences among examinees on the attribute(s) measured by the test.


Reliability

When a test is reliable, it provides dependable, consistent results and, for this reason, the term consistency is often given as a synonym for reliability (e.g., Anastasi, 1988).


The Reliability Coefficient Ideally, a test's reliability would be calculated by dividing true

score variance by the obtained (total) variance to derive a reliability index. This index would indicate the proportion of observed variability in test scores that reflects true score variability. A test's true score variance is not known, however, and reliability must be estimated rather than calculated directly. There are several ways to estimate a test's reliability. Each involves assessing the consistency of an examinee's scores over time, across different content samples, or across different scorers and is based on the assumption that variability that is consistent is true score variability, while variability that is inconsistent reflects random error.


Most methods for estimating reliability produce a reliability coefficient, which is a correlation coefficient that ranges in value from 0.0 to + 1.0. When a test's reliability coefficient is 0.0, this means that all variability in obtained test scores is due to measurement error. Conversely, when a test's reliability coefficient is + 1.0, this indicates that all variability in scores reflects true score variability. The reliability coefficient is symbolized with the letter "r" and a subscript that contains two of the same letters or numbers (e.g., ''rxx''). The subscript indicates that the correlation coefficient was calculated by correlating a test with itself rather than with some other measure.


Regardless of the method used to calculate a reliability coefficient, the coefficient is interpreted directly as the proportion of variability in obtained test scores that reflects true score variability. For example, as depicted in Figure 1, a reliability coefficient of .84 indicates that 84% of variability in scores is due to true score differences among examinees, while the remaining 16% (1.00 - .84) is due to measurement error.

Figure 1. Proportion of variability in test scores

True Score Variability (84%) Error (16%)


Note that a reliability coefficient does not provide any information about what is actually being measured by a test. A reliability coefficient only indicates whether the attribute measured by the test— whatever it is—is being assessed in a consistent, precise way. Whether the test is actually assessing what it was designed to measure is addressed by an analysis of the test's validity.


Study Tip: Remember that, in contrast to other correlation coefficients, the reliability coefficient is never squared to interpret it but is interpreted directly as a measure of true score variability. A reliability coefficient of .89 means that 89% of variability in obtained scores is true score variability.


Methods for Estimating Reliability

The selection of a method for estimating reliability depends on the nature of the test. As noted below, each method not only entails different procedures but is also affected by different sources of error. For many tests, more than one method should be used.


1. Test-Retest Reliability: The test-retest method for estimating reliability involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores. When using this method, the reliability coefficient indicates the degree of stability (consistency) of examinees' scores over time and is also known as the coefficient of stability.

The primary sources of measurement error for test-retest reliability are any random factors related to the time that passes between the two administrations of the test. These time sampling factors include random fluctuations in examinees over time (e.g., changes in anxiety or motivation) and random variations in the testing situation. Memory and practice also contribute to error when they have random carryover effects; i.e., when they affect many or all examinees but not in the same way.


Test-retest reliability is appropriate for determining the reliability of tests designed to measure attributes that are relatively stable over time and that are not affected by repeated measurement. It would be appropriate for a test of aptitude, which is a stable characteristic, but not for a test of mood, since mood fluctuates over time, or a test of creativity, which might be affected by previous exposure to test items.


2. Alternate (Equivalent, Parallel) Forms Reliability: To assess a test's alternate forms reliability, two equivalent forms of the test are administered to the same group of examinees and the two sets of scores are correlated. Alternate forms reliability indicates the consistency of responding to different item samples (the two test forms) and, when the forms are administered at different times, the consistency of responding over time. The alternate forms reliability coefficient is also called the coefficient of equivalence when the two forms are administered at about the same time and the coefficient of equivalence and stability when a relatively long period of time separates administration of the two forms.


The primary source of measurement error for alternate forms reliability is content sampling, or error introduced by an interaction between different examinees' knowledge and the different content assessed by the items included in the two forms: The items in Form A might be a better match of one examinee's knowledge than items in Form B, while the opposite is true for another examinee. In this situation, the two scores obtained by each examinee will differ, which will lower the alternate forms reliability coefficient. When administration of the two forms is separated by a period of time, time sampling factors also contribute to error.


Like test-retest reliability, alternate forms reliability is not appropriate when the attribute measured by the test is likely to fluctuate over time (and the forms will be administered at different times) or when scores are likely to be affected by repeated measurement. If the same strategies required to solve problems on Form A are used to solve problems on Form B, even if the problems on the two forms are not identical, there are likely to be practice effects. When these effects differ for different examinees (i.e., are random), practice will serve as a source of measurement error. Although alternate forms reliability is considered by some experts to be the most rigorous (and best) method for estimating reliability, it is not often assessed due to the difficulty in developing forms that are truly equivalent.


3. Internal Consistency Reliability: Reliability can also be estimated by measuring the internal consistency of a test. Split-half reliability and coefficient alpha are two methods for evaluating internal consistency. Both involve administering the test once to a single group of examinees, and both yield a reliability coefficient that is also known as the coefficient of internal consistency.

To determine a test's split-half reliability, the test is split into equal halves so that each examinee has two scores (one for each half of the test). Scores on the two halves are then correlated. Tests can be split in several ways, but probably the most common way is to divide the test on the basis of odd- versus even-numbered items.


A problem with the split-half method is that it produces a reliability coefficient that is based on test scores that were derived from one-half of the entire length of the test. If a test contains 30 items, each score is based on 15 items. Because reliability tends to decrease as the length of a test decreases, the split-half reliability coefficient usually underestimates a test's true reliability. For this reason, the split-half reliability coefficient is ordinarily corrected using the Spearman-Brown prophecy formula, which provides an estimate of what the reliability coefficient would have been had it been based on the full length of the test.

Cronbach's coefficient alpha also involves administering the test once to a single group of examinees. However, rather than splitting the test in half, a special formula is used to determine the average degree of inter-item consistency. One way to interpret coefficient alpha is as the average reliability that would be obtained from all possible splits of the test. Coefficient alpha tends to be conservative and can be considered the lower boundary of a test's reliability (Novick and Lewis, 1967). When test items are scored dichotomously (right or wrong), a variation of coefficient alpha known as the Kuder-Richardson Formula 20 (KR-20) can be used.


Content sampling is a source of error for both split-half reliability and coefficient alpha. For split-half reliability, content sampling refers to the error resulting from differences between the content of the two halves of the test (i.e., the items included in one half may better fit the knowledge of some examinees than items in the other half); for coefficient alpha, content (item) sampling refers to differences between individual test items rather than between test halves. Coefficient alpha also has as a source of error, the heterogeneity of the content domain. A test is heterogeneous with regard to content domain when its items measure several different domains of knowledge or behavior. The greater the heterogeneity of the content domain, the lower the inter-item correlations and the lower the magnitude of coefficient alpha. Coefficient alpha could be expected to be smaller for a 200-item test that contains items assessing knowledge of test construction, statistics, ethics, industrial-organizational psychology, clinical psychology, etc. than for a 200-item test that contains questions on test construction only.


The methods for assessing internal consistency reliability are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. They are not appropriate for assessing the reliability of speed tests because, for these tests, they tend to produce spuriously high coefficients. (For speed tests, alternate forms reliability is usually the best choice.)


4. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability: Inter-rater reliability is of concern whenever test scores depend on a rater's judgment. A test constructor would want to make sure that an essay test, a behavioral observation scale, or a projective personality test have adequate inter-rater reliability. This type of reliability is assessed either by calculating a correlation coefficient (e.g., a kappa coefficient or coefficient of concordance) or by determining the percent agreement between two or more raters. Although the latter technique is frequently used, it can lead to erroneous conclusions since it does not take into account the level of agreement that would have occurred by chance alone. This is a particular problem for behavioral observation scales that require raters to record the frequency of a specific behavior. In this situation, the degree of chance agreement is high whenever the behavior has a high rate of occurrence, and percent agreement will provide an inflated estimate of the measure's reliability.


Sources of error for inter-rater reliability include factors related to the raters such as lack of motivation and rater biases and characteristics of the measuring device. An inter-rater reliability coefficient is likely to be low, for instance, when rating categories are not exhaustive (i.e., don't include all possible responses or behaviors) and/or are not mutually exclusive.

The inter-rater reliability of a behavioral rating scale can also be affected by consensual observer drift, which occurs when two (or more) observers working together influence each other's ratings so that they both assign ratings in a similarly idiosyncratic way. (Observer drift can also affect a single observer's ratings when he or she assigns ratings in a consistently deviant way.) Unlike other sources of error, consensual observer drift tends to artificially inflate inter-rater reliability.


The reliability (and validity) of ratings can be improved in several ways. Consensual observer drift can be eliminated by having raters work independently or by alternating raters. Rating accuracy is also improved when raters are told that their ratings will be checked. Overall, the best way to improve both inter- and intra-rater accuracy is to provide raters with training that emphasizes the distinction between observation and interpretation (Aiken, 1985).


Study Tip: Remember the Spearman-Brown formula is related to split-half reliability and KR-20 is related to the coefficient alpha. Also know that alternate forms reliability is the most thorough method for estimating reliability and that internal consistency reliability is not appropriate for speed tests.


Factors That Affect The Reliability Coefficient

The magnitude of the reliability coefficient is affected not only by the sources of error discussed above but also by the length of the test, the range of the test scores, and the probability that the correct response to items can be selected by guessing.

Test Length Range of Test Scores Guessing


1. Test Length: The larger the sample of the attribute being measured by a test, the less the relative effects of measurement error and the more likely the sample will provide dependable, consistent information. Consequently, a general rule is that the longer the test, the larger the test's reliability coefficient.

The Spearman-Brown prophecy formula is most associated with split-half reliability but can actually be used whenever a test developer wants to estimate the effects of lengthening or shortening a test on its reliability coefficient. For instance, if a 100-item test has a reliability coefficient of .84, the Spearman-Brown formula could be used to estimate the effects of increasing the number of items to 150 or reducing the number to 50. A problem with the Spearman-Brown formula is that it does not always yield an accurate estimate of reliability: In general, it tends to overestimate a test's true reliability (Gay, 1992).


This is most likely to be the case when the added items do not measure the same content domain as the original items and/or are more susceptible to the effects of measurement error. Note that, when used to correct the split-half reliability coefficient, the situation is more complex, and this generalization does not always apply: When the two halves are not equivalent in terms of their means and standard deviations, the Spearman-Brown formula may either over- or underestimate the test's actual reliability.

2. Range of Test Scores: Since the reliability coefficient is a correlation coefficient, it is maximized when the range of scores is unrestricted. The range is directly affected by the de£ee of similarity of examinees with regard to the attribute measured by the test: When examinees are heterogeneous, the range of scores is maximized. The range is also affected by the difficulty level of the test items. When all items are either very difficult or very easy, all examinees will obtain either low or high scores, resulting in a restricted range. Therefore, the best strategy is to choose items so that the average difficulty level is in the mid-range (r = .50).


Guessing: A test's reliability coefficient is also affected by the probability that examinees can guess the correct answers to test items. As the probability of correctly guessing answers increases, the reliability coefficient decreases. All other things being equal, a true/false test will have a lower reliability coefficient than a four-alternative multiple-choice test which, in turn, will have a lower reliability coefficient than a free recall test.


The Interpretation of Reliability

The interpretation of a test's reliability entails considering its effects on the scores achieved by a group of examinees as well as the score obtained by a single examinee.

1. The Reliability Coefficient: As discussed above, a reliability

coefficient is interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability. A reliability coefficient of .84 indicates that 84% of variability in test scores is due to true score differences among examinees, while the remaining 16% is due to measurement error. While different types of tests can be expected to have different levels of reliability, for most tests, reliability coefficients of .80 or larger are considered acceptable.


When interpreting a reliability coefficient, it is important to keep in mind that there is no single index of reliability for a given test. Instead, a test's reliability coefficient can vary from situation to situation and sample to sample. Ability tests, for example, typically have different reliability coefficients for groups of individuals of different ages or ability levels.


2. The Standard Error of Measurement: While the reliability coefficient is useful for estimating the proportion of true score variability in a set of test scores, it is not particularly helpful for interpreting an individual examinee's obtained test score. When an examinee receives a score of 80 on a 100-item test that has a reliability coefficient of .84, for instance, we can only conclude that, since the test is not perfectly reliable, the examinee's obtained score might or might not be his or her true score.

A common practice when interpreting an examinee-s obtained score is to construct a confidence interval around that score. The confidence interval helps a test user estimate the range within which an examinee's true score is likely to fall given his or her obtained score. This range is calculated using the standard error of measurement, which is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test. (When raw scores have been converted to percentile ranks, the confidence interval is referred to as a percentile band.)


The following formula is used to estimate the standard error of measurement:

Formula 1: Standard Error of Measurement SEmeas = SDx *(1 – rxx)

1/2

Where:SEmeas = standard error of measurement

SDx = standard deviation of test scores

rxx= reliability coefficient


As shown by the formula, the magnitude of the standard error is affected by two factors: the standard deviation of the test scores and the test's reliability coefficient. The lower the test's standard deviation and the higher its reliability coefficient, the smaller the standard error of measurement (and vice versa).

Because the standard error is a type of standard deviation, it can be interpreted in terms of the areas under the normal curve. With regard to confidence intervals, this means that a 68% confidence interval is constructed by adding and subtracting one standard error to an examinee's obtained score; a 95% confidence interval is constructed by adding and subtracting two standard errors; and a 99% confidence interval is constructed by adding and subtracting three standard errors.


Example: The psychologist in Study #3 administers the interpersonal assertiveness test to a sales applicant who receives a score of 80. Since the test's reliability is less than 1.0, the psychologist knows that this score might be an imprecise estimate of the applicant's true score and decides to use the standard error of measurement to construct a 95% confidence interval. Assuming that the test-s reliability coefficient is .84 and its standard deviation is 10, the standard error of measurement is equal to 4.0:

SEmeas = SDx 1 – rxx =10 (1 - .84)1/2 = 10(.4) = 4.0

The psychologist constructs the 95% confidence interval by adding and subtracting two standard errors from the applicant's obtained score: 80 + 2(4.0) = 72 to 88. This means that there is a 95% chance that the applicant's true score falls between 72 and 88.


One problem with the standard error is that measurement error is not usually equally distributed throughout the range of test scores. Use of the same standard error to construct confidence intervals for all scores in a distribution can, therefore, be somewhat misleading. To overcome this problem, some test manuals report different standard errors for different score intervals.


3. Estimating True Scores from Obtained Scores: As discussed above, because of the effects of measurement error, obtained test scores tend to be biased (inaccurate) estimates of true scores. More specifically, scores above the mean of a distribution tend to overestimate true scores, while scores below the mean tend to underestimate true scores. Moreover, the farther from the mean an obtained score is, the greater this bias. Rather than constructing a confidence interval, an alternative (but less used) method for interpreting an examinee's obtained test score is to estimate his/her true score using a formula that takes into account this bias by adjusting the obtained score using the mean of the distribution and the test's reliability coefficient.


For example, if an examinee obtains a score of 80 on a test that has a mean of 70 and a reliability coefficient of .84, the formula predicts that the examinee's true score is 78.2.

T’=a + bX =(1-rxx )X + rxx X

T’=(1-.84) x 70 + .84 x 80 =.16 x 70 + .84 x 80 =11.2 + 67=78.2


4. The Reliability of Difference Scores: A test user is sometimes interested in comparing the performance of an examinee on two different tests or subtests and, therefore, computes a difference score. An educational psychologist, for instance, might calculate the difference between a child's WISC-R Verbal and Performance 19 scores. When doing so, it is important to keep in mind that the reliability coefficient for the difference scores can be no larger than the average of the reliabilities of the two tests or subtests: If Test A has a reliability coefficient of .95 and Test B has a reliability coefficient of .85, this means that difference scores calculated from the two tests will have a reliability coefficient of .90 or less. The exact size of the reliability coefficient for difference scores depends on the degree of correlation between the two tests: The more highly correlated the tests, the smaller the reliability coefficient (and the larger the standard error of measurement).


Validity

Validity refers to a test's accuracy. A test is valid when it measures what it is intended to measure. The intended uses for most tests fall into one of three categories, and each category is associated with a different method for establishing validity:

The test is used to obtain information about an examinee's

familiarity with a particular content or behavior domain: content validity.

The test is administered to determine the extent to which an

examinee possesses a particular hypothetical trait: construct validity.

The test is used to estimate or predict an examinee's standing or

performance on an external criterion: criterion-related validity.


For some tests, it is necessary to demonstrate only one type of validity; for others, it is desirable to establish more than one type. For example, if an arithmetic achievement test will be used to assess the classroom learning of 8th grade students, establishing the test's content validity would be sufficient. If the same test will be used to predict the performance of 8th grade students in an advanced high school math class, the test's content and criterion-related validity will both be of concern.

Note that, even when a test is found valid for a particular purpose, it might not be valid for that purpose for all people. It is quite possible for a test to be a valid measure of intelligence or a valid predictor of job performance for one group of people but not for another group.


Content Validity

A test has content validity to the extent that it adequately samples the content or behavior domain that it is designed to measure. If test items are not a good sample, results of testing will be misleading. Although content validation is sometimes used to establish the validity of personality, aptitude, and attitude tests, it is most associated with achievement-type tests that measure knowledge of one or more content domains and with tests designed to assess a well-defined behavior domain. Adequate content validity would be important for a statistics test and for a work (job) sample test.

Content validity is usually "built into" a test as it is constructed through a systematic, logical, and qualitative process that involves clearly identifying the content or behavior domain to be sampled and then writing or selecting items that represent that domain. Once a test has been developed, the establishment of content validity relies primarily on the judgment of subject matter experts. If experts agree that test items are an adequate and representative sample of the target domain, then the test is said to have content validity.


Although content validation depends mainly on the judgment of experts, supplemental quantitative evidence can be obtained. If a test has adequate content validity, a coefficient of internal consistency will be large; the test will correlate highly with other tests that purport to measure the same domain; and pre-/post-test evaluations of a program designed to increase familiarity with the domain will indicate appropriate changes.

Content validity must not be confused with face validity. Content validity refers to the systematic evaluation of a test by experts who determine whether or not test items adequately sample the relevant domain, while face validity refers simply to whether or not a test "looks like" it measures what it is intended to measure. Although face validity is not an actual type of validity, it is a desirable feature for many tests. If a test lacks face validity, examinees may not be motivated to respond to items in an honest or accurate manner. A high degree of face validity does not, however, indicate that a test has content validity.


Construct Validity

When a test has been found to measure the hypothetical trait (construct) it is intended to measure, the test is said to have construct validity. A construct is an abstract characteristic that cannot be observed directly but must be inferred by observing its effects. lntelligence, mechanical aptitude, self-esteem, and neuroticism are all constructs.

There is no single way to establish a test's construct validity. Instead, construct validation entails a systematic accumulation of evidence showing that the test actually measures the construct it was designed to measure. The various methods used to establish this type of validity each answer a slightly different question about the construct and include the following:


Assessing the test's internal consistency: Do scores on individual test items correlate highly with the total test score; i.e., are all of the test items measuring the same construct?

Studying group differences: Do scores on the test

accurately distinguish between people who are known to have different levels of the construct?

Conducting research to test hypotheses about the

construct: Do test scores change, following an experimental manipulation, in the direction predicted by the theory underlying the construct?


Assessing the test's convergent and discriminant validity: Does the test have high correlations with measures of the same trait (convergent validity) and low correlations with measures of unrelated traits (discriminant validity)?

Assessing the test's factorial validity: Does the test have the

factorial composition it would be expected to have; i.e., does it have factorial validity?


Construct validity is said to be the most theory-laden of the methods of test validation. The developer of a test designed to measure a construct begins with a theory about the nature of the construct, which then guides the test developer in selecting test items and in choosing the methods for establishing the test's validity. For example, if the developer of a creativity test believes that creativity is unrelated to general intelligence, that creativity is an innate characteristic that cannot be learned, and that creative people can be expected to generate more alternative solutions to certain types of problems than non-creative people, she would want to determine the correlation between scores on the creativity test and a measure of intelligence, see if a course in creativity affects test scores, and find out if test scores distinguish between people who differ in the number of solutions they generate to relevant problems.


Note that some experts consider construct validity to be the most basic form of validity because the techniques involved in establshing construct validity overlap those used to determine if a test has content or criterion-related validity. Indeed, Cronbach argues that "all validation is one, and in a sense all is construct validation."


Construct Validity

1. Convergent and Discriminant Validity: As noted above, one way to assess a test's construct validity is to correlate test scores with scores on measures that do and do not purport to assess the same trait. High correlations with measures of the same trait provide evidence of the test's convergent validity, while low correlations with measures of unrelated characteristics provide evidence of the test's discriminant (divergent) validity.


The multitrait-multimethod matrix (Campbell & Fiske, 1959) is used to systematically organize the data collected when assessing a test's convergent and discriminant validity. The multitrait-multimethod matrix is a table of correlation coefficients, and, as its name suggests, it provides information about the degree of association between two or more traits that have each been assessed using two or more methods. When the correlations between different methods measuring the same trait are larger than the correlations between the same and different methods measuring different traits, the matrix provides evidence of the test's convergent and discriminant validity.


Example: To assess the construct validity of the interpersonal assertiveness test, the psychologist in Study #3 administers four measures to a group of salespeople: ( 1 ) the test of interpersonal assertiveness; (2) a supervisor's rating of interpersonal assertiveness; (3) a test of aggressiveness; and (4) a supervisor's rating of aggressiveness. The psychologist has the minimum data needed to construct a multitrait-multimethod matrix: She has measured two traits that she believes are unrelated (assertiveness and aggressiveness), and each trait has been measured by two different methods (a test and a supervisor-s rating). The psychologist calculates correlation coefficients for all possible pairs of scores on the four measures and constructs the following multitrait-multimethod matrix (the upper half of the table has not been filled in because it would simply duplicate the correlations in the lower half):


rB2B2 (.70)rB2A2 (.16)rB2B1 (.71)rB2A1 (.04)B2

rA2A2 (.76)rA2B1 (.03)rA2A1 (.71)A2

rB1B1 (.88)rB1A1 (.13)B1

rA1A1 (.93)A1

B2AggressivenessRating

A2AssertivenessRating

B1AggressivenessTest

A1Assertiveness Test


All multitrait-multimethod matrices contain four types of correlation coefficients:

Monotrait-monomethod coefficients ("same trait-same

method") Monotrait-heteromethod coefficients ("same trait-different

methods") Heterotrait-monomethod coefficients ("different traits-

same method") Heterotrait-heteromethod coefficients ("different traits-

different methods“)


1. Monotrait-monomethod coefficients ("same trait-same method"): The monotrait-monomethod coefficients (coefficients in parentheses in the above matrix) are reliability coefficients: They indicate the correlation between a measure and itself. Although these coeffcients are not directly relevant to a test's convergent and discriminant validity, they should be large in order for the matrix to provide useful information.


2. Monotrait-heteromethod coefficients ("same trait-different methods"): These coefficients (coefficients in rectangles) indicate the correlation between different measures of the same trait. When these coefficients are large, they provide evidence of convergent validity.


3. Heterotrait-monomethod coefficients ("different traits-same method"): These coefficients (coefficients in ellipses) show the correlation between different traits that have been measured by the same method. When the heterotrait-monomethod coefficients are small, this indicates that a test has discriminant validity.


4. Heterotrait-heteromethod coefficients ("different traits-different methods"): The heterotrait-heteromethod coefficients (underlined coefficients) indicate the correlation between different traits that have been measured by different methods. These coefficients also provide evidence of discriminant validity when they are small


Note that, in a multitrait-multimethod matrix, only those correlation coefficients that include the test that is being validated are actually of interest. For the above example, the correlation between the rating of interpersonal assertiveness and the rating of aggressiveness (r = .16) is a heterotrait-monomethod coefficient, but it isn't of interest because it doesn't provide information about the interpersonal assertiveness test. Also, the number of correlation coefficients that can provide evidence of convergent and discriminant validity depends on the number of measures included in the matrix. In the example, only four measures were included (the minimum number), but there could certainly have been more.


Example: Three of the correlations in the above multitrait-multimethod matrix are relevant to the construct validity of the interpersonal assertiveness test. The correlation between the assertiveness test and the assertiveness rating (monotrait-heteromethod coefficient) is .71. Since this is a relatively high correlation, it suggests that the test has convergent validity. The correlation between the assertiveness test and the aggressiveness test (heterotrait-monomethod coefficient) is .13 and the correlation between the assertiveness test and the aggressiveness rating (heterotrait-heteromethod coefficient) is .04. Because these two correlations are low, they confirm that the assertiveness test has discriminant validity. This pattern of correlation coefficients confirms that the assertiveness test has construct validity. Note that the monotrait-monomethod coefficient for the assertiveness test is .93, which indicates that the test also has adequate reliability. (The other correlations in the matrix are not relevant to the psychologist's validation study because they do not include the assertiveness test.)


rB2B2 (.89)rB2A2 (.16)rB2B1 (.68)rB2A1 (.04)B2

rA2A2 (.86)rA2B1 (.09)rA2A1 (.71)A2

rB1B1 (.91)rB1A1 (.13)B1

rA1A1 (.93)A1

B2AggressivenessRating

A2AssertivenessRating

B1AggressivenessTest

A1Assertiveness Test


Construct Validity2. Factor Analysis: Factor analysis is used for

several reasons including identifying the minimum number of common factors required to account for the intercorrelations among a set of tests or test items, evaluating a test’s internal consistency, and assessing a test’s construct validity. When factor analysis is used in the latter purpose, a test is considered to have construct (factorial) validity when it correlates highly only with the factor(s) that it would be expected to correlate with.

DESCRIPTIVE STATISTICS

Descriptive Statistics

Descriptive statistics are used to describe or summarize a distribution (set) of data. Descriptive techniques include: tables, graphs, measures of central tendency, and measures of variability.


A set of data can be organized in a table known as a frequency distribution. Frequency distributions are constructed by summarizing the data in terms of the number (frequency) of observations in each category, score, or score interval. In Study # 1, the academic achievement tests scores of 25 children with ADHD could be summarized as shown in Table 1. The column labeled "Frequency (f) indicates the number of observations in each score interval: Three of the 25 children received a score between 80 and 100, while five received a score between 60 and 79.


Table 1 also includes a "Cumulative Frequency (cf)" column. The cumulative frequencies indicate the total number of observations that fall at or below each category or score. The information in Table 1 indicates that 2 of the 25 children received scores of 19 or below. 5 received scores of 39 or below and so on.Table 1

CumulativeScore Frequency (f) Frequency (cf) 80- 100 3 25 60-79 5 22 40-59 12 17 20-39 3 5 0-19 2 2


The information presented in a table can also be presented in a graph. Bar graphs, histograms, and frequency polygons are three types of graphs. The choice of a graph depends on the scale of measurement: Bar graphs are used when the data represent a nominal or ordinal scale, while histograms and frequency polygons are used with interval or ratio data.


Shapes of distribution Normal curve (mean, mode, median fall on the same

point) Leptokurtic distribution (more peaked than the normal

curve) Platykurtic distribution (flatter than the normal curve) Positive skewed distribution (the tail is extended to the

positive side of the distribution—i.e., most of the scores are in the negative side)—mode<median<mean.

Negative skewed distribution (the opposite characteristics of the positive skewed distribution)—mean < median < mode.


Measure of central tendency Mean: the arithmetic average Mode: the most frequently occur score(s). Median: the middle score.


Measure of variability Range: Max score – Min score. Variance (Mean Square): S2=SS/(N-1)=(X-M)2/(N-

1)[the denominator is N-1 for the sample variance—this is because the sample variance tend to underestimate the population variance because one subject score cannot be freely varied.]

Standard deviation is computed by taking the square root of the variance.

Normal distribution: M+ 1 SD (68.26%); 2 SD (95.44%); 3 SD (99.72%)


Effect of math. Operations on measures of central tendency and variability: Add/subtract constant to every score: the central tendency score will change but not the variability. Multiply/divide by a constant will change both central tendency score and variability.

INFERENTIAL STATISTICS

Inferential Statistics

While descriptive statistics are used to summarize data, inferential statistics are used to make inferences about a population based on data collected from a sample drawn from that population and to do so with a pre-defined degree of confidence. In this section, the concept of statistical inference is explained. In Section IV, specific inferential statistical tests are described.


The Logic of Statistical Inference

The techniques of statistical inference allow an investigator to make inferences about the relationships between variables in a population based on relationships observed in a sample.


For example, the psychologist in Study # 1 will want to determine if there is a relationship between training in the self-control procedure and scores on an academic achievement test for all children who have received a diagnosis of ADHD. Since the psychologist won't have access to the entire population of children with this disorder, he will evaluate the effects of the self-control procedure on a sample of children drawn from the target population. The psychologist will then use an inferential statistical test to analyze the data he collects from the sample, and results of the test will enable him to make an inference about the effects of the procedure on the achievement test scores for the population of children with ADHD. Inferential statistical tests accomplish this task through the use of a sampling distribution.


Sampling Distributions

1. Population Parameters and Sample Statistics: To understand inferential statistics, it is necessary to first distinguish between sample values and population values. As noted above, when conducting a research study, an investigator does not have access to the entire population of interest but, instead, estimates population values based on obtained sample values. In other words, an investigator uses a sample statistic to estimate a population parameter. Sample statistics and population parameters are designated with different symbols:


Value Population Parameter

Sample Statistic

MeanStandard DeviationVariance

2

MSDS2


2. Characteristics of Sampling Distributions: Due to the effects of random (chance) factors, it is unlikely that any sample will perfectly represent the population from which it was drawn. As a result, an estimate of a population parameter from a sample statistic is always subject to some inaccuracy. Because of the effects of sampling error, sample statistics deviate from population parameters and from statistics obtained from other samples drawn from the same population.


The relationship between sample statistics and a population parameter can be described in terms of a sampling distribution, which is a frequency distribution of the means or other sample values of a very large number of equal-sized samples that have been randomly selected from the population. Keep in mind that a sampling distribution is not a distribution of individual scores but a distribution of sample statistics. A sampling distribution is important in inferential statistics because it allows a researcher to determine the probability that a sample having a particular mean or other value could have been drawn from a population with a known parameter.


To better understand what a sampling distribution is, assume that the psychologist in Study # 1 defines his population as "all children in the 6th grade who have received a diagnosis of ADHD," and, for that population, an academic achievement test has a mean of 50 and a standard deviation of 10. The psychologist repeatedly selects random samples of 25 children from this population; and, for each sample he administers the achievement test and calculates the mean score. The psychologist has collected a set of sample means and finds that, while some of the sample means are equal to the population mean (50), because of the effects of sampling error, some means are larger than the population mean and some are smaller. In fact, the psychologist finds that his distribution of sample means, or sampling distribution of the mean, resembles the distribution depicted in Figure 7. As shown in that figure, the sampling distribution of the mean is normally shaped and its mean is equal to the population mean of 50.


Researchers do not actually construct a sampling distribution of the mean by obtaining a large number of samples and calculating each sample's mean. Instead, they depend on probability theory to tell them what a sampling distribution would look like. The sampling distribution defined by probability theory is called a theoretical sampling distribution, and it is based on the assumption that an infinite number of equal-sized samples have been randomly drawn from the same population.


The characteristics of a sampling distribution of the mean are specified by the Central Limit Theorem, which makes the following predictions: (a) Regardless of the shape of the distribution of individual scores in the population, as the sample size increases, the sampling distribution of the mean approaches a normal distribution; (b) The mean of the sampling distribution of the mean is equal to the population mean; (c) The standard deviation of the sampling distribution of the mean is equal to the population standard deviation divided by the square root of the sample size:

SEM=/(N)


The standard deviation of a sampling distribution of the mean is known as the standard error of the mean. It provides an estimate of the extent to which the mean of any one sample randomly drawn from a population can be expected to vary from the population mean as the result of sampling error. In other words, like other standard deviations, it is a measure of variability, but it is a measure of variability that is due to the effects of random error. The formula for SEM indicates that the size of the standard error of the mean is affected by the population standard deviation and the sample size (N): The larger the population standard deviation and the smaller the sample size, the larger the standard error and vice versa.


For the above example, the population standard deviation for the achievement test is 10 and the sample size is 25. Using Formula 4, we can determine that the standard error of the mean in this situation is equal to 2:

For Study # 1, the Central Limit Theorem predicts that the sampling distribution of the mean is normally shaped, that its mean is equal to 50, and that its standard deviation is equal to 2.


Note that, if the sample size had been 9 instead of 25, the standard error would increase to 3.33 (10 divided by the square root of 9 = 10/3 = 3.33). In other words, the smaller the sample size, the larger the standard error of the mean. One implication of this is that the smaller the size of the sample, the greater the probability for error when using a sample statistic to estimate a population parameter. Another implication is that, for any given population, there is a ''family'' of sampling distributions, with a different distribution for each sample size.


Although this discussion of sampling distributions has focused on the sampling distribution of the mean, a sampling distribution can actually be derived for any sample statistic. A sampling distribution can be obtained for standard deviations, proportions, correlation coefficients, the difference between means, and so on. In each case, the basic characteristics of the sampling distribution are similar to those of the sampling distribution of the mean.

The sampling distribution is the foundation of inferential statistics. It is the sampling distribution that enables a researcher to make inferences about the relationships between variables in the population based on obtained sample data. How this is done is described in the next section.


Analyzing the Data and Making a Decision: After stating the null and alternative hypotheses and collecting the sample data, an investigator analyzes the data using an inferential statistical test such as the t-test or analysis of variance. The choice of a statistical test is based on several factors including the scale of measurement of the data to be analyzed. The inferential statistical test yields a t, an F, or other value that indicates where the obtained sample statistic falls in the appropriate sampling distribution. That is, the test indicates whether the statistic is in the rejection region or the retention region of the sampling distribution:


The rejection region, or "region of unlikely values," lies in one or both tails of the sampling distribution and contains the sample values that are most unlikely to occur simply as the result of sampling error. (The rejection region is also known as the critical region.)

The retention region, or "region of likely values," lies in the central portion of the sampling distribution and consists of the values that are likely to occur as a consequence of sampling error only.


When the results of the statistical test indicate that the obtained sample statistic is in the rejection region of the sampling distribution, the null hypothesis is rejected and the alternative hypothesis is retained. The investigator concludes that the sample statistic is not likely to have been obtained by chance alone and that the independent variable has had an effect on the dependent variable. Conversely, if the statistical test indicates that the sample statistic lies in the retention region of the sampling distributionb -the null hypothesis is retained and the alternative hypothesis is rejected. In this case, the investigator concludes that the independent variable has not had an effect and that any observed effect is due to error.


Example

In Study # 1, if the children who receive training in the self-control procedure obtain a mean of 60 on the achievement test following training, the psychologist would use an inferential statistical test to determine whether the mean of 60 is due to error or to the procedure. If the results of the test indicate that a mean of 60 is in the retention region of the appropriate sampling distribution, the psychologist will conclude that the procedure does not have an effect and that the observed effect simply reflects error. Conversely, if the statistical test indicates that a mean of 60 is in the rejection region, the psychologist will conclude that the self-control procedure does, in fact, have a beneficial effect on achievement test scores.


Alpha: The size of the rejection region is defined by alpha (a), or the level of significance. If alpha is .05, then 5% of the sampling distribution represents the rejection region and the remaining 95% represents the retention region. The rejection region is always placed in one or both tails of the sampling distribution; that is, in that portion of the sampling distribution that contains the values that are least likely to occur as the result of sampling error only. The value of alpha is set by an experimenter prior to collecting and/or analyzing the data. In other words, it is the experimenter who decides what proportion of the sampling distribution will represent the region of unlikely values. In psychological research, alpha is commonly set at .05 or .01.


When the results of an inferential statistical test indicate that the obtained sample statistic lies in the rejection region of the sampling distribution, the study's results are said to be statistically significant. For example, when alpha has been set at .05 and the statistical test indicates that the sample value is in the rejection region, the results of the study are "significant at the .05 level."


One- versus Two-Tailed Tests: Some inferential statistical tests can be conducted as either a one- or two-tailed test. When a two-tailed test is used, the rejection region is equally divided between the two tails of the sampling distribution. If alpha is set at .05, 2.5% of the rejection region lies in the positive tail of the distribution and 2.5% lies in the negative tail. With a one-tailed test, the entire rejection region is placed in only one of the tails. The division of the sampling distribution for one- and two-tailed tests when alpha has been set at .05 is illustrated in the following figure.


It is the alternative hypothesis that determines whether a one- or a two-tailed test should be conducted. A two-tailed test is used when the alternative hypothesis is nondirectional, while a one-tailed test is used when the alternative hypothesis is directional. If a directional alternative hypothesis predicts that the sample statistic will be greater than the value specified in the null hypothesis, the entire rejection region lies in the positive tail of the sampling distribution. If a directional alternative hypothesis predicts that the sample statistic will be less than the value specified in the null hypothesis, the rejection region is located in the negative tail.


Decide, on the basis of the results of the statistical test, whether to retain or reject the statistical hypotheses.


Decision Outcomes: Regardless of whether an experimenter decides to retain or reject the null hypothesis, there are two possible outcomes of his or her decision: The decision can be either correct or in error, and an experimenter can never be entirely certain which type of decision has been made.


Decision Errors: There are two decision errors, a Type I error and a Type II error. A Type I error occurs when an investigator rejects a true null hypothesis. For example, if the psychologist in Study # 1 concludes that the self-control procedure increases achievement test scores but the apparent improvement in scores is actually a consequence of sampling error, the psychologist has made a Type I error. (Keep in mind that it is unlikely -- but not impossible -- to obtain a sample value in the rejection region of the sampling distribution as the result of chance alone.)


The probability of making a Type I error is equal to alpha. As the value of alpha increases, the probability of rejecting a true null hypothesis also increases. Increasing the value of alpha from .01 to .05, for example, increases the probability of making a Type I error from 1 chance in 100 to 1 chance in 20. Because an investigator sets the value of alpha, he or she has control over the probability of making a Type I error. Note that the chance of making a Type I error is affected by other factors. It may be increased, for example, when the sample size is small or when observations are dependent. The other decision error, a Type II error, occurs when an investigator retains a false null hypothesis. In Study # 1, if the psychologist concludes that the self-control procedure does not improve achievement test scores when it actually does, the psychologist has made a Type II error.


The probability of making a Type II error is equal to beta (). Although the value of beta is not set by an investigator and cannot be directly calculated for a particular study, the probability of making a Type II error can be indirectly influenced: A Type II error is more likely when the value of alpha is low, when the sample size is small, and when the independent variable is not administered in sufficient intensity.


There is an inverse relationship between Type I and Type II errors: As the probability of making a Type I error increases, the probability of making a Type II error decreases and vice versa. Consequently, the selection of a level of significance depends, in part, on the seriousness of making these two errors. For some research, a Type I error (rejecting a true null hypothesis) is considered more problematic. In these situations, the experimenter will choose a level of significance that minimizes the probability of making a Type I error (e.g., .01 rather than .05). In other situations, it is more important to avoid making a Type II error (retaining a false null hypothesis). When this is the case, a larger level of significance is preferred (.10 or .05 rather than .01).


Correct decisions: There are also two possible correct decisions. An investigator can make a correct decision by retaining the null hypothesis when it is true or by rejecting the null hypothesis when it is false.

It is the second type of correct decision that an experimenter ordinarily wants to make. When a statistical test enables an experimenter to reject a false null hypothesis, the test is said to have statistical power. Obviously, researchers want to maximize power whenever they conduct a research study.


Methods to Maximize Power

Increasing alpha: A null hypothesis (true or false) is more likely to be rejected when alpha is .05 than when it is .01.

Increasing sample size: A correct decision is more likely to be made when the sample size is 50 than when it is 25. The effects of increasing sample size on power are greatest when the sample is small. (When there are 100 or more subjects in each group, adding more subjects does not have a substantial impact on power.)


Increasing the effect size: Maximizing the effects of the IV increases the likelihood that the effects will be detected. The effects of the IV are maximized by administering the IV for a long enough period of time or in sufficient intensity.

Minimizing error: When potential sources of systematic and random error are controlled, it is easier to detect the effects of the independent variable. One way to reduce error is to make sure the DV measure is reliable. Another way is to reduce within-group variability by controlling extraneous variables or by using a within-subjects design.


Using a one-tailed test when appropriate: A one-tailed test is more powerful than a two-tailed test as long as it is appropriately used.

Using a parametric test: Parametric statistical tests, such as the t-test or ANOVA, are more powerful than nonparametric tests.


Note that "power" is not the same as "confidence." Power refers to the ability to reject a false null hypothesis and, as noted above, is affected by the size of alpha: Power increases as alpha increases and vice versa. Statistical power is something a researcher is concerned about before a decision about the null hypothesis is made. Confidence refers to the certainty a researcher has about the decision he or she has already made about the null hypothesis. An experimenter has more confidence that his or her decision to reject the null hypothesis was correct when alpha is small (e.g., .01 rather than .05).


Exercise A

When a researcher makes the decision to retain or reject the (1) ________ hypothesis, there is no way to know with certainty if the decision is correct or in error. There are two kinds of decision errors. Type I error is made when a true null hypothesis is (2) ________ . This occurs when a researcher concludes that an independent variable has had an effect on the dependent variable, but the observed effect was actually due to (3) ________ . The probability of making a Type I error is equal to (4) ________ . For example, when alpha is set at .05 and the researcher has rejected the null hypothesis, there is a (6) ________ % chance that a Type I error has been made.


A Type II error is made when a false null hypothesis is (7) ________ . This occurs when the researcher decides that an independent variable has no effect on the dependent variable when it actually does. A Type II error might occur when the (8) ________ variable was not administered in sufficient intensity or for a long enough period of time, when the sample size was too (9) ________ , or when alpha is too (10) ________ .


A researcher can, of course, make a correct decision. One kind of correct decision is to (11) ________ a true null hypothesis. In this situation, the researcher correctly concludes that any observed effect of an IV is actually due to (12) ________ . The other correct is to (13) ________ a false null hypothesis. The researcher correctly that the (14) ________ has an effect on the DV. When a statistical test enables a researcher to make this kind of correct decision, the test is said to have power.


Power is increased as alpha (15) ________ , as the sample size (16) ________ , and as the magnitude of effect of the (17) ________ increases. Power is also maximized when a (18) ________ –tailed test is used (if appropriate) ________ and when the data are analyzed using a t-test, ANOVA, or other (19) ________ statistical test.


Exercise B

A researcher compared the number of cavities of children who had used either Toothpaste brand X or Toothpaste brand Y for a year. At the end of the year, the researcher found that the children who had used brand X had significantly fewer cavities than the children who had used brand Y. The difference was significant at the .05 level.

1. What is the null hypothesis?2. What is the research hypothesis?3. What would be the Type I error?4. What would be the Type II error?5. What is the probability of a Type I error?

Non-Experimental Research

Non-experimental Research

Non-experimental (descriptive) research is conducted primarily to collect data about variables rather than to test hypotheses about the relationships between them. In other words, a non-experimental study is conducted to describe "how things are." Observational studies, archival research, correlational research, case histories and case studies, and surveys are ordinarily non-experimental.


Non-experimental Research

Observational studies Archival research Correlational research Case histories and case studies, and Surveys


Observational StudiesObservational studies involve observing behavior in a systematic way, often in a naturalistic setting. Naturalistic field studies and participant observation are examples of observational studies. An important decision that must be made before conducting an observational study is how to record the behavior of interest. One method is to obtain a narrative record of the behavior as it actually occurred, with the record taking the form of a detailed written description or an audio and/or visual recording.


Content analysis, which involves organizing the data into categories, can then be used to summarize and interpret the information contained in the narrative record.

Protocol analysis can be viewed as a type of content

analysis. It is used by psychologists interested in the cognitive processes ("heeded cognitions") that underlie problem-solving and other complex tasks and involves asking a subject to "think aloud" while solving a problem. The subject's verbalizations are recorded, and the protocol (record) is later coded in terms of relevant categories such as intentions, cognitions, planning, and evaluations. To obtain valid information when conducting a content analysis, the coding (behavioral) categories must be clearly defined, exhaustive, and mutually exclusive.


An alternative to obtaining a complete record of a behavior is to look at specific aspects of it by employing a systematic method for sampling and recording the frequency or duration of the behavior and/or rating the behavior in terms of its qualitative characteristics. Methods of behavioral sampling include interval recording and event sampling.


Interval recording, a type of time sampling, involves observing a behavior for a period of time that has been divided into equal intervals (e.g., a 30-minute period that has been divided into 1 5-second intervals) and recording whether or not the behavior occurred during each interval. Interval recording is especially useful for studying complex interactions and behaviors that have no clear beginning or end such as laughing, talking, or playing.

Event sampling (recording) entails observing a behavior each time it occurs. This technique is good for studying behaviors that occur infrequently, that have a long duration, or that leave a permanent record or other product (e.g., a completed worksheet or test).


Situational sampling is an alternative to behavioral sampling and is used when the goal of the study is to observe a behavior in a number of settings. Situational sampling helps increase the generalizability of a study's findings.

Sequential analysis entails coding behavioral sequences rather than isolated behavioral events and is used to study complex social behaviors (Bakeman and Gottman, 1989).


Case Studies

Although a case study is most associated with an in-depth description and analysis of a single person, it can also entail an intensive investigation of a single institution, agency, community, or other social unit. Ethnographic research, which focuses on a single culture, is an example of a case study that includes more than one person. Two shortcomings of case studies are that (a) their results usually do not allow an investigator to draw conclusions about the exact nature of the relationships between variables (e.g., to determine if status on one variable causes status on another variable) and (b) the information derived from a case study might not be generalizable to other cases. Because of their limitations, case studies are most useful for investigating rare phenomena and as exploratory studies for identifying independent and dependent variables and generating hypotheses about the relationships between them that can be more systematically investigated in the future.


SurveysA survey involves administering a questionnaire either in-person, by phone, or through the mail. A serious problem with mail surveys is their susceptibility to nonresponse biases that occur when the people who fail to send back the survey differ in important ways from those who return it. In general, the lower the overall response rate, the more likely the survey's results will be affected by nonresponse biases. Several techniques are useful for increasing the number of responses to mail surveys and for reducing biases. Probably the most effective method is to follow-up the initial contact with one or more mailings (three follow-ups seem optimal). Other useful methods are including a cover letter that provides relevant information (e.g., information about the sponsoring agency and the purpose of the study); including a small reward (larger rewards have not been found to be more effective than smaller ones); and pre-contacting individuals by phone about the questionnaire.

Documents

Validity/Reliability and a recap of Statistics RCS 6740 6/27/05