Reliability and Validity

Measurement ErrorWhatever measurement we might make with regard to some psychological construct, we do so with some amount of errorAny observed score for an individual is their true score with error added inThere are different types of error, but here we are concerned with a measures inability to capture the true response for an individualObserved Score = True score + Error of measurement

ReliabilityReliability refers to a measures ability to capture an individuals true score, i.e. to distinguish accurately one person from anotherWhile a reliable measure will be consistent, consistency can actually be seen as a by-product of reliability, and in a case where we had perfect consistency (everyone scores the same and gets the same score repeatedly), reliability coefficients could not be calculatedNo variance/covariance to give a correlationThe error in our analyses is due to individual differences but also the lack of the measure being perfectly reliable

ReliabilityCriteria of reliabilityTest-retestTest components (internal consistency)Test-retest reliabilityConsistency of measurement for individuals over timeThe score similarly e.g. today and 6 months from nowIssuesMemoryIf too close in time the correlation between scores is due to memory of item responses rather than true score capturedChance covariationAny two variables will always have a non-zero correlationReliability is not constant across subsets of a populationGeneral IQ scores good reliabilityIQ scores for college students, less reliableRestriction of range, fewer individual differences

Internal ConsistencyWe can get a sort of average correlation among items to assess the reliability of some measure1As one would most likely intuitively assume, having more measures of something is better than fewIt is the case that having more items which correlate with one another will increase the tests reliability

Whats good reliability?While we have conventions, it really kind of dependsAs mentioned reliability of a measure may be different for different groups of peopleWhat we may need to do is compare reliability to those measures which are in place and deemed good as well as get interval estimates to provide an assessment of the uncertainty in our reliability estimateNote also that reliability estimates are biased upwardly and so are a bit optimistic Also, many of our techniques do not take into account the reliability of our measures, and poor reliability can result in lower statistical power i.e. an increase in type II errorThough technically increasing reliability can potentially also lower power1

Replication and ReliabilityWhile reliability implies replicability, assessing reliability does not provide a probability of replicationNote also that statistical significance is not a measure of reliability or replicability1Replication is not perhaps conducted as much as should be in psychology for a number of reasonsPractical concerns, lack of publishing outlets etc.Furthermore, knowing our estimates are biased and variable themselves, we might even think that in many cases we would not expect consistent research findingsIn psychology, many people spend a lot of time debating back and forth about the merits of some theory, citing cases where it did or did not replicateHowever the lack of replication could be due to low power, low reliability, problem data, incorrectly carrying out the experiment etc.In other words, we didnt repeat because of methodology, not because the theory was wrong

Factors affecting the utility of replicationsYou cant step in the same river twice!Heraclitus1WhenLater replications are not providing as much information, however they can contribute greatly to the overall assessment of an effectMeta-analysisHowThere is no perfect replication (different people involved, time it takes to conduct etc.)Doing exact replication gives us more confidence in the original finding (should it hold), but may not offer much in the way of generalizationExample: doing a gender difference study at UNT over and over. Does it work for non-college folk? People outside of Texas?

Factors affecting the utility of replicationsBy whomIt is well known that those with a vested interest in some idea tend to find confirming evidence more than those that dontReplications by others are still being done by those with an interest in that research topic and so may have a precorrelation inherent in their attemptDirect: correlation of attributes of persons involvedIndirect: correlation of data to be obtainedGist, we cant have truly independent replication attempts, but must strive to minimize biasThe more independent replication attempts are, the more informative they will be

ValidityValidity refers to the question of whether our measurements are actually hitting on the construct we think they areWhile we can obtain specific statistics for reliability (even different types), validity is more of a global assessment based on the evidence availableWe can have reliable measurements that are invalidClassic example: The scale which is consistent and able to distinguish from one person to the next but actually off by 5 pounds

Validity Criteria in Psychological TestingContent validityCriterion validityConcurrentPredictiveConstruct-related validityConvergentDiscriminant

Content validityItems represent the kinds of material (or content areas) they are supposed to representAre the questions worth a flip in the sense they cover all domains of a given construct?E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.

Validity Criteria in Psychological TestingCriterion validitythe degree to which the measure correlates with various outcomesDoes some new personality measure correlate with the Big 5ConcurrentCriterion is in the presentMeasure of ADHD and current scholastic behavioral problemsPredictiveCriterion in the futureSAT and college gpa

Validity Criteria in Psychological TestingConstruct-related validityHow much is it an actual measure of the construct of interestConvergentCorrelates well with other measures of the constructDepression scale correlates well with other dep scalesDiscriminantIs distinguished from related but distinct constructsDep scale != Stress scale

Validity Criteria in ExperimentationStatistical conclusion validityIs there a causal relationship between X and Y?Correlation is our starting point (i.e. correlation isnt causation, but does lead to it)Related to this is the question of whether the study was sufficiently sensitive to pick up on the correlationInternal validityHas the study been conducted so as to rule out other effects which were controllable?Poor instruments, experimenter biasExternal validityWill the relationship be seen in other settings?Construct validitySame concerns as beforeEx. Is reaction time an appropriate measure of learning?

SummaryReliability and Validity are key concerns in psychological researchPart of the problem in psychology is the lack of reliable measures of the things we are interested in1Assuming that they are valid to begin with, we must always press for more reliable measures if we are to progress scientificallyThis means letting go of supposed standards when they are no longer as useful and look for ways to improve current ones

*****1. Cronbachs alpha is a function of the number of items and average correlation among them*1. We dont really want to go there, but you can see the related notes on the 5710 page. Its at the end of the Experimental design notes.*1. Often you may see people use the term statistically reliable meaning statistically significant. Please dont. It is grossly misleading terminology and, taken literally, usually wrong.

*1. His student Cratylus said that you couldnt step in the river even once.*******1. Always, always, always find the reliability estimates for the measure you are about to use. The information is readily available in original and related articles (always go to the original though), Mental Measurement Yearbook etc. At least give yourself a chance to do a decent study by using reliable measures.

Documents

Reliability and Validity