5
Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality Pere J. Ferrando Research Centre for Behavioural Assessment (CRAMC), Universidad ‘Rovira i Virgili’, Facultad de Psicologia, Carretera Valls s/n, 43007 Tarragona, Spain article info Article history: Received 3 October 2011 Received in revised form 16 December 2011 Accepted 29 December 2011 Available online 26 January 2012 Keywords: Person fit Personality measurement Item response theory Eysenck personality questionnaires abstract Person-fit analysis assesses the fit of an IRT model at the level of each respondent. This type of assessment allows potentially inconsistent respondents to be detected, and is very relevant in both clinical assess- ment and validity studies. However, person fit is rarely assessed in IRT-based personality applications. This study assessed person fit in two datasets in which a normal-range personality measure – Neuroti- cism in one case and Extraversion in the other – had been administered under standard conditions. At the illustrative level, the study shows how person-fit analysis can be performed and how it can be useful in personality measurement. At the substantive level it assesses the frequency and sources of individual misfit. Idiosyncratic interpretation of groups of items, low person reliability, and deliberate sabotaging were identified as sources of misfit. Person-fit measures showed a moderate degree of temporal stability and a significant correlation with a Conscientiousness measure. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction In the framework of Item Response Theory (IRT) the process of test analysis is model-based: a falsifiable model for the item responses is selected and its appropriateness assessed. The results derived from its application can only be interpreted correctly when there is a close match between the chosen model and the test data. In recent decades IRT models have been increasingly used in personality measurement (e.g. Reise & Waller, 2009) and virtually all the reported applications have assessed the degree of model- data fit, either at the global level or at the item level. However, the fit of an IRT model can also be assessed at the level of each respondent. This is usually known as person-fit analysis (Meijer & Sijtsma, 2001). In the parametric context adopted here, person- fit analysis assesses the extent to which the response pattern of an individual is consistent with the pattern that would be expected given the model and his/her estimated trait level. This assessment is highly relevant, for at least two reasons. First, in individual assessment valid interpretations and inferences based on an esti- mated trait level are only warranted if the response pattern of the individual is consistent. Second, in external validity analysis, the presence of a proportion of inconsistent respondents might dis- tort the relations with relevant criteria (Schmitt, Chan, Sacco, McFarland, & Jennings, 1999). In this respect, it might be thought that if the IRT model fits well at the global level, then the propor- tion of non-fitting respondents cannot be high. While this is true, Levine and Drasgow (1983) showed that acceptable global fits can be obtained with up to 10% of inconsistent respondents. Person-fit indicators are not intended to be used as strict statis- tical indices that automatically put aside inconsistent respondents. Rather, they are screening tools that aim to flag problematic response patterns (Meijer, 2003). Once a pattern has been detected as potentially inconsistent, further information must be obtained regarding the possible causes of the inconsistency. With this infor- mation, the practitioner must decide what to do with the pattern in each case. From the discussion above, it appears that person-fit analysis should be widely used in personality applications of IRT. However, this is far from the case. There are some useful didactic and infor- mative applications (e.g. Meijer, Egberink, Emons, & Sijtsma, 2008; Reise & Flannery, 1996; Reise & Waller, 1993) but not many. As Meijer et al. (2008) noted, the problem is that, so far, the person- fit literature has been mainly technical, and little attention has been paid to substantive issues. For normal-range personality traits measured under standard instructions, there are three main issues that clearly require far more research: (a) the frequency of person misfit, (b) its main sources, and (c) its status as a variable. The sources of person misfit in personality have been the subject of more theoretical than empirical studies. Four broad categories have been described: (a) idiosyncratic interpretation of the item content, (b) unmotivated or unsympathetic test responding (sabo- taging), (c) person unreliability, and (d) response biases (mainly acquiescence and faking/socially desirable responding) (Meijer et al., 2008; Reise & Flannery, 1996; Reise & Waller, 1993; Zickar & Drasgow, 1996). Below, I discuss sources (c) and (d) in more detail. 0191-8869/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.paid.2011.12.036 Tel./fax: +34977558079. E-mail address: [email protected] Personality and Individual Differences 52 (2012) 718–722 Contents lists available at SciVerse ScienceDirect Personality and Individual Differences journal homepage: www.elsevier.com/locate/paid

Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

Embed Size (px)

Citation preview

Page 1: Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

Personality and Individual Differences 52 (2012) 718–722

Contents lists available at SciVerse ScienceDirect

Personality and Individual Differences

journal homepage: www.elsevier .com/locate /paid

Assessing inconsistent responding in E and N measures: An applicationof person-fit analysis in personality

Pere J. Ferrando ⇑Research Centre for Behavioural Assessment (CRAMC), Universidad ‘Rovira i Virgili’, Facultad de Psicologia, Carretera Valls s/n, 43007 Tarragona, Spain

a r t i c l e i n f o

Article history:Received 3 October 2011Received in revised form 16 December 2011Accepted 29 December 2011Available online 26 January 2012

Keywords:Person fitPersonality measurementItem response theoryEysenck personality questionnaires

0191-8869/$ - see front matter � 2012 Elsevier Ltd. Adoi:10.1016/j.paid.2011.12.036

⇑ Tel./fax: +34977558079.E-mail address: [email protected]

a b s t r a c t

Person-fit analysis assesses the fit of an IRT model at the level of each respondent. This type of assessmentallows potentially inconsistent respondents to be detected, and is very relevant in both clinical assess-ment and validity studies. However, person fit is rarely assessed in IRT-based personality applications.This study assessed person fit in two datasets in which a normal-range personality measure – Neuroti-cism in one case and Extraversion in the other – had been administered under standard conditions. Atthe illustrative level, the study shows how person-fit analysis can be performed and how it can be usefulin personality measurement. At the substantive level it assesses the frequency and sources of individualmisfit. Idiosyncratic interpretation of groups of items, low person reliability, and deliberate sabotagingwere identified as sources of misfit. Person-fit measures showed a moderate degree of temporal stabilityand a significant correlation with a Conscientiousness measure.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction Levine and Drasgow (1983) showed that acceptable global fits

In the framework of Item Response Theory (IRT) the process oftest analysis is model-based: a falsifiable model for the itemresponses is selected and its appropriateness assessed. The resultsderived from its application can only be interpreted correctly whenthere is a close match between the chosen model and the test data.

In recent decades IRT models have been increasingly used inpersonality measurement (e.g. Reise & Waller, 2009) and virtuallyall the reported applications have assessed the degree of model-data fit, either at the global level or at the item level. However,the fit of an IRT model can also be assessed at the level of eachrespondent. This is usually known as person-fit analysis (Meijer& Sijtsma, 2001). In the parametric context adopted here, person-fit analysis assesses the extent to which the response pattern ofan individual is consistent with the pattern that would be expectedgiven the model and his/her estimated trait level. This assessmentis highly relevant, for at least two reasons. First, in individualassessment valid interpretations and inferences based on an esti-mated trait level are only warranted if the response pattern ofthe individual is consistent. Second, in external validity analysis,the presence of a proportion of inconsistent respondents might dis-tort the relations with relevant criteria (Schmitt, Chan, Sacco,McFarland, & Jennings, 1999). In this respect, it might be thoughtthat if the IRT model fits well at the global level, then the propor-tion of non-fitting respondents cannot be high. While this is true,

ll rights reserved.

can be obtained with up to 10% of inconsistent respondents.Person-fit indicators are not intended to be used as strict statis-

tical indices that automatically put aside inconsistent respondents.Rather, they are screening tools that aim to flag problematicresponse patterns (Meijer, 2003). Once a pattern has been detectedas potentially inconsistent, further information must be obtainedregarding the possible causes of the inconsistency. With this infor-mation, the practitioner must decide what to do with the pattern ineach case.

From the discussion above, it appears that person-fit analysisshould be widely used in personality applications of IRT. However,this is far from the case. There are some useful didactic and infor-mative applications (e.g. Meijer, Egberink, Emons, & Sijtsma, 2008;Reise & Flannery, 1996; Reise & Waller, 1993) but not many. AsMeijer et al. (2008) noted, the problem is that, so far, the person-fit literature has been mainly technical, and little attention hasbeen paid to substantive issues. For normal-range personalitytraits measured under standard instructions, there are three mainissues that clearly require far more research: (a) the frequency ofperson misfit, (b) its main sources, and (c) its status as a variable.

The sources of person misfit in personality have been the subjectof more theoretical than empirical studies. Four broad categorieshave been described: (a) idiosyncratic interpretation of the itemcontent, (b) unmotivated or unsympathetic test responding (sabo-taging), (c) person unreliability, and (d) response biases (mainlyacquiescence and faking/socially desirable responding) (Meijeret al., 2008; Reise & Flannery, 1996; Reise & Waller, 1993; Zickar& Drasgow, 1996). Below, I discuss sources (c) and (d) in moredetail.

Page 2: Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

P.J. Ferrando / Personality and Individual Differences 52 (2012) 718–722 719

Person reliability (see Ferrando, 2004) refers to the clarity andsensitivity with which individuals perceive their own trait level,and is supposed to depend on the relevance and the degree of orga-nization with which the trait is internally organized in these indi-viduals. Operatively, it is defined as the extent to which theresponses are sensitive to the location of the items on the trait con-tinuum. A highly reliable respondent agrees with those items lo-cated below his/her trait level, and rejects those located above:the pattern of this respondent is therefore highly scalable and fitsvery well with the model (see Lanning, 1991). At the other ex-treme, the responses of a highly unreliable individual are largelyinsensitive to the item ordering. The resulting pattern is, therefore,almost random and likely to be detected as misfitting. Person unre-liability is viewed as an individual-differences continuum, so allindividuals are unreliable to some degree.

As for response biases, acquiescence can be detected as a sourceof person misfit if a balanced scale is used and if the impact isstrong enough (see Ferrando & Lorenzo-Seva, 2010). The role offaking/social desirability is less clear. The position adopted here(see e.g. Reise & Waller, 2009) is that this type of responding con-sists of a consistent elevation of the scores that is unlikely to be de-tected in the person-fit analyses. This issue requires furtherresearch. However, in the conditions we shall consider here (anon-ymous and voluntary administration, no pressure to falsify thetruth and standard instructions), socially desirable responding isnot expected to be a relevant source of misfit.

We turn finally to the status of person misfit as a variable. Inmost applications person misfit is considered to be purely situa-tional. However, some authors have hypothesized that it might be-have (at least in part) as an individual-differences variable (Reise &Waller, 1993; Schmitt et al., 1999). Now, if this is to be so, two ba-sic requisites must be fulfilled. First, the person-fit measures mustpossess a certain degree of reliability. Second, they must showmeaningful relations with other potentially related individual-dif-ferences variables. As for the first point, Reise and Waller (1993)used the same person-fit measure that I shall use here, and as-sessed its reliability using a split-half estimate. They analyzed amultidimensional questionnaire made up of scales of about 24items, and obtained an average person-fit reliability estimate ofonly 0.32. As for the second point, Schmitt et al. (1999) consideredConscientiousness (C) as a variable potentially related to personmisfit. They hypothesized that high C levels generally imply morecareful responding, and predicted that C scores and the person-fitmeasure would be positively correlated. Their results (r = 0.34)agreed with this prediction.

1.1. Scope and aims of the study

In this study I have chosen to assess two broad personalitytraits: Neuroticism (N) and Extraversion (E). As discussed above,person-fit research is particularly necessary for normal-range per-sonality traits and the two that have been chosen can be regardedas some of the most representative of the group. Overall, the studyis a non-technically-oriented application of IRT-based person fitanalysis whose aims are illustrative, substantive and theoretical.At the illustrative level it shows (a) how person-fit assessmentcan be carried out, and (b) the extent to which this type of assess-ment is practically useful and relevant. At the substantive level itempirically assesses the frequency and sources of individual misfitwhen the N and E measures are administered under standard con-ditions. Finally, at a more theoretical level, a small validity study isperformed in an attempt to obtain further information about per-son-fit as an individual-differences variable. On the one hand, thestudy aims to replicate the results by Schmitt et al. (1999) on thepositive relation between person fit and C levels. On the otherhand, it extends the initial reliability analysis by Reise and Waller

(1993) from internal consistency to temporal stability by using atest–retest correlation.

1.2. The person-fit assessment procedure and rationale

The present proposal can be viewed as a two-stage procedure inwhich each stage is developed in two successive steps. The firststage is the approach that is most commonly used in IRT applica-tions. The second is the person-fit assessment strictly speaking.

In the first stage, the chosen IRT model is fitted to the data intwo steps: calibration and scoring. In the calibration step, the itemparameter estimates are obtained, and model-data fit is assessed atthe global and item level. If the fit is appropriate, then the itemestimates are taken as fixed and known in the second (scoring)step, and trait level estimates (i.e. scores) are obtained for eachrespondent.

We turn now to the second, person-fit stage. In the first step, aperson-fit statistic is computed for each respondent in order toidentify the potentially inconsistent patterns. In the second step,the flagged patterns are visually inspected using a graphical proce-dure to gain insight into the possible sources of misfit.

In this study, the chosen IRT model is the two-parameter model(2PM), one of the most used with binary personality items. Morespecifically, the 2PM has been shown to provide good fits for thenormal-range measures considered here (Ferrando, 1994). It is adominance model, governed by two item parameters: difficultyand discrimination, in which the probability of item endorsementincreases with trait level. The 2PM is relatively simple and canbe fitted using a variety of widely available, free and commercialprograms.

The person-fit statistic we shall use in the second stage is the lzindex developed by Drasgow, Levine, and Williams (1985). At pres-ent, lz is possibly the most widely used person-fit measure, partic-ularly in personality: it performs better than other measures and iseasy to interpret. Its basic rationale is that the likelihood of an indi-vidual pattern given the item parameters and the estimated traitlevel is high when the pattern is consistent with the model andlow when the pattern is inconsistent. The lz index is the standard-ized value of the pattern’s likelihood, and, given the right condi-tions, has a unit normal distribution under the null hypothesisthat there are no inconsistent respondents. Large negative valuesof lz indicate person misfit. So, the usual procedure is to set acut-off point on the left tail of its distribution, and consider aspotentially inconsistent the patterns whose values are below thispoint. In the present study we shall use the most accurate estima-tion of lz, which uses Snijders’s (2001) correction.

The lz value can be viewed as a measure of the relative decreasein the pattern’s likelihood due to responses that are unlikely for themodel (e.g. Reise & Due, 1991). So, the responses that had most im-pact on it are those that have a very low probability: i.e. when theparticipant endorses an item that is very ‘difficult’ for him/her orrejects an item that is very ‘easy’. It then follows that, to attaingood power and accuracy for most of the respondents, a long testwith a wide spread of item locations will be required. It is hardto provide more specific guidance, because power also dependson the items’ accuracy and the type of inconsistency that is to bedetected. However, simulation studies based on a good spread ofdifficulties suggest that: (a) it is difficult to detect person-fit usingtests of fewer than 20 items, and (b) almost full detection rates areattained with tests of 80 items or more (e.g. Reise & Due, 1991).

The graphical procedure we shall use in the second person-fitstep is Person Response Curve (PRC) analysis (e.g. Emons, Sijtsma,& Meijer, 2005). The PRC, which is computed for each individual,plots the probability of endorsement as a function of the item loca-tion/difficulty. The model-based PRC is the curve which would beexpected given the estimated trait level of this respondent. The

Page 3: Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

720 P.J. Ferrando / Personality and Individual Differences 52 (2012) 718–722

empirical PRC is the curve directly fitted to the observed responsesof this individual. Graphically, an inconsistent pattern leads to dis-crepancies between the two curves that can provide insights intothe sources of misfit: for example, general differences in the trendsof both curves, or identification of subsets of items for which theresponses are unusual. The inspection of the PRC is visual butcan be improved using two auxiliary statistical procedures: vari-ability bands and standardized residuals. Variability bands expressthe uncertainty associated with the estimated empirical PRC, andallow a clearer assessment to be made of the regions on which bothcurves most diverge (Emons et al., 2005). The standardized residu-als (Wright, 1977) are scaled discrepancies between the observedand the model-expected item response values that are interpretedwith reference to the standard normal distribution. Because theyare computed for each response, the standardized residuals makeit possible to identify those items in which the most unexpectedresponses occurred.

2. Method

2.1. Participants

The N and E measures were administered in different samples.There were 436 respondents in the N sample and 531 in the E sam-ple. All respondents were undergraduate students from the facul-ties of Psychology and Social Sciences of a Spanish university.Their mean age was about 21, and about 80% were female.

In both samples, the measures were administered in paper andpencil format, and completed voluntarily and anonymously inclassroom groups of 25–60 students. In the N sample the measurewas administered twice under the same conditions with a retestinterval of four weeks.

2.2. Measures

Both the N and E measures were made up of non-redundantitems taken from the various Eysenck questionnaires: MMQ, MPI,EPI, and EPQ (see Eysenck & Eysenck, 1976). The N measure con-sisted of 60 items, all of which were worded in the same direction.In the E dataset, the questionnaire that was administered con-tained 55 E items interspersed with 23 C items. Of the 55 E items,14 measured in the direction of introversion and the rest in thedirection of extraversion. The C items were taken from the Interna-tional Personality Item Pool (Goldberg, 1999). For the sake of clar-ity, relatively long measures were used so as to attain good powerand accuracy in this illustrative study.

Fig. 1. Distribution of the lz values in both datasets and cut-off point.

3. Results

3.1. First stage

The 2PM was fitted to the data using the free program NOHARM(Fraser & McDonald, 1988) to assess global fit, and the commercialprogram BILOG MG-3 (Zimowski, Muraki, Mislevy, & Bock, 2003) tomake a more detailed assessment at the item level. In both data-sets the 2PM fitted the data well, both globally and on an item-by-item basis. Overall goodness-of-fit results were: Root MeanSquared Residual (RMSR) covariances of 0.012 (E) and 0.013 (N)and Gamma-goodness-of-fit index (GFI) values of 0.91 (E) and0.88 (N). Details at the item level can be obtained from the author.In the second step, and in both datasets, trait estimates wereobtained for each respondent.

In the partially-balanced E measure, the potential impact ofacquiescence was assessed using the procedure proposed by Ferr-ando, Lorenzo-Seva, and Chico (2003) in which acquiescence is

modeled as a secondary factor. Three main results were obtained.First, the disattenuated correlation between the scores based onthe positive and the negative item sets was r = �0.99. Second, theimprovement of fit when going from one factor (content) to twofactors (content and acquiescence) was meager: GFI = 0.91 toGFI = 0.93. Third, most of the loadings on the second factor wereabout the size of their standard errors, and the criterion that atthe very least three loadings above 0.30 are needed to define a fac-tor (McDonald, 1985) was not attained. These results suggest that,in the present study, the impact of acquiescence at the item level isalmost negligible for most of the items, so acquiescence would notbe a substantial source of person-misfit.

3.2. Second stage: Person-fit analyses

The lz values were obtained with the program WPERFIT (Ferran-do & Lorenzo-Seva, 2000), a free, non-commercial Windows pro-gram that computes a variety of person-fit indices and displaysperson response curves. WPERFIT can be obtained at no chargeby writing to the present author.

The distribution of the lz values in both datasets had a slightnegative skew, with a heavier left tail. This is the usual result ob-tained in applications (Reise & Flannery, 1996). I chose a value of�2 as the cut-off for considering a respondent as potentially incon-sistent. Under the null hypothesis of no misfit, the percentage ofindividuals expected to fall below the chosen cut-off point is2.5%. The observed percentages were 4% in both cases (18 respon-dents in N and 21 respondents in E), only slightly above the nom-inal level, and far below the 10% ceiling discussed above. Thedistribution of the lz values in both datasets and the cut-off pointare shown in Fig. 1.

In the second step, PRC analyses were carried out. To classifythe sources of misfit, each graph was visually inspected, 90% vari-ability bands were obtained for the observed PRC, and standard-ized residuals were computed for each response. For the Nmeasure two sources of misfit were identified: local deviations insome subsets of items (12 patterns) and very flat PRCs (6 patterns).Conceptually the first source implies an idiosyncratic interpreta-tion of certain items, which leads to individuals endorsing ‘diffi-cult’ items that are far above their estimated trait level orrejecting ‘easy’ items far below their estimated trait level. The sec-ond source implies very low person reliability, so the respondent is

Page 4: Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

P.J. Ferrando / Personality and Individual Differences 52 (2012) 718–722 721

largely insensitive to the normative ordering of the items, and theresulting pattern is almost random. The same two main sourceswere also identified in the E measure: local deviations (13 pat-terns) and almost-flat PRCs (6 patterns). However, a third sourcewas also detected here: deliberate sabotaging. Two participantssystematically responded opposite to the normative item ordering,disagreeing with the most frequently endorsed items, and agreeingwith the less frequently endorsed ones.

For illustrative purposes, Fig. 2 shows the PRC results of twopotentially inconsistent respondents. Panel (a) corresponds to par-ticipant n� 138 in the N dataset who obtained an lz value of �2.80.The expected PRC is decreasing, as it should be, indicating that theprobability of endorsement decreases as the difficulty of the itemsincreases. The observed PRC, however, is different, particularly onthe right-hand side of the graph. The analysis of the standardizedresiduals showed that 4 responses had residual values above 2.0in absolute values. These residuals are the extreme points on theupper right-hand side of the figure. So, this is an example of localdiscrepancy: the respondent endorsed a group of items which wererather difficult given his/her estimated trait level.

Panel (b) is the analysis of participant n� 431 in the E dataset,who was detected as inconsistent with an lz value of �3.46. Again,

Fig. 2. PRC analysis of two potentially inconsistent respondents.

the expected PRC is decreasing. However, the observed PRC is vir-tually flat. In this case there were 8 standardized residuals largerthan 2.0 but were distributed between the two ends of the curve.This is an example of person unreliability: the responses of thisindividual are almost insensitive to the item locations.

3.3. Validity analyses

In the E dataset the lz values correlated 0.24 with the C scores, avalue lower than the 0.34 obtained by Schmitt et al. (1999). How-ever, they did not use a measure based on a single scale (as here)but a multitest extension of lz obtained using five NEO scales.

The test–retest correlation of the lz values in the E dataset was0.68 which suggests that the measures of person misfit has somedegree of temporal stability. This retest estimate is higher thanthe split-half estimates obtained by Reise and Waller (1993). How-ever, they used shorter scales, which means that their lz valuescontained much more error than in the present case.

4. Discussion

Person-fit analysis is an important part of evaluating an IRTmodel and provides information that is highly relevant for individ-ual assessment. The position adopted here is that this analysisshould be routinely carried out as an additional stage after the cal-ibration and scoring stages, and before the individual scores areinterpreted and used for further inferences (i.e. validity). However,if this analysis is to be appropriate and accurate, two basic condi-tions must be met. First, an appropriate IRT model that fits the datawell at the global level must be used. Second, a relatively long testwith a wide spread of item locations is needed. Normal-range per-sonality scales of the type considered here are usually broad-band-width measures with an appropriate dispersion of difficulties.However, some scales might be too short to provide accurateperson-fit measurement. To overcome this limitation, the multi-test extension of the lz statistic proposed by Drasgow, Levine,and McLaughlin (1991) can be used. This extension, which com-bines the person-fit information obtained over individual scalesinto a single value, seems particularly useful in personality, inwhich individual scales are usually part of a multidimensionalquestionnaire.

If person-fit analysis is to be widely used in personality it mustbe implemented in user-friendly and widely available software. Asdiscussed above, free software already exists for the proceduresconsidered here. However, more work is needed to include furtherdevelopments, such as the multi-test extension mentioned aboveor the existing procedures for graded-response items.

We turn now to the present study. To start with, given the mea-sures, conditions and samples used, the generalizability of the re-sults is necessarily limited. Thus, only the E scale was partiallybalanced, and in this measure, the impact of acquiescence wasminimal. However, there may be measures, situations or samplesin which acquiescence is an important source of misfit. It wouldalso be of interest to consider conditions other than the presentones. Thus, in real situations of selection or situations that generatepressure of some sort, the frequency and sources of misfit could bequite different from the ones obtained here. Finally, the study hasbeen based on two different samples. A single-sample repeated-measures design would have made it possible to study the conver-gent validity of person-fit over different measures.

In spite of these limitations, the study has obtained interestingresults in terms of the three purposes described above. At the illus-trative level, it demonstrates that the proposal is both feasible anduseful because the person-fit analysis worked quite well in bothdatasets.

Page 5: Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality

722 P.J. Ferrando / Personality and Individual Differences 52 (2012) 718–722

New results were obtained at the substantive level. In bothdatasets the results suggest that most of the participants answeredthe measures consistently. At first sight this result could be ex-pected given the conditions of the study: anonymous participationand no pressure of any sort. However, it should also be noted that,in these conditions, the motivation for carefully answering a rela-tively long test is rather low (Reise & Flannery, 1996). In spite ofthis generalized consistency, in both N and E datasets, a small per-centage of participants provided inconsistent responses. The iden-tified sources of misfit were local discrepancies (probably due tothe idiosyncratic interpretation of some items), low person reliabil-ity, and deliberate sabotaging. It is precisely these results thatshow the need for person-fit analysis. Given the small proportionof inconsistent respondents, the external validity would probablynot be affected a great deal. However, if inferences were to havebeen drawn or decisions taken at the individual level, they wouldhave been incorrect for the inconsistent individuals because thetrait estimate for these respondents is meaningless.

Finally, the results are also of interest at the theoretical level.The positive correlation between the C scores and person-fit valuessuggests that respondents high in Conscientiousness tend to re-spond to the test more carefully. This result reinforces the useful-ness of C as a predictor. On the other hand the moderate temporalstability obtained in the N dataset suggests that when person-fitmeasures are obtained in favorable conditions they are reliableto some extent. Together, both results imply that, at least someof the variation in the lz values is attributable to systematic factors.

Acknowledgment

This research was supported by a Grant from the Spanish Min-istry of Education and Science (PSI2008-00236/PSIC).

References

Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1991). Appropriatenessmeasurement for some multidimensional test batteries. Applied PsychologicalMeasurement, 15, 171–191.

Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurementwith polychotomous item response models and standardized indices. BritishJournal of Mathematical and Statistical Psychology, 38, 67–86.

Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2005). Global, local and graphicalperson-fit analysis using person-response functions. Psychological Methods, 10,101–119.

Eysenck, H. J., & Eysenck, S. B. G. (1976). Psychoticism as a dimension of personality.New York: Crane-Russak.

Ferrando, P. J. (1994). Fitting item response models to the EPI-A impulsivitysubscale. Educational and Psychological Measurement, 54, 118–127.

Ferrando, P. J. (2004). Person reliability in personality measurement: An itemresponse theory analysis. Applied Psychological Measurement, 28, 126–140.

Ferrando, P. J., & Lorenzo-Seva, U. (2000). WPERFIT: A program for computingparametric person-fit statistics and plotting person response curves. Educationaland Psychological Measurement, 60, 479–487.

Ferrando, P. J., & Lorenzo-Seva, U. (2010). Acquiescence as a source of bias andmodel and person misfit: A theoretical and empirical analysis. British Journal ofMathematical and Statistical Psychology, 63, 427–448.

Ferrando, P. J., Lorenzo-Seva, U., & Chico, E. (2003). Unrestricted factor analyticprocedures for assessing acquiescent responding in balanced, theoreticallyunidimensional personality scales. Multivariate Behavioral Research, 38,353–374.

Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis.Multivariate Behavioral Research, 23, 267–269.

Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventorymeasuring the lower-level facets of several five-factor models. In I. Mervielde, I.Deary, F. De Fruyt, & F. Ostendorf (Eds.). Personality Psychology in Europe (Vol. 7,pp. 7–28). Tilburg, The Netherlands: Tilburg University Press.

Lanning, K. (1991). Consistency, Scalability and Personality Measurement. New York:Springer-Verlag.

Levine, M. V., & Drasgow, F. (1983). Appropriateness measurement: Validatingstudies and variable ability models. In D. J. Weiss (Ed.), New horizons in testing(pp. 109–131). New York: Academic press.

McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale: LEA.Meijer, R. R. (2003). Diagnosing item score patterns on a test using item response

theory-based person-fit statistics. Psychological Methods, 8, 72–87.Meijer, R. R., Egberink, I. J. K., Emons, W. H. M., & Sijtsma, K. (2008). Detection and

validation of unscalable item score patterns using item response theory: Anillustration with Harter’s self-perception profile for children. Journal ofPersonality Assessment, 90, 1–14.

Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit.Applied Psychological Measurement, 25, 107–135.

Reise, S. P., & Due, A. M. (1991). The influence of test characteristics on the detectionof aberrant response patterns. Applied Psychological Measurement, 15,217–226.

Reise, S. P., & Flannery, W. P. (1996). Assessing person-fit on measures of typicalperformance. Applied Measurement in Education, 9, 9–26.

Reise, S. P., & Waller, N. G. (1993). Traitedness and the assessment of responsepattern scalability. Journal of Personality and Social Psychology, 65,143–151.

Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical measurement.Annual Review of Clinical Psychology, 65, 143–151.

Schmitt, N., Chan, D., Sacco, J. M., McFarland, L. A., & Jennings, D. (1999). Correlatesof person-fit and effect of person-fit on test validity. Applied PsychologicalMeasurement, 23, 41–53.

Snijders, T. A. B. (2001). Asymptotic null distribution of person fit statistics withestimated person parameter. Psychometrika, 66, 331–342.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journalof Educational Measurement, 14, 97–116.

Zickar, M. J., & Drasgow, F. (1996). Detecting faking on a personality instrumentusing appropriateness measurement. Applied Psychological Measurement, 20,71–87.

Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Itemanalysis and test scoring with binary logistic models. Chicago: Scientific Software.