University of Leeds - PILOT STUDY · Web viewParticularly Quest was used to estimate person and item parameters using the Rasch model and HLM 5.0 was used to perform the multilevel

Identifying the Sources of Person Misfit: Combining Quantitative and Qualitative

Approaches

Alexandra PetridouLearning, Teaching and Assessment Research & Teaching

GroupUniversity of Manchester

Seminar paper presented at a seminar of the University of Manchester, School of Education, Learning, Teaching & Assessment

Research and Teaching Group, November 2004

Abstract

Person-fit statistics aim to detect aberrant response behaviour as this may have detrimental effects on the quality and validity of measurement. Although there is a large number of indices in the literature that were developed to identify aberrant examinees, the reasons that lead an examinee to provide aberrant responses remain until today largely unknown. This is because finding an unexpected or unusual pattern does not provide an explanation for this aberrance. In the literature many authors have suggested various possible reasons that may lead to aberrant response behaviour. These however have not been systematically investigated. This study combines quantitative and qualitative approaches in order to identify potential reasons for person misfit using real data. The data comes from a mathematics test containing 45 constructed and multiple choice items and two questionnaires that are used to gather background information about the examinees. The quantitative part adopts two methods to analyse the performance of the individuals on the test. Specifically, the first one examines aberrant behaviour under the Rasch model using the Infit and Outfit person-fit statistics. On the basis of the Infit and Outfit values a new variable is constructed that becomes the response variable in a two-level model. The second method adopts the multilevel logistic regression approach proposed by Reise (2000). The qualitative part follows-up and interviews two examinees who have provided misfitting response patterns. The main outcomes of the The author is grateful for the financial support of the ESRC (PTA-030-2004-00072). The usual disclaimer applies.

1

study to date are: (i) that the multi-level methodology is promising but will need a larger data set to yield substantive results and (ii) the case-studies yielded a number of explanation of misfit in this context that are worthy of further investigation.

1 IntroductionWhen examinees take a test their responses are expected to conform to some standard of reasonableness (Smith, 1986): for instance, a response pattern that involves a significant number of wrong answers to èasy' questions but right answers to `hard' questions would be regarded as àberrant'. Such an aberrant response pattern would be signalled by a high `misfit' statistic computed from the deviations of these responses from those èxpected'. Detecting such unexpected response patterns has been investigated by many researchers in the last decades. The rationale behind this effort was the claim that the test scores of these examinees with unexpected response patterns may fail to provide a useful and valid measure of ability. Unexpected response patterns may result from: student misconceptions, cultural differences, atypical schooling, language deficiencies, anxiety, lack of motivation, faulty test construction, ethnicity, external distractions, fatigue and many more. There is a large number of indices in the literature which were developed in order to identify aberrant response patterns. However statistical misfit is just an indication of problematic examinee performances and that is all. Person-fit statistics generally indicate whether someone has an unusual response pattern and not why such a pattern has occurred. The statistical model cannot tell us more about the reasons that led an examinee to generate these responses. This is because even if a pattern is statistically identified as aberrant the researcher cannot always be sure of the kind of aberrance underlying test performance. Most researchers seem to agree that factors causing the aberrance can be very complex and their identification needs something more than mere statistics. This has been one of the main reasons that the factors that lead some examinees to generate aberrant responses remain today largely unknown. Hulin, Drasgow & Parsons (1983) argued that "since the underlying causes of aberrance are usually unknown little meaning can be given to an individual's test score determined to be aberrant by an appropriateness index" (p.149).

2

The purpose of my study is to examine the validity of test scores from the point of view of `misfitting' or àberrant' students and classes. Validity refers to the “degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests”. The above definition of validity is given by the Standards for Educational Testing (AERA et al., 1999). What becomes obvious is that the main emphasis is on scores not on tests as tests do not have reliabilities and validities. Test responses have these properties and are a function “not only of the items, tasks or stimulus conditions but of the persons responding and the context of measurement” (Messick, 1993, p.14). There are many different sources of evidence that could be used in evaluating proposed interpretation of test scores and these different sources can shed light on different aspects of validity. This study will examine the validity of test scores from the point of view of misfitting response patterns as the test scores of these examinees maybe invalidly low (because of language deficiencies, alignment error etc) or invalidly high (because of cheating, copying etc). In order to be able to give meaning to these scores and make valid inferences based on these test scores we need to know the reasons behind this unexpected responding. Otherwise no trust can be put in the inferences based on these scores.

Particularly this study will provide answers to the following question: what factors lead examinees to provide unexpected response patterns and more specifically to what extent is misfit attributable to class level variables or individual characteristics. The design of this study involves hierarchical statistical models to identify such classes and individual students, followed-by case studies of particular students and classes to elicit insights into the causes of their statistical àberrance'. This study is expected to be completed in September 2006.

This paper reports some preliminary findings from a pilot study that took place between May and July 2004. This pilot study was conducted in order to run a small scale version of the main study but also to pilot test the research instruments that will be used in the main study. Specifically the objectives of this pilot study were to:

Pilot test the research instruments i.e. two questionnaires and semi-structured interview's plan

Assess the proposed data analysis techniques (both qualitative and quantitative) in order to examine whether these are feasible and

Uncover any potential problems in the research process

3

In general this study was a feasibility study and aimed to provide some feedback on the methodology and methods employed before the main study but also to provide some preliminary results.

2 Study SampleFour-hundred sixty-nine Year 6 pupils nested within 22 classes from various primary schools in the area of Manchester composed the sample of this study. However only for two hundred and thirty three pupils there was available background information data. The schools from which pupils were selected were different in terms of socio-economic background, ethnicity composition and national curriculum levels in Mathematics. This study used a convenient sample in terms of spatial proximity and easy access.

3 DesignA two-phased methodology was used as this pilot study combined quantitative and qualitative methods (Figure 1). In particular, phase A was a quantitative study that aimed to identify examinees with misfitting response patterns and then study the statistical relationships between misfit with various individual and class background variables. Phase B started one month after the first phase. This phase included the study of specific cases of individuals. The purpose of the case studies was to build a profile of these pupils, identify any existing relationships, patterns and mechanisms and finally and more importantly try to explain the causes of their statistical misfit.

Figure 1: The design of the methodology and methods

4

4 Phase A: Quantitative Study

4.1 The Research InstrumentsThe quantitative part of this study included the following instruments: a Mathematics test (MaLT test), a student and a teacher questionnaire.

4.1.1 The TestThe test used to collect the data was obtained from the Mathematics for Learning and Teaching (MALT) project of the University of Manchester, which collects diagnostic information and standardize mathematics tests for years Reception to nine. The test used in the pilot study was the Year 6 test and it was designed to cover the full range of levels and content of the mathematics programme of study for Year 6. The test was composed of 45 one mark point questions and it was divided into two parts, a calculator and a non-calculator section. The pupils had approximately 45 minutes to an hour to complete the test.

4.1.2 The QuestionnairesThis study involved two questionnaires, a student and a teacher questionnaire. The purpose of the questionnaires was to collect background information that would be entered as explanatory variables during analysis in a multilevel model. The questionnaires tried to elicit background variables believed to be of significance as they came up often in the literature as possible sources of aberrance. The student questionnaire was administered immediately after the completion of the test while the teacher questionnaire was completed when pupils were taking the test. Pilot testing the questionnaires was considered of paramount importance in order to check the clarity and

5

wording of questionnaire items, the difficulty and time needed to complete the questionnaire.

The student questionnaire had as a purpose to elicit information about pupils' sex, ethnicity, socio-economic background in terms of parents' education, language background i.e. if another language other than English was spoken at home, anxiety before taking the test, motivation and effort in taking the test, the level of difficulty and speed of the test. Motivation and anxiety was measured with the help of existing scales found in the literature. The teacher questionnaire elicited information on ethnicity, gender, years of experience, teachers' education and training and on instructional methods used in Mathematics. In both questionnaires respondents were asked to show their interest in participating in follow-up interviews.

In the following section the data analysis methods will be described and some preliminary findings will be presented.

4.2 AnalysisTwo methods were piloted for the statistical study of person-fit. The first one involved identifying examinees with misfitting response patterns using two general-purpose fit statistics i.e. Infit Mean Square and Outfit Mean Square. These two person-fit statistics were used in order to examine whether data but more importantly pupils were consistent with the Rasch model. In order to distinguish inconsistent performances (so as to be examined further) a criterion was required to be set. In order to identify this criterion, a simulation method proposed by Linacre (personal communication) was used, but having at the same time in mind the limitations of setting a mathematical criterion.

On the basis of the Infit and Outfit values, two new variables were created. Examinees with Infit and Outfit values above 1.3 (criterion from simulation method) were coded as 1, otherwise as 0. These new variables became the response variables in a two-level logistic model, where individuals were nested within classes. Explanatory variables could then be entered in the model in order to examine the degree to which specific variables at different levels could account for the generation of aberrant response patterns.

The second method to study person misfit involved a new method proposed by Reise (2000) which incorporates a multilevel approach to IRT person misfit detection. Specifically, this method uses a multilevel logistic model to evaluate the fit of an IRT model, to an individual’s item response pattern after item and person parameters have been estimated using standard IRT software. According to this method item responses are treated as nested within individuals and a multilevel

6

logistic regression is used to estimate a person-response curve that relates examinee response probability to item difficulty. This person-response curve models how an individual’s item endorsement rate diminishes as a function of item difficulty (Reise, 2000). The slope of the individual’s person response curve reflects the consistency of the individual’s response pattern and is an indicator of person-fit.

Before presenting the two methods in more detail with some preliminary results, it is important to report some descriptive statistics about the sample. Due to missing data the final sample size reduced to 170 pupils from 233. From the 45 items of the test only 44 were included in the analysis as one item had a mistake in the instruction part and in order to be fair was excluded from the analysis. Table 1 presents the sample’s characteristics.

Table 1: Demographic data of the sample

Background Variables PercentageGenderMale 51.8Female 48.2EthnicityWhite 57.1Black 30Asian 12.9Other language than English spoken at homeYes 32.8No 68.2

Mean Estimate of Ability = -0.0867 (logits)

4.2.1 Method 1Although the first method seemed promising, in the application there was one very important limitation i.e. sample size. Multilevel modelling with small sample sizes raises questions about the accuracy for the parameter estimates. Kreft (1996) suggested the “30/30 rule” as a rule of thumb for sample sizes. Specifically Kreft (1996) suggested that in order to be on the safe side researchers should try

7

to obtain a sample of at least 30 groups with at least 30 individuals per group. For certain applications this rule of thumb can be modified. Research has shown that in order to have accuracy and high power it is more important to have large number of groups than a large number of individuals per group.

When applying Method 1 the whole sample was used i.e. 469 pupils nested within 22 classes. From the analysis outputs it was obvious that parameter estimation was very unreliable (i.e. very large standard error values). As a result model building procedures could not be pursued further. However the intra-class correlation could be estimated from the null model. Table 1 reports the intra-class correlation calculated from the Infit and Outfit two-level binary models. The intra-class correlation is an indication of the proportion of the overall variation in misfit that is attributable to higher-level units i.e. in this case class. Specifically the intra-class correlation (ρ) is not really a correlation but a measure of strength of association.

Table 2: Intra-class correlations of two binary multilevel models

Model 1: Response variable INFIT MSQR

Model 2: Response variable OUTFIT

MSQRIntra-class correlation 0.086 0.044

Table 2, shows that roughly the 8.6 % of the total variance in misfit as indicated by the Infit person-fit statistic can be explained by class membership, while 4.4% of the total variance in misfit as indicated by the Outfit statistic can be explained by class. These results indicate that the most of the variance in misfit as indicated by Infit and Outfit statistics is at the individual level. The variation in misfit attributed to class membership as indicated by Infit statistic is considered quite high and explanatory variables at the class level could have been entered in order to explain such variation. The variation however in misfit attributable to class membership as indicated by Outfit statistic is quite small. Consequently, there is no point entering class-level predictor variables but we can still enter variables at the individual level. As this study’s sample was quite small, it did not allow this kind of pursuit. Moreover background information was not available for all the pupils and classes in the sample. This method will be attempted in the main study where the dataset will be much larger and background information will be available at the individual and class level for the whole sample.

8

4.2.2 Method 2Method 2 treated items as nested within pupils, and an additional level was added i.e. pupils nested within classes. Pupils and classes became in this method, the level-two and level-three groups respectively. In method 2, the data from only the 233 pupils were used as for these pupils there was available background information.

The basic multilevel logistic model used in Method 2, is the following

Basic Model

Level-One (1)

Level-Two(2)(3)

Level-Three(4)(5)(6)

(7)(8)

.

.

In the level-one model the response variable is the probability of an item endorsement or the natural log odds of success which is regressed on item difficulty ( ). The intercept term ( ) is the log odds of item endorsement when item difficulty equals zero. The slope term ( ) indicates how item endorsement rate decreases as item difficulty increases. It is the empirical Bayes estimates of person slope parameters that are used as indicators of person fit ( values near zero indicate poor person-fit). Intercept and slope parameters for each individual in this model are treated as random coefficients. Thus their level-one regression coefficients (i.e. and for the jth individual) become the level-two response variables which are regressed on latent trait ( ) and on level-two variables ( ) respectively. The terms and represent level-two residuals. Each of the level-two

9

coefficients defined in level-two, become a response variable in the level three model. The terms are level-three variables while the terms and are level-three residual terms.

This basic multilevel model is viewed by Reise as a three-step process and in this way was used for analysis purposes. Quest (Adams & Khoo, 1996) and HLM 5.0 (Raudenbush et al., 2001) were used to perform the analyses. Particularly Quest was used to estimate person and item parameters using the Rasch model and HLM 5.0 was used to perform the multilevel modeling analyses.

One significant feature of this method for this study is that explanatory models can be entered into the model and the cause of person misfit can be investigated at two levels i.e. the individual and group level (in this study class). Another important feature of this method is that goodness of model fit can be investigated at item and person level simultaneously. In other words item and person misfit can be investigated in a single multilevel analysis.

4.2.2.1 ResultsAnalysis results will be presented as a three-step process. This three-step process is described by Reise (2000) in much more detail. According to Reise (2000) the first step in this three-step process is to estimate a multilevel logistic model with no second and third level variables. In this multilevel model items are nested within individuals and individuals within classes. This model is presented below.

Model 1

Level-One

Level-Two (9)

(10)

Level-Three(11)(12)

10

Before examining the results obtained from HLM, it is important to mention here that in this first step we are most interested to inspect the reliability of random intercept and random slope parameters and to test the slope variability for statistical significance.

Table 3: Reliability estimates for level-1 coefficients-Model 1

Random level-1 coefficient Reliability estimate

Intercept term, 0.875Slope term, 0.181

Table 3 shows that intercept variation is highly reliable while slope variation is not reliable. These results were expected as intercepts are analogous to examine trait level variation while slopes parameters were generated from Rasch model which treats item slope parameters as fixed. Table 4 reports the values for and parameters. The parameter is the grand mean of the estimates. The coefficient represents the log-odds of an item endorsement when the item difficulty equals zero. The represents the average person-slopes (

). In this study =-0.06 i.e. the average log-odds of an item endorsement when item difficulty equals zero is -0.06 which corresponds to a probability of 0.484. The i.e. the average person slope equals to -0.97.

Table 4: Estimates of the fixed effects in Model 1

Fixed level-2

Coefficient

Standard Coefficient

Error Approx.T-ratio

d.f. P-value

-0.058661 0.166164 -0.353 7 0.734

-0.968755 0.032261 -30.029 7 0.000

In this first step we are most interested in the variation of person slopes. Table 5 shows that there is a significant variation in person slopes. According to Reise (2000) this is an indication that individuals differ in the person-response curve slopes. This means that IRCs are not applicable to everyone in the sample and deviations from the average person slope do not reflect chance error.

11

Table 5: Estimates of the random effects in Model 1

Random level-2

Coefficients

Standard Deviation

Variance Componen

t

d.f. Chi-square P-value

Intercept Variation (

)

1.12994 1.27675 162 712.97065 0.000

Slope variation (

)

0.16015 0.02565 162 211.17609

0.006

The second step in this process is to introduce examinee trait level (i.e. ) as a second-level variable in Equation 9. In this step the model takes the following form.

Model 2

Level-One

Level-Two(13)

Level-Three

(14)

In the level-one model and represent the intercept and slope coefficients respectively. The term represents the grand-mean of the first-level intercepts when equals zero. The term is a second-level slope predictor of the first-level intercepts. The represents the average person-slope parameter.

12

The purpose of the second-step is to establish that no reliable variation exists in the intercept coefficient. The Rasch model is a unidimensional model i.e. it assumes that the probability of item endorsement depends on examinees standing on only one latent trait. If this is true then no significant and reliable residual intercept variation must exist after introducing examinee trait level ( ) in Equation 9.

Table 6: Estimates of the random effects in Model 2

Random level-2 Coefficients

Standard

Deviation

Variance Componen

t


Intercept Variation ( )

0.01101 0.00012 161 50.65181 >.500

Slope variation ( )

0.07332 0.00538 162 227.72228 0.001




Tables 6 and 7 show that there is no significant or reliable residual variation in the intercepts. This means that the variation in the intercepts is completely accounted for by examinee trait level. This gives credibility to the assumptions made by using the Rasch model (a unidimensional model), since the above results show that endorsement probabilities are not affected by any other factor beyond examinee trait level ( ).

In the third and final step in this process the slopes are treated as random while the intercepts are treated as non-randomly varying. Since in the previous step residual variation was found not to be significantly different from zero, there is no reason to treat the

13

intercepts as random in this step. So in Equation 15 there is no error term. This means that, in the intercept equation it is assumed that all the reliable variation in the intercepts is accounted for by latent trait (

). Model 3 is presented below.

Model 3Level-One

Level-Two(15)

Level-Three

In this model we are most interested in: (i) the variation of random slopes, (ii) the reliability of this variation and (iii) the empirical Bayes estimates of the person-slopes parameters (Reise, 2000). In fact it is the person-slope estimates for each individual that are used as indicator of person-fit. (i.e. individuals with person-slopes near zero).

Before I move on to the interpretation of some preliminary results, an empirical comparison was performed between the empirical Bayes estimates of the slope calculated from the above model with some more traditional person-fit indices i.e. Infit and Outfit Mean Square. In the literature of person-fit one can find many comparisons between different approaches to person-fit, in order to find a person-fit index that performs better than others in detecting examinees with aberrant response patterns. The purpose of the comparison here was to investigate how well this new method correlated with other more traditional person-fit indices. A correlation of 0.7 was found between empirical Bayes slope estimates and Infit values while the correlation between the empirical Bayes slope estimates and Outfit values was 0.526.

Reise (2000) also reported a comparison between empirical Bayes slope estimates and standardized log-likelihood values (lz) for simulated data. He found a correlation of -0.65, which indicates that these two methods produce similar but not exactly the same results. A comparability study is also underway by Yang (2004). Specifically Yang (2004) will compare the standardized log-likelihood index lz, the

14

standardized extended caution index ECI4z and the chi-square statistic of Trabin & Weiss with the multilevel method proposed by Reise (2000) using real empirical data. The purpose of this study by Yang, is to investigate the effectiveness and the differences between these approaches in detecting person misfit. Whether this method is superior in detection rates than other more traditional ones, still remains to be explored with simulated and real data. This however is out-of-the scope of this research study.

Tables 8, 9, 10 and 11 present the estimates for the fixed and random coefficients obtained from HLM for Model 3.

Table 8: Estimates of the fixed-effects in Model 3

Fixed level-3

Coefficient

Standard Coefficient

Error Approx.T-ratio

d.f. P-value

-0.001666 0.029992 -0.056 7 0.9581.010894 0.029247 34.564 7477 0.000-1.024010 0.030032 -34.098 7 0.000

Table 9: Estimates of the level-2 random effects-Model 3


Standard

Deviation

Variance Component


Slope variation ( )

0.07907 0.00625 162 177.12902 0.197

Table 10: Estimates of the level-3 random effects-Model 3


Standard

Deviation

Variance Component


0.00507 0.00003 7 0.15078 >.500

15

0.03100 0.00096 7 8.60737 0.281




Tables 9 and 11 show that there is no significant or reliable variation in person-slopes (p=0.197, reliability=0.155). This means that person-slope estimates are not significantly different from zero and any observed variation is not reliable. Consequently this lends credibility to the fixed slope assumption of the Rasch model. The average person-slope ( ) is -1.02 and each and term represent an individual’s and a classroom’s deviation from this average respectively.

The terms and represent the empirical Bayes estimates of each examinee’s logistic regression intercept and slope respectively. In Model 3 we are more interested in the empirical Bayes estimates of person slopes which can be used to identify individuals with aberrant response behaviour i.e. persons with person-slopes near zero. A person-slope near zero indicates no discernible relation between item difficulty values and the examinee’s response patterns.

Since no significant or reliable variation was found to be present in person slopes, this means that in this sample all pupils have displayed similar patterns of response. Since no significant or reliable variation exists there is no point in adding explanatory variables. In other words, there are not any non-modelled variables that are influencing test performance.

4.3 Conclusions from the Quantitative StudyIn the first phase two methods were piloted for purposes of analyses. While both methods were not very informative about the effect of various variables at the individual and class level on misfit, they still remain promising and will be used in the main study. However there are some issues that need to be resolved.

16

In Method 1, the cut-off value used for the Infit and Outfit statistics was set according to a simulation study. Cut-off values are a common problem in the area of person-fit statistics. Most indices have this problem related with the selection of the cut-off values i.e. how extreme a fit value must be to indicate a lack of fit to the model. In other words, how improbable is a given response pattern under the null hypothesis that it is produced according to the item response model (Molenaar & Hoijtink, 1990). This decision must rely on statistical decision-making or hypothesis testing for an answer, which requires the construction of sampling distributions for the fit statistics (Liou & Sinica, 1993). Calculating however sampling distributions is not easy; in fact it involves many computational problems. This is the reason why researchers usually use large-sample approximations such as a normal distribution (Drasgow, Levine, & Williams, 1985) or a chi-square distribution.

Another issue, in relation again with method 1, is the use of a two-level logistic regression model. In the main study an ordinal two-level model will also be tried out in order to examine the effect of different variables at various levels of misfit. This was not pursued in the pilot study as multilevel model-building procedures using an ordinal model could not be ran in HLM 5 but it is now available in HLM 6.

Method 2 still needs to be tried out with a larger sample. This method although promising, is still quite new in the area of person-fit. This means that whether this method is superior in detection rates than other more traditional ones, it still remains to be explored. An exploratory investigation in this area is currently under way by Professor LihShing Wang and her research team (personal communication).

5 Phase B: Qualitative Study The quantitative part of this study was followed by two case studies. The qualitative study focused on cases of pupils that the quantitative analysis showed that they had provided misfitting response patterns in the test. The unit of analysis for the cases was not however the specific pupils but the phenomenon of test misfit. The cases provided the opportunity to study the phenomenon of interest. The purpose of the case studies in this pilot study was to provide some insight into both substantive and methodological issues. As no qualitative study has been attempted before in this area of person-fit, the pilot case studies were needed in order to inform the research design with empirical observations, (and not just with review of relevant

17

literature) and to explore what categories and issues might emerge in the main study.

As pupils with misfitting response patterns were going to take part in the second phase of this study, this meant that qualitative data collection had to start after statistical analyses had been completed. As a result time-limits had become very tight, as it was already July and the end of school year was close. Consequently the number of pupils that could be involved in this phase was restricted by time limits. The pupils that participated in the pilot study came from the same school (and in fact from the same class) and they were selected on the basis of their Infit and Outfit values, the geographical proximity of the school and the easy access obtained by the relevant gatekeepers (i.e. head teacher, teacher and parents).

As already mentioned the data for the case studies came from semi-structured interviews with the two pupils and their teacher. There were two different interview guides, one for the teacher interview and one for the pupil interviews. The interview guide included different topic headings to be explored and associated questions under each heading. However the sequencing of questions, the amount of time and attention given to different topics varied in each interview. All the interviews were tape-recorded after obtaining permission from the interviewees and immediately transcribed after the completion of the interviews. As the data collected is strictly anonymous the names of the pupils used for the purposes of reporting are not the real ones.

5.1 Case-studies

5.1.1 Introducing the “cases”John and Larry were selected to be studied further as they have generated aberrant response patterns on the MALT test. John and Larry are Year 6 pupils and they are both British. They come from a community school in the City of Manchester and in fact they are in the same class. According to the latest Ofsted report, the number of pupils that are allowed free school meals in this school is above the national average and most of the families in the area suffer from high unemployment rates. Very few pupils in this school come from homes where English is an additional language. According to the latest published league tables (i.e. 2003), the Math’s performance of pupils in Year 6 in this school was a little bit below the national average but above the LEA average.

18

5.1.2 AnalysisThe technique employed to analyse the case studies is a technique that applies specifically to multiple cases. This technique is called cross-case synthesis. According to this technique each individual case study is treated as a separate study and findings are aggregated across a series of individual studies (Yin, 2003). For this purpose word tables were created that displayed the data form the two individual cases according to a uniform framework. This framework was provided in one of the studies in the area of person-fit. However this framework was modified a little in order the distinction between the factors at different levels that contribute to test misfit to become more obvious. Figure 2 presents this framework. This study tried to eliminate factors at the item level so as the study of test misfit to be focused at the pupil and class level.

Figure 2: Framework for case-studies

As already mentioned word tables were created for each individual case i.e. in this pilot study two (Table 12). As both pupils were coming from the same class the upper part of the table i.e. the class level factors remained the same. The purpose of the construction of word tables is to identify cross-case patterns that would allow the development of strong arguments supported by the data that would eventually give answers to our research questions.

19

CLASS LEVELDEMOGRAPHIC CLASS

CHARACTERISTICSDEMOGRAPHIC

TEACHER CHARACTERISTICS

INSTRUCTION EFFECTCURRICULUM EFFECT

PUPIL LEVELDEMOGRAPHIC

CHARACTERISTICSEDUCATIONAL

CHARACTERISTICSPERSONAL

CHARACTERISTICSTEST-TAKING

STRATEGIESEXTERNAL FACTORS

ITEM LEVELTEST BIASFAULTY ITEMSCLERRICAL

ERRORS

TEST MISFIT

Table 12: Word tables for the identification of cross-patterns

CLASS LEVELCLASS DEMOGRAPHIC

CLASS CHARACTERISTICS

DEMOGRAPHIC TEACHER

CHARACTERISTICS

INSTRUCTION EFFECT

CURRICULUM EFFECT

PUPIL LEVELPUPIL DEMOGRAPHIC

CHARACTERISTICSEDUCATIONAL

CHARACTERISTICSPERSONAL

CHARACTERISTICSTEST-TAKING STRATEGIES

The two-cases in this pilot study although they shared common demographic characteristics and high misfit values on the test; they differed alongside the rest of the pupil’s level dimensions. Although cross-case patterns were difficult to be identified as only two pupils were interviewed, the tables have shown that the misfitting response behaviour of the two pupils in the test was a result of various within pupil-level interactions. However the reasons that led each case to provide misfitting responses were quite different. Specifically for John who was a high ability pupil in Maths, the reasons behind his misfitting behaviour was mainly misinterpretation of very easy items and carelessness, while for Larry was his low motivation in taking the test which that led to a minimum amount of effort while taking the test and copying from things displayed on classroom walls.

The analysis however highlighted new analyses directions that could be pursued in the main study. Specifically in the main study it will be interesting to pursue comparisons based on various key characteristics for example: ‘high ability misfitting pupils vs. low ability misfitting pupils’, ‘high motivational vs. low motivational’ etc in order to identify different types of misfit.

Many important pieces of information came to surface with the case studies, important issues that will be taken into account in the main study. Specifically, the time of the school year that the test took place was really bad. This had impact on the willingness of schools to participate in the study but also on the motivation of pupils in taking the test. In the main study the test will be administered much earlier in the school year i.e. around late January. This will also facilitate the

20

case studies as more time will be available for onsite visits and interviews.

Misfit on the individual level could be explored satisfactorily well through the interviews. However misfit on the class level was more difficult to be explored in this way. Issues at the class level, like opportunity-to-learn, curriculum and instructional practices were difficult to be explored through teacher interviews. This was also due to time-limits as the teacher interview takes place on school time. Another approach to these issues has to be attempted in the main study, possible alternatives are school documents, interviews with the head teacher etc.

Moreover during case studies, pupils were asked to solve again items on the test where unexpected answers were given. The conditions under which the items are approached during the interview and during the actual test are different in many ways. For example in the amount of time devoted on specific items, in the psychological and physical state of the pupils, in the amount of attention given on items, etc. As a result, we depend on the explanations given from the pupils to explore what happened. However factors such as misconceptions, language deficiencies, curriculum effect etc can still be explored.

Finally, one has to take account to the so-called reactivity. Reactivity refers according to Robson (2002) to the “way in which the researcher’s presence may interfere in some way with the setting which forms the focus of the study, and in particular with the behaviour of the people involved” (p. 172). This is unavoidable when the research involves human beings but it will be kept in mind when interpreting results.

6 Concluding RemarksThe pilot study provided useful feedback for the methods employed to collect and analyse data. Specifically the pilot study gave useful feedback on how the questionnaires could be improved so as to serve better their purposes. Moreover problems associated with data access and time constraints were revealed that will be taken into account in the main study.

With this pilot study it was also able to check the feasibility and the problems associated with the statistical and qualitative methods employed. Specifically two statistical methods for identifying misfitting response patterns and for examining the effect of explanatory variables at different levels on person misfit were piloted

21

in this study. The first method could not be fully explored due to sample size problems. The second method showed that variation in person-slopes was not significantly different from zero. Person-slopes in this method provide a vehicle, according to Reise (2000), for identifying examinees with uninterpretable patterns of response. No significant variation in person-slopes indicated that there was no non-modelled variable other than item difficulty that affected test performance.

Although the quantitative study was not very informative, the qualitative part that followed provided informative empirical observations. The two case-studies revealed interesting analysis directions to be further explored in the main study with any new that might emerge.

Although there are still problems that need to be resolved for the main study, this pilot study showed that the multilevel methodology in the area of person-fit is promising and from the case-studies we can anticipate some interesting results.

7 References:Adams, R. J., & Khoo, S.-T. (1996). QUEST: The Intercative Test

Analysis System. Australia: The Australian Council for Educational Research Ltd.

22

Documents

University of Leeds - PILOT STUDY · Web viewParticularly Quest was used to estimate person and item parameters using the Rasch model and HLM 5.0 was used to perform the multilevel