fsi.stanford.edu · Web viewDynamics - Mechanics 2 5% 8 Electricity and Electric Fields 5 13% 9 Magnetism and Magnetic Fields 6 15% 10 Relativity and Quantum Physics 2 5% 11 Waves

February 2016

Appendices to

“Developing Instruments to Assess and Compare the Quality of Engineering

Education: the Case of China and Russia”1

E. Kardanova, P. Loyalka, I. Chirikov, L,Liu, G. Li, H. Wang, E. Enchikova, H. Shi,

N. Johnson

Appendix A. Analytical approach: technical details

One of the intentions of the pilot study and subsequent analysis was to create shortened final

tests that can be used in future studies by selecting only the items with the best psychometric

properties from our pilot tests. In particular, while the pilot tests were 55 minutes each, the

research team sought to cut the length of each subject test to 40 minutes for the final versions.

We therefore included more items in the pilot tests than we needed for the final tests. We also

gave more time to the students during the pilot study so that we would be able to delete some of

the items from the tests due to poor psychometric quality—that is if they did not fit the IRT

model or had low discrimination.

As a discrimination index we used the correlation between examinees’ responses to the

item and their ability levels. As for the threshold to detect the items with low discrimination, we

used a value of 0.2, which is usually used for this purpose in similar studies (Crocker and Algina

1986). We expected that the final number of items for the main study would be 35–40 items for

each subject test.

To measure the extent to which the data fit the Rasch model, we used the unweighted and

weighted mean square statistics provided by Winsteps (in terms of Winsteps output: OUTFIT

1 Forthcoming in Assessment and Evaluation in Higher Education

1

MNSQ and INFIT MNSQ, respectively). These statistics rely on standardized residuals, which

represent the differences between the observed response and the response expected under the

model (Wright and Stone 1979). Generally, a criterion of +1.2 for these statistics is used to flag

potential problems with misfit.

To test for DIF across countries and across grades we used the ETS approach for DIF

classification (Zwick et al, 1999), which designates items as A (negligible or nonsignificant

DIF), B (slight DIF), or C (large DIF) items depending on the magnitude of the Mantel-Haenszel

statistic (Dorans 1989) and its statistical significance. An item was considered a C item if two

conditions were satisfied: (1) the difference in item difficulty between different groups of

students was more than 0.64 logits, and (2) the Mantel-Haenzel statistic had a significance level

of p < .05 (Linacre, 2011). Only C items were considered as items with DIF in this study.

To examine the dimensionality of each scale we conducted a principal components

analysis (PCA) of the standardized residuals (Linacre 1998; Smith 2002). Theoretically, if all the

information in the data is explained by one latent variable, the residuals would represent random

noise and would be independent of each other. As a consequence, correlations between the

residuals would be near zero. If there is no second dimension within the data, then a PCA of the

standardized residuals should generate eigenvalues all near one and the percentage of variance

across the components should be uniform (Ludlow 1985).

To analyze reliability, we used the person reliability index provided by the Rasch analysis

(Stone 2004). The separation index compares the distribution of student measures (estimates of

ability) with their measurement errors and indicates the spread of student measures in standard

error units. The index can be used to calculate the number of distinct levels, or strata separated

by at least three errors of measurement, in the distributions (Wright and Stone 1979; Smith

2

2001). The number of strata are calculated as: Strata=(4G+1)/3, where G is the separation index.

At least three different strata are recommended (for example, low, middle and high ability

levels).

In order to show the relative distribution of item difficulty and students’ scores in a

common metric, we constructed the variable map (Wright and Stone 1979). For equating

between grades, we used a separate calibration design with anchoring items from one test when

calibrating the other test (Wolfe 2004). After equating, we evaluated the quality of the link

between the grade 1 and grade 3 tests by calculating the item-within-link statistic (Wright and

Bell 1984). Under the null hypothesis of the items exhibiting perfect fit within the link, this

statistic has an expected value of 1. Link adequacy was also evaluated by determining the

stability of the item difficulty estimates across the grade 3 test with and without anchoring. To do

that, we calculated the correlation between the item difficulty estimates.

3

Appendix B. The results of the psychometric analysis for grades 1 and 3 physics tests

The results for the physics tests are substantively similar to the mathematics tests and will be

presented briefly because of space limitations. Five and six items were deleted from the grade 1

and grade 3 physics tests respectively because of poor psychometric quality. For further analysis

we thus considered the sets of 40 and of 39 items for the grade 1 and grade 3 tests respectively.

Of the 40 items for the grade 1 physics test, 17 demonstrated country-related DIF: 9 items in

favour of China and 8 in favour of Russia. The other 23 items were DIF free and could be used

for linking between the two countries. As for the grade 3 physics test, 16 demonstrated country-

related DIF: 8 items in favour of China and 8 in favour of Russia, while the remaining 23 items

were DIF free. As with the mathematics tests, we used DIF-free items in each test for linking

between the two countries. Items that were not considered good for linking were deleted at

earlier stages, either because they did not have good psychometric qualities when they were

analyzed for inclusion in a particular test or because they exhibited DIF for at least one test.

Further analysis showed that all items of the grade 1 and grade 3 physics tests had good

psychometrics characteristics, fit the model, and could be considered as essentially

unidimensional. The person reliability and the person separation index were 0.83 and 2.03 for the

grade 1 physics test and 0.77 and 1.84 for the grade 3 physics test, indicating three statistically

distinct groups of students along the continuum for each test. The variable maps for the physics

tests are presented in Figures B.1 and B.2.

4

Figure B.1. The Physics Grade 1 Test Variable Map

5

Figure B.2. The Physics Grade 3 Test Variable Map

6

Appendix C. The Content Areas of the Math and Physics Tests

The experts were asked to rate the items according to four criteria: (1) comprehensibility of

wording, (2) appropriateness in measuring the content area of interest, (3) difficulty, and (4)

expected time required to answer. The experts found 13.4% of the physics items and 7.5% of the

math to have problems in clarity of wording (meaning 25% or more of the experts marked the

item for clarity issues). We reviewed all the items that experts marked as having clarity issues

and rectified the items that had simple and obvious wording problems and deleted the other

unclear items from our test item bank. The experts ranked (2), (3) and (4) on continuous scales

(e.g. 1-4). We gave priority to the items that experts deemed to be most appropriate in measuring

the content of interest. We also selected items spread across a range of difficulties and avoided

items that experts thought would take students too long to complete. Consequently, we selected

less than half of the original item pool for use in the clinical pilot. Out of a pool of 179 physics

items and 174 math items we initially collected, we selected 80 physics items and 80 math items

to be used in the clinical pilot respectively. In selecting these items, we also took care to keep

balance across weighted content areas.

Table C.1: The Content Areas of the Math Grade 1 Test

Number Topic Frequency %1 Derivatives and their application 7 18.92 Equations 7 18.93 Functions and domains 5 13.54 Inequalities 3 8.15 Mathematical reasoning and logic 5 13.56 Single Variable Differentiation 4 10.97 Trigonometric functions and equations 6 16.2

Total 37 100

7

Table C.2. The Content Areas of the Math Grade 3 Test

Number Topic Frequency %1 Derivatives and their application 3 7.72 Equations 1 2.63 Functions and domains 1 2.64 Inequalities 1 2.65 Linear Algebra 5 12.86 Mathematical reasoning and logic 2 5.17 Multivariate Differentiation 6 15.48 Ordinary differential Equations 1 2.69 Probability and statistics 3 7.710 Series 2 5.111 Single Variable Differentiation 7 17.912 Single Variable Integration 5 12.813 Trigonometric functions and equations 2 5.1

Total 39 100

8

Table C.3. The Content Areas of the Physics Grade 1 Test

Number Topic Frequency %1 Circuits 5 13%2 Electromagnetic fields 6 15%3 Electromagnetic induction 7 18%4 Mechanical energy 1 3%5 Motion and forces 3 8%6 Optics 6 15%7 Oscillation and mechanical waves 6 15%8 Dynamics – Mechanics 1 3%9 Electricity and Electric Fields 2 5%10 Magnetism and Magnetic Fields 2 5%11 Waves and Oscillation 1 3%

Total 40 100%

Table C.4. The Content Areas of the Physics Grade 3 Test

Number Topic Frequency %1 Circuits 2 5%2 Electromagnetic fields 2 5%3 Electromagnetic induction 6 15%4 Motion and forces 1 3%5 Optics 7 18%6 Oscillation and mechanical waves 2 5%7 Dynamics - Mechanics 2 5%8 Electricity and Electric Fields 5 13%9 Magnetism and Magnetic Fields 6 15%10 Relativity and Quantum Physics 2 5%11 Waves and Oscillation 4 10%

Total 39 100%

9

Appendix D. Selection of Majors

In order to meaningfully compare learning gains across institutions and across countries in this

field, it is necessary to limit the sample of students to those that have overlapping course

requirements. In both China and Russia, the major-categories selected for inclusion in this study

(Electrical Engineering and Computer Science) do not correspond to discrete majors but rather to

“categories” of majors that have varying course requirements. We therefore took these steps to

further limit our sample to students who are enrolled in similar majors within the categories of

EE and CS, i.e. majors that are similar in terms of course requirements and curriculum. Through

finding overlapping courses between majors and countries, our goal was to limit our sample to

EE and CS majors whose students experience a common set of curricular experiences that are

most relevant to the EE and CS categories.

Unlike China and Russia, US doctoral-research institutions (the institutional equivalent of

our China and Russian sample) typically have only one major called EE and one major called

CS. We collected curricula information from ten doctoral-research institutions (Stanford,

Cornell, University of Washington, Vanderbilt, Virginia Tech, Boston University, Kansas State

University, University of Kentucky, Wayne State University, Marquette University) and

constructed a list of required courses common to all the institutions in EE and CS respectively.

While it is true that this does not fully account for the heterogeneity in the American higher

education system, we believe that looking for overlap with these American majors allows for us

to state with much greater confidence that we have developed our assessments to be of relevance

to EE and CS students across international contexts.

In China, we also collected curricular information on all EE and CS majors from 10

Chinese universities (both elite and non-elite) and constructed a list of common required courses

10

for each major within country. In Russia, we obtained the national curriculum for each EE and

CS major.

Finally, we compared the required course lists for the U.S., China and Russia, and

dropped the majors that did not require the full list of required courses used in American CS and

EE departments. Through this process, we selected EE and CS majors whose students have

similar curricular experiences within country and across countries.

Since the common curriculum overlaps with that of the US, we can also have reasonable

confidence that the curricula we used for test development bears relevance to the field of EE and

CS in general and are not just reflections of the peculiarities of the higher education systems in

China and Russia.

11

Appendix E. Chinese Curricular Standards

Although China’s Ministry of Education does not publish official national curriculum standards

for engineering education, it approves a finite list of textbooks that reflect the math and physics

content that should be taught in engineering programs in Chinese universities. We compared the

MOE-approved textbooks against one another and found that the main content areas were

essentially the same across all of these approved textbooks. We then based our content map for

China on the (almost entirely) overlapping content areas between the textbooks, as this

constitutes the de facto national curriculum for engineering students in China. The full list of

textbooks is included immediately below.

Zhong, X. 2013. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Mechanics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Lixue (2nd Edition). Peking University Press.

Liu, Y. 2013. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Thermodynamics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Rexue (2nd Edition). Peking University Press.

Chen, B., and J. Wang. 2012. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Electro-magnetism (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Lixue (2nd Edition). Peking University Press.

Chen, X., and X. Zhong. 2011. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Optics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Guangxue (2nd Edition). Peking University Press.

Chen, X., and X. Zhong. 2011. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Modern Physics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: JindaiWuli Peking University Press.

Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Advanced Mathematics (part 1) (6th edition).

12

ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GaodengShuxue. Higher Education Press.

Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Advanced Mathematics (part 2) (6th edition). ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GaodengShuxue. Higher Education Press.

Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Engineering Mathematics and Linear Algebra (5th edition). ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GongchengShuxueXianxingDaishu Higher Education Press.

Sheng, J., S. Xie, and C. Pan. 2010. Eleventh Five-Year National Plan General Higher Education Text Book: Probability and Mathematical Statistics (4th edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocai: Gaolvlun Yu Shuli Tongji. Higher Education Press.

13

Appendix F: Test Item Selection

Test items were selected from each country’s past university entrance exams (China’s Gaokao

and Russia’s Unified State Exam), other standardized exams, and widely-used exercise books in

both countries. All of these items were multiple choice items and all were taken from sources

that are used widely in each country and targeted at a national population of students similar to

the students in our sampling frame.

A number of items were taken from Russian materials. In Russia, we took test items from

standardized exams in math and physics that were provided by the Institute for Monitoring the

Quality in Education – the country’s primary quality assessment agency for higher education.

The items for the grade 1 tests were very similar to those used on the Russian Unified State

Exam, which is the mandatory college entrance exam that all students must take if they seek

entry to higher education institutions. The items for the grade 3 tests were based on the Russian

Federal State Standard in math and physics (mandatory part of the curriculum for most Russian

higher education institutions).

Additional test items were taken from Chinese materials. In China, grade 1 test items

were taken from the college entrance examination (gaokao), a nation-wide standardized

examination of high school learning that determines college entry for the vast majority of

students. Test items for grade 3 math and physics came from official Chinese exercise books that

are on the list of approved curricular materials provided by the Ministry of Education for

university use (see Appendix D for details).

14

References

Crocker, L., and J. Algina. 1986. Introduction to Classical and Modern Test Theory. New York: Holt, Rinehart, and Winston.

Dorans, N.J. 1989. “Two New Approaches to Assessing Differential Item Functioning: Standardization and the Mantel-Haenszel Method.” Applied Measurement in Education 2 (3): 217-233.

Linacre, J.M. 1998. “Detecting multidimensionality: Which residual data-type works best?” Journal of Outcome measurement 2: 266-283.

Ludlow, L.H. 1985. “A strategy for the graphical representation of Rasch model residuals.” Educational and Psychological Measurement 45 (4): 851-859.

Smith, E.V. 2001. “Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective.” Journal of Applied Measurement 2: 281-311.

Smith, E. V. 2002. “Understanding Rasch measurement: Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals.” Journal of Applied Measurement 3(2): 205-231. Stone, M.H. 2004. “Substantive scale construction.” In Introduction to Rasch measurement, edited by E.V. Smith and R.M. Smith, 201-225. Maple Grove, MN: JAM Press.

Wolfe, E.W. 2004. “Equating and Item Banking with the Rasch Model.” In Introduction to Rasch measurement, edited by E.V. Smith and R.M. Smith, 366-390. Maple Grove, MN: JAM Press.

Zwick, R., D.T. Thayer, and C. Lewis. 1999. “An Empirical Bayes Approach to Mantel-Haenszel DIF Analysis.” Journal of Educational Measurement 36(1): 1-28.

15

Documents

fsi.stanford.edu · Web viewDynamics - Mechanics 2 5% 8 Electricity and Electric Fields 5 13% 9 Magnetism and Magnetic Fields 6 15% 10 Relativity and Quantum Physics 2 5% 11 Waves