Click here to load reader
Upload
vokhue
View
212
Download
0
Embed Size (px)
Citation preview
February 2016
Appendices to
“Developing Instruments to Assess and Compare the Quality of Engineering
Education: the Case of China and Russia”1
E. Kardanova, P. Loyalka, I. Chirikov, L,Liu, G. Li, H. Wang, E. Enchikova, H. Shi,
N. Johnson
Appendix A. Analytical approach: technical details
One of the intentions of the pilot study and subsequent analysis was to create shortened final
tests that can be used in future studies by selecting only the items with the best psychometric
properties from our pilot tests. In particular, while the pilot tests were 55 minutes each, the
research team sought to cut the length of each subject test to 40 minutes for the final versions.
We therefore included more items in the pilot tests than we needed for the final tests. We also
gave more time to the students during the pilot study so that we would be able to delete some of
the items from the tests due to poor psychometric quality—that is if they did not fit the IRT
model or had low discrimination.
As a discrimination index we used the correlation between examinees’ responses to the
item and their ability levels. As for the threshold to detect the items with low discrimination, we
used a value of 0.2, which is usually used for this purpose in similar studies (Crocker and Algina
1986). We expected that the final number of items for the main study would be 35–40 items for
each subject test.
To measure the extent to which the data fit the Rasch model, we used the unweighted and
weighted mean square statistics provided by Winsteps (in terms of Winsteps output: OUTFIT
1 Forthcoming in Assessment and Evaluation in Higher Education
1
MNSQ and INFIT MNSQ, respectively). These statistics rely on standardized residuals, which
represent the differences between the observed response and the response expected under the
model (Wright and Stone 1979). Generally, a criterion of +1.2 for these statistics is used to flag
potential problems with misfit.
To test for DIF across countries and across grades we used the ETS approach for DIF
classification (Zwick et al, 1999), which designates items as A (negligible or nonsignificant
DIF), B (slight DIF), or C (large DIF) items depending on the magnitude of the Mantel-Haenszel
statistic (Dorans 1989) and its statistical significance. An item was considered a C item if two
conditions were satisfied: (1) the difference in item difficulty between different groups of
students was more than 0.64 logits, and (2) the Mantel-Haenzel statistic had a significance level
of p < .05 (Linacre, 2011). Only C items were considered as items with DIF in this study.
To examine the dimensionality of each scale we conducted a principal components
analysis (PCA) of the standardized residuals (Linacre 1998; Smith 2002). Theoretically, if all the
information in the data is explained by one latent variable, the residuals would represent random
noise and would be independent of each other. As a consequence, correlations between the
residuals would be near zero. If there is no second dimension within the data, then a PCA of the
standardized residuals should generate eigenvalues all near one and the percentage of variance
across the components should be uniform (Ludlow 1985).
To analyze reliability, we used the person reliability index provided by the Rasch analysis
(Stone 2004). The separation index compares the distribution of student measures (estimates of
ability) with their measurement errors and indicates the spread of student measures in standard
error units. The index can be used to calculate the number of distinct levels, or strata separated
by at least three errors of measurement, in the distributions (Wright and Stone 1979; Smith
2
2001). The number of strata are calculated as: Strata=(4G+1)/3, where G is the separation index.
At least three different strata are recommended (for example, low, middle and high ability
levels).
In order to show the relative distribution of item difficulty and students’ scores in a
common metric, we constructed the variable map (Wright and Stone 1979). For equating
between grades, we used a separate calibration design with anchoring items from one test when
calibrating the other test (Wolfe 2004). After equating, we evaluated the quality of the link
between the grade 1 and grade 3 tests by calculating the item-within-link statistic (Wright and
Bell 1984). Under the null hypothesis of the items exhibiting perfect fit within the link, this
statistic has an expected value of 1. Link adequacy was also evaluated by determining the
stability of the item difficulty estimates across the grade 3 test with and without anchoring. To do
that, we calculated the correlation between the item difficulty estimates.
3
Appendix B. The results of the psychometric analysis for grades 1 and 3 physics tests
The results for the physics tests are substantively similar to the mathematics tests and will be
presented briefly because of space limitations. Five and six items were deleted from the grade 1
and grade 3 physics tests respectively because of poor psychometric quality. For further analysis
we thus considered the sets of 40 and of 39 items for the grade 1 and grade 3 tests respectively.
Of the 40 items for the grade 1 physics test, 17 demonstrated country-related DIF: 9 items in
favour of China and 8 in favour of Russia. The other 23 items were DIF free and could be used
for linking between the two countries. As for the grade 3 physics test, 16 demonstrated country-
related DIF: 8 items in favour of China and 8 in favour of Russia, while the remaining 23 items
were DIF free. As with the mathematics tests, we used DIF-free items in each test for linking
between the two countries. Items that were not considered good for linking were deleted at
earlier stages, either because they did not have good psychometric qualities when they were
analyzed for inclusion in a particular test or because they exhibited DIF for at least one test.
Further analysis showed that all items of the grade 1 and grade 3 physics tests had good
psychometrics characteristics, fit the model, and could be considered as essentially
unidimensional. The person reliability and the person separation index were 0.83 and 2.03 for the
grade 1 physics test and 0.77 and 1.84 for the grade 3 physics test, indicating three statistically
distinct groups of students along the continuum for each test. The variable maps for the physics
tests are presented in Figures B.1 and B.2.
4
Figure B.1. The Physics Grade 1 Test Variable Map
5
Figure B.2. The Physics Grade 3 Test Variable Map
6
Appendix C. The Content Areas of the Math and Physics Tests
The experts were asked to rate the items according to four criteria: (1) comprehensibility of
wording, (2) appropriateness in measuring the content area of interest, (3) difficulty, and (4)
expected time required to answer. The experts found 13.4% of the physics items and 7.5% of the
math to have problems in clarity of wording (meaning 25% or more of the experts marked the
item for clarity issues). We reviewed all the items that experts marked as having clarity issues
and rectified the items that had simple and obvious wording problems and deleted the other
unclear items from our test item bank. The experts ranked (2), (3) and (4) on continuous scales
(e.g. 1-4). We gave priority to the items that experts deemed to be most appropriate in measuring
the content of interest. We also selected items spread across a range of difficulties and avoided
items that experts thought would take students too long to complete. Consequently, we selected
less than half of the original item pool for use in the clinical pilot. Out of a pool of 179 physics
items and 174 math items we initially collected, we selected 80 physics items and 80 math items
to be used in the clinical pilot respectively. In selecting these items, we also took care to keep
balance across weighted content areas.
Table C.1: The Content Areas of the Math Grade 1 Test
Number Topic Frequency %1 Derivatives and their application 7 18.92 Equations 7 18.93 Functions and domains 5 13.54 Inequalities 3 8.15 Mathematical reasoning and logic 5 13.56 Single Variable Differentiation 4 10.97 Trigonometric functions and equations 6 16.2
Total 37 100
7
Table C.2. The Content Areas of the Math Grade 3 Test
Number Topic Frequency %1 Derivatives and their application 3 7.72 Equations 1 2.63 Functions and domains 1 2.64 Inequalities 1 2.65 Linear Algebra 5 12.86 Mathematical reasoning and logic 2 5.17 Multivariate Differentiation 6 15.48 Ordinary differential Equations 1 2.69 Probability and statistics 3 7.710 Series 2 5.111 Single Variable Differentiation 7 17.912 Single Variable Integration 5 12.813 Trigonometric functions and equations 2 5.1
Total 39 100
8
Table C.3. The Content Areas of the Physics Grade 1 Test
Number Topic Frequency %1 Circuits 5 13%2 Electromagnetic fields 6 15%3 Electromagnetic induction 7 18%4 Mechanical energy 1 3%5 Motion and forces 3 8%6 Optics 6 15%7 Oscillation and mechanical waves 6 15%8 Dynamics – Mechanics 1 3%9 Electricity and Electric Fields 2 5%10 Magnetism and Magnetic Fields 2 5%11 Waves and Oscillation 1 3%
Total 40 100%
Table C.4. The Content Areas of the Physics Grade 3 Test
Number Topic Frequency %1 Circuits 2 5%2 Electromagnetic fields 2 5%3 Electromagnetic induction 6 15%4 Motion and forces 1 3%5 Optics 7 18%6 Oscillation and mechanical waves 2 5%7 Dynamics - Mechanics 2 5%8 Electricity and Electric Fields 5 13%9 Magnetism and Magnetic Fields 6 15%10 Relativity and Quantum Physics 2 5%11 Waves and Oscillation 4 10%
Total 39 100%
9
Appendix D. Selection of Majors
In order to meaningfully compare learning gains across institutions and across countries in this
field, it is necessary to limit the sample of students to those that have overlapping course
requirements. In both China and Russia, the major-categories selected for inclusion in this study
(Electrical Engineering and Computer Science) do not correspond to discrete majors but rather to
“categories” of majors that have varying course requirements. We therefore took these steps to
further limit our sample to students who are enrolled in similar majors within the categories of
EE and CS, i.e. majors that are similar in terms of course requirements and curriculum. Through
finding overlapping courses between majors and countries, our goal was to limit our sample to
EE and CS majors whose students experience a common set of curricular experiences that are
most relevant to the EE and CS categories.
Unlike China and Russia, US doctoral-research institutions (the institutional equivalent of
our China and Russian sample) typically have only one major called EE and one major called
CS. We collected curricula information from ten doctoral-research institutions (Stanford,
Cornell, University of Washington, Vanderbilt, Virginia Tech, Boston University, Kansas State
University, University of Kentucky, Wayne State University, Marquette University) and
constructed a list of required courses common to all the institutions in EE and CS respectively.
While it is true that this does not fully account for the heterogeneity in the American higher
education system, we believe that looking for overlap with these American majors allows for us
to state with much greater confidence that we have developed our assessments to be of relevance
to EE and CS students across international contexts.
In China, we also collected curricular information on all EE and CS majors from 10
Chinese universities (both elite and non-elite) and constructed a list of common required courses
10
for each major within country. In Russia, we obtained the national curriculum for each EE and
CS major.
Finally, we compared the required course lists for the U.S., China and Russia, and
dropped the majors that did not require the full list of required courses used in American CS and
EE departments. Through this process, we selected EE and CS majors whose students have
similar curricular experiences within country and across countries.
Since the common curriculum overlaps with that of the US, we can also have reasonable
confidence that the curricula we used for test development bears relevance to the field of EE and
CS in general and are not just reflections of the peculiarities of the higher education systems in
China and Russia.
11
Appendix E. Chinese Curricular Standards
Although China’s Ministry of Education does not publish official national curriculum standards
for engineering education, it approves a finite list of textbooks that reflect the math and physics
content that should be taught in engineering programs in Chinese universities. We compared the
MOE-approved textbooks against one another and found that the main content areas were
essentially the same across all of these approved textbooks. We then based our content map for
China on the (almost entirely) overlapping content areas between the textbooks, as this
constitutes the de facto national curriculum for engineering students in China. The full list of
textbooks is included immediately below.
Zhong, X. 2013. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Mechanics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Lixue (2nd Edition). Peking University Press.
Liu, Y. 2013. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Thermodynamics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Rexue (2nd Edition). Peking University Press.
Chen, B., and J. Wang. 2012. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Electro-magnetism (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Lixue (2nd Edition). Peking University Press.
Chen, X., and X. Zhong. 2011. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Optics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: Guangxue (2nd Edition). Peking University Press.
Chen, X., and X. Zhong. 2011. Eleventh Five-Year National Plan General Higher Education Text Book—General Physics Curriculum: Modern Physics (2nd Edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocaiDaxueWuliTongyongJiaocheng: JindaiWuli Peking University Press.
Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Advanced Mathematics (part 1) (6th edition).
12
ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GaodengShuxue. Higher Education Press.
Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Advanced Mathematics (part 2) (6th edition). ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GaodengShuxue. Higher Education Press.
Tongji University Department of Mathematics. 2007. Twelfth Five-Year National Plan General Higher Education Text Book: Engineering Mathematics and Linear Algebra (5th edition). ShierwuPutongGaodengJiaoyuBenkeGuojiajiGuihuaJiaocai: GongchengShuxueXianxingDaishu Higher Education Press.
Sheng, J., S. Xie, and C. Pan. 2010. Eleventh Five-Year National Plan General Higher Education Text Book: Probability and Mathematical Statistics (4th edition). PutongGaodengJiaoyuShiyiwuGuojiaGuihuaJiaocai: Gaolvlun Yu Shuli Tongji. Higher Education Press.
13
Appendix F: Test Item Selection
Test items were selected from each country’s past university entrance exams (China’s Gaokao
and Russia’s Unified State Exam), other standardized exams, and widely-used exercise books in
both countries. All of these items were multiple choice items and all were taken from sources
that are used widely in each country and targeted at a national population of students similar to
the students in our sampling frame.
A number of items were taken from Russian materials. In Russia, we took test items from
standardized exams in math and physics that were provided by the Institute for Monitoring the
Quality in Education – the country’s primary quality assessment agency for higher education.
The items for the grade 1 tests were very similar to those used on the Russian Unified State
Exam, which is the mandatory college entrance exam that all students must take if they seek
entry to higher education institutions. The items for the grade 3 tests were based on the Russian
Federal State Standard in math and physics (mandatory part of the curriculum for most Russian
higher education institutions).
Additional test items were taken from Chinese materials. In China, grade 1 test items
were taken from the college entrance examination (gaokao), a nation-wide standardized
examination of high school learning that determines college entry for the vast majority of
students. Test items for grade 3 math and physics came from official Chinese exercise books that
are on the list of approved curricular materials provided by the Ministry of Education for
university use (see Appendix D for details).
14
References
Crocker, L., and J. Algina. 1986. Introduction to Classical and Modern Test Theory. New York: Holt, Rinehart, and Winston.
Dorans, N.J. 1989. “Two New Approaches to Assessing Differential Item Functioning: Standardization and the Mantel-Haenszel Method.” Applied Measurement in Education 2 (3): 217-233.
Linacre, J.M. 1998. “Detecting multidimensionality: Which residual data-type works best?” Journal of Outcome measurement 2: 266-283.
Ludlow, L.H. 1985. “A strategy for the graphical representation of Rasch model residuals.” Educational and Psychological Measurement 45 (4): 851-859.
Smith, E.V. 2001. “Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective.” Journal of Applied Measurement 2: 281-311.
Smith, E. V. 2002. “Understanding Rasch measurement: Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals.” Journal of Applied Measurement 3(2): 205-231. Stone, M.H. 2004. “Substantive scale construction.” In Introduction to Rasch measurement, edited by E.V. Smith and R.M. Smith, 201-225. Maple Grove, MN: JAM Press.
Wolfe, E.W. 2004. “Equating and Item Banking with the Rasch Model.” In Introduction to Rasch measurement, edited by E.V. Smith and R.M. Smith, 366-390. Maple Grove, MN: JAM Press.
Zwick, R., D.T. Thayer, and C. Lewis. 1999. “An Empirical Bayes Approach to Mantel-Haenszel DIF Analysis.” Journal of Educational Measurement 36(1): 1-28.
15