31
Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to generate the plausible values for the PISA 2018 main survey assessment data. RESULTS OF THE IRT SCALING AND POPULATION MODELLING Results of the IRT scaling and population modelling include the proportions of item parameters that were common (i.e. invariant) across countries and PISA cycles and the reliability of the assessments for each country. Large proportions of invariant parameters across countries and cycles ensured the comparability of the proficiency estimates. Assessing the invariance of item parameters The item parameters for all items used in the computer-based assessment (CBA) and paper- based assessment (PBA) were obtained through IRT scaling. Typically, items received international model parameters that fitted data for a large majority of the country-by-language groups. Otherwise, items received unique or group-specific parameters, or, if no parameters could be found that fit data in country-by-language group or groups, the item was dropped for these groups. One reading item (DR563Q12C) was identified as problematic based on classical item analyses and the IRT parameters did not fit the data for the majority of the country-by-language groups. This item was found to be flawed and it was dropped from all groups. To assess the invariance of item parameters across country-by-language groups and cycles, for each country-by-language group, items were categorized as: invariant when their parameters were the same as the international item parameters in 2015 and 2018 (in the case of trend items), or the same as the international item parameters in 2018 (in the case of new items); group-specific invariant when their parameters were not the same as the international parameters in 2015 and 2018, but the group-specific parameters did not change between the 2015 and 2018 cycles (in the case of trend items); dropped if the item was dropped from the group’s scaling; and noninvariant for all other cases (different trend unique in 2015 and 2018, or new items unique item parameters). For countries with multiple language groups, the results were averaged for the country, using the population weights to represent the language group’s proportion in the country’s sample. Table 12.1 shows the proportion of items categorized as invariant, noninvariant, and dropped, averaged across countries participating in the 2018 CBA. The proportion of invariant items (with parameters equal to the international parameters), which is critical for ensuring the comparability of scores across countries and for the stability of trends, was large for all domains, ranging from 77.35% for the reading trend items to 95.17% for the reading fluency items. When taking into account the group-specific invariant items, which also contribute to the stability of the trends, the total proportion of invariant items was near or above 90% for all domains. Regarding the dropped category, the proportions were very small for all domains (0.64% or less).

Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Chapter 12

Scaling outcomes

This chapter reports the outcomes of applying the item response theory (IRT) scaling and

population modelling to generate the plausible values for the PISA 2018 main survey

assessment data.

RESULTS OF THE IRT SCALING AND POPULATION MODELLING

Results of the IRT scaling and population modelling include the proportions of item

parameters that were common (i.e. invariant) across countries and PISA cycles and the

reliability of the assessments for each country. Large proportions of invariant parameters

across countries and cycles ensured the comparability of the proficiency estimates.

Assessing the invariance of item parameters

The item parameters for all items used in the computer-based assessment (CBA) and paper-

based assessment (PBA) were obtained through IRT scaling. Typically, items received

international model parameters that fitted data for a large majority of the country-by-language

groups. Otherwise, items received unique or group-specific parameters, or, if no parameters

could be found that fit data in country-by-language group or groups, the item was dropped for

these groups. One reading item (DR563Q12C) was identified as problematic based on

classical item analyses and the IRT parameters did not fit the data for the majority of the

country-by-language groups. This item was found to be flawed and it was dropped from all

groups.

To assess the invariance of item parameters across country-by-language groups and cycles,

for each country-by-language group, items were categorized as: invariant when their

parameters were the same as the international item parameters in 2015 and 2018 (in the case

of trend items), or the same as the international item parameters in 2018 (in the case of new

items); group-specific invariant when their parameters were not the same as the international

parameters in 2015 and 2018, but the group-specific parameters did not change between the

2015 and 2018 cycles (in the case of trend items); dropped if the item was dropped from the

group’s scaling; and noninvariant for all other cases (different trend unique in 2015 and

2018, or new items unique item parameters). For countries with multiple language groups, the

results were averaged for the country, using the population weights to represent the language

group’s proportion in the country’s sample.

Table 12.1 shows the proportion of items categorized as invariant, noninvariant, and dropped,

averaged across countries participating in the 2018 CBA. The proportion of invariant items

(with parameters equal to the international parameters), which is critical for ensuring the

comparability of scores across countries and for the stability of trends, was large for all

domains, ranging from 77.35% for the reading trend items to 95.17% for the reading fluency

items. When taking into account the group-specific invariant items, which also contribute to

the stability of the trends, the total proportion of invariant items was near or above 90% for

all domains. Regarding the dropped category, the proportions were very small for all domains

(0.64% or less).

Page 2: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.1 Proportion of invariant, noninvariant, and dropped CBA items averaged across

countries, for each domain

Mathem

atics

Reading

- Trend

Reading

- New

Reading

fluency

Science

Financia

l literacy

- Trend

Financia

l literacy

- New

Global

compete

nce

Total items 70 72 172 65 115 29 14 69

Invariant 91.81% 77.35% 88.39% 95.17% 86.73% 84.85% 93.04% 90.05%

Group-specific

invariant 4.50% 10.35% 5.85% 7.11%

Invariant total 1 96.32% 87.70% 88.39% 95.17% 92.59% 91.97% 93.04% 90.05%

Noninvariant 3.55% 12.30% 11.40% 4.19% 7.25% 7.90% 6.96% 9.65%

Dropped 0.13% 0.00% 0.22% 0.64% 0.17% 0.13% 0.00% 0.30%

Note: Viet Nam is not included in the analysis due to adjudication issues.

1. Invariant total is the sum of invariant and group-specific invariant. The percentages of the invariant total,

noninvariant, and dropped items add up to 100%.

Table 12.2 shows the proportion of items categorized as invariant, noninvariant, and dropped,

averaged across countries participating in the 2018 PBA. The results are similar to those for

CBA, with high proportions of invariant and group-specific invariant items—with values

greater than 90% for mathematics and science, and slightly below 90% for reading—and very

small proportions of dropped items.

Table 12.2 Proportion of invariant, noninvariant, and dropped PBA items averaged across

countries, for each domain

Mathematics Reading Science

Total items 71 87 85

Invariant 89.82% 77.83% 80.08%

Group-specific invariant 5.50% 9.38% 10.46%

Invariant total 1 95.32% 87.21% 90.54%

Noninvariant 4.51% 11.98% 9.10%

Dropped 0.18% 0.81% 0.36%

Note: Viet Nam is not included in the analysis due to adjudication issues.

1. Invariant total is the sum of invariant and group-specific invariant. The percentages of the invariant total,

noninvariant, and dropped items add up to 100%.

An overview of the frequencies of invariant, noninvariant, and dropped items for each

domain, separated by CBA and PBA, is presented in Figures 12.1 to 12.5. Each country is

represented by a vertical bar, with: dark green representing the number of items classified as

invariant (when applicable, a vertical bar is used to separate the trend items, closer to the x-

axis, and new items); light green representing the number of invariant group-specific items;

yellow indicating noninvariant group-specific items (when applicable, a vertical bar is used

Page 3: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

to separate the trend items, closer to the x-axis, and new items); and red indicating items

dropped from scaling. The countries are ordered from left to right by increasing number of

invariant items. These plots show that while most countries have large numbers of invariant

items, a few countries show noticeably lower invariance.

Figure 12.1 Frequency of invariant, noninvariant, and dropped items for mathematics, by

country

Figure 12.2 Frequency of invariant, noninvariant, and dropped items for reading, by country

Page 4: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.3 Frequency of invariant, noninvariant, and dropped items for science, by country

Figure 12.4 Frequency of invariant, noninvariant, and dropped items for financial literacy, by

country

Figure 12.5 Frequency of invariant, noninvariant, and dropped items for global competence,

by country

Page 5: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

After the IRT scaling was finalised, item parameter estimates were delivered to each

country, with an indication of which items had received international item parameters and

which had received group-specific item parameters. Table 12.3 gives an example of the

information provided to countries: the first column shows the domain; the second column

flags items that had received group-specific parameters or had been excluded from the IRT

scaling; and the remaining columns show the final item parameter estimates (the slope and

difficulty parameters are listed for all items, while the threshold parameters are listed for

the polytomous items). Note that some item parameters that had been estimated before

PISA 2015 using the Rasch or PCM models and still fit the 2018 data retained their slope

value of 1. As indicated earlier, all items in the 2018 main survey were modelled using the

two-parameter logistic model (2PLM) or the generalized partial credit model (GPCM),

with the Rasch model being a special case of these models.

Table 12.3 Example of item parameter estimates provided to countries

Domain Flag Item Slope Difficulty Step 1 Step 2

Mathematics Excluded

from scaling CM998Q04

Mathematics Unique item

parameters PM155Q01 1.42972 -0.35538

Mathematics PM00GQ01 1 1.62226

Mathematics PM155Q03 1.08678 0.73497 -0.20119 0.20119

Reliability of the PISA scales

Plausible values were generated for all students by setting all the item parameters to the

values obtained from the final IRT scaling and by applying the population modelling

approach described in Chapter 9.

Page 6: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Given the rotated and incomplete assessment design, it was not possible to calculate the

classical reliability values for each cognitive domain. Nevertheless, test reliability could be

estimated using the commonly used formula: 1 – (expected error variance/total variance).

The expected error variance is the weighted average of the posteriori variance (i.e. the

variance across the 10 plausible values, which is an expression of the posterior measurement

error). The total variance was estimated using a resampling approach (Efron, 1982) and was

estimated for each country depending on the country-specific proficiency distributions for

each cognitive domain.

Table 12.4 presents the distribution of the national reliabilities for the generated scale scores based on all 10 plausible values. The reliabilities for each country are presented in Table 12.5. These tables show that the variance explained by the combined IRT model and population model is at a comparable level across countries. While the values are above 0.80 in all the domains assessed in CBA and PBA, it is important to keep in mind that this is not to be confused with a classical reliability coefficient, as it is based on more than the item responses. Comparisons among individual students are not appropriate because the apparent accuracy of the measures is obtained by statistically adjusting the estimates based on background data. This approach does provide improved behaviour of subgroup estimates, even if the plausible values obtained using this methodology are not suitable for comparisons of individuals (Mislevy & Sheehan, 1987; von Davier et al., 2006).

Table 12.4 Distribution of the national reliabilities of the cognitive domains and reading

subscales

Mode Domain Median S.D. Min Max

CBA

Mathematics 0.85 0.03 0.77 0.90

Reading 0.93 0.01 0.91 0.95

Science 0.88 0.02 0.83 0.92

Financial literacy 1 0.89 0.01 0.86 0.93

Global competence 0.88 0.02 0.83 0.91

Reading subscale - Evaluate and reflect 0.90 0.02 0.84 0.93

Reading subscale - Locate information 0.88 0.02 0.82 0.92

Reading subscale - Understand 0.92 0.01 0.89 0.94

Reading subscale - Multiple 0.91 0.01 0.88 0.94

Reading subscale - Single 0.91 0.01 0.88 0.94

PBA

Mathematics 0.81 0.02 0.77 0.85

Reading 0.91 0.01 0.89 0.92

Science 0.82 0.02 0.80 0.87

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. The financial literacy sample was separate from the main sample.

Page 7: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.5 National reliability values of the cognitive domains

Mode Country Mathemati

cs

Reading Science Financial

literacy 1

Global

competenc

e

CBA Albania 0.77 0.92 0.84 0.85

CBA Australia 0.85 0.93 0.89 0.88

CBA Austria 0.88 0.94 0.90

CBA Baku (Azerbaijan) 0.81 0.91 0.84

CBA Belarus 0.87 0.93 0.88

CBA Belgium 0.89 0.93 0.90

CBA Bosnia and Herzegovina 0.83 0.93 0.86

CBA Brazil 0.85 0.94 0.89 0.89 CBA Brunei Darussalam 0.90 0.95 0.92 0.91

CBA B-S-J-Z (China) 2 0.84 0.91 0.87

CBA Bulgaria 0.83 0.94 0.88 0.87 CBA Canada 0.81 0.92 0.86 0.91 0.83

CBA Chile 0.85 0.93 0.87 0.89 0.87

CBA Chinese Taipei 0.87 0.93 0.90 0.90

CBA Colombia 0.85 0.93 0.88 0.89

CBA Costa Rica 0.83 0.91 0.86 0.86

CBA Croatia 0.84 0.93 0.87 0.87

CBA Cyprus 3 0.83 0.94 0.87

CBA Czech Republic 0.87 0.94 0.90

CBA Denmark 0.85 0.93 0.89

CBA Dominican Republic 0.83 0.92 0.85

CBA Estonia 0.84 0.92 0.88 0.89

CBA Finland 0.84 0.93 0.88 0.88

CBA France 0.89 0.94 0.90

CBA Georgia 0.83 0.93 0.86 0.88

CBA Germany 0.88 0.94 0.90 CBA Greece 0.82 0.93 0.86 0.88

CBA Hong Kong (China) 0.84 0.92 0.86 0.85

CBA Hungary 0.87 0.94 0.90

CBA Iceland 0.84 0.94 0.89 CBA Indonesia 0.88 0.94 0.89 0.93 0.87

CBA Ireland 0.85 0.93 0.89

CBA Israel 0.85 0.94 0.90

CBA Italy 0.87 0.93 0.89 0.90

CBA Japan 0.85 0.93 0.89 CBA Kazakhstan 0.78 0.94 0.87 0.83

CBA Korea 0.86 0.93 0.88 0.88

CBA Kosovo 0.82 0.91 0.86 CBA Latvia 0.83 0.93 0.86 0.87 0.88

Page 8: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

CBA Lithuania 0.85 0.94 0.89 0.89 0.89

CBA Luxembourg 0.87 0.95 0.91

CBA Macao (China) 0.80 0.91 0.86

CBA Malaysia 0.86 0.93 0.89 CBA Malta 0.87 0.95 0.90 0.90

CBA Mexico 0.82 0.92 0.87

CBA Montenegro 0.82 0.93 0.87 CBA Morocco 0.80 0.91 0.85 0.85

CBA Netherlands 0.90 0.94 0.91 0.90

CBA New Zealand 0.85 0.94 0.90

CBA Norway 0.86 0.93 0.89 CBA Panama 0.86 0.93 0.90 0.89

CBA Peru 0.84 0.92 0.88 0.89

CBA Philippines 0.85 0.94 0.88 0.88

CBA Poland 0.84 0.93 0.88 0.87

CBA Portugal 0.88 0.93 0.90 0.89

CBA Qatar 0.85 0.95 0.88 CBA Russian Federation 0.81 0.93 0.86 0.86 0.85

CBA Serbia 0.83 0.94 0.86 0.87 0.87

CBA Singapore 0.84 0.93 0.89 0.88

CBA Slovak Republic 0.86 0.94 0.88 0.88 0.89

CBA Slovenia 0.87 0.93 0.90 CBA Spain 0.80 0.92 0.83 0.88 0.83

CBA Sweden 0.86 0.93 0.89

CBA Switzerland 0.87 0.94 0.90

CBA Thailand 0.86 0.94 0.90 0.89

CBA Turkey 0.86 0.93 0.89

CBA United Arab Emirates 0.85 0.95 0.89

CBA United Kingdom 0.84 0.93 0.88

CBA United States 0.88 0.94 0.90 0.90

CBA Uruguay 0.86 0.93 0.89 PBA Argentina 0.82 0.89 0.82

PBA Jordan 0.77 0.89 0.80

PBA Lebanon 0.81 0.91 0.82

PBA North Macedonia 0.81 0.90 0.82

PBA Republic of Moldova 0.80 0.91 0.84

PBA Romania 0.85 0.92 0.86

PBA Saudi Arabia 0.80 0.90 0.82

PBA Ukraine 0.85 0.92 0.87

Page 9: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. The financial literacy sample was separate from the main sample.

2. B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

3. Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern part of

the Island. There is no single authority representing both Turkish and Greek Cypriot people on the Island.

Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable solution is

found within the context of the United Nations, Turkey shall preserve its position concerning the “Cyprus

issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

Reading MSAT measurement error

As indicated earlier, the main goal of the new reading multistage adaptive testing (MSAT)

design was to improve measurement accuracy over what would have been obtained with a

linear (nonadaptive) design used in past PISA cycles.

The efficiency of an MSAT design depends, in large part, on the resources available as well

as the constraints placed on the assembly of the MSAT components (i.e., the core, stage 1 and

stage 2 testlets). For the 2018 reading MSAT, the testlets were assembled to provide strong

links between the MSAT forms, to meet sets of content, timing, and other test blueprints, as

well as to create a differentiation between the high (difficult) and low (easy) stage 1 and stage

2 testlets. Using the international item parameters from the IRT scaling of the PISA 2018

main survey data, the average standard error of measurement for the difficult (HH), easy

(LL), difficult/easy (HL), and easy/difficult (LH) forms across the full range of PISA

reported proficiencies were computed for the reading MSAT. The results are displayed in

Figure 12.6. The lowest standard errors of measurement that can be achieved—if students are

routed according to their true proficiency—is shown by the lowest curve at any point on the

proficiency scale. For example, for a student with a true proficiency of 350, the easy (LL)

form provides the lowest standard error of measurement (approximately 30 points on the

PISA scale), and for a student with a true proficiency of 700, the difficult (HH) form provides

the lowest standard error of measurement (approximately 50 points). However, in reality, the

assignment of the MSAT forms is not ideal because of the inaccuracy of the routing

proficiency estimates and because a proportion of the students are randomly routed by design.

The average standard error obtained for the 2018 main survey, taking into account the

proportion of students that were actually assigned to each form at each proficiency level, is

shown as a dashed line in Figure 12.6. Despite the imperfect routing, it can be observed that

across all points on the proficiency scale, the dashed line is only slightly above the lowest

standard error of measurement that can be achieved across the different MSAT forms.

Figure 12.6 Conditional standard error of measurement for the different forms for the reading

MSAT and the weighted average across the actual assigned forms

Page 10: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

The standard error of measurement that can be expected for a traditional nonadaptive PISA

design was also estimated using the same item pool and the same test length (i.e. average

number of items in the forms delivered to students) as the reading MSAT. Figure 12.7 shows

the ratio of the conditional standard error of measurement for the reading MSAT design that

was implemented in the 2018 main survey to the traditional nonadaptive design that could

have been implemented. A ratio of less than 1 indicates that the standard error of

measurement for the MSAT design is lower than that of the traditional nonadaptive design.

As expected, the standard errors of measurement were reduced with the MSAT design by as

much as 10% at the lower and higher proficiency levels.

Figure 12.7 Ratio of the conditional standard error of measurement for the MSAT design to

the standard error of measurement for a traditional nonadaptive PISA design

TRANSFORMING THE PLAUSIBLE VALUES TO PISA SCALES

The plausible values generated from the population modelling onto the IRT scale were

transformed, using a linear transformation, to be reported onto the a scale that is linked to the

Page 11: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

historic PISA scale. This scale can be used to compare the overall performance of countries

or subgroups within a country.

Mathematics, reading, and science

For mathematics, reading, and science, the transformation coefficients established for the

PISA 2015 cycle were applicable to the 2018 cycle. Note that in 2015, the transformation

coefficients were computed for each domain separately, based on the 2006, 2009, 2012, and

2015 scaled proficiencies from only the OECD countries. The country means and variances

used to compute the transformation coefficients included only the values from the cycle in

which a given content domain was the major domain. Hence, the transformation coefficients

for science are based on the 2006 reported results, the reading coefficients are based on the

2009 results, and the mathematics coefficients are based on the 2012 results. Computational

details are provided in the PISA 2015 technical report (OECD, PISA 2015 Technical report,

chapter 12).

Financial literacy

For financial literacy, results from the 2012 PISA cycle were used to compute the

transformation coefficients. The method for computing the transformation coefficients for

financial literacy was similar to that used for mathematics, reading, and science. However,

the key distinction was that for financial literacy, all available country data were used to

compute the coefficients, whereas for mathematics, reading, and science, only the data from

OECD countries were used. This decision was made because there were too few OECD

countries that had participated in the financial literacy assessment to provide defensible

transformation coefficients.

Global competence

Global competence was a newly established domain in PISA 2018. Consistent with the new

domains that had been introduced in previous PISA cycles, the transformation coefficients for

global competence were computed so that the plausible values for the OECD countries would

have a mean of 500 and a standard deviation of 100. To take into account the 10 sets of

plausible values, all sets were stacked together and the weighted (using the senate weights)

mean and variance were computed. Stated differently, the full set of transformed plausible

values for global competence had a weighted mean of 500 and a weighted standard deviation

of 100 for the OECD countries.

Specifically, the equations used to compute the transformation coefficients for global

competence are presented below. Xkv is the vth plausible value {v in 1, 2, ..., 10} for examinee

k. The grand mean of the plausible values is �̅�𝑘𝑣, which is computed by compiling all 10 sets

of plausible values into a single vector (with the corresponding senate weights compiled in a

separate vector) and finding the weighted mean of these values. The weighted variance of the

plausible values is 𝜏𝑃𝑉2 which is computed using the vector of plausible values described

above. The square root of 𝜏𝑃𝑉2 is the weighted standard deviation, 𝜏𝑃𝑉.

𝜏𝑃𝑉 = √𝜏𝑃𝑉2 = √

∑ ∑ 𝑊𝑘𝑣(𝑋𝑘𝑣−�̅�𝑘𝑣)2𝑛𝑘=1

10𝑣=1

[(10𝑛−1) ∑ 𝑊𝑘𝑣𝑛𝑘=1 ]/𝑛

(12.1)

Page 12: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

The transformation coefficients for global competence were computed using the following

equations:

𝐴 =100

𝜏𝑃𝑉 (12.2)

𝐵 = 500 − 𝐴[�̅�𝑘𝑣] = 500 − 𝐴 [∑ ∑ 𝑋𝑘𝑣𝑊𝑘𝑣

𝑛𝑘=1

10𝑣=1

10 ∑ 𝑊𝑘𝑣𝑛𝑘=1

] (12.3)

The plausible values for global competence were transformed to the PISA scale using a

similar approach to that used for mathematics, reading, science, and financial literacy.

However, one difference is that for global competence, the transformation was based on the

plausible values, because global competence had been introduced for the fist time in 2018

(whereas for mathematics, reading, science, and financial literacy, the transformations used

the model-based results from the concurrent calibration in order to align the results with

previously established scales).

Transformation coefficients for all domains

The transformation coefficients for all content domains are presented in Table 12.6. The A

coefficient adjusts the variability (standard deviation) of the resulting scale, while the B

coefficient adjusts the scale location (mean).

Table 12.6 Transformation coefficients for PISA 2018

Domain A B

Mathematics 135.9030 514.1848

Reading 131.5806 437.9583

Science 168.3189 494.5360

Financial literacy 140.0807 490.7259

Global competence 166.1760 530.9083

Table 12.7 shows the average transformed plausible values as well as the resampling-based

standard errors for each country and domain.

Page 13: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.7 Average plausible values (PVs) and resampling-based standard errors (SE) by

country and domain

Country

Mathematics Reading Science Financial literacy

1

Global competence

Average

PV SE

Average

PV SE

Average

PV SE

Average

PV SE

Average

PV SE

International average 458.67 0.29 453.40 0.29 457.92 0.28 481.42 0.59 474.06 0.55

Albania 437.22 2.42 405.43 1.92 416.73 1.99 426.93 2.47

Argentina 379.45 2.77 401.50 2.98 404.07 2.87

Australia 491.36 1.94 502.63 1.63 502.96 1.80 510.88 2.07

Austria 498.94 2.97 484.39 2.70 489.78 2.78

Baku (Azerbaijan) 419.64 2.82 389.39 2.51 397.65 2.36

Belarus 471.87 2.67 473.79 2.44 471.26 2.45

Belgium 508.07 2.26 492.86 2.32 498.77 2.23

Bosnia and Herzegovina 406.38 3.06 402.98 2.93 398.50 2.74

Brazil 383.57 2.03 412.87 2.11 403.62 2.06 420.41 2.34

Brunei Darussalam 430.11 1.16 408.07 0.90 430.98 1.21 428.94 1.29

B-S-J-Z (China) 2 591.39 2.52 555.24 2.75 590.45 2.67

Bulgaria 436.04 3.82 419.84 3.91 424.07 3.63 432.24 4.14

Canada 512.02 2.36 520.09 1.80 518.00 2.15 532.29 3.22 553.77 2.32

Chile 417.41 2.42 452.27 2.64 443.58 2.42 450.88 2.90 466.08 2.94

Chinese Taipei 531.14 2.89 502.60 2.84 515.75 2.87 527.27 2.90

Colombia 390.93 2.99 412.30 3.25 413.32 3.05 457.43 3.26

Costa Rica 402.33 3.29 426.50 3.42 415.62 3.27 455.60 3.71

Croatia 464.20 2.55 478.99 2.67 472.36 2.79 506.34 2.77

Cyprus 3 450.68 1.41 424.36 1.37 439.01 1.39

Czech Republic 499.47 2.46 490.22 2.55 496.79 2.55

Denmark 509.40 1.74 501.13 1.80 492.64 1.94

Dominican Republic 325.10 2.62 341.63 2.86 335.63 2.50

Estonia 523.41 1.74 523.02 1.84 530.11 1.88 547.49 2.05

Finland 507.30 1.97 520.08 2.31 521.88 2.51 536.86 2.42

France 495.41 2.32 492.61 2.32 492.98 2.22

Georgia 397.59 2.60 379.75 2.16 382.66 2.31 402.94 2.56

Germany 500.04 2.65 498.28 3.03 502.99 2.91

Greece 451.37 3.09 457.41 3.62 451.63 3.14 487.92 3.59

Hong Kong 551.15 3.00 524.28 2.73 516.69 2.54 542.14 2.81

Hungary 481.08 2.32 475.99 2.25 480.91 2.33

Iceland 495.19 1.95 473.97 1.74 475.02 1.80

Indonesia 378.67 3.12 370.97 2.56 396.07 2.39 388.38 3.21 407.96 2.39

Ireland 499.63 2.20 518.08 2.24 496.11 2.21

Israel 463.03 3.50 470.42 3.67 462.20 3.62 496.37 3.85

Italy 486.59 2.78 476.28 2.44 468.01 2.43 476.49 2.49

Japan 526.97 2.47 503.86 2.67 529.14 2.59

Jordan 399.76 3.31 419.06 2.94 429.25 2.93

Kazakhstan 423.15 1.91 386.91 1.46 397.10 1.66 407.65 1.62

Korea 525.93 3.12 514.05 2.94 519.01 2.80 508.66 2.96

Page 14: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Kosovo 365.88 1.49 353.07 1.14 364.88 1.17

Latvia 496.13 1.96 478.70 1.62 487.25 1.76 501.31 1.77 496.54 1.99

Lebanon 393.45 4.05 353.36 4.32 383.72 3.54

Lithuania 481.19 1.95 475.87 1.52 482.07 1.63 498.29 1.81 489.43 1.89

Luxembourg 483.42 1.10 469.99 1.13 476.77 1.22

Macao 557.67 1.53 525.12 1.23 543.59 1.46

Malaysia 440.21 2.88 414.98 2.87 437.62 2.71

Malta 471.72 1.90 448.23 1.73 456.59 1.87 478.78 2.15

Mexico 408.80 2.49 420.47 2.75 419.20 2.58

Moldova 420.60 2.43 423.99 2.44 428.49 2.26

Montenegro 429.61 1.24 421.06 1.05 415.17 1.31

Morocco 367.73 3.33 359.39 3.13 376.60 3.00 402.30 3.45

Netherlands 519.23 2.63 484.78 2.65 503.38 2.84 557.90 2.64

New Zealand 494.49 1.71 505.73 2.04 508.49 2.10

North Macedonia 394.45 1.56 392.67 1.10 413.04 1.42

Norway 500.96 2.22 499.45 2.17 490.41 2.28

Panama 352.85 2.72 376.97 2.95 364.62 2.89 412.52 2.94

Peru 399.84 2.61 400.51 2.96 404.22 2.67 410.63 3.15

Philippines 352.57 3.47 339.69 3.29 356.93 3.18 371.14 3.41

Poland 515.65 2.60 511.86 2.70 511.04 2.61 519.61 2.55

Portugal 492.49 2.68 491.80 2.43 491.68 2.77 505.36 2.42

Qatar 414.23 1.23 407.09 0.77 419.13 0.92

Romania 429.92 4.90 427.70 5.14 425.76 4.60

Russian Federation 487.79 2.96 478.50 3.08 477.72 2.87 495.14 2.94 479.95 2.84

Saudi Arabia 373.24 2.99 399.15 2.96 386.25 2.84

Serbia 448.28 3.16 439.47 3.27 439.87 3.05 443.62 2.87 463.45 3.20

Singapore 569.01 1.60 549.46 1.59 550.94 1.48 576.48 1.81

Slovak Republic 486.16 2.56 457.98 2.23 464.05 2.28 481.26 2.29 486.42 2.30

Slovenia 508.90 1.36 495.35 1.23 507.01 1.25

Spain 481.39 1.46 476.54 1.58 483.25 1.55 492.25 2.21 512.12 1.61

Sweden 502.39 2.65 505.79 3.02 499.44 3.07

Switzerland 515.31 2.91 483.93 3.12 495.28 3.00

Thailand 418.56 3.45 392.89 3.23 425.81 3.18 423.42 3.01

Turkey 453.51 2.26 465.63 2.17 468.30 2.01

Ukraine 453.12 3.65 465.95 3.50 468.99 3.30

United Arab Emirates 434.95 2.14 431.78 2.30 433.64 2.01

United Kingdom 501.77 2.56 503.93 2.58 504.67 2.56 534.06 4.86

United States 478.24 3.24 505.35 3.57 502.38 3.32 505.68 3.35

Uruguay 417.66 2.63 427.12 2.76 425.81 2.47

Page 15: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. The financial literacy sample was separate from the main sample.

2. B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

3. Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern part of

the Island. There is no single authority representing both Turkish and Greek Cypriot people on the Island.

Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable solution is

found within the context of the United Nations, Turkey shall preserve its position concerning the “Cyprus

issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

LINKING ERROR

The estimation of the linking error between two PISA cycles was accomplished by

considering the differences between the reported country means from the previous PISA

cycles and new estimates of these country means based on the new PISA cycle item

parameters (see Chapter 9 for more information on the estimation process). To estimate the

linking error for trend comparisons between PISA 2018 and a previous PISA cycle, the

subset of countries that had participated in both cycles being compared were used. In the case

of financial literacy, since the number of participating countries was relatively small, all

countries were used.

The 2018 linking errors are reported in Table 12.8, below. Using these values, the extent to

which changes in country or subgroup’s performance between PISA 2018 and a previous

PISA cycle are significant can be evaluated.

For each domain, the change in a country or subgroup’s performance between PISA 2018 and

a previous PISA cycle can be calculated using the following formula:

𝛥2018−𝑡 = 𝑃𝐼𝑆𝐴2018 − 𝑃𝐼𝑆𝐴𝑡 (12.4)

where 𝛥2018−𝑡 is the difference in performance between PISA 2018 and a previous PISA

cycle, where t can take any of the following values: 2000, 2003, 2006, 2009, 2012, or 2015.

𝑃𝐼𝑆𝐴2018 is the observed score in 2018, and 𝑃𝐼𝑆𝐴𝑡 is the observed score in a previous cycle.

The standard error of the change in performance 𝜎(𝛥2018−𝑡) can be calculated as:

𝜎(𝛥2018−𝑡) = √σ20182 + σt

2 + error2018,𝑡2 (12.5)

where 𝜎2018 is the standard error observed in PISA 2018, 𝜎𝑡 is the standard error observed in

a previous PISA cycle t, and error2018,𝑡 is the linking error for the comparisons of the scores

between PISA 2018 and a previous PISA cycle t. The values for error2018,𝑡 are presented in

Table 12.8.

Note that for each domain, the earliest cycle for which comparisons can be made between

PISA 2018 and a previous PISA cycle is the cycle in which the domain first became a major

domain. Thus, the comparison of mathematics scores between PISA 2018 and PISA 2000 is

not possible, nor is the comparison of science scores between PISA 2018 and PISA 2000 or

between PISA 2018 and PISA 2003.

Page 16: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.8 Linking error for score comparisons between PISA 2018 and previous PISA

cycles

Comparison Mathematics Reading Science Financial literacy

PISA 2000 to 2018 4.04

PISA 2003 to 2018 2.80 7.77

PISA 2006 to 2018 3.18 5.24 3.47

PISA 2009 to 2018 3.54 3.52 3.59

PISA 2012 to 2018 3.34 3.74 4.01 5.55

PISA 2015 to 2018 2.33 3.93 1.51 9.37

INTERNATIONAL CHARACTERISTICS OF THE ITEM POOL

This section provides an overview of the test targeting, the domain inter-correlations, and the

correlations among the reading scale and subscales.

Test targeting

Similar to assigning a specific score on a scale to students according to their performance on

an assessment (OECD, 2002), each item in PISA 2018 was assigned a specific value on a

scale – the response probability (RP) – according to the item’s discrimination and difficulty

parameters that were estimated in the calibration stage. Chapter 15 describes how items can

be placed along a scale based on their RP values and how these values can be used to classify

items into proficiency levels. The different item levels provide information about the

underlying characteristics of an item as it relates to the domain (such as item difficulty), with

higher difficulty indicating a higher level.

In PISA, RP62 values were used to classify items into levels. Respondents with a proficiency

located below this point have less than a 62 percent probability of getting the item correct,

while respondents with a proficiency above this point have more than a 62 percent probability

of getting the item correct. The RP62 values for all items are presented in Annex A, together

with the final item parameters obtained from the IRT scaling.

Similar to the process above, respondents were also classified into proficiency levels using

PISA scale scores transformed from the plausible values. The purpose of classifying

respondents into levels was to provide more descriptive information about group

proficiencies.

For each cognitive domain, the levels were defined by certain score boundaries which were

determined based on the previous PISA cycles. Tables 12.9 to 12.13 show the score

boundaries used for each cognitive domain, along with the percentage of items and

respondents classified at each level of proficiency.

Page 17: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.9 Score boundaries for each level of proficiency for mathematics and the

classification of items and respondents

Level Score points on the PISA scale Number of

items

Percentage of

items

Percentage of

respondents

6

Higher than 669.30

18 10.91 18.04

5 Higher than 606.99 and less than or

equal to 669.30

22 13.33 18.82

4 Higher than 544.68 and less than or

equal to 606.99

40 24.24 21.47

3 Higher than 482.38 and less than or

equal to 544.68

32 19.39 19.78

2 Higher than 420.07 and less than or

equal to 482.38

37 22.42 13.52

1 Higher than 357.77 and less than or

equal to 420.07

8 4.85 6.33

Below 1

Less than 357.77

8 4.85 2.04

Table 12.10 Score boundaries for each level of proficiency for reading and the classification

of items and respondents

Level Score points on the PISA scale Number of

items

Percentage of

items

Percentage of

respondents

6

Higher than 698.32

11 3.08 0.32

5 Higher than 625.61 and less than or

equal to 698.32

22 6.16 3.22

4 Higher than 552.89 and less than or

equal to 625.61

50 14.01 11.33

3 Higher than 480.18 and less than or

equal to 552.89

79 22.13 20.13

2 Higher than 407.47 and less than or

equal to 480.18

106 29.69 24.23

1a Higher than 334.75 and less than or

equal to 407.47

72 20.17 21.44

1b Higher than 262.04 and less than or

equal to 334.75

14 3.92 13.47

1c Higher than 189.33 and less than or

equal to 262.04

3 0.84 4.96

Below 1c

Less than 189.33

0 0.00 0.91

Table 12.11 Score boundaries for each level of proficiency for science and the classification

of items and respondents

Level Score points on the PISA scale Number of

items

Percentage of

items

Percentage of

respondents

6

Higher than 707.93

6 3.00 1.96

Page 18: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

5 Higher than 633.33 and less than or

equal to 707.93

19 9.50 10.16

4 Higher than 558.73 and less than or

equal to 633.33

53 26.50 21.99

3 Higher than 484.14 and less than or

equal to 558.73

65 32.50 25.96

2 Higher than 409.54 and less than or

equal to 484.14

45 22.50 22.15

1a Higher than 334.94 and less than or

equal to 409.54

10 5.00 13.00

1b Higher than 260.54 and less than or

equal to 334.94

2 1.00 4.17

Below 1b

Less than 260.54

0 0.00 0.61

Table 12.12 Score boundaries for each level of proficiency for financial literacy and the

classification of items and respondents

Level Score points on the PISA scale Number of

items

Percentage of

items

Percentage of

respondents

5

Higher than 626.00

9 20.93 7.69

4 Higher than 551.00 and less than or

equal to 626.00

7 16.28 15.48

3 Higher than 476.00 and less than or

equal to 551.00

16 37.21 23.83

2 Higher than 401.00 and less than or

equal to 476.00

6 13.95 26.45

1 Higher than 326.00 and less than or

equal to 401.00

4 9.30 18.17

Below 1 Less than 326.00 1 2.33 8.39

Table 12.13 Score boundaries for each level of proficiency for global competence and the

classification of items and respondents

Level Score points on the PISA scale Number of

items

Percentage of

items

Percentage of

respondents

5

Higher than 660.00

12 17.39 26.14

4 Higher than 595.00 and less than or

equal to 660.00

19 27.54 22.51

3 Higher than 530.00 and less than or

equal to 595.00

16 23.19 21.29

2 Higher than 465.00 and less than or

equal to 530.00

17 24.64 16.34

1 Higher than 400.00 and less than or

equal to 465.00

5 7.25 9.37

Below 1

Less than 400.00

0 0.00 4.35

Page 19: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Since RP62 values and the transformed plausible values are on the same PISA scale, the

distribution of respondents’ latent ability and the items’ RP62 values can be placed on the

same scale. In Figures 12.8 to 12.12, the left side of the figures illustrates the distribution of

the first plausible values (PV1) across countries. In each figure, the blue line indicates the

empirical density of the first plausible values across all countries, and the red line indicates

the theoretical normal distribution with the mean of the distribution equal to the mean of the

plausible values and the variance of the distribution equal to the variance of plausible values

across all countries in each domain. The figures show that the distribution of the plausible

values for each domain are approximately normal. On the right side of the figures, each

item’s international RP62 value is plotted on the PISA scale. Note that for polytomous items,

only the lowest category’s RP62 value is plotted.

Figure 12.8 Distribution of the first plausible values and item RP62 values in mathematics

Figure 12.9 Distribution of the first plausible values and item RP62 values in reading

Page 20: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to
Page 21: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.10 Distribution of the first plausible values and item RP62 values in science

Figure 12.11 Distribution of the first plausible values and item RP62 values in financial

literacy

Page 22: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.12 Distribution of the first plausible values and item RP62 values in global

competence

Figures 12.13 to 12.15 show the percentage of respondents in each country at each level of

proficiency, sorted in descending order of the average score for the domain.

Page 23: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.13 Percentage of respondents in each country at each level of proficiency for

mathematics

Note 1: Viet Nam was not included in this analysis due to adjudication issues.

Note 2: B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

Note 3: Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern

part of the Island. There is no single authority representing both Turkish and Greek Cypriot people on the

Island. Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable

solution is found within the context of the United Nations, Turkey shall preserve its position concerning the

“Cyprus issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

2126

272928

3031

2827

3022

2627

2828

3229

3024

2726

282624

272624

262122

2525

22202020

2319

1620

16181716161616141515151414131313141412

1312131311

13121111121011

810

99

64

52

6955

5447474341

414136

3835

343132

282926

302525

252526

222023

2224221716

18171516

1418

1811

1411

101099

1111

9789878

477

48

766

47

54

55

45

25

35

31

21

71413

1617191920

1821

192121

2423

262426

222526

2525

232727

2424

2124

2828

2425272627

2121

2721

2524242424

22212325

2321232221

262122

252122222222

19212221

1921

1921

1719

16131211

7

254

56

77

99

1013

1212

1312

1113

1315

1616

1416

1716

1817

1617

181919

1922

232221

2021

2423

2424252526

2324

2628

2625

2527

2630

2525

3124

272626

2924

2729

2625

2723

2924

2723

2225

1918

011

11

22

23

3644

44

34

47

66

667

67

89

109

78

1212

111111

1415

1316

15161716

1718

1818

1818

2019

2021

192020

2121

212021

2322

2223

2222

2223

2423

2523

2630

2629

000

00

00

01

02

11

110

102

11

222

22

334

3224

43

44

67

47

67

77

69

88

78

99

89

7109

710

101010

912

109

1112

1214

1214

141619

202328

000

00

00

00

00

00

000

00

10

00

00

000

01

100

1111

12

2121

1121

22

21

22

322

133

13

233

23

32

45

44

47

48

108

1417

Dominican Republic 325Philippines 353

Panama 353Kosovo 366

Morocco 368Saudi Arabia 373

Indonesia 379Argentina 379

Brazil 384Colombia 391Lebanon 393

North Macedonia 394Georgia 398

Jordan 400Peru 400

Costa Rica 402Bosnia and Herzegovina 406

Mexico 409Qatar 414Chile 417

Uruguay 418Thailand 419

Baku (Azerbaijan) 420Moldova 421

Kazakhstan 423Montenegro 430

Romania 430Brunei Darussalam 430

United Arab Emirates 435Bulgaria 436Albania 437

Malaysia 440Serbia 448

Cyprus 451Greece 451

Ukraine 453Turkey 454

International Average 459Israel 463

Croatia 464Malta 472

Belarus 472United States 478

Hungary 481Lithuania 481

Spain 481Luxembourg 483

Slovak Republic 486Italy 487

Russian Federation 488Australia 491Portugal 492

New Zealand 494Iceland 495France 495Latvia 496

Austria 499Czech Republic 499

Ireland 500Germany 500

Norway 501United Kingdom 502

Sweden 502Finland 507

Belgium 508Slovenia 509

Denmark 509Canada 512

Switzerland 515Poland 516

Netherlands 519Estonia 523

Korea 526Japan 527

Chinese Taipei 531Hong Kong 551

Macao 558Singapore 569

B-S-J-Z (China) 591

2018 PISA main study - MathematicsAverage scores & proficiency-level percentages

Level 1 Below level 1 Level 2 Level 3 Level 4 Level 5 Level 6

Page 24: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.14 Percentage of respondents in each country at each level of proficiency for

reading

Note 1: Viet Nam was not included in this analysis due to adjudication issues.

Note 2: B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

Note 3: Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern

part of the Island. There is no single authority representing both Turkish and Greek Cypriot people on the

Island. Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable

solution is found within the context of the United Nations, Turkey shall preserve its position concerning the

“Cyprus issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

2729

3822

3337

3233

3837

2835

292927

3333

2427

3027282525

29282524

2924232223

192120192019171815

171617171516151716151616151414141314121212131212131312111099

109888

4

3833

3223

3127

2324

2220

1821

1720

171816

1819

161814

1117

1314

1315

1114

131512

129

119

96

79106867

766

557

67

5566

46

644

644

555

34233

2423

1

1516

917

96

87

44

74

55

73

38

545

44

523

4424

46

35

23

22

12

25

121

121111

111

1111

11

21

11

11

111

11

011

0100

0

110

60

0100

02

01

01

00

10

00

01

000

00

001

10

10

00

000

01

00

00

00000

000

0000

000

00

000

000

00

000

000

0

13151717

2122232324

292726

302626

2930

2324

2825

3134

253230

2827

322828

2328

2429

242727

3028

2419

292526252627282728

23232425

23232225

212124

22212223

21212122

2022

192021

1819

1414

55

410

67

10109

914

1215

1416

1414

1615

1616

1820

1717

1821

1919

2021

1822

2224

2125

2427

2823

2228

2528

262828282929

262624

2728

2726

302526

3027

252927

252525

2828

302827

3028

3022

28

110

41

1323

24

33

5533

77

67

44

8567

86

89

1110

1312

1313

141314

1618

1617

1718

1717161716

1919

191921

2020

202122

2222

2122

21212322

2325

242524

2427

2626

31

000

10

0000

00

00

11

0021

12

00

2111

2111

424

25

34

33

68

46

55

54544

77

876

88

71010

79

109

9111111

1011

1012

1211

1212

1918

000

00

0000

00

00

00

000

000

00

0000

0000

101

01

00

00

12

01

01

00100

111

11

11

122

12

322

322

22

223

32

27

4

Philippines 340Dominican Republic 342

Kosovo 353Lebanon 353Morocco 359

Indonesia 371Panama 377Georgia 380

Kazakhstan 387Baku (Azerbaijan) 389North Macedonia 393

Thailand 393Saudi Arabia 399

Peru 401Argentina 402

Bosnia and Herzegovina 403Albania 405

Qatar 407Brunei Darussalam 408

Colombia 412Brazil 413

Malaysia 415Jordan 419

Bulgaria 420Mexico 420

Montenegro 421Moldova 424

Cyprus 424Costa Rica 426

Uruguay 427Romania 428

United Arab Emirates 432Serbia 439Malta 448Chile 452

International Average 453Greece 457

Slovak Republic 458Turkey 466

Ukraine 466Luxembourg 470

Israel 470Belarus 474Iceland 474

Lithuania 476Hungary 476

Italy 476Spain 477

Russian Federation 479Latvia 479

Croatia 479Switzerland 484

Austria 484Netherlands 485

Czech Republic 490Portugal 492

France 493Belgium 493Slovenia 495

Germany 498Norway 499

Denmark 501Chinese Taipei 503

Australia 503Japan 504

United Kingdom 504United States 505New Zealand 506

Sweden 506Poland 512Korea 514

Ireland 518Finland 520Canada 520Estonia 523

Hong Kong 524Macao 525

Singapore 549B-S-J-Z (China) 555

2018 PISA main study - ReadingAverage scores & proficiency-level percentages

Level 1a Level 1b Level 1c Below Level 1c Level 2 Level 3 Level 4 Level 5 Level 6

Page 25: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Figure 12.15 Percentage of respondents in each country at each level of proficiency for

science

Note 1: Viet Nam was not included in this analysis due to adjudication issues.

Note 2: B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

Note 3: Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern

part of the Island. There is no single authority representing both Turkish and Greek Cypriot people on the

Island. Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable

solution is found within the context of the United Nations, Turkey shall preserve its position concerning the

“Cyprus issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

323534

4341

3630

3641403836

3130

3529

3331

3434

2634

282829

322726

3025

28252526

2219

22192018

2019191919191718171615

16141514151513141414141414141312131111

9101110985

72

4035

2729

2623

2422

1718

1718

2018

1716

1515

1212

1712

1513

1312

1311

141481211

9811

1011

87

565

66

74

655

35

644

55

34554455

42

433

2333

21

12

0

147

104

36

9522

33

45

35

221251

332

12

32

41

22

11

32

31

10

11

10

10

101

01

101

10

00

1111

11

10

10

10

00

00

00

00

1215

2019

2424

2227

2927

3029

252729283031

3435

2534

27303132

3032

2626

362930

3332

2526

232930

333031

3028

2632

26282829

25252627

25252726

2224242322222425

2225

212222

21212022

1715

8

36

74

610

1210

910

1012

1415131615

161515

1715

18191918

2021

1719

222121

2326

2422

2325

2827

2729

2728

2630

282929

312829

2930

2828

3129

282828

2727

2528

3227

3029

3429

2929

3032

3225

23

01

20

024

12

2225

43

54

433

84

76657

69

105

99

89

1313

1513

1312

1313

1415

1714

171617

171919

1919

2019

1919

2121

2121

2122

2122

2222

2425

24242527

2531

3035

00

00

00

00

00

001

00

10

000

201

111

11

23

121

11

445

332

32

34

53

444

366

55

67

57

7788

99

87

108

107

101010

1110

121724

00

00

00

00

00

00

00

00

0000

000

000

00

00

000

00

011

000

00

00

10

000

001

01

11

01

1112

12

11

21

21

22

22

22

47

Dominican Republic 336Philippines 357

Panama 365Kosovo 365

Morocco 377Georgia 383

Lebanon 384Saudi Arabia 386

Indonesia 396Kazakhstan 397

Baku (Azerbaijan) 398Bosnia and Herzegovina 398

Brazil 404Argentina 404

Peru 404North Macedonia 413

Colombia 413Montenegro 415

Costa Rica 416Albania 417

Qatar 419Mexico 419

Bulgaria 424Romania 426Uruguay 426Thailand 426Moldova 428

Jordan 429Brunei Darussalam 431

United Arab Emirates 434Malaysia 438

Cyprus 439Serbia 440

Chile 444Greece 452

Malta 457International Average 458

Israel 462Slovak Republic 464

Italy 468Turkey 468

Ukraine 469Belarus 471Croatia 472Iceland 475

Luxembourg 477Russian Federation 478

Hungary 481Lithuania 482

Spain 483Latvia 487

Austria 490Norway 490

Portugal 492Denmark 493

France 493Switzerland 495

Ireland 496Czech Republic 497

Belgium 499Sweden 499

United States 502Australia 503Germany 503

Netherlands 503United Kingdom 505

Slovenia 507New Zealand 508

Poland 511Chinese Taipei 516

Hong Kong 517Canada 518

Korea 519Finland 522

Japan 529Estonia 530Macao 544

Singapore 551B-S-J-Z (China) 590

2018 PISA main study - ScienceAverage scores & proficiency-level percentages

Level 1a Level 1b Below Level 1b Level 2 Level 3 Level 4 Level 5 Level 6

Page 26: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Domain inter-correlations

Estimated correlations between the domains, based on the 10 plausible values and averaged

across all countries and assessment modes, are presented in Table 12.14 for the main sample

and in Table 12.5 for the financial literacy sample. Overall, the correlations are quite high, as

expected, yet there are some differences among the domains. The estimated correlations for

each country are presented in Table 12.16.

Table 12.14 Domain inter-correlations for the main sample

Domain Reading Science Global competence

Mathematics

Average 0.80 0.80 0.73

Average (CBA) 0.80 0.81 0.73

Average (PBA) 0.77 0.78

Range 0.66 ~ 0.89 0.65 ~ 0.88 0.55 ~ 0.83

Reading

Average

0.85 0.84

Average (CBA) 0.86 0.84

Average (PBA) 0.81

Range 0.78 ~ 0.92 0.75 ~ 0.90

Science

Average

0.79

Average (CBA) 0.79

Average (PBA)

Range 0.68 ~ 0.87

Note: Viet Nam was not included in this analysis due to adjudication issues.

Table 12.15 Domain inter-correlations for the financial literacy sample

Domain Reading Financial literacy

Mathematics

Average 0.81 0.87

Range 0.78 ~ 0.85 0.84 ~ 0.90

Reading Average

0.83

Page 27: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Range 0.77 ~ 0.86

Note: The financial literacy sample was separate from the main sample.

Page 28: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Table 12.16 Domain inter-correlations by country

Country Mathematics &

Reading Mathematics

& Science

Mathematics

& Financial

literacy 1

Mathematics

& Global

competence

Reading & Science

Reading &

Financial

literacy 1

Reading &

Global

competence

Science &

Global

competence

Albania 0.72 0.72 0.67 0.81 0.83 0.73

Argentina 0.77 0.76 0.79

Australia 0.79 0.85 0.87 0.85 0.82

Austria 0.85 0.85 0.89

Baku (Azerbaijan) 0.74 0.76 0.78

Belarus 0.85 0.84 0.88

Belgium 0.84 0.87 0.88

Bosnia and

Herzegovina 0.79 0.79 0.82

Brazil 0.81 0.82 0.86 0.86 0.86

Brunei Darussalam 0.89 0.87 0.83 0.92 0.90 0.87

B-S-J-Z (China) 2 0.81 0.82 0.88

Bulgaria 0.79 0.77 0.85 0.87 0.84

Canada 0.75 0.76 0.85 0.69 0.84 0.81 0.84 0.78

Chile 0.78 0.76 0.89 0.73 0.84 0.85 0.85 0.79

Chinese Taipei 0.83 0.88 0.82 0.88 0.88 0.86

Colombia 0.80 0.77 0.74 0.86 0.87 0.81

Costa Rica 0.77 0.78 0.65 0.84 0.78 0.72

Croatia 0.81 0.78 0.76 0.84 0.87 0.81

Cyprus 3 0.76 0.79 0.82

Czech Republic 0.80 0.83 0.85

Denmark 0.80 0.82 0.86

Dominican Republic 0.79 0.75 0.84

Estonia 0.79 0.80 0.85 0.87 0.83

Finland 0.81 0.82 0.87 0.88 0.84

France 0.84 0.84 0.88

Georgia 0.77 0.76 0.85 0.82 0.81

Germany 0.85 0.86 0.89

Greece 0.77 0.76 0.73 0.85 0.86 0.78

Hong Kong 0.80 0.83 0.78 0.84 0.85 0.82

Hungary 0.84 0.85 0.87

Iceland 0.78 0.85 0.87

Indonesia 0.78 0.71 0.84 0.65 0.82 0.84 0.76 0.69

Ireland 0.82 0.83 0.88

Israel 0.82 0.83 0.81 0.89 0.88 0.85

Italy 0.78 0.82 0.84 0.84 0.77

Japan 0.80 0.85 0.88

Jordan 0.72 0.73 0.78

Kazakhstan 0.66 0.65 0.55 0.82 0.75 0.68

Korea 0.78 0.84 0.81 0.84 0.84 0.85

Kosovo 0.79 0.76 0.84

Page 29: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Latvia 0.78 0.77 0.88 0.76 0.84 0.80 0.86 0.80

Lebanon 0.78 0.79 0.79

Lithuania 0.83 0.81 0.87 0.76 0.87 0.86 0.88 0.81

Luxembourg 0.83 0.83 0.89

Macao 0.73 0.73 0.86

Malaysia 0.81 0.81 0.90

Malta 0.83 0.86 0.81 0.88 0.88 0.85

Mexico 0.81 0.78 0.88

Moldova 0.76 0.77 0.82

Montenegro 0.77 0.78 0.85

Morocco 0.75 0.73 0.69 0.84 0.84 0.77

Netherlands 0.84 0.87 0.90 0.87 0.85

New Zealand 0.80 0.82 0.88

North Macedonia 0.77 0.78 0.79

Norway 0.82 0.87 0.84

Panama 0.82 0.78 0.72 0.89 0.86 0.80

Peru 0.82 0.80 0.88 0.86 0.86

Philippines 0.85 0.80 0.76 0.89 0.87 0.81

Poland 0.80 0.81 0.86 0.86 0.80

Portugal 0.82 0.85 0.88 0.86 0.85

Qatar 0.82 0.81 0.85

Romania 0.80 0.81 0.83

Russian Federation 0.78 0.77 0.86 0.71 0.84 0.80 0.82 0.78

Saudi Arabia 0.77 0.77 0.81

Serbia 0.79 0.77 0.87 0.73 0.83 0.82 0.85 0.77

Singapore 0.81 0.80 0.76 0.89 0.88 0.83

Slovak Republic 0.80 0.81 0.87 0.72 0.85 0.83 0.82 0.77

Slovenia 0.78 0.84 0.85

Spain 0.76 0.77 0.84 0.73 0.81 0.80 0.83 0.80

Sweden 0.81 0.86 0.85

Switzerland 0.81 0.82 0.87

Thailand 0.75 0.73 0.68 0.83 0.81 0.76

Turkey 0.81 0.84 0.87

Ukraine 0.81 0.84 0.84

United Arab Emirates 0.79 0.81 0.85

United Kingdom 0.79 0.81 0.64 0.85 0.78 0.68

United States 0.85 0.86 0.90 0.89 0.85

Uruguay 0.80 0.84 0.86

Page 30: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. The financial literacy sample was separate from the main sample.

2. B-S-J-Z (China) data represent the regions of Beijing, Shanghai, Jiangsu, and Zhejiang.

3. Note by Turkey: The information in this document with reference to “Cyprus” relates to the southern part of

the Island. There is no single authority representing both Turkish and Greek Cypriot people on the Island.

Turkey recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable solution is

found within the context of the United Nations, Turkey shall preserve its position concerning the “Cyprus

issue.”

Note by all the European Union Member States of the OECD and the European Union: The Republic of Cyprus

is recognised by all members of the United Nations with the exception of Turkey. The information in this

document relates to the area under the effective control of the Government of the Republic of Cyprus.

Reading subscales

The reading subscales were divided into two groups. The first group, measuring cognitive

processes, was composed of the following subscales: evaluate and reflect (RCER), locate

information (RCLI), and understand (RCUN). The second group, based on the text structure,

comprised the subscales multiple (RTML) and single (RTSN). Due to the way in which the

proficiency data were generated, correlations among the cognitive processes reading

subscales and the text structure reading subscales cannot be calculated. Therefore, the

correlations between the cognitive domains and the cognitive processes reading subscales are

presented in Table 12.17, while the correlations between the cognitive domains and the text

structure reading subscales are presented in Table 12.18.

Table 12.17 Estimated correlations between the cognitive domains and the cognitive

processes reading subscales

RCER 1 RCLI 2 RCUN 3

Mathematics 0.76 0.74 0.76

Science 0.81 0.79 0.81

Global competence 0.78 0.76 0.79

RCER 1

0.90 0.93

RCLI 2

0.93

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. RCER: Evaluate and reflect

2. RCLI: Locate information

3. RCUN: Understand

Table 12.18 Estimated correlations between the cognitive domains and the text structure

reading subscales

Page 31: Chapter 12 Scaling outcomes - OECD · Chapter 12 Scaling outcomes This chapter reports the outcomes of applying the item response theory (IRT) scaling and population modelling to

RTML 1 RTSN 2

Mathematics 0.76 0.76

Science 0.81 0.81

Global Competence 0.79 0.79

RTML 1

0.94

Note: Viet Nam was not included in this analysis due to adjudication issues.

1. RTML: Multiple

2. RTSN: Single

REFERENCES

Efron, B. (1982), “The jackknife, the bootstrap, and other resampling plans”, Society of

Industrial and Applied Mathematics CBMS-NSF Monographs, Vol. 38.

Mislevy, R. J. and K. M. Sheehan (1987), “Marginal estimation procedures”, in A. E. Beaton

(Ed.), Implementing the New Design: The NAEP 1983-84 Technical Report (Report No. 15-

TR-20), Educational Testing Service, Princeton, NJ.

OECD (2002), Reading for Change: Performance and Engagement across Countries (Results

from PISA 2000), OECD Publishing, Paris, http://dx.doi.org/10.1787/9789264099289-en.

Rousseeuw, P. J. and C. Croux (1993), “Alternatives to the median absolute

deviation”, Journal of the American Statistical Association, Vol. 88, pp. 1273-1283.

von Davier, M., S. Sinharay, A. Oranje and A. Beaton (2006), “The statistical procedures

used in National Assessment of Educational Progress: Recent developments and future

directions”, in C. R. Rao and S. Sinharay (Eds.), Handbook of Statistics, Vol. 26, pp. 1039-

1055, Elsevier.