Using regression tree ensembles to model interaction ......Published by Informa UK Limited, trading as Taylor & Francis Group. View supplementary material Published online: 05 Jul

Full Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=raec20

Applied Economics

ISSN: 0003-6846 (Print) 1466-4283 (Online) Journal homepage: https://www.tandfonline.com/loi/raec20

Using regression tree ensembles to modelinteraction effects: a graphical approach

Fritz Schiltz, Chiara Masci, Tommaso Agasisti & Daniel Horn

To cite this article: Fritz Schiltz, Chiara Masci, Tommaso Agasisti & Daniel Horn (2018) Usingregression tree ensembles to model interaction effects: a graphical approach, Applied Economics,50:58, 6341-6354, DOI: 10.1080/00036846.2018.1489520

To link to this article: https://doi.org/10.1080/00036846.2018.1489520

© 2018 The Author(s). Published by InformaUK Limited, trading as Taylor & FrancisGroup.

View supplementary material

Published online: 05 Jul 2018.

Submit your article to this journal

Article views: 1123

View related articles

View Crossmark data

https://www.tandfonline.com/action/journalInformation?journalCode=raec20

https://www.tandfonline.com/loi/raec20

https://www.tandfonline.com/action/showCitFormats?doi=10.1080/00036846.2018.1489520

https://doi.org/10.1080/00036846.2018.1489520

https://www.tandfonline.com/doi/suppl/10.1080/00036846.2018.1489520

https://www.tandfonline.com/doi/suppl/10.1080/00036846.2018.1489520

https://www.tandfonline.com/action/authorSubmission?journalCode=raec20&show=instructions

https://www.tandfonline.com/action/authorSubmission?journalCode=raec20&show=instructions

https://www.tandfonline.com/doi/mlt/10.1080/00036846.2018.1489520

https://www.tandfonline.com/doi/mlt/10.1080/00036846.2018.1489520

http://crossmark.crossref.org/dialog/?doi=10.1080/00036846.2018.1489520&domain=pdf&date_stamp=2018-07-05

http://crossmark.crossref.org/dialog/?doi=10.1080/00036846.2018.1489520&domain=pdf&date_stamp=2018-07-05

Using regression tree ensembles to model interaction effects: a graphicalapproachFritz Schiltz a, Chiara Mascib, Tommaso Agasistic and Daniel Hornd

aLeuven Economics of Education Research, University of Leuven, Leuven, Belgium; bDepartment of Mathematics, Modelling and ScientificComputing, Politecnico di Milano, Italy; cDepartment of Management, Economics and Industrial Engineering, Politecnico di Milano, Italy;dCentre for Economic and Regional Studies, Hungarian Academy of Sciences, Hungary

ABSTRACTMultiplicative interaction terms are widely used in economics to identify heterogeneous effectsand to tailor policy recommendations. The execution of these models is often flawed due tospecification and interpretation errors. This article introduces regression trees and regression treeensembles to model and visualize interaction effects. Tree-based methods include interactions byconstruction and in a nonlinear manner. Visualizing nonlinear interaction effects in a way that canbe easily read overcomes common interpretation errors. We apply the proposed approach to twodifferent datasets to illustrate its usefulness.

KEYWORDSheterogeneous effects;Regression trees; interactioneffects; machine learning;education economics

JEL CLASSIFICATIONC50; I21

I. Introduction

The estimation of interaction effects has receivedconsiderable attention as they are frequently usedby economists to identify heterogeneous treatmenteffects (Ai and Norton, 2003; Karaca-Mandic,Norton, and Dowd 2012). A common approachis to include multiplicative interactions. However,the execution of these models is often flawed dueto the lack of conditional hypotheses (specificationerrors) and interpretation errors (Brambor, Clark,and Golder 2006). A conditional hypothesis suchas ‘an increase in X is associated with an increasein Y when Z = 1ʹ implies the need for an a priorispecification of this interaction effect. If the inter-action term is not specified, it will not be esti-mated in a standard regression approach. In manyempirical studies, when interaction effects areadded, constitutive terms are not included, biasingthe estimation and interpretation of coefficients(Brambor, Clark, and Golder 2006). In addition,interaction effects are generally included linearly(Hainmueller, Mummolo, and Xu 2017).

This article introduces regression trees andregression trees ensembles, rooted in the machinelearning (ML) literature, and illustrates how thesemethods can be useful to model interaction effects.

ML methods are increasingly being used in differentfields. Applications include, among others, predict-ing worker productivity (Burns and Köster 2016),poverty alleviation (Blumenstock 2016), classifyingeconomics journal content (Angrist et al. 2017),textual analysis in real estate (Nowak and Smith2017), and even modelling judges’ jail-or-releasedecisions (Kleinberg et al. 2017). The motivation touse regression trees and ensembles to model inter-action effects is threefold. First, tree-based methodsinclude interactions by construction, without requir-ing the researcher to have any preconception on thismatter (Su et al. 2011). As a result, common speci-fication errors can be overcome. Second, discontin-uous relationships and nonlinear interaction effectsare more naturally accommodated by tree-basedmethods, as opposed to multiplicative interactions.Third, joint plots visualize interactions between vari-ables, providing applied economists with a usefultool to explore interaction effects in an intuitive way.

We illustrate the approach using two datasets onschools in Hungary and Italy. Our data for Hungarycovers the 2008–2010 period and includes variableswith respect to all organizational levels in primaryeducation (student, class, school and education pro-vider). Our data for Italy covers the 2013–2016

Supplemental material for this article can be accessed here.CONTACT Fritz Schiltz [email protected] University of Leuven, Naamsestraat 69, Leuven 3000, Belgium

APPLIED ECONOMICS2018, VOL. 50, NO. 58, 6341–6354https://doi.org/10.1080/00036846.2018.1489520

© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or builtupon in any way.

http://orcid.org/0000-0001-9920-5353

https://doi.org/10.1080/00036846.2018.1489520

http://www.tandfonline.com

http://crossmark.crossref.org/dialog/?doi=10.1080/00036846.2018.1489520&domain=pdf

period and contains extensive information on stu-dent characteristics (e.g. socio-economic status(SES), immigration). Both datasets allow estimatinga value-added (VA) approach since the same stu-dents are followed over time. Student achievementcan be seen as the outcome of a production processcharacterized by many interactions between differ-ent stakeholders. Analogous to the production func-tion in social policy applications, Y = f(L,K), thisprocess has been modelled as the education produc-tion function (EPF). Interaction effects are particu-larly interesting in the context of education wheremany decisions are taken by different actors at dif-ferent levels of operations (class, school, provider)(Burns and Köster 2016). As a direct consequence,researchers and policymakers that do not acknowl-edge these interactions will over or underestimatethe impact of education policies. Our findings indi-cate that classical regression approaches with multi-plicative interactions fail to identify interestingnonlinear interactions. Also, visualizing our resultsconsiderably improves interpretability compared tocommonly misinterpreted interaction effects (Berry,Golder, and Milton 2012).

We contribute to the literature in at least twoways. First, we introduce regression trees andregression trees ensembles in the newly devel-oping literature that proposes to use ML algo-rithms to explore heterogeneous effects. Second,we contribute to the broader economic literatureby applying the proposed approach to two edu-cation datasets, for two different countries. Indoing so, we illustrate the wider applicability ofthe proposed approach. Despite the growinginterest in ML methods, applications in educa-tion only received little attention (Vanthienenand De Witte 2017). The remainder of the arti-cle is organized as follows. Section II reviewsmultiplicative interactions and introducesregression trees. A brief overview of the data ispresented in Section III, followed by the discus-sion of our results in Section IV. Section Vconcludes.

II. Methodology

Multiplicative interactions

When modelling heterogeneous effects, most empiri-cal studies include multiplicative interaction terms:

Y ¼ β0 þ β1X þ β2Z þ β3XZ þ �; (1)

In the simplest model, X and Y are both contin-uous variables and Z is a dummy variable when Zis a continuous variable, interpreting β1 or β2 isoften meaningless. A common interpretation erroris to report β1 as the unconditional effect of X onY, whereas the unconditional effect depends onthe distribution of Z.1 In the trivial case where Ztakes on 0 or 1, (1) simplifies to:

Y ¼ β0 þ β1X þ � when Z ¼ 0;ðβ0 þ β2Þ þ ðβ1 þ β3ÞX þ � when Z ¼ 1:

�

(1a)

Hence, when Z ¼ 0, the effect of a unit increase inX on Y equals β1. When Z ¼ 1, we obtain theestimated effect of a unit increase in X on Y byadding up β1 and β3. Note that the coefficient β3indicates the extent to which the slopes between Xand Y differ for different values of Z.

The intercept of the estimated regression lineequals β0 + β2 when Z = 1. However, in manystudies some constitutive terms (here: X and Z)are not specified in the estimated model. Forexample, Brambor, Clark, and Golder (2006, 77)review articles published in the top three politicalscience journals from 1998 to 2002 and find thatin 31% of the articles, constitutive terms were notincluded. The remaining 69% that included thoseterms, misinterpreted their coefficient in 62% ofthe cases. When leaving out Z, (1) reduces to:

Y ¼ γ0 þ γ1X þ γ2XZ þ v (2)

Comparing (2) to (1) reveals that not including Zin the specification essentially corresponds toassuming that β2 ¼ 0. This forces the interceptsto coincide for both values of Z. Instead of

1In many empirical applications (where Z is not distributed uniformly in the data), averaging coefficients estimated for X for different values of Z doesnot yield the unconditional effect of X on Y. Also, the insignificance of the coefficient for X in (1), and the insignificance of the interaction effectdoes not imply that the unconditional effect of X on Y is insignificant. To assess the significance of the unconditional effect of X, obtain

σδYδX¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivarðβ1Þ þ Z2varðβ3Þ þ 2Zcovðβ1β3Þ

q. More general, if the covariance term is negative, as is often the case, then it is entirely possible for

β1 þ β3Z to be significant for substantively relevant values of Z even if all of the model parameters are insignificant (Brambor, Clark, and Golder2006, 70). Hence, the straightforward way to infer the unconditional effect of X on Y is to consider coefficients in a specification where interactionsare not included.

6342 F. SCHILTZ ET AL.

estimating two intercepts (β0 and β0 + β2), (2)estimates only one (γ0). In other words, omittedvariable (Z) bias occurs whenever β2 is not zero. Inthis case, all estimated coefficients will be biaseddue to this specification error. Moreover, whenthe number of covariates significantly increases,introducing all the interaction terms and selectingonly the significant ones is not trivial and oftenleads to a misinterpretation of the results.

Regression trees and ensembles

This section introduces single regression trees andthe ensemble method boosting. We outline howthese methods, rooted in the ML literature, can beused to model nonlinear relations and interactioneffects. Also, we illustrate how they can help toovercome common specification and interpreta-tion errors, introduced before.

Single regression treesWhen modelling a production function in socialpolicy applications, assumptions need to be madeon its functional form. As opposed to a commonlyimposed linear function form, regression treeshave a very different flavour. When a regressiontree is used to model a production function, nofunctional form is imposed and interactions areallowed between variables. This is done by con-struction, since building a tree from variablesimplies interacting them.

Consider the example in Figure 1 to illustratethe idea of regression trees. Imagine we want toregress student test scores (response variable) onprevious scores, SES and gender (covariates). Theregression tree divides the covariates space into anumber of regions, where, the predicted value ofthe response variable within each region can beobtained as the mean of all the observations thatbelong to each region. The regions are identifiedby the model in order to minimize the residualsum of squares (RSS). In our example, the regres-sion tree that we obtain could be displayed inFigure 1. The threshold values that are identified(previous score = 50 in the first split, SES = 0 inthe second split), at each split, are able to dividethe sample into subgroups, minimizing the RSS.The tree continues to divide the covariates spaceinto subregions until a certain criterion is reached(e.g. minimum number of observations withineach region or maximum RSS within each region).It can be the case that not all the covariates areused in the identification of regions. The covari-ates that are not involved are the ones that resultto be not predictive for the response variable.From the tree in Figure 2, we can conclude that,in this hypothetical example, the only variablesthat matter are previous test scores and SES. Thisimplies that gender is not able to catch any varia-bility in student test scores otherwise it wouldhave been included in the regression tree. Whenestimating student test scores, we read the tree inthe following way: if the previous score of thestudent is less than 50, then the estimated student

(a) Panel a (b) Panel b

Figure 1. Example of a regression tree (panel a), with the partition of the covariates space (panel b). The outcome variable isstudent test score (continuous [0,100]), and the two predictors selected by the regression tree are previous score (continuous[0,100]) and socio-economic status (continuous [−3,3]). The partition of the covariates space is done considering only the twocovariates involved in the splits.

APPLIED ECONOMICS 6343

score is 48; if the previous score of the student isbigger than 50, it depends on the student SES: ifthe student SES is higher than 0, the expectedscore is 80, while if it is less than 0, the expectedscore is 60. Note how, in this example tree, SESonly matters when the previous score was biggerthan 50, indicating an interaction between pre-vious test scores and SES. This interaction wasdetected by recursively partitioning the observa-tions into nodes, and without the need to specifythe interaction ex ante.

More formally, considering the regressionmodel Y ¼ f ðXÞ þ �, generalizing (1), the classiclinear functional form is the following:

f ðXÞ ¼ β0 þXj¼1p

Xjβj; (3)

where X is the (n × p)-matrix of predictors, n isthe number of observations and p is the number ofpredictors. On the other side, regression treesassume a model of the form:

f ðXÞ ¼Xm¼1M

cm1ðX2RmÞ; (4)

where R1; . . . ;RM represent a partition of the cov-ariates space and cm is the mean of all the obser-vations that belong to region Rm. Algorithm 1summarizes the two steps involved for building aregression tree.

Regression trees have several advantages: they donot force any functional form, they can easilymodel interactions among the covariates, they caneasily handle categorical covariates and missingdata. When modelling an EPF, regression treesare ideally suited to accommodate the hierarchicalstructure of education systems, characterized byinteractions between (and within) different levels(e.g. see Online Figure A1 for Hungary). Despitethe advantages in terms of added flexibility, regres-sion trees have also some disadvantages: they gen-erally suffer from high variance and are sensitive tooutliers. However, there are methods that, byaggregating information from many trees intoensembles (e.g. boosting), substantially improvesthe predictive performance (James et al. 2013).

Regression tree ensemble: boostingBoosting is a stagewise procedure that aggregatesinformation from many trees (=ensemble) by

Figure 2. Joint partial plot, estimated by boosting model applied to NABC data of Location (categorical variable) and School size(continuous variable). The outcome variable is the school added value in mathematics.

Algorithm 1. Single regression trees.1. Use recursive partitioning to divide the predictor space the set ofpossible values for components X1; X2; . . . ; Xpinto M distinct andnon-overlapping regions, R1; R2; . . . ; RM . Being yim the ithobservation within the m-region and given the mean of theobservations within the mth box yRm the regions are chosen in

order to minimize the residual sum of squares RSS ¼ PMm¼1

Pi2RmðriÞ2

where residual ri ¼ yim � yRm .2. For every observation that falls into region Rm , make aprediction equal to yRm .


growing them sequentially: each tree is grownusing information from previously grown trees.Boosting does not involve bootstrap sampling.Instead, each tree is fit on a modified version ofthe original data. The idea behind this procedureis that, unlike fitting a single large tree to the data,which amounts to fitting the data hard and poten-tially overfitting (Varian 2014), the boostingapproach instead learns slowly. The algorithmstarts by fitting a regression tree on the originaldata and continuously updates it fitting regressiontrees on the residuals of the previous model. Givena current model, a tree is fitted on the residuals ofthe model, rather than on the outcome variable.This new tree is then added into the fitted func-tion in order to update the residuals. Note that inboosting, the construction of each tree dependsstrongly on the trees that have already beengrown. Algorithm 2 describes this procedure.

Instead of imposing functional form assump-tions on the production function, the boostingalgorithm requires the specification of threeparameters. First, the number of trees B.Boosting can result in overfitting if B is toolarge, although this overfitting tends to occurslowly, if at all. In our application, we usecross-validation to select B.2 Second, the shrink-age parameter λ, a small positive number. It isalso known as the learning rate as it controls themagnitude at which each tree contributes to themodel, and is typically equal to 0.01 or 0.001(we set λ = 0.001). Third, the number of splitsin each tree, d, that controls the complexity, orinteraction depth, as d splits can involve at mostd variables (we set d = 4).

Despite improvements in robustness by aggre-gating many trees, a price needs to be paid interms of interpretability, as it is no longer possibleto graphically display the final tree (as inFigure 1). Nonetheless, the output of these meth-ods can still be informative about (i) the percen-tage of variability explained by the model, (ii) thevariables importance (VI) ranking and (iii) partialdependence plots. The percentage of variabilityexplained is indicated by the pseudo R2. The VIranking reveals the ranking of the covariates,based on the ‘importance’ of each covariate inexplaining the response (different measures canbe used to compute importance (e.g. James et al.2013)). Focusing on the influence of each singlecovariate and on the interaction effects, partialdependence plots (Friedman 2001) are especiallyinteresting as they allow us to infer the (bothsingle and joint) relation of specific variableswith the outcome. Given N observations yk, fork ¼ 1; . . . ;N, and p predictors, boosting generatespredictions of the form:

yk ¼ Fðx1;k; . . . ; xp;kÞ; (5)

for some mathematical function Fð. . .Þ. The par-tial dependence plot, or partial plot, of the jthcovariate is defined as:

ϕjðxÞ ¼1N

XNk¼1

Fðx1;k; . . . ; xj�1;k; x; xjþ1;k; . . . ; xp;kÞ:

(6)

ϕjðxÞ indicates how the jth covariate is related to y

after averaging out the influence of all other vari-ables. In other words, ϕjðxÞ is the net effect of thejth covariate. After the identification of the impor-tant variables by means of the VI ranking, ϕjðxÞallows us to investigate which is the range ofvalues of the jth covariate that is associated tochanges in the response variable and which is theform of this association. Moreover, the general-ization of this marginal effect to the multivariatecase is analytically straightforward. The joint par-tial dependence plot, or joint plot, of the jth andith covariates,

Algorithm 2. Boosting.1. Set f ðxÞ ¼ 0 and residuals ri ¼ yi for all i in the training set2. For b ¼ 1; 2; . . . ; B, repeat:

(a) Fit a tree f b with d splits (d þ 1 nodes) to the training datausing ri as response variable

(b) Update f by adding a shrunken version of the new tree:

f f ðxÞ þ λf bðxÞ(c) Update the residuals: ri ri � λf bðxiÞ

3. Output the boosted model: fðxÞ ¼PBb¼1 λf

bðxÞ

2We use 70% of the data as a training set and set B = 3000. Online Figure C4 plots cross-validation errors against the number of trees used in the boosting model.


ϕi;jðx; yÞ ¼1N

XNk¼1

Fðx1;k; . . . ; xi�1;k; x; xiþ1;k; . . . ;

xj�1;k; y; xjþ1;k . . . ; xp;kÞ(7)

represents the joint effect of two covariates on y.ϕi;jðxÞ indicates how the jth and ith covariates are

related to y after averaging out the influence of allother variables. The advantage is that it is possible todisplay the joint relation of the two covariates withthe outcome, showing how the two variables interactin their joint association with the response variable.As a result, it is still possible to obtain a visualinterpretation of the relationship between variablesin an intuitive way. We will use these joint plots tovisualize and explore interaction effects in SectionIV, as an alternative approach to estimating (2).

III. Application

A trend towards more accountability in the educa-tion sector has led to the emergence of instrumentsto benchmark schools. The most prominent exampleis the VA estimate, which is considered a best prac-tice to rank schools and has been adopted in theUnited Kingdom, the Netherlands and the USA.Although VA estimates are generally preferred overraw test scores, estimating VA is a high-stakes statis-tical exercise, as VA estimates often determine per-sonnel decisions or school closure, and remainsunder debate (Koedel, Mihaly, and Rockoff 2015).Another discussion closely related is the identifica-tion of variables that are able to explain quality (VA)differences between schools. Here, interaction effectsare particularly interesting as many decisions aretaken by different actors at different levels of opera-tions (class, school, provider) (Burns and Köster2016). As a direct consequence, not acknowledgingthese interactions will lead to a misrepresentation ofvariables and their relationship with school quality.

In this section, we illustrate the usefulness of theapproach proposed above using Hungarian and

Italian datasets. In doing so, we explore the deter-minants of school added value, i.e. components ofthe EPF and interactions that determine its shape.The boosting algorithm can be presented graphi-cally, which facilitates interpretation. We con-structed an indicator of school added value usingdata envelopment analysis (DEA).3 Note that thisnonparametric specification overcomes theassumption that the functional form of achieve-ment is linear and additively separable (Todd andWolpin 2003).4 The indicator of school added valueranges between 0 and 1 and can be interpreted asthe relative ability of a school to transform itsinputs (prior achievement and socio-economicbackground) into outputs (current achievement).5

Data

Hungary: NABCOur dataset for Hungary was constructed by inte-grating information on student, class and schoolcharacteristics from the National Assessment ofBasic Competencies (NABC). The NABC coversall students in primary schools in Hungary (seeOnline Appendix A for a brief introduction to theHungarian education system). It is a standardbased assessment for mathematics and readingthat follows the model of the Programme forInternational Student Assessment (PISA), but isconducted every year in May. Students are testedat grade 6 (age 12) and before graduation fromprimary school, in grade 8 (age 14). In addition tomathematics and literacy test scores, which arecommon in education datasets, NABC containsextensive information on school principal charac-teristics and the socio-economic composition ofschools. This study uses data on students fromthe sixth grade 2008 cohort, graduating fromtheir primary schools in grade 8 in 2010. Ourmain analysis is performed on school levelNABC data. After the removal of statistical unitswith missing values, the final dataset contains

3DEA generates a frontier based on the data and allows comparison of schools with their reference school, given inputs, without any assumption on thefunctional form. This reference school is not the average school, but a school that is situated on the ‘efficiency frontier’. The frontier for a given schoolconsists of schools attaching the same weights to the inputs (prior achievement and socio-economic background), see Online Appendix B.

4As we are mainly interested in studying the importance of discretionary inputs (e.g. class and school size, and principal characteristics), we choose toinclude the socio-economic background and prior achievement in the DEA specification (see Online Appendix B).

5In order to obtain a measure of the added value of schools, input and output variables were averaged for every school. Test scores and the SES indicator arez-standardized indices, with mean 0 and standard deviation 1. The generation of this index follows the logic of the economic-social and cultural status(ESCS) index of the OECD PISA studies.


2122 schools. All variables used in subsequentanalyses are presented in Table 1. A furtherdescription of the data used here can be found inKertesi and Kezdi (2011) and OECD 2010. As canbe seen in Table 1, the average added value ofschools is 0.71 for reading and 0.53 for mathe-matics. For reading, this can be interpreted asfollows: on average, a school can improve its read-ing achievement by 29% if it were to perform aswell as its reference school.

Using information from the NABC data, weinclude class, school and provider size. Since ouranalysis is at school level, ‘class size’ is measured asthe average class size in a school. Provider size ismeasured as the number of schools under super-vision of the same provider of education. In thedecentralized Hungarian system, the type and sizeof education providers varies widely from very smalllocal government providers with only one school tolarge centralized networks of church schools. Also,we include the number of computers in the

dedicated computer class, as a proxy for schoolresources, and the percentage of teachers receivingadditional training.

Complementing the administrative data, we alsoinclude several variables from a questionnaire inNABC completed by the school principal. Theseadditional variables can be used to describe the orga-nizational setting of the schools. The school levelquestionnaire includes variables such as principalexperience, age and satisfaction. Also, the perceivedratio of Roma students is indicated by the principalof each school.6 Finally, to capture geographical dis-crepancies in schools’ performance, we include bothregional and school location categorical variables(e.g. village or city). ‘School location’ is a categoricalvariable indicating the geographical area where aschool operates. Categories include Budapest, cities(not Budapest), county centres and villages (groupedby size). Summary statistics are presented in Table 1.

Italy: INVALSIFor Italy, we use data from the National Institute forthe Evaluation of the Educational System ofEducation and Training (INVALSI). The INVALSIdata closely resembles the NABC structure, albeitthat students are observed in grades 5 and 8 (seeOnline Appendix A for a brief introduction to theItalian education system). In the application at hand,we use data for the 2013 cohort. Hence, the addedvalue measured here reflects the added value ofmiddle schools (‘scuola media’) as students enterschool in the fifth grade and graduate from middleschools in the eighth. In contrast to the Hungariandataset, we do not have access to comprehensivemanagerial and organizational variables.Nonetheless, the dataset allows inclusion of regionalvariables and school characteristics. Furthermore,the INVALSI dataset contains extensive informationregarding the immigration status of students. Allsubsequent analyses are performed on school leveldata. After the removal of statistical units with miss-ing values, the final dataset contains 5751 schools.All variables are presented in Table 2. A more com-prehensive description can be found in De Simone

Table 1. Descriptive statistics for the continuous and the dis-crete variables in the NABC dataset, respectively.Continuous variables (N = 2122) Mean SD Min Max

School size 292 187 0 2195Class size (school average) 21 6 3 37% Roma students 16.46 22.43 0 100Number of computersa 17.47 6.617 0 80Teacher with training (%) 32.17 28.49 0 100Experience principal (years) 8.238 6.532 0 55Age principal (years) 56.4 6.507 31 75Principal satisfaction (%)b 71.1 20.87 0 100Provider size 7.48 12.06 1 97School added value, Math 0.53 0.11 0.13 0.89School added value, Reading 0.71 0.11 0.18 0.99

Discrete variables (N = 2122)

Location Budapest City County V < 2k V (2k–5k) V > 5k

11% 28% 14% 29% 16% 2%

Region C H C T N GP N H S GP S T W T

22% 12% 16% 16% 13% 9% 12%

Provider SD gov. Ecc. Private Other

85% 7% 2% 6%

Note for the continuous variables: aNumber of computers measures the totalnumber of computers as counted in the dedicated computer class.bPrincipal satisfaction answers the question ‘If you were to assigned toanother school, what percentage of the current teaching staff would youtake with you to your new place?’. Note for the discrete variables: V:Village, N: Northern, W: Western, S: Southern, H: Hungary; T:Transdanubia; GP: Great Plain; SD gov.: settlement or district government,Ecc.: Ecclesiastical.

6Even though it seems that the individual performance of Roma students does not differ significantly from other (non-Roma) students, once socio-economicbackground is accounted for (see Kertesi and Kezdi 2011), the inherent discriminatory tendencies in the Hungarian society might cause some families (oreven teachers) to refrain from entering schools where large Roma ratios are present. On average, 16% of students are considered to be from Roma origin.Because of an increasingly segregated education system where so called ‘Roma schools’ are being ghettoized (Kertesi and Kezdi 2011), the median value ismuch lower at 8%.


(2013, 14) or Bertoni, Brunello, and Rocco (2013,66–67). From Table 2, it can be seen that the averageadded value of schools is 0.83 for mathematics and0.78 for reading. For mathematics, these findingssuggest that school can improve their mathematicstest scores by 17% if they were to perform as well asthe school in their reference sets.

Summary statistics are presented in Table 2. Weincluded school size in terms of physical locations –grade size and class size measured as the averageclass size in a school. In addition, we included theshare of immigrants (first and second generationcombined), and the gender balance of the school,calculated as the share of girls. To capture well-documented geographical discrepancies in educa-tional outcomes in Italy (e.g. Agasisti, Ieva, andPaganoni 2017), we also consider regional dummies.Finally, we included a dummy indicating whetherschools are privately or publicly organized.

Results

In this section, we graphically display the results ofthe application of boosting to both the Hungarianand Italian data, when school added value in mathe-matics is the outcome variable and the school level

variables shown in the previous section are thepredictors.7 In order to allow a comparison of mod-els, we included the output of ordinary least squares(OLS) regressions in Tables 3 and 4. Using theboosting approach, the importance of variables inexplaining the added value of schools can be ranked(see Section ‘Regression tree ensemble: boosting’).8

Finally, we present joint plots to visualize interac-tions between explanatory variables and the addedvalue of schools. Joint plots display a net effect,accounting for all other control variables in themodel. In order to allow an intuitive interpretation,boosting requires choosing two variables and dis-plays their joint plot.9 It is important to note that thischoice does not affect the model outcome. In fact,the structure of the model (see Figure 1) assures thatinteractions within and between levels are includedby construction. As a result, no assumptions will beneeded on the existence and functional form ofinteraction effects. In this vein, we can interpret theresults in the joint plots as a data-driven estimationof possibly nonlinear interactions between compo-nents of the EPF. The outcome variable for all jointplots is the added value of schools in mathematics,where the mean is set to 0 in order to ease theinterpretation.

Hungarian schoolsIn Figures 2 and 3, we present three examples ofvariable combinations to illustrate how boosting canbe used to explore heterogeneous effects in an intui-tive way.10 Corresponding OLS coefficients are pre-sented in Table 3 (A, B, C and D). Regression Apresents the baseline model, without interactions. Itincludes size variables (class, school and provider),the share of Roma students, the number of compu-ters, principal characteristics (age, experience, satis-faction), in addition to dummies for school location,the type of education provider and region. All sub-sequent models (B, C and D) extend the baselinemodel by adding an interaction effect. This way, theregression results can be easily compared to the

Table 2. Descriptive statistics for the continuous and the dis-crete variables in the INVALSI dataset, respectively.Continuous variables (N = 5751) Mean SD Min Max

Grade size 71 44 1 370Class size (school average) 20 4 1 32% Immigrant students 9.76 9.13 0 87.23Gender balance (% girls) 50.50 9.05 0 100School added value, Math 0.83 0.05 0.33 0.99School added value, Reading 0.78 0.04 0.40 0.97

Discrete variables (N = 2122)

Region Centro Isole Nord est Nord ovest Sud

19% 12% 18% 26% 25%

Locations per school 1 2 3 or more

75% 17% 8%

Type of education Private Public

11% 89%

Note for the discrete variables: Locations per school denotes the number ofphysical locations (or ‘plessos’) supervised by one administrative unit.

7Results for reading are analogous and hence left out for the sake of brevity.8The relative importance of variables in the EPF, as modelled by boosting, is displayed in Online Figure C1. Online Figure C2 presents partial plots for theshare of Roma students, school size, school location and class size, identified as important determinants of school added value in Hungary. OnlineFigure C3 presents partial plots for the share of immigrants, grade size, school region and class size, identified as important determinants of school addedvalue in Italy. Partial plots display the partial influence of a variable on the outcome, averaging out the influence of all other variables included in themodel. This graphical approach can be compared to plotting the OLS coefficients obtained in Tables 3 and 4, without assuming this coefficient is constant(or varies at given rate) across values of a specified variable.

9Selecting three variables would be possible if a 3D plot is used.10Additional joint plots and alternative variable combinations, together with the R code are available upon request.


boosting approach. In terms of model fit, the boost-ing approach outperforms the linear regressionmodel (pseudo R2 of 24.9 compared to an R2 of21.9 for OLS). This model improvement, as mea-sured by R2, might be due to the actual relationshipsnot being linear (Varian 2014).

In Figure 2, we interact a categorical and acontinuous variable, school location and schoolsize, with the aim of showing how our variable

of interest (school VA) is jointly affected by thetwo variables. Although the shape of the relation-ship in all school locations looks similar, itsstrength can be seen on the vertical axis, indicat-ing a strong discrepancy across locations. Forexample, the variation in added value related toschool size differences appears to be much smallerin cities compared to county centres andBudapest, as showed by the steeper slope of theline measuring the outcome of interest. This mightindicate that school size reform could have a dif-ferential impact across school locations, confirm-ing the need to account for heterogeneous effects.OLS results indicate a similar pattern but compli-cate the interpretation, as is often the case inempirical studies (Hainmueller, Mummolo, andXu 2017). When schools in cities are set as thereference category, we can see from B that theschool size slope is significantly steeper for schoolsin Budapest and county centres. The coefficient onschool size is now close to 0 and no longer sig-nificant. This does not imply that the uncondi-tional effect is no longer significant after includinginteractions in the specification, it simply capturesthe school size effect in cities (i.e. the referencecategory, or Z = 0), i.e. averaging effects thatinstead are heterogeneous across the school size

Table 3. Results of the OLS regression model applied to NABC data.Variables A B C D

School size 0.001*** 0.000 0.001*** 0.001***Average class size 0.011** 0.0100** 0.011** 0.011**Roma students −0.014*** −0.014*** −0.012*** −0.014***Number of computers 0.003 0.003 0.002 0.003P age −0.006* −0.006* −0.006* −0.006P experience 0.012*** 0.012*** 0.001*** 0.013P satisfaction 0.001 0.001 0.001 0.001Provider size 0.004 0.004 0.004 0.004Interactions(Reference = Cities)Budapest × School size 0.001***County centre × School size 0.001***Village (<2k) × School size 0.000Village (2k–5k) × School size −0.000Village (>5k) × School size 0.000Roma students × School size −0.000***P experience × Age 0.000Constant 0.844*** 0.778*** 0.423 0.837**ControlsSchool location X X X XEducation provider X X X XRegion X X X XObservations 2122 2122 2122 2122R2 0.213 0.221 0.221 0.219

Note: P: principal, Provider size indicates the number of schools affiliated to a district, as in Table 1. The outcome variable is the added value of schools interms of mathematics.

*, **, *** indicate significance at the 10%, 5% and 1% level, respectively.For conciseness, standard errors are not displayed here.

Table 4. Results of the OLS regression model applied toINVALSI data .Variables A B C

Grade size 0.000 0.000 0.000Average class size 0.001*** 0.001*** 0.001***Immigrant students −0.048*** 0.047 −0.047Gender balance −0.008 −0.007 −0.007Interactions(Reference = Sud)Centro × Immigrant % −0.077*Isole × Immigrant % −0.049Nord est × Immigrant % −0.098**Nord ovest × Immigrant % −0.128***Immigrant % × Gender balance −0.001Constant 0.806*** 0.785*** 0.789***ControlsLocations per school X X XType of education X X XRegion X X XObservations 5751 5751 5751R2 0.137 0.140 0.137

Note: The outcome variable is the added value of schools for mathematics.*, **, *** indicate significance at the 10%, 5% and 1% level, respectively.For conciseness, standard errors are not displayed here.


distribution. In order to obtain an estimate of theunconditional school size effect, we need to addup coefficients for all categories, weighted by theirrelative frequency. Although the coefficients foreach location are not included in Table 2, inter-preting them in B would be nonsensical as thereare no schools without students. Also, the insig-nificance of the coefficient for school size in thisspecification, and the insignificance of some inter-action effects does not imply that the uncondi-tional school size effect is insignificant, but onlythat ‘average’ effects do not reveal the complexheterogeneous effect of the two joint variablesover the outcome of interest.

As another example, the joint plots in Figure 3interact two continuous variables: % Roma stu-dents and school size (top), and principal ageand experience (bottom). The added value ofschools is now indicated using a colour scale,ranging from blue (high) to purple (low). Thecolours correspond to schools characterized bythe interaction of two variables indicated on theaxes. In this perspective, the areas of the graphwhere the colour is more intense (blue) are thosewhere schools report higher VA, and reading theirpartitioning can reveal their characteristics aboutjoint levels of percentage of Roma students andsize. Clearly, the highest added value can be found

(a)

(b)

Figure 3. Joint partial plot, estimated by boosting model applied to NABC data of School size and Roma students (%) (a) andPrincipal Age and Principal Experience (b). In both the plots, the outcome variable is the school added value in mathematics. Thejoint plot can be interpreted as follows: the lighter the colour, the more pronounced the added value of the school in mathematicsat this specific point. Each point in every graph corresponds to an intersection of two values of the variables denoted on thehorizontal and vertical axes.


in relatively large schools with a low share ofRoma students. The joint plots allow us to explore,possibly nonlinear, interaction effects. For exam-ple, in schools where the share of Roma students israther low, a clear positive relationship betweenschool size and added value can be seen inFigure 3, while schools with a majority of Romastudents seem to be adding more value in rela-tively small and large schools. Middle-sizedschools with many Roma students appear to per-form the worst in terms of added value. Perhapshighly segregated schools benefit from either atailored approach facilitated by a smaller schoolsize, or from scale economies experienced by lar-ger schools. From Table 3C, we find that theinteraction effect Roma students × School size issignificantly negative, indicating that the positiveschool size effect (see Table 3A) is decreasing withthe share of Roma students. Although we do notclaim to present a causal relationship, it is clearfrom this comparison that joint plots reveal pat-terns that cannot be obtained using multiplicativeinteractions, which instead are able only to repre-sent average effects, i.e. in the middle of the dis-tribution of the variable of interest.

The bottom panel of Figure 3 interacts the ageand experience of school principals. In contrast tothe top panel, and Figure 2, no clear interactioneffect can be observed between age and experi-ence. Nonetheless, interacting two continuousvariables in joint plots allows us to identify an‘optimum’: schools led by a relatively young(50 years) and experienced (around 17 years)principal. In this vein, the figure can be used toobtain information about the complexity of theinteractions between two continuous variables,whose relationship is not linear along the wholedistribution of each of them. The differencebetween these schools and schools led by relativelyold (60 years) and inexperienced principalsamounts to 15% points in school added value. Inaddition, and in contrast to coefficients fromTable 3D, the graphical approach reveals a cutoffaround 10 years, where the added value of schools‘jumps’ to higher values. Consistent with this find-ing, it has been argued before that school princi-pals ‘take time to realize their full effect at schools’(Coelli and Green 2012, 92). Again, the graphicalapproach proposed here reveals interesting

patterns, which cannot be obtained from Table 3,where the coefficient for the variable Principalexperience × Principal age actually measures thenon-statistically effect on school value-added, thatis easily observable in the white cells in the middleof the graph reported in Figure 3(b).

Italian schoolsIn order to illustrate the wider applicability of theproposed approach to visualize interaction effects,we repeat the methodological approach of ouranalysis using Italian data described in ‘Results’section. Regression A presents the baselinemodel, without interactions. It includes size vari-ables (grade and class), the share of immigrantstudent, the gender balance, and dummies forthe number of locations per school (1, 2 and 3or more), the type of education (public or private),and the geographical region. Subsequent models Band C extend the baseline model by adding aninteraction effect. This way, the regression resultscan be easily compared to the boosting approach.In terms of model fit, the boosting approach againoutperforms the linear regression model (pseudoR2 of 15.9 compared to an R2 of 14.0 for OLS). Asnoted before, this model improvement, might bedue to the actual association not being linear(Varian 2014). Figure presents two examples ofvariables combinations and their joint relationshipwith the added value of Italian schools. As inFigure 2 for Hungary, we interact a categoricaland a continuous variable in Figure 4(a): regionand immigrant students (%). Both the shape andthe position of the relationship between schooladded value and the share of students from animmigrant background appears to depend on theregion where a school is located, as it appearsevident from the different shapes of the plots inthe different quadrants. On the one hand, thediscrepancies in Figure 4(a) capture regional dif-ferences in added value between schools. On theother hand, this figure suggests differential effectsof immigrant students on the added value ofschools, depending on the region, corroboratingthe need for a more nuanced view of marginaleffects. Looking at Table 4, the OLS results(Table 4B) suggest a similar pattern, i.e. all slopesare more negative compared to the reference


region (Sud), although this difference is not sig-nificant for Isole. Again, complications arise interms of interpreting the constitutive terms,because of their value in interpreting ‘averageeffects’ and not the whole distribution of theeffects of the variable of interest on outcome,which is overcome by the intuitive graphicalapproach.

Figure 4(b) presents the joint plot of the inter-action between two continuous variables in influ-encing the outcome of interest (school VA): thegender balance (% girls) and the share of studentsfrom an immigrant background. As before, thecolour scale represents the added value of schools.In contrast to Figure 3, no clear interaction effectcan be observed here. However, as in Figure 4(a)

(a)

(b)

Figure 4. Joint partial plots, estimated by boosting model applied to INVALSI data, of Geographical location (categorical variable)and Immigrant students (%) (continuous variable) (a) and Principal Age and Principal Experience (both continuous variables) (b). Inboth the plots, the outcome variable is the school added value in mathematics. The joint plot in panel (b) can be interpreted asfollows: the lighter the colour, the more pronounced the added value of the school in mathematics at this particular point.


the graphical approach reveals interesting pat-terns, which cannot be obtained from Table 4.For example, schools with the highest addedvalue in mathematics are those schools whosestudents are mainly boys and do not come froman immigrant background. On the contrary,schools consisting of mainly girls with an immi-grant background appear to perform the worst.Finally, schools with mostly immigrant boys (topleft) are adding value in mathematics to theirstudents, roughly equivalent to schools that con-sist of mostly girls without an immigrant back-ground (bottom right). Although we do not claimto provide any causal evidence on this matter, thisexample once again illustrates the exploratorybenefits of visualizing how variables interact todetermine an outcome. In particular, the graphallows to visualize those cases where the jointeffect of two variables is associated with higherschool VA (darker colour), avoiding the risk ofcapturing a ‘zero effect’ in the middle of the dis-tribution of the two interacted variables where thecolour of the relationship is lighter.

IV. Conclusion

This article demonstrates how regression trees andensembles can be used to model and visualize inter-action effects. Multiplicative interactions are com-monly used to identify heterogeneous treatmenteffects and to tailor policy recommendations. Themethod proposed in this article provides appliedeconomists with an innovative tool to explore inter-action effects in a way that overcomes commonspecification and interpretation errors. Our empiri-cal application illustrates the usefulness of joint plotsgenerated by the boosting algorithm to model inter-action effects in education. For example, we find thatschool size has different effects on the added value ofschools, depending on the socio-economic composi-tion and location of schools. This means that theeffect of each predictor might depend also on thevalues of the other predictors, suggesting that, whenmaking considerations about each single aspect ofschools, its effect needs to be related to the setting inwhich it acts. Visualizing results in joint plots allowsan intuitive interpretation of interaction effects.Despite the complexity of the model, results can beeasily read, while at the same time, flexible

interactions provide a more realistic insight intothe (education) production function. As illustratedusing two datasets, a potential data-driven approachcan be derived. Practitioners could explore relation-ships between variables using the boosting algorithmand visualize them, before formally testing interac-tions in a regression framework. In many applica-tions, boosting and regression could provecomplementary to uncover and test complexrelationships.

Acknowledgements

We are grateful to Kristof De Witte, Geraint Johnes, DanielSantin, and other seminar participants in Milan, Lisbon,Leuven, Sankt Gallen, and Budapest for many useful com-ments and remarks and to Anna Maria Paganoni for statis-tical discussions during the analysis.

Disclosure statement

The authors declare that they have no relevant or material finan-cial interests that relate to the research described in this article.

Funding

The project leading to this article has received funding fromthe European Union’s Horizon 2020 Research andInnovation Programme under grant agreement no. 691676(EdEN).

ORCID

Fritz Schiltz http://orcid.org/0000-0001-9920-5353

References

Agasisti, T., F. Ieva, and A. M. Paganoni. 2017.“Heterogeneity, School-Effects and the North/SouthAchievement Gap in Italian Secondary Education:Evidence from a Three-Level Mixed Model.” StatisticalMethods & Applications 26 (1): 157–180.

Ai, C., and E. C. Norton. 2003. “Interaction Terms in Logitand Probit Models.” Economics Letters 80: 123–129.

Angrist, J., P. Azoulay, G. Ellison, R. Hill, and S. F. Lu. 2017.“Economic Research Evolves: Fields and Styles.” AmericanEconomic Review 107 (5): 293–297.

Berry, W. D., M. Golder, and D. Milton. 2012. “ImprovingTests of Theories Positing Interaction.” The Journal ofPolitics 74 (3): 653–671.

Bertoni, M., G. Brunello, and L. Rocco. 2013. “When the CatIs Near, the Mice Won’t Play: The Effect of External


Examiners in Italian Schools.” Journal of Public Economics104: 65–77.

Blumenstock, J. E. 2016. “Fighting Poverty with Data.”Science 353 (6301): 753–754.

Brambor, T., W. R. Clark, and M. Golder. 2006.“Understanding Interaction Models: Improving EmpiricalAnalyses.” Political Analysis 14 (1): 63–82.

Burns, T., and F. Köster. 2016. “Governing Education in aComplex World. OECD Publishing. Chalfin, A., Danielli,O., Hillis, A., Jelveh, Z., Luca, M., Ludwig, J., AndMullainathan, S. (2016). Productivity and Selection ofHuman Capital with Machine Learning.” AmericanEconomic Review 106 (5): 124–127.

Coelli, M., and D. A. Green. 2012. “Leadership Effects:School Principals and Student Outcomes.” Economics ofEducation Review 31 (1): 92–109.

De Simone, G. 2013. “Render Unto Primary the ThingsWhich are Primary’s: Inherited and Fresh LearningDivides in Italian Lower Secondary Education.”Economics of Education Review 35: 12–23.

Friedman, J. H. 2001. “Greedy Function Approximation: AGradient Boosting Machine.” Annals of Statistics 29 (5):1189–1232.

Hainmueller, J., J. Mummolo, and Y. Xu (2017). How MuchShould We Trust Estimates from MultiplicativeInteraction Models? Simple Tools to Improve EmpiricalPractice. Working Paper.

James, G., D. Witten, R. Hastie, and R. Tibshirani. 2013. AnIntroduction to Statistical Learning. Springer. New York: 6edition.

Karaca-Mandic, P., E. C. Norton, and B. Dowd. 2012.“Interaction Terms in Nonlinear Models.” Health ServicesResearch 47: 255–274.

Kertesi, G., and G. Kezdi. 2011. “The Roma/Non-Roma TestScore Gap in Hungary.” American Economic Review 101 (3):519–525.

Kleinberg, J., H. Lakkaraju, J. Leskovec, J. Ludwig, and S.Mullainathan. 2017. “Human Decisions and MachinePredictions.” Quarterly Journal of Economics 133 (1):237–293.

Koedel, C., K. Mihaly, and J. E. Rockoff. 2015. “Value-AddedModeling: A Review.” Economics of Education Review 47:180–195.

Nowak, A., and P. Smith. 2017. “Textual Analysis in RealEstate.” Journal of Applied Econometrics 32 (4): 896–918.

OECD (2010). OECD Review on Evaluation and AssessmentFrameworks for Improving School Outcomes HungaryCountry Background Report. Technical report, OECD, Paris.

Su, X., K. Meneses, P. McNees, and O. Johnson. 2011.“Interaction Trees: Exploring the Differential Effects ofan Intervention Programme for Breast CancerSurvivors.” Journal of the Royal Statistical Society. SeriesC (Applied Statistics) 60 (3): 457–474.

Todd, P. E., and K. I. Wolpin. 2003. “On the Specificationand Estimation of the Production Function for CognitiveAchievement.” Economic Journal 113 (485): F3–33.

Vanthienen, J., and K. De Witte. 2017. Data AnalyticsApplications in Education. Boca Raton, FL: Taylor & Francis.

Varian, H. R. 2014. “Big Data: New Tricks for Econometrics.”The Journal of Economic Perspectives 28 (2): 3–28.


Documents

Using regression tree ensembles to model interaction ......Published by Informa UK Limited, trading as Taylor & Francis Group. View supplementary material Published online: 05 Jul