9
TOWARDS BETTER UNDERSTANDING OF META - FEATURES CONTRIBUTIONS APREPRINT Katarzyna Wo´ znica Faculty of Mathematics and Information Science Warsaw University of Technology [email protected] Przemyslaw Biecek Faculty of Mathematics, Informatics and Mechanics University of Warsaw Faculty of Mathematics and Information Science Warsaw University of Technology [email protected] https://orcid.org/0000-0001-8423-1823 February 12, 2020 ABSTRACT Meta learning is a difficult problem as the expected performance of a model is affected by various aspects such as selected hyperparameters, dataset properties and landmarkers. Existing approaches are focused on searching for the best model but do not explain how these different aspects contribute to the performance of a model. To build a new generation of meta-models we need a deeper understanding of relative importance of meta-features and construction of better meta-features. In this paper we (1) introduce a method that can be used to understand how meta-features influence a model performance, (2) discuss the relative importance of different groups of meta-features, (3) analyse in detail the most informative hyperparameters that may result in insights for selection of empirical priors of engineering for hyperparameters. To our knowledge this is the first paper that uses techniques developed for eXplainable Artificial Intelligence (XAI) to examine the behaviour of a meta model. 1 Introduction Meta-learning, or learning to learn, is a wide area of machine learning focused on extracting knowledge about building pipelines of predictive models. Transferring this experience are close to model selection problem and optimization of performance measure. Investigation involved structure of dependency between given dataset and performance of selected machine learning algorithm presumably enables choosing better configuration of model resulting in higher performance measure. This process involves advanced techniques (Vanschoren, 2018) and requires meta-data about analysed problems to insight into algorithm-tasks synergy. The effectiveness of meta-learning is largely dependent on type and quality of meta-features so it is crucial to consider broad range of variables. The primary sources of meta-features that feasibly predict model performance, are dataset characteristics. They are important because of better understanding of structure of data but also potential application in assessment similarity of tasks. Comprehensive summary of commonly used meta-features is provided in (Vanschoren, 2018). These characteristics are available for datasets uploaded to OpenML. In this overview, depending on definitions, features are grouped in five clusters: simple, statistical, information-theoretic, complexity, model-based, and landmarkers. Detailed description of these groups may be found in (Rivolli et al., 2018). In (Bilalli et al., 2017) is shown that selection of the most relevant data characteristics is related to definition of meta-model, especially to meta-response which is preferred performance measure. This study was deficient as concentrating only on statistical and information-theoretic properties. Due to high computational cost of extracting, model-based and landmarkers meta-data were neglected. The idea of landmarking is an approach to represent dataset in meta-space by a vector of performance measures of simple and effective models. Rather than trust in statistical characterizations landmarkers, models helps to uncover more sophisticated structure of task. This method was introduced by (Pfahringer et al., 2000). Various models has arXiv:2002.04276v1 [stat.ML] 11 Feb 2020

Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

TOWARDS BETTER UNDERSTANDING OF META-FEATURESCONTRIBUTIONS

A PREPRINT

Katarzyna WoznicaFaculty of Mathematics and Information Science

Warsaw University of [email protected]

Przemysław BiecekFaculty of Mathematics, Informatics and Mechanics

University of WarsawFaculty of Mathematics and Information Science

Warsaw University of [email protected]

https://orcid.org/0000-0001-8423-1823

February 12, 2020

ABSTRACT

Meta learning is a difficult problem as the expected performance of a model is affected by variousaspects such as selected hyperparameters, dataset properties and landmarkers. Existing approaches arefocused on searching for the best model but do not explain how these different aspects contribute to theperformance of a model. To build a new generation of meta-models we need a deeper understandingof relative importance of meta-features and construction of better meta-features. In this paper we (1)introduce a method that can be used to understand how meta-features influence a model performance,(2) discuss the relative importance of different groups of meta-features, (3) analyse in detail the mostinformative hyperparameters that may result in insights for selection of empirical priors of engineeringfor hyperparameters. To our knowledge this is the first paper that uses techniques developed foreXplainable Artificial Intelligence (XAI) to examine the behaviour of a meta model.

1 Introduction

Meta-learning, or learning to learn, is a wide area of machine learning focused on extracting knowledge about buildingpipelines of predictive models. Transferring this experience are close to model selection problem and optimizationof performance measure. Investigation involved structure of dependency between given dataset and performance ofselected machine learning algorithm presumably enables choosing better configuration of model resulting in higherperformance measure. This process involves advanced techniques (Vanschoren, 2018) and requires meta-data aboutanalysed problems to insight into algorithm-tasks synergy. The effectiveness of meta-learning is largely dependent ontype and quality of meta-features so it is crucial to consider broad range of variables.

The primary sources of meta-features that feasibly predict model performance, are dataset characteristics. They areimportant because of better understanding of structure of data but also potential application in assessment similarityof tasks. Comprehensive summary of commonly used meta-features is provided in (Vanschoren, 2018). Thesecharacteristics are available for datasets uploaded to OpenML. In this overview, depending on definitions, features aregrouped in five clusters: simple, statistical, information-theoretic, complexity, model-based, and landmarkers. Detaileddescription of these groups may be found in (Rivolli et al., 2018). In (Bilalli et al., 2017) is shown that selection of themost relevant data characteristics is related to definition of meta-model, especially to meta-response which is preferredperformance measure. This study was deficient as concentrating only on statistical and information-theoretic properties.Due to high computational cost of extracting, model-based and landmarkers meta-data were neglected.

The idea of landmarking is an approach to represent dataset in meta-space by a vector of performance measures ofsimple and effective models. Rather than trust in statistical characterizations landmarkers, models helps to uncovermore sophisticated structure of task. This method was introduced by (Pfahringer et al., 2000). Various models has

arX

iv:2

002.

0427

6v1

[st

at.M

L]

11

Feb

2020

Page 2: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

Meta-feature effect: n.trees

256 1024 4096

pred

icte

d pe

rform

ance

MinorityClassSizeNumberOfInstances

MajorityClassSizeranger_def_to_gbm_def

kknn_def_to_gbm_defn.trees

interaction.depthshrinkage

Meta-feature importance

3. XAI analysis of the meta-model

2. Assembly of the meta-model

1. Large-scale performance assessment for OpenML data

4 landmark features61 data characteristics5 hyperparameters per configuration1 performance per configuration

diabetes (37)

spambase (44)

mozilla4 (1046)

.....

Set of configurations Θ

4 landmark features61 data characteristics5 hyperparameters per configuration1 performance per configuration

4 landmark features61 data characteristics5 hyperparameters per configuration1 performance per configuration

gbm(y, X)

Set ofdatasets

Figure 1: Process of meta-model exploration. First we gather meta-features about selected datasets from OpenMLrepository. Then we calculate model performance on these datasets for selected configurations of hyperparameters.Second we assemble a surrogate meta-model, here the gradient boosting model. Third we use XAI techniques to extractinformation about relative importance of meta-features and their marginal responses.

been proposed as landmarkers (Balte, 2014) but one of requirements are diverse architectures of algorithms capturingdifferent variables interrelationship.

Except for dataset properties highly informative for meta-model may be detailed model configuration - settings ofhyperparameters. It is observed that some algorithms are prone to tunability of specified parameters (Probst et al.,2018).For that reason automatic optimization of hyperparametrs in models are popular topic in research. One of approachto this automation is building surrogate model on historical datasets, their properties and model hyperparameters space(Wistuba et al., 2015, 2018). These paper are focused on yielding from that meta-model the best hyperparametersconfiguration for the new task rather than analysing the impact of individual meta-features.

1.1 Our contributions

This paper will be divided into three parts (see Figure 1). Section 2 appeal to collecting meta-features to meta-learningproblem. Next part is concentrated on selection of meta-model. In section 4 we present reservoir of eXplainableAI (XAI) tools to examine built surrogate model structure. We draw conclusions about importance of three typesmeta-features: data properties, landmarkers and hyperparameters.

Methods proposed in third section are universal and can be applied to diverse definition of meta-learning problem withdifferent definition of meta-fetures space. That work may be perceived as sort of case study of application of thesemethods. For the case study we used the OpenML 100 database.

2 Meta-learning

Meta-learning is multi-stage process. In our study we distinguish two main steps:

step 1 establishment of meta-space for meta-learning problem:• meta-features:

– for every dataset 61 statistical properties have been computed in OpenML,– gradient boosting algorithm has determined space of 5 considered hyperparameters,– 4 landmarkers from baseline models.

• meta-response: predictive score for every pair of selected machine learning model and dataset

2

Page 3: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

step 2 training and selection of meta-learning surrogate model using one-dataset-out cross-validation.

Table 1: Meta-features for selected datasets from OpenML repository. Four landmarks (relative performance to defaultgbm model) and five hyperparameteres for gbm model are presented. Dataset characteristics are omitted for the brevity.Only optimal hyperparameteres are preseneted for each individual dataset.

HYPERPARAMETERS FOR GBM MODEL LANDMARKSDATASET (ID) SHRINK. INTER.

DEPTHN.TREES BAG

FRACT.MIN.NODE

KNN GLMNET RANGER RANDOMFOREST

DIABETES (37) 0.00 4 1480 0.69 7 1.10 2.25 2.36 2.30SPAMBASE (44) 0.04 5 1414 0.98 16 2.97 4.78 7.57 7.74ADA_AGNOSTIC (1043) 0.05 5 333 0.90 7 2.31 5.41 6.28 5.37MOZILLA4 (1046) 0.09 4 1567 0.54 7 1.71 0.39 2.62 2.84PC4 (1049) 0.01 5 2367 0.75 12 1.11 2.13 3.46 3.57PC3 (1050) 0.04 1 949 0.90 11 0.86 1.34 1.85 1.88KC2 (1063) 0.00 2 273 1.00 21 0.81 0.73 0.90 0.98KC (1067) 0.00 3 5630 0.26 3 0.23 1.16 1.49 1.66PC1 (1068) 0.00 4 6058 0.21 14 1.12 0.27 1.92 1.84BANKNOTE AUTHENTICATION(1462)

0.03 1 8429 0.52 12 6.47 4.99 5.56 6.02

BLOOD TRANSFUSION SER-VICE CENTER (1464)

0.01 1 394 0.21 10 0.64 1.23 0.86 0.71

CLIMATE MODEL SIMULATIONCRASHES (1467)

0.00 2 654 0.26 15 0.35 1.36 1.12 1.15

EEG-EYE-STATE (1471) 0.08 5 2604 0.28 14 2.48 0.93 3.34 4.16HILL-VALLEY (1479) 0.08 5 2604 0.28 14 1.78 0.43 2.04 2.24MADELON (1485) 0.05 5 333 0.90 7 0.34 0.48 1.61 1.58OZONE-LEVEL-8HR (1487) 0.00 3 8868 0.33 11 1.41 3.40 4.43 4.36PHONEME (1489) 0.02 5 5107 0.60 18 4.89 2.90 7.12 8.13QSAR-BIODEG (1494) 0.05 5 333 0.90 7 4.16 5.34 6.74 6.81WDBC (1510) 0.04 1 949 0.90 11 1.56 0.63 1.88 1.91WILT (1570) 0.00 4 6058 0.21 14 2.73 4.80 7.15 7.28

2.1 Datasets and their characterizations

In this study we have focused on OpenML100 database (Bischl et al., 2017) so that considered datasets are real-worldtasks. On this platform every uploaded tabular data has provided automated computing of 61 characteristics. Summaryof available dataset properties is included in (Vanschoren, 2018). As follow in (Bilalli et al., 2017), some of statisticsmay be generated only for datasets which contain continuous variables, some for datasets with at least one categoricalcolumn and third group are calculated for any type of data. We have decided to analyse binary classification problemand selected 20 data sets whose all predictive variables are continuous. For these datasets are 38 available statistical andinformation-theoretic properties. See the Table 1 for the list of all selected datasets.

2.2 Algorithms and hyperparameters space

We have decided to explore gradient boosting classifiers (gbm) sensitivity to tunability so very algorithm was fitted oneach dataset with various parameters. 100 random configurations of hyperparameters have been drawn to examine theirinfluence model predictive power. The region, which hyperparameters are sampled, is according to (Probst et al., 2018).Additionally, one of considered configurations is set of default values in R software (R Core Team, 2019). In this studywe consider following hyperparameters for gbm model:

• n.trees,• interaction.depth,• n.minobsinnode,• shrinkage,• bag.fraction.

3

Page 4: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

2.3 Model and dataset predictive power

For every pair of algorithm with hyperparameters configuration and dataset we have specified 20 train and test datasplits. Model has been fitted on each train subset and afterwards AUC was computed on the test frame.

On the base of AUC results, for individual dataset we have created ranking of models - the higher AUC, the higherposition in ranking. Ratings are scaled to [0,1] interval. Every configurations for each algorithms appear in list 20 timesbeacuse of train-test splits. To aggregate this to one value for every model we have computed average rating for model.

2.4 Landmarkers

We have used landmarkers technique to distinct approach to characterise learning problem of given dataset. Aslandmarkers models we have considered five machine learning algorithms with default hyperparameter configurations:

• generalized linear regression with regularization,• gradient boosting,• k nearest neighbours,• two random forest implementations: randomForest and ranger.

To evaluate their predictive power we have also applied 20 train/test split method and computed AUC score. Thesefive algorithms have been ranked according to methodology in section 2.3. Because we are mainly concentrated onanalysing gradient boosting models we have computed landmarkers as ratio of models rankings (knn, glmnet, rangerand randomForest) to default gbm configuration. In a result, we have four landmarkers meta-features.

Meta-dataset with all meta-features can be found in github repository anonymous URL.

3 Selection of the surrogate meta-model

We consider three types of surrogate model: gradient boosting with interaction depth equals to 2 (named: gbm shallow),gradient boosting with interaction depth equals to 10 (named: gbm deep) and glmnet with LASSO penalty (named:glm) (Tibshirani, 1994). These algorithms have different structure and employ various aspects of variables relationshipin this meta-learning problem. Glmnet with LASSO penalty parameter finds linear dependence and simultaneouslyprevent overfitting. Gradient boosting capture different levels of interactions depending on interaction depth.

To compare the quality of meta-models we apply one-dataset-out crossvalidation: every models is trained on 19 datasetsand then is tested on remaining data frame by calculating mean square error (MSE) and ratio model mean square errorto MSE for model with predicted mean ranking (RATIO MSE). For test dataset we also calculate Spearman correlationbetween predicted rankings and actual meta-response. In Table 2 we sum up performance metrics on leave one datasetout validation. Presented values are average across all validation datasets.

Table 2: Summary of meta-model performance measures. MSE stands for mean square error (the lower the better),RATIO MSE stands for ratio model mean square error to MSE for model with predicted mean ranking (the lower thebetter)

META-MODEL MSE RATIO MSE SPEARMAN CORGBM DEEP 0.017 -0.488 0.662GBM SHALLOW 0.024 -0.246 0.523GLM 0.031 0.130 0.435

Gradient boosting deep has similar mean value of MSE like gradient boosting shallow but relative improvement toMSE with predicted mean ranking is significant higher. Spearman correlation between predictions and predicted valuesis also the highest for gbm deep. According to above results we assume that the best meta-model is gradient boostingwith interaction depth equals 10, i.e. model that may be rich in interactions between meta-features.

During the process of surrogate model selection and validation we build 20 different meta-learning gbm models, in eachof them one of datasets was out of train data frame. To further analyse we pick one of these models in which dataseteeg-eye-state (1471) was used as test dataset. In following sections we mark this meta-model as meta-gbm.

4

Page 5: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

4 Explanatory meta-analysis of gbm meta-model

In this part we present main contribution of this paper. We focus on understanding of the structure of meta-gbm modeland insight into informative power of meta-features and their interactions. To comprehensive analyse we introducetechniques developed for eXplainable Artificial Intelligence. For more details see (Biecek and Burzykowski, 2020;Molnar, 2019) .

4.1 Meta-features importance

First aspect is examination of influence of individual meta-variables and ordering them according to this. Thisinvestigation helps to identify presumptive noisy aspects and may be significant in deliberation to exclude thesemeta-features from new generations meta-models. We have used permutation based measures of variable importance(Fisher et al., 2018). In Figure 2 are presented the most influential meta-features. On the x-axis there is dropout valuesof root mean square error score caused by permutation of consecutive variable.

0.10 0.12 0.14

n.minobsinnode

AutoCorrelation

MaxKurtosisOfNumericAtts

NumberOfFeatures

MajorityClassPercentage

bag.fraction

randomForest_def_to_gbm_def

MinorityClassSize

NumberOfInstances

MajorityClassSize

ranger_def_to_gbm_def

kknn_def_to_gbm_def

n.trees

interaction.depth

shrinkage

Permutational feature importance

Validation datset: 1471 MODEL:GBM_EXP:GBM_ntree:30_deepth_10

Figure 2: Top 15 most important meta-features in GBM meta-model. Lengths of the bar correspond to the permutationalfeature importance.

Validation datset: 1471 MODEL:GBM_EXP:GBM_ntree:30_deepth_10

0.10 0.15 0.20 0.25

DATASET_PROPERTIES

LANDMARKERS

HYPERPARAMETERS

Permutational feature importance

Figure 3: Meta-features importance groups

The most important meta-data are shrinkage, interaction depth and number of trees. These hyperparameters are followedby landmarkers for k nearest neighbours and ranger. Datasets properties play minor role in this ranking, significantpermutational feature importance is obtained by MajorityClassSize, NumberOfInstances and MinorityClassSize. Thesemeta-features are related to dimension and balancing in dataset.

Considered meta-features form three groups because of different approach to creating them. Figure 3 presents theassessment of groups influence. As we see, the most important class of meta-features is hyperparameters and thisconclusion is consistent with importance measure for individual variables. Landmarkers and dataset characteristics hassimilar dropout values.

4.2 Grouping meta-features

In previous section we analyse importance of three groups related to definitions of meta-features. We also applyautomatically group meta-features based on meta-features correlation. Resulted dendrogram is presented in Figure 4.

Some of dataset statistical characteristcs are closely related and pairwise have high correlation. For example quartile ofmeans and standard deviation create correlated cluster. Hyperparameters are uncorrelated with landmarkers and datasetproperties. Moreover they have close to zero correlation within their group. This observation agrees with definition

5

Page 6: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

0.020.04

0.140.22

0

0.01

0.711111

0.22

0.65

0.74

0.951

0.84

0.95

0.931

0.98

0.45

0.01

0.28

0.39

0.97

0.870.99

0.04

0.711

0.09

0.6

0.820.94

0.87

0.57

0.69

0.83

0.940.97

0.90.97

n.minobsinnodeshrinkage

n.treesinteraction.depth

bag.fractionDimensionality

PercentageOfSymbolicFeaturesPercentageOfNumericFeatures

PercentageOfBinaryFeaturesNumberOfFeatures

NumberOfNumericFeaturesQuartile1SkewnessOfNumericAtts

MeanKurtosisOfNumericAttsMaxKurtosisOfNumericAtts

MaxSkewnessOfNumericAttsQuartile1KurtosisOfNumericAttsQuartile2KurtosisOfNumericAttsQuartile3KurtosisOfNumericAtts

Quartile3SkewnessOfNumericAttsMeanSkewnessOfNumericAtts

Quartile2SkewnessOfNumericAttsAutoCorrelation

MinKurtosisOfNumericAttsMinSkewnessOfNumericAtts

MajorityClassSizeNumberOfInstances

kknn_def_to_gbm_defrandomForest_def_to_gbm_def

ranger_def_to_gbm_defMinorityClassSize

ClassEntropyMajorityClassPercentageMinorityClassPercentageglmnet_def_to_gbm_def

MaxStdDevOfNumericAttsMeanStdDevOfNumericAtts

MaxMeansOfNumericAttsMeanMeansOfNumericAtts

MinStdDevOfNumericAttsMinMeansOfNumericAtts

Quartile1MeansOfNumericAttsQuartile2MeansOfNumericAttsQuartile3MeansOfNumericAttsQuartile3StdDevOfNumericAttsQuartile1StdDevOfNumericAttsQuartile2StdDevOfNumericAtts

−0.5 0.0 0.5 1.0Spearman correlations

Figure 4: Groups of meta-features based on meta-features correlation. Colors denote the group of meta-features.

since hyperparameters were sampled independently. Landmarkers for ranger and randomForest are strongly correlatedbecause this algorithms different implementation of the same algorithm.

4.3 Hyperparameters informativeness

So far, we investigate overall impact of explanatory variables which can be estimated as single value. Now we focus onthe type of effect of selected meta-predictor and how meta-model response is sensitive to changes in this feature. In thisanalysis we apply Individual Conditional Expectations (ICE) (Goldstein et al., 2015) and its R implementation calledCeteris Paribus (CP) (Biecek, 2018).

ICE/CP analysis for selected hyperparameters is presented in Figure 5. In view of distribution of hyperparameters weapply loarithmic scale in x-axis. On each plot we present ceteris paribus profiles for all datasets, grey lines correspondto datasets which are in train subset of meta-data in this particular fold of one-dataset-out crossvalidation and blackdotted line is related to eeg-eye-state test dataset.

For every dataset in meta-data we have 100 observations because of varying gbm hyperparameters but profiles forinstances of selected dataset have similar shape and are nearly parallel. We are especially interested in curve outlinesince the translation may be caused by other meta-features values. So we present profiles for averaged meta-features foreach dataset. In fact, we averaged only hyperparameters because dataset characteristics and landmarkers has constantvalues for all dataset observations.

To recall, the higher ranking, the better predictive power of given for gradient boosting model. So observing increasinglines in Ceteris Paribus profiles indicate the better rating of considered gbm models.

We start with the most important variables correspondingly to Figure 2. For shrinkage, interaction depth and numberof trees hyperparameters we detect three patterns of profiles in train datasets subset. We apply hierarchical clusteringfor profiles and for each of three groups aggregated profiles are shown in plots 5. These groups are indicated withdistinctive colors and termed as A, B and C. It is worth emphasizing that groups indicate as A for two hyperparametersmay consist of different datasets because clustering was executed independently.

6

Page 7: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

shrinkage

0.002 0.008 0.032

0.4

0.5

0.6

0.7

0.8

shrinkage

pred

icte

d pe

rfor

man

ce

Ceteris paribus clusters

A

B

C

interaction.depth

1 2 4

0.4

0.6

0.8

interaction.depth

n.trees

256 1024 4096

0.4

0.5

0.6

0.7

0.8

n.trees

pred

icte

d pe

rfor

man

ce

pred

icte

d pe

rfor

man

ce

0.128

Figure 5: Ceteris paribus for hyperparameters. Grey solid lines correspond to profiles for dataset in train meta-data,black dotted line are profile for test dataset. Thick colored lines are aggreated profiles for datasets clusters. Colorindicates groups.

For each of these three hyperparameters group A has increasing CP profile with very strong trend for interior valuesof variables. Profiles stabilise on high prediction for larger meta-features values. Similar behaviour we can observein group B, with this difference, that maximum predicted rating is achieved for smaller values. Group C are strictlydifferent and the highest prediction of rankings are obtained by gbm models with rather lower number of trees, shallowtrees. Because of different types of profiles for above meta-features we infer that for different dataset meta-modelpredict various hyperparameters as optimal.

Next meta-variable from importance ranking is relative landmark for k nearest neighbours. Corresponding CP profile ispresented in Figure 6. In this case all profiles has similar shape and this is difficult to distinguish any clusters.

kknn_def_to_gbm_def

0.25 0.50 1.00 2.00 4.000.4

0.5

0.6

0.7

log of kknn_def_to_gbm_def

pred

ictio

n

Figure 6: Ceteris paribus of kknn landmark Grey solid lines correspond to profiles for dataset in train meta-data, Thickred lines are Partial Dependency Plot (PDP) - aggreated profile for all datasets.

On every plot we observe that Ceteris Paribus profile for test dataset has similar shape to training datasets curves. Soaccording to low MSE score on test dataset we can conclude that meta-gbm is accurate and generalise meta-problem.

5 Conclusions

Meta modeling is a promising approach to AutoML. In this work we have shown how to use XAI techniques to not onlybuild an effective meta model, but also to extract knowledge about the importance of particular meta features in themeta model.

7

Page 8: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

This contributions can be seen on two levels. First is an universal and generic approach to explainable analysis of anymeta-learning model presented in Figure 1. In case of employing different types of meta-features, for instance differentlandmarkers models, we provided comprehensive method to investigate meta-model. This approach can be reproducedon a repository of datasets from a specific domain, or datasets of a selected size or complexity.

Second level is the application of this process to a specific investigation of predictive structure of meta-gbm modeltrained on OpenML datasets. Proposed meta-gbm model is generally well fitted to meta-problem what is confirmed bysmall value of MSE (see Table 2). Meta-model learns to propose different hyperparameters configuration as optimal.Figure 5 shows that individual CP profiles have maximum for different configurations of hyperparameters. For selectedOpenML datasets and considered meta-features we conclude that statistical properties of datasets are relatively lessimportant for meta-gbm model in presence of hyperparameters (see Figure 2). As significant dataset characteristics maybe treated only this related to size of dataset. This conclusion may change when researcher define meta-problem indifferent way. It does not mean that no other dataset characteristic is important, surely we can use this technique toassess the predictive meta-importance of new attempts to extraction of dataset characteristics.

ReferencesA. Balte. Meta-Learning With Landmarking: A Survey. Technical Report 8, 2014.

P. Biecek. DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research, 19(84):1–5, 2018. URL http://jmlr.org/papers/v19/18-416.html.

P. Biecek and T. Burzykowski. Explanatory Model Analysis, 2020. URL https://pbiecek.github.io/ema/.

B. Bilalli, A. Abelló, and T. Aluja-Banet. On the predictive power of meta-features in OpenML. InternationalJournal of Applied Mathematics and Computer Science, 27(4):697–712, 12 2017. ISSN 20838492. doi: 10.1515/amcs-2017-0048.

B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren. OpenMLBenchmarking Suites. 8 2017. URL http://arxiv.org/abs/1708.03731.

A. Fisher, C. Rudin, and F. Dominici. All Models are Wrong, but Many are Useful: Learning a Variable’s Importanceby Studying an Entire Class of Prediction Models Simultaneously. 1 2018. URL http://arxiv.org/abs/1801.01489.

A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking Inside the Black Box: Visualizing Statistical LearningWith Plots of Individual Conditional Expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, 12015. ISSN 1061-8600. doi: 10.1080/10618600.2014.907095. URL http://www.tandfonline.com/doi/full/10.1080/10618600.2014.907095.

C. Molnar. Interpretable Machine Learning, 2019. URL https://christophm.github.io/interpretable-ml-book/.

B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-Learning by Landmarking Various Learning Algorithms.Proceedings of the Seventeenth International Conference on Machine Learning ICML2000, 951(2000):743–750, 2000.ISSN 00071536. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.158.1272&rep=rep1&type=pdf.

P. Probst, A.-L. Boulesteix, and B. Bischl. Tunability: Importance of Hyperparameters of Machine Learning Algorithms.Technical report, 2018.

R Core Team. R: A Language and Environment for Statistical Computing, 2019. URL https://www.r-project.org/.

A. Rivolli, L. P. F Garcia, C. Soares, and J. Vanschoren. Towards Reproducible Empirical Research in Meta-learning.Technical report, 2018. URL https://CRAN.R-project.org/package=mfe,.

R. Tibshirani. Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY,SERIES B, 58:267–288, 1994.

J. Vanschoren. Meta-Learning: A Survey. pages 1–29, 2018. URL http://arxiv.org/abs/1810.03548.

8

Page 9: Katarzyna Woznica´ arXiv:2002.04276v1 [stat.ML] 11 Feb 2020 · 5 hyperparameters per configuration 1 performance per configuration 4 landmark features 61 data characteristics 5 hyperparameters

M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization initializations. In Proceedingsof the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015. Institute ofElectrical and Electronics Engineers Inc., 12 2015. ISBN 9781467382731. doi: 10.1109/DSAA.2015.7344817.

M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Scalable Gaussian process-based transfer surrogates for hyperpa-rameter optimization. Machine Learning, 107(1):43–78, 2018. ISSN 15730565. doi: 10.1007/s10994-017-5684-y.URL https://doi.org/10.1007/s10994-017-5684-y.

9