39
Transfer Learning for Improving Model Predictions in Highly Configurable Software Pooyan Jamshidi, Miguel Velez, Christian Kästner Norbert Siegmund, Prasad Kawthekar

Transfer Learning for Improving Model Predictions in Highly Configurable Software

Embed Size (px)

Citation preview

Page 1: Transfer Learning for Improving Model Predictions in Highly Configurable Software

TransferLearningforImprovingModelPredictionsinHighlyConfigurableSoftwarePooyan Jamshidi, Miguel Velez, Christian Kästner

Norbert Siegmund, Prasad Kawthekar

Page 2: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Mybackgroundandobjectives

• PhD(2010-13):Fuzzycontrolforauto-scalingincloud• Postdoc1(2014):FuzzyQ-Learningforonlineruleupdate,MAPE-KE• Postdoc2(2015-16):Configurationoptimizationandperformancemodelingforbigdatasystems(Streamprocessing,Map-Reduce,NoSQL)• Now:Transferlearningformakingperformancemodelpredictionsmoreaccurateandmorereliableforhighlyconfigurablesoftware

Objectives:• Opportunitiesforapplyingmodellearningatruntime• Collaborations

2

Page 3: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Motivation

1- Many different Parameters => - large state space- interactions

2- Defaults are typically used =>- poor performance

Page 4: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Motivation

0 1 2 3 4 5average read latency (µs)

×104

0

20

40

60

80

100

120

140

160

ob

serv

atio

ns

1000 1200 1400 1600 1800 2000average read latency (µs)

0

10

20

30

40

50

60

70

ob

serv

atio

ns

1

12

00

13

00

14

00

15

00

16

00

17

00

18

00

19

00

1

0.5 1

1.5 2

2.5 3

3.5 4

×1

04

(a) cass-20 (b) cass-10

Best configurations

Worst configurations

Experiments on Apache Cassandra:- 6 parameters, 1024 configurations- Average read latency- 10 millions records (cass-10)- 20 millions records (cass-20)

Look at these outliers!Large statistical dispersion

Page 5: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Problem-solutionoverview• Theroboticsoftwareisconsideredasahighlyconfigurablesystem.• Theconfigurationoftherobotinfluencetheperformance (aswellasenergyusage).• Problem:therearemanydifferentparametersmakingtheconfigurationspacehighlydimensionalanddifficulttounderstandtheinfluenceofconfigurationonsystemperformance.• Solution:learnablack-boxperformancemodelusingmeasurementsfromtherobot.performance_value=f(configuration)• Challenge:Measurementsfromrealrobotisexpensive(timeconsuming,requirehumanresource,riskywhenfail)• Ourcontribution:performingmostmeasurementsonsimulator(Gazebo)andtakingonlyfewsamplesfromtherealrobottolearnareliableandaccurateperformancemodelgivenacertainexperimentalbudget. 5

Page 6: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Overviewofourapproach

ML Model

Learn Model

MeasureMeasure

DataSourceTarget

Simulator (Gazebo) Robot (TurtleBot)

Predict Performance

Predictions

Adaptation

Reason

6

Page 7: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Overviewofourapproach

ML Model

Learn Model

MeasureMeasure

DataSourceTarget

Simulator (Gazebo) Robot (TurtleBot)

Predict Performance

Predictions

Adaptation

Reason

Data

7

Page 8: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Overviewofourapproach

Wegetmore cheaper samplesfromthesimulatorandonlyfewexpensivefromtherealrobottolearnanaccuratemodeltopredicttheperformanceoftherobot.Here,performancemaydenote“batteryusage”,“localizationerror”or“missiontime”.

ML Model

Learn Model

MeasureMeasure

DataSourceTarget

Simulator (Gazebo) Robot (TurtleBot)

Predict Performance

Predictions

Adaptation

Reason

8

Page 9: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Goal

9

information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2

overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.

2. PROBLEM AND MOTIVATION

2.1 Problem statementIn this paper, we focus on the problem of optimal system

configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.

We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x

⇤ that globally minimizes f(·) over X:

x

⇤ = argminx2X

f(x) (1)

In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.

2.2 Related workSeveral approaches have attempted to address the above

problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design

(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.

2.3 SolutionAll the previous e↵orts attempt to improve the sampling

process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.

To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.

2.4 MotivationA motivating example. We now illustrate the previous

points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.

Kafka Spout Splitter Bolt Counter Bolt

(sentence) (word)[paintings, 3][poems, 60][letter, 75]

Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)

Figure 1: WordCount architecture.

2

information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2

overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.

2. PROBLEM AND MOTIVATION

2.1 Problem statementIn this paper, we focus on the problem of optimal system

configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.

We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x

⇤ that globally minimizes f(·) over X:

x

⇤ = argminx2X

f(x) (1)

In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.

2.2 Related workSeveral approaches have attempted to address the above

problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design

(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.

2.3 SolutionAll the previous e↵orts attempt to improve the sampling

process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.

To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.

2.4 MotivationA motivating example. We now illustrate the previous

points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.

Kafka Spout Splitter Bolt Counter Bolt

(sentence) (word)[paintings, 3][poems, 60][letter, 75]

Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)

Figure 1: WordCount architecture.

2

Partially known

Taking measurements

Configuration space

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overview

An overview of our model learning methodology is de-picted in Figure 1. We use the observations that we canobtain cheaply, with the cost of c

s

for each sample, froman alternative measurable response function, g(·), that yieldsdifferent but related response to the target response, D

s

=

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overview

An overview of our model learning methodology is de-picted in Figure 1. We use the observations that we canobtain cheaply, with the cost of c

s

for each sample, froman alternative measurable response function, g(·), that yieldsdifferent but related response to the target response, D

s

=

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overview

An overview of our model learning methodology is de-picted in Figure 1. We use the observations that we canobtain cheaply, with the cost of c

s

for each sample, froman alternative measurable response function, g(·), that yieldsdifferent but related response to the target response, D

s

=

Response function to learn

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overview

An overview of our model learning methodology is de-picted in Figure 1. We use the observations that we canobtain cheaply, with the cost of c

s

for each sample, froman alternative measurable response function, g(·), that yieldsdifferent but related response to the target response, D

s

=

Learn model

Page 10: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Costmodel

10

1. Model Learning

2. Predictive Model

4. Transfer Learning

3. Cost Model

ImprovedPredictive

Model

Conf.Parameters

Samples0 500 1000 1500

Throughput (ops/sec)

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Avera

ge w

rite

late

ncy (µ

s)

TL4COBO4CO

Fig. 1: Overview of cost-aware transfer learning methodology.

{(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transfer these obser-vations to learn a performance model for the real system usingonly few observations from that system, D

t

= {(xt

, y

t

)|yt

=f(x

t

)+✏

t

,x

t

2 X}. The cost of obtaining each observation onthe target system is c

t

, and we assume that cs

⌧ c

t

and thatthe source response function, g(·), is related (correlated) to thetarget function, and this relatedness contributes to the learningof the target function, using only sparse samples from thereal system, i.e., |D

t

| ⌧ |Ds

|. Intuitively, rather than startingfrom scratch, we transfer the learned knowledge from the datawe have collected using cheap methods such as simulatorsor previous system versions, to learn a more accurate modelabout the target configurable system and use this model toreason about its performance at runtime.

In Figure 2, we demonstrate the idea of transfer learningwith a synthetic example. Only three observations are takenfrom a synthetic target function, f(·), and we try to learna model, f̂(x), which provides predictions for unobservedresponse values at some input locations. Figure 2(b) depictsthe predictions that are provided by the model trained usingonly the 3 samples without considering the transfer learningfrom any source. The predictions are accurate around thethree observations, while highly inaccurate elsewhere. Figure2(c) depicts the model trained over the 3 observations and9 other observations from a different but related source.The predictions, thanks to transfer learning, are here closelyfollowing the true function with a narrow confidence interval(i.e., high confidence in predictions provided by the model).Interestingly, the confidence interval around points x = 2, 4in the model with transfer learning have disappeared, demon-strating more confident predictions with transfer learning.

Given a target environment, the effectiveness of any transferlearning depends on the source environment and how it isrelated to the target [33]. If the relationship is strong andthe transfer learning method can take advantage of it, theperformance in the target predictions can significantly improvevia transfer. However, if the source response is not sufficientlyrelated or if the transfer method does not exploit the relation-ship, the performance may not improve or even may decrease.We show this case via the synthetic example in Figure 2(d)where the prediction model uses 9 samples from a misleadingfunction that is unrelated to the target function (the responseremains constant as we increase x); as a consequence, thepredictions are very inaccurate and do not follow the growingtrend of the true function.

1 3 5 9 10 11 12 13 140

50

100

150

200source samplestarget samples

(a) (b)

(c) (d)

Fig. 2: (a) 9 samples have been taken from a source (re-spectively 3 from a target); (b) [without transfer learning] aregression model is constructed using only the target sam-ples. (c) [with transfer learning] another regression model isconstructed using the target and the source samples. The newmodel is based on only 3 target measurements while providingaccurate predictions. (d) [negative transfer] a regression modelis constructed with transfer from a misleading response func-tion and therefore the model prediction is inaccurate.

For the choice of number of samples from the source andtarget, we use a simple cost model (cf. Figure 1) that is basedon the number of source and target samples as follows:

C(Ds

,Dt

) = c

s

· |Ds

|+ c

t

· |Dt

|+ CTr

(|Ds

|, |Dt

|), (2)C(D

s

,Dt

) Cmax

, (3)

where Cmax

is the experimental budget and CTr

(|Ds

|, |Dt

|)is the cost of model training, which is a function of sourceand target sample sizes. Note that this cost model makes theproblem we have formulated in Eq. (1) into a multi-objectiveproblem, where not only accuracy is considered, but also costcan be considered in the transfer learning process.

B. Model learning: Technical detailsWe use Gaussian Process (GP) models to learn a reliable

performance model f̂(·) that can predict unobserved responsevalues. The main motivation to chose GP here is that it offersa framework in which performance reasoning can be doneusing mean estimates as well as a confidence interval foreach estimation. In other words, where the model is confidentwith its estimation, it provides a small confidence interval; onthe other hand, where it is uncertain about its estimations itgives a large confidence interval meaning that the estimationsshould be used with precautions. The other reason is that allthe computations are based on linear algebra which are cheapto compute. This is especially useful in the domain of self-adaptive systems where automated performance reasoning inthe feedback loop is typically time constrained and shouldbe robust against any sort of uncertainty including prediction

1. Model Learning

2. Predictive Model

4. Transfer Learning

3. Cost Model

ImprovedPredictive

Model

Conf.Parameters

Samples0 500 1000 1500

Throughput (ops/sec)

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Avera

ge w

rite

late

ncy (µ

s)

TL4COBO4CO

Fig. 1: Overview of cost-aware transfer learning methodology.

{(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transfer these obser-vations to learn a performance model for the real system usingonly few observations from that system, D

t

= {(xt

, y

t

)|yt

=f(x

t

)+✏

t

,x

t

2 X}. The cost of obtaining each observation onthe target system is c

t

, and we assume that cs

⌧ c

t

and thatthe source response function, g(·), is related (correlated) to thetarget function, and this relatedness contributes to the learningof the target function, using only sparse samples from thereal system, i.e., |D

t

| ⌧ |Ds

|. Intuitively, rather than startingfrom scratch, we transfer the learned knowledge from the datawe have collected using cheap methods such as simulatorsor previous system versions, to learn a more accurate modelabout the target configurable system and use this model toreason about its performance at runtime.

In Figure 2, we demonstrate the idea of transfer learningwith a synthetic example. Only three observations are takenfrom a synthetic target function, f(·), and we try to learna model, f̂(x), which provides predictions for unobservedresponse values at some input locations. Figure 2(b) depictsthe predictions that are provided by the model trained usingonly the 3 samples without considering the transfer learningfrom any source. The predictions are accurate around thethree observations, while highly inaccurate elsewhere. Figure2(c) depicts the model trained over the 3 observations and9 other observations from a different but related source.The predictions, thanks to transfer learning, are here closelyfollowing the true function with a narrow confidence interval(i.e., high confidence in predictions provided by the model).Interestingly, the confidence interval around points x = 2, 4in the model with transfer learning have disappeared, demon-strating more confident predictions with transfer learning.

Given a target environment, the effectiveness of any transferlearning depends on the source environment and how it isrelated to the target [33]. If the relationship is strong andthe transfer learning method can take advantage of it, theperformance in the target predictions can significantly improvevia transfer. However, if the source response is not sufficientlyrelated or if the transfer method does not exploit the relation-ship, the performance may not improve or even may decrease.We show this case via the synthetic example in Figure 2(d)where the prediction model uses 9 samples from a misleadingfunction that is unrelated to the target function (the responseremains constant as we increase x); as a consequence, thepredictions are very inaccurate and do not follow the growingtrend of the true function.

1 3 5 9 10 11 12 13 140

50

100

150

200source samplestarget samples

(a) (b)

(c) (d)

Fig. 2: (a) 9 samples have been taken from a source (re-spectively 3 from a target); (b) [without transfer learning] aregression model is constructed using only the target sam-ples. (c) [with transfer learning] another regression model isconstructed using the target and the source samples. The newmodel is based on only 3 target measurements while providingaccurate predictions. (d) [negative transfer] a regression modelis constructed with transfer from a misleading response func-tion and therefore the model prediction is inaccurate.

For the choice of number of samples from the source andtarget, we use a simple cost model (cf. Figure 1) that is basedon the number of source and target samples as follows:

C(Ds

,Dt

) = c

s

· |Ds

|+ c

t

· |Dt

|+ CTr

(|Ds

|, |Dt

|), (2)C(D

s

,Dt

) Cmax

, (3)

where Cmax

is the experimental budget and CTr

(|Ds

|, |Dt

|)is the cost of model training, which is a function of sourceand target sample sizes. Note that this cost model makes theproblem we have formulated in Eq. (1) into a multi-objectiveproblem, where not only accuracy is considered, but also costcan be considered in the transfer learning process.

B. Model learning: Technical detailsWe use Gaussian Process (GP) models to learn a reliable

performance model f̂(·) that can predict unobserved responsevalues. The main motivation to chose GP here is that it offersa framework in which performance reasoning can be doneusing mean estimates as well as a confidence interval foreach estimation. In other words, where the model is confidentwith its estimation, it provides a small confidence interval; onthe other hand, where it is uncertain about its estimations itgives a large confidence interval meaning that the estimationsshould be used with precautions. The other reason is that allthe computations are based on linear algebra which are cheapto compute. This is especially useful in the domain of self-adaptive systems where automated performance reasoning inthe feedback loop is typically time constrained and shouldbe robust against any sort of uncertainty including prediction

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overviewD

s

= {(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transferthese observations to learn a performance model for the realsystem using only few observations from that system, D

t

={(x

t

, y

t

)|yt

= f(xt

) + ✏

t

,x

t

2 X}. The cost of obtainingeach observation on the target system is c

t

, and we assumethat c

s

⌧ c

t

and that the source response function, g(·), is

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overviewD

s

= {(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transferthese observations D

t

= {(xt

, y

t

)|yt

= f(xt

) + ✏

t

,x

t

2 X}.The cost of obtaining each observation on the target system isc

t

, and we assume that cs

⌧ c

t

and that the source responsefunction, g(·), is related (correlated) to the target function,and this relatedness contributes to the learning of the target

A black-box response function f : X ! R is usedto build a performance model given some observations ofthe system performance under different settings. In practice,though, the observation data may contain noise, i.e., y

i

=f(x

i

) + ✏

i

,x

i

2 X where ✏

i

⇠ N (0,�i

) and we onlypartially know the response function through the observationsD = {(x

i

, y

i

)}di=1, |D| ⌧ |X|. In other words, a response

function is simply a mapping from the configuration space toa measurable performance metric that produces interval-scaleddata (here we assume it produces real numbers). Note thatwe can learn a model for all measurable attributes (includingaccuracy, safety if suitably operationalized), but here wemostly train predictive models on performance attributes (e.g.,response time, throughput, CPU utilization).

Our main goal is to learn a reliable regression model, f̂(·),that can predict the performance of the system, f(·), given alimited number of observations D. More specifically, we aimto minimize the prediction error over the configuration space:

argminx2X

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but otherwise little is known about theresponse function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well.

D. State-of-the-artIn literature, model learning has been approached from two

standpoints: (i) sampling strategies and (ii) learning methods.1) Model learning: Sampling. Random sampling has been

used to collect unbiased observations in computer-based ex-periments. However, random sampling may require a largenumber of samples to build an accurate model [31]. Moreintelligent sampling approaches (such as Latin Hypercube,Box Behnken, and Plackett Burman) have been developed inthe statistics and machine learning communities, under theumbrella of experimental designs, to ensure certain statisti-cal properties [25]. The aim of these different experimentaldesigns is to ensure that we gain high level of informationfrom sparse sampling (partial design) in high dimensionalspaces. Relevant to the software engineering community, sev-eral approaches tried different experimental designs for highlyconfigurable software [16] and some even consider cost as anexplicit factor to determine optimal sampling [30].

Recently, researchers have tried novel ways of samplingwith a feedback embedded inside the process where newsamples are derived based on information gain from previousset of samples. A recent solution [8] uses active learningbased on a random forest predictor to find good designbased on a multi-objective goal. Recursive Random Sampling(RRS) [37] integrates a restarting mechanism into the randomsampling to achieve high search efficiency. Smart Hill Climb-ing (SHC) [36] integrates the importance sampling with LatinHypercube Design (lhd). SHC estimates the local regression ateach potential region, then it searches toward the steepest de-scent direction. An approach based on direct search [40] formsa simplex in the parameter space by a number of samples, anditeratively updates a simplex through a number of well-defined

operations including reflection, expansion, and contraction toguide the sample generation. Quick Optimization via Guessing(QOG) [26] speeds up the optimization process exploitingsome heuristics to filter out sub-optimal configurations.Machine learning. Also, standard machine-learning tech-niques, such as support-vector machines, decision trees, andevolutionary algorithms have been tried [20], [38]. Theseapproaches trade simplicity and understandability of thelearned models for predictive power. For example, some recentwork [39] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach [31] also exploitedthis fact, but iteratively construct a regression model represent-ing performance influences in an active learning process.

2) Positioning in the self-adaptive community: Performancereasoning is a key activity for decision making at runtime.Time series techniques [10] shown to be effective in predictingresponse time and uncovering performance anomalies aheadof time. FUSION [12], [11] exploited inter-feature relation-ships (e.g., feature dependencies) to reduce the dimensions ofconfiguration space, making runtime performance reasoningfeasible. Different classification models have been evaluatedfor the purpose of time series predictions in [3]. Performancepredictions also have been applied for resource allocationsat runtime [19], [21]. Note that the approaches above arereferred to as black-box models. However, another categoryof models known as white-box are built early in the lifecycle, by studying the underlying architecture of the system[15], [17] using Queuing networks, Petri Nets, and StochasticProcess Algebras [4]. Performance prediction and reasoninghave also been used extensively in other communities such ascomponent-based [5] and control theory [13].

3) Novelty: Previous work attempted to improve the pre-diction power of the model by exploiting the information thathas been gained from the target system either by improvingthe sampling process or by adopting a sophisticated learningmethod. Our approach is orthogonal to both sampling andlearning, proposing a new way to enhance model accuracyby exploiting the knowledge we can gain from other relevantand possibly cheaper sources to accelerate the learning of anaccurate performance model through transfer learning.

Transfer learning has been applied previously for regressionand classification problems in machine learning [27] andsoftware engineering (mainly for defect predictions [24]).However, to the best of our knowledge, our approach is thefirst attempt towards learning performance models using cost-aware transfer learning for configurable software.

III. COST-AWARE TRANSFER LEARNING

In this section, we explain our cost-aware transfer learningsolution to improve learning an accurate performance model.

A. Solution overviewD

s

= {(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transferthese observations D

t

= {(xt

, y

t

)|yt

= f(xt

) + ✏

t

,x

t

2 X}.The cost of obtaining each observation on the target system isc

t

, and we assume that cs

⌧ c

t

and that the source responsefunction, g(·), is related (correlated) to the target function,and this relatedness contributes to the learning of the target

Source observations

Target observations

Assumption

Budget

Total cost

Page 11: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Syntheticexample

(a) Sourceandtargetsamples

(b) Learningwithouttransfer

(c) Learningwithtransfer

(d) Negativetransfer

1 3 5 9 10 11 12 13 140

50

100

150

200source samplestarget samples

(a) (b)

(c) (d) 11

Page 12: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Assumption:Sourceandtargetarecorrelated

0 2 4 6 8 100

0.5

1

1.5

2

2.510

4

0 2 4 6 8 100

0.5

1

1.5

2

2.5

310

4

0 2 4 6 8 100

0.5

1

1.5

2

2.510

4

(a) (b) (c)

0 2 4 6 8 100

0.5

1

1.5

2

2.5

310

4

(d)

0 2 4 6 8 100

0.5

1

1.5

2

2.5

310

4

0 2 4 6 8 10-0.5

0

0.5

1

1.5

2

2.510

4

0 2 4 6 8 100

0.5

1

1.5

2

2.5

310

4

0 2 4 6 8 100

0.5

1

1.5

2

2.5

310

4

gain from the relationship of simulation samples and the fewones on real system to train a performance model that is moreaccurate at less costly comparing with the situation where werely only on observations from real systems.

B. ChallengesAlthough we can take relatively cheap samples from simula-

tion, it is impractical and naive to exhaustively run simulationenvironments for all possible combination of configurationoptions:

• The configuration space of just 20 parameters with binaryoptions for each comprises of 220 = 1m possible options.Note that real systems typically have much larger config-uration parameters. But even for this small configurationspace, if we spend just one minute for collecting eachsample, it will take 694 days (⇡ 2 years) to perform anexhaustive sampling. Therefore, we can sample only asubset of this space, and we need to do this purposefully.

• Not all samples from a simulator reflects the real behaviorand, as a result, would be useful for learning a perfor-mance model. More specifically, the measurements fromsimulators typically contain noise and for some configu-ration, the data may not have any relationship with thedata we may observe on the real system. Therefore, if welearn a model based on these data, it becomes inaccurateand far from real system behavior and any reasoningbased on the model predictions becomes misleading orineffective. This is the case of negative transfer and wewill demonstrate specific cases where a negative transferhappens.

• Often, there exist different sources of data (e.g., differentsimulators, different versions of a system) that we canlearn from. However, too many training samples causelong training time. Since the learned model may be usedin some time constrained environments (e.g., runtimedecision making in a feedback loop for robots), it isimportant to select the sources purposefully.

C. Problem formulation: Model learningIn order to introduce the concepts in our approach precisely

and concisely, we define the model learning problem usingmathematical notation. Let X

i

indicate the i-th configurationparameter, which ranges in a finite domain Dom(X

i

). Ingeneral, X

i

may either indicate (i) an integer variable such asthe number of iterative refinements in a localization algorithmor (ii) a categorical variable such as sensor names or binaryoptions (e.g., local vs global localization method). There-fore, the configuration space is mathematically a Cartesianproduct of all of the domains of the parameters of interestX = Dom(X1)⇥ · · ·⇥Dom(X

d

).A configuration x resides in the design parameter space

x 2 X. A black-box response function f : X ! R isused to build a performance model given some observationsof the system performance under different settings, D ✓ X.In practice, though, such measurements may contain noise,i.e., y

i

= f(xi

) + ✏

i

where ✏

i

⇠ N (0,�i

). In other words,a performance model is simply a function (mapping) fromconfiguration space to a measurable performance metric thatproduces interval-scaled data (here we assume it produces real

numbers). Note that we can learn a model for all measurableattributes (including accuracy, safety), but here we mostly trainpredictive models on performance attributes (e.g., responsetime, throughput, CPU utilization).

The goal here is to learn a reliable regression model, f̂(·)that can predict the performance of the system, f(·), given alimited number of measurements taken from the real systemto minimize the prediction error:

arg minx2D

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but where otherwise little is known aboutthe response function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well. This allows us to build a modelusing black-box models that can predict unobserved values.

D. State-of-the-artIn literature, this problem has been approached from two

different standpoints: (i) sampling strategies and (ii) learningmethods.

1) Model learning: Sampling. Random sampling has beenused to collect unbiased observations in computer-based exper-iments. However, based on our experience, random samplingmay require a large number of samples to build an accuratemodel, and since the overall tuning time is an importantconsideration in an auto-tuner, a long training time can reduceits effectiveness. More intelligent sampling approaches (suchas Latin Hypercube, Box Behnken, and Plackett Burman) havebeen developed in statistics and machine learning community,in which experimental designs have been developed to ensurecertain statistical properties [22]. The aim of these differentexperimental design is to ensure that we gain high level ofinformation from sparse sampling (partial design) in highdimensional spaces. Relevant to software engineering commu-nity, several approaches tried different experimental designsfor highly configurable software [14] and some even considercost as an explicit factor to determine optimal sampling [26].

Recently, researchers have tried novel way of sampling witha feedback embedded inside the process where new samplesare derived based on information gain from previous set ofsamples. Recursive Random Sampling (RRS) [33] integratesa restarting mechanism into the random sampling to achievehigh search efficiency. Smart Hill Climbing (SHC) [32] inte-grates the importance sampling with Latin Hypercube Design(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [35] forms a simplex inthe parameter space by a number of samples, and iterativelyupdates a simplex through a number of well-defined operationsincluding reflection, expansion, and contraction to guide thesample generation. Quick Optimization via Guessing (QOG)in [23] speeds up the optimization process exploiting someheuristics to filter out sub-optimal configurations. Some recentwork [34] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach also exploited thisfact, but iteratively construct a regression model representingperformance influences in an active learning process.

gain from the relationship of simulation samples and the fewones on real system to train a performance model that is moreaccurate at less costly comparing with the situation where werely only on observations from real systems.

B. ChallengesAlthough we can take relatively cheap samples from simula-

tion, it is impractical and naive to exhaustively run simulationenvironments for all possible combination of configurationoptions:

• The configuration space of just 20 parameters with binaryoptions for each comprises of 220 = 1m possible options.Note that real systems typically have much larger config-uration parameters. But even for this small configurationspace, if we spend just one minute for collecting eachsample, it will take 694 days (⇡ 2 years) to perform anexhaustive sampling. Therefore, we can sample only asubset of this space, and we need to do this purposefully.

• Not all samples from a simulator reflects the real behaviorand, as a result, would be useful for learning a perfor-mance model. More specifically, the measurements fromsimulators typically contain noise and for some configu-ration, the data may not have any relationship with thedata we may observe on the real system. Therefore, if welearn a model based on these data, it becomes inaccurateand far from real system behavior and any reasoningbased on the model predictions becomes misleading orineffective. This is the case of negative transfer and wewill demonstrate specific cases where a negative transferhappens.

• Often, there exist different sources of data (e.g., differentsimulators, different versions of a system) that we canlearn from. However, too many training samples causelong training time. Since the learned model may be usedin some time constrained environments (e.g., runtimedecision making in a feedback loop for robots), it isimportant to select the sources purposefully.

C. Problem formulation: Model learningIn order to introduce the concepts in our approach precisely

and concisely, we define the model learning problem usingmathematical notation. Let X

i

indicate the i-th configurationparameter, which ranges in a finite domain Dom(X

i

). Ingeneral, X

i

may either indicate (i) an integer variable such asthe number of iterative refinements in a localization algorithmor (ii) a categorical variable such as sensor names or binaryoptions (e.g., local vs global localization method). There-fore, the configuration space is mathematically a Cartesianproduct of all of the domains of the parameters of interestX = Dom(X1)⇥ · · ·⇥Dom(X

d

).A configuration x resides in the design parameter space

x 2 X. A black-box response function f : X ! R isused to build a performance model given some observationsof the system performance under different settings, D ✓ X.In practice, though, such measurements may contain noise,i.e., y

i

= f(xi

) + ✏

i

where ✏

i

⇠ N (0,�i

). In other words,a performance model is simply a function (mapping) fromconfiguration space to a measurable performance metric thatproduces interval-scaled data (here we assume it produces real

numbers). Note that we can learn a model for all measurableattributes (including accuracy, safety), but here we mostly trainpredictive models on performance attributes (e.g., responsetime, throughput, CPU utilization).

The goal here is to learn a reliable regression model, f̂(·)that can predict the performance of the system, f(·), given alimited number of measurements taken from the real systemto minimize the prediction error:

arg minx2D

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but where otherwise little is known aboutthe response function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well. This allows us to build a modelusing black-box models that can predict unobserved values.

D. State-of-the-artIn literature, this problem has been approached from two

different standpoints: (i) sampling strategies and (ii) learningmethods.

1) Model learning: Sampling. Random sampling has beenused to collect unbiased observations in computer-based exper-iments. However, based on our experience, random samplingmay require a large number of samples to build an accuratemodel, and since the overall tuning time is an importantconsideration in an auto-tuner, a long training time can reduceits effectiveness. More intelligent sampling approaches (suchas Latin Hypercube, Box Behnken, and Plackett Burman) havebeen developed in statistics and machine learning community,in which experimental designs have been developed to ensurecertain statistical properties [22]. The aim of these differentexperimental design is to ensure that we gain high level ofinformation from sparse sampling (partial design) in highdimensional spaces. Relevant to software engineering commu-nity, several approaches tried different experimental designsfor highly configurable software [14] and some even considercost as an explicit factor to determine optimal sampling [26].

Recently, researchers have tried novel way of sampling witha feedback embedded inside the process where new samplesare derived based on information gain from previous set ofsamples. Recursive Random Sampling (RRS) [33] integratesa restarting mechanism into the random sampling to achievehigh search efficiency. Smart Hill Climbing (SHC) [32] inte-grates the importance sampling with Latin Hypercube Design(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [35] forms a simplex inthe parameter space by a number of samples, and iterativelyupdates a simplex through a number of well-defined operationsincluding reflection, expansion, and contraction to guide thesample generation. Quick Optimization via Guessing (QOG)in [23] speeds up the optimization process exploiting someheuristics to filter out sub-optimal configurations. Some recentwork [34] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach also exploited thisfact, but iteratively construct a regression model representingperformance influences in an active learning process.

gain from the relationship of simulation samples and the fewones on real system to train a performance model that is moreaccurate at less costly comparing with the situation where werely only on observations from real systems.

B. ChallengesAlthough we can take relatively cheap samples from simula-

tion, it is impractical and naive to exhaustively run simulationenvironments for all possible combination of configurationoptions:

• The configuration space of just 20 parameters with binaryoptions for each comprises of 220 = 1m possible options.Note that real systems typically have much larger config-uration parameters. But even for this small configurationspace, if we spend just one minute for collecting eachsample, it will take 694 days (⇡ 2 years) to perform anexhaustive sampling. Therefore, we can sample only asubset of this space, and we need to do this purposefully.

• Not all samples from a simulator reflects the real behaviorand, as a result, would be useful for learning a perfor-mance model. More specifically, the measurements fromsimulators typically contain noise and for some configu-ration, the data may not have any relationship with thedata we may observe on the real system. Therefore, if welearn a model based on these data, it becomes inaccurateand far from real system behavior and any reasoningbased on the model predictions becomes misleading orineffective. This is the case of negative transfer and wewill demonstrate specific cases where a negative transferhappens.

• Often, there exist different sources of data (e.g., differentsimulators, different versions of a system) that we canlearn from. However, too many training samples causelong training time. Since the learned model may be usedin some time constrained environments (e.g., runtimedecision making in a feedback loop for robots), it isimportant to select the sources purposefully.

C. Problem formulation: Model learningIn order to introduce the concepts in our approach precisely

and concisely, we define the model learning problem usingmathematical notation. Let X

i

indicate the i-th configurationparameter, which ranges in a finite domain Dom(X

i

). Ingeneral, X

i

may either indicate (i) an integer variable such asthe number of iterative refinements in a localization algorithmor (ii) a categorical variable such as sensor names or binaryoptions (e.g., local vs global localization method). There-fore, the configuration space is mathematically a Cartesianproduct of all of the domains of the parameters of interestX = Dom(X1)⇥ · · ·⇥Dom(X

d

).A configuration x resides in the design parameter space

x 2 X. A black-box response function f : X ! R isused to build a performance model given some observationsof the system performance under different settings, D ✓ X.In practice, though, such measurements may contain noise,i.e., y

i

= f(xi

) + ✏

i

where ✏

i

⇠ N (0,�i

). In other words,a performance model is simply a function (mapping) fromconfiguration space to a measurable performance metric thatproduces interval-scaled data (here we assume it produces real

numbers). Note that we can learn a model for all measurableattributes (including accuracy, safety), but here we mostly trainpredictive models on performance attributes (e.g., responsetime, throughput, CPU utilization).

The goal here is to learn a reliable regression model, f̂(·)that can predict the performance of the system, f(·), given alimited number of measurements taken from the real systemto minimize the prediction error:

arg minx2D

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but where otherwise little is known aboutthe response function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well. This allows us to build a modelusing black-box models that can predict unobserved values.

D. State-of-the-artIn literature, this problem has been approached from two

different standpoints: (i) sampling strategies and (ii) learningmethods.

1) Model learning: Sampling. Random sampling has beenused to collect unbiased observations in computer-based exper-iments. However, based on our experience, random samplingmay require a large number of samples to build an accuratemodel, and since the overall tuning time is an importantconsideration in an auto-tuner, a long training time can reduceits effectiveness. More intelligent sampling approaches (suchas Latin Hypercube, Box Behnken, and Plackett Burman) havebeen developed in statistics and machine learning community,in which experimental designs have been developed to ensurecertain statistical properties [22]. The aim of these differentexperimental design is to ensure that we gain high level ofinformation from sparse sampling (partial design) in highdimensional spaces. Relevant to software engineering commu-nity, several approaches tried different experimental designsfor highly configurable software [14] and some even considercost as an explicit factor to determine optimal sampling [26].

Recently, researchers have tried novel way of sampling witha feedback embedded inside the process where new samplesare derived based on information gain from previous set ofsamples. Recursive Random Sampling (RRS) [33] integratesa restarting mechanism into the random sampling to achievehigh search efficiency. Smart Hill Climbing (SHC) [32] inte-grates the importance sampling with Latin Hypercube Design(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [35] forms a simplex inthe parameter space by a number of samples, and iterativelyupdates a simplex through a number of well-defined operationsincluding reflection, expansion, and contraction to guide thesample generation. Quick Optimization via Guessing (QOG)in [23] speeds up the optimization process exploiting someheuristics to filter out sub-optimal configurations. Some recentwork [34] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach also exploited thisfact, but iteratively construct a regression model representingperformance influences in an active learning process.

gain from the relationship of simulation samples and the fewones on real system to train a performance model that is moreaccurate at less costly comparing with the situation where werely only on observations from real systems.

B. ChallengesAlthough we can take relatively cheap samples from simula-

tion, it is impractical and naive to exhaustively run simulationenvironments for all possible combination of configurationoptions:

• The configuration space of just 20 parameters with binaryoptions for each comprises of 220 = 1m possible options.Note that real systems typically have much larger config-uration parameters. But even for this small configurationspace, if we spend just one minute for collecting eachsample, it will take 694 days (⇡ 2 years) to perform anexhaustive sampling. Therefore, we can sample only asubset of this space, and we need to do this purposefully.

• Not all samples from a simulator reflects the real behaviorand, as a result, would be useful for learning a perfor-mance model. More specifically, the measurements fromsimulators typically contain noise and for some configu-ration, the data may not have any relationship with thedata we may observe on the real system. Therefore, if welearn a model based on these data, it becomes inaccurateand far from real system behavior and any reasoningbased on the model predictions becomes misleading orineffective. This is the case of negative transfer and wewill demonstrate specific cases where a negative transferhappens.

• Often, there exist different sources of data (e.g., differentsimulators, different versions of a system) that we canlearn from. However, too many training samples causelong training time. Since the learned model may be usedin some time constrained environments (e.g., runtimedecision making in a feedback loop for robots), it isimportant to select the sources purposefully.

C. Problem formulation: Model learningIn order to introduce the concepts in our approach precisely

and concisely, we define the model learning problem usingmathematical notation. Let X

i

indicate the i-th configurationparameter, which ranges in a finite domain Dom(X

i

). Ingeneral, X

i

may either indicate (i) an integer variable such asthe number of iterative refinements in a localization algorithmor (ii) a categorical variable such as sensor names or binaryoptions (e.g., local vs global localization method). There-fore, the configuration space is mathematically a Cartesianproduct of all of the domains of the parameters of interestX = Dom(X1)⇥ · · ·⇥Dom(X

d

).A configuration x resides in the design parameter space

x 2 X. A black-box response function f : X ! R isused to build a performance model given some observationsof the system performance under different settings, D ✓ X.In practice, though, such measurements may contain noise,i.e., y

i

= f(xi

) + ✏

i

where ✏

i

⇠ N (0,�i

). In other words,a performance model is simply a function (mapping) fromconfiguration space to a measurable performance metric thatproduces interval-scaled data (here we assume it produces real

numbers). Note that we can learn a model for all measurableattributes (including accuracy, safety), but here we mostly trainpredictive models on performance attributes (e.g., responsetime, throughput, CPU utilization).

The goal here is to learn a reliable regression model, f̂(·)that can predict the performance of the system, f(·), given alimited number of measurements taken from the real systemto minimize the prediction error:

arg minx2D

pe = |f̂(x)� f(x)| (1)

In order to solve the problem above, we assume f(·) isreasonably smooth, but where otherwise little is known aboutthe response function. Intuitively, we expect that for “near-by”input points x and x

0 their corresponding output points y andy

0 to be “near-by” as well. This allows us to build a modelusing black-box models that can predict unobserved values.

D. State-of-the-artIn literature, this problem has been approached from two

different standpoints: (i) sampling strategies and (ii) learningmethods.

1) Model learning: Sampling. Random sampling has beenused to collect unbiased observations in computer-based exper-iments. However, based on our experience, random samplingmay require a large number of samples to build an accuratemodel, and since the overall tuning time is an importantconsideration in an auto-tuner, a long training time can reduceits effectiveness. More intelligent sampling approaches (suchas Latin Hypercube, Box Behnken, and Plackett Burman) havebeen developed in statistics and machine learning community,in which experimental designs have been developed to ensurecertain statistical properties [22]. The aim of these differentexperimental design is to ensure that we gain high level ofinformation from sparse sampling (partial design) in highdimensional spaces. Relevant to software engineering commu-nity, several approaches tried different experimental designsfor highly configurable software [14] and some even considercost as an explicit factor to determine optimal sampling [26].

Recently, researchers have tried novel way of sampling witha feedback embedded inside the process where new samplesare derived based on information gain from previous set ofsamples. Recursive Random Sampling (RRS) [33] integratesa restarting mechanism into the random sampling to achievehigh search efficiency. Smart Hill Climbing (SHC) [32] inte-grates the importance sampling with Latin Hypercube Design(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [35] forms a simplex inthe parameter space by a number of samples, and iterativelyupdates a simplex through a number of well-defined operationsincluding reflection, expansion, and contraction to guide thesample generation. Quick Optimization via Guessing (QOG)in [23] speeds up the optimization process exploiting someheuristics to filter out sub-optimal configurations. Some recentwork [34] exploited a characteristic of the response surface ofthe configurable software to learn Fourier sparse functions byonly a small sample size. Another approach also exploited thisfact, but iteratively construct a regression model representingperformance influences in an active learning process.

a) Sourcemaybeashifttothetargetb) Sourcemaybeanoisyversionofthetargetc) Sourcemaybedifferenttotargetinsomeextremepositionsintheconfigurationspaced) Sourcemaybeirrelevanttothetarget->negativetransferwillhappen!

12

Page 13: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Correlations

100

150

1

200

250

Late

ncy

(ms)

300

2 53 104

5 156

14

16

18

20

1

22

24

26

Late

ncy

(ms)

28

30

32

2 53 1045 156 number of countersnumber of splitters number of countersnumber of splitters

2.8

2.9

1

3

3.1

3.2

3.3

2

Late

ncy

(ms)

3.4

3.5

3.6

3 54 10

5 156

1.2

1.3

1.4

1

1.5

1.6

1.7

Late

ncy

(ms)

1.8

1.9

2 53104

5 156

(a) WordCount v1

(b) WordCount v2

(c) WordCount v3 (d) WordCount v4

(e) Pearson correlation coefficients

(g) Measurement noise across WordCount versions

(f) Spearman correlation coefficients

correlation coefficient

p-va

lue

v1 v2 v3 v4

500

600

700

800

900

1000

1100

1200

La

ten

cy (

ms)

hardware change

softw

are

chan

ge

Table 1: My caption

v1 v2 v3 v4

v1 1 0.41 -0.46 -0.50

v2 7.36E-06 1 -0.20 -0.18

v3 6.92E-07 0.04 1 0.94

v4 2.54E-08 0.07 1.16E-52 1

Table 2: My caption

v1 v2 v3 v4

v1 1 0.49 -0.51 -0.51

v2 5.50E-08 1 -0.2793 -0.24

v3 1.30E-08 0.003 1 0.88

v4 1.40E-08 0.01 8.30E-36 1

Table 3: My caption

ver. µ � µ�

v1 516.59 7.96 64.88

v2 584.94 2.58 226.32

v3 654.89 13.56 48.30

v4 1125.81 16.92 66.56

1

Table 1: My caption

v1 v2 v3 v4

v1 1 0.41 -0.46 -0.50

v2 7.36E-06 1 -0.20 -0.18

v3 6.92E-07 0.04 1 0.94

v4 2.54E-08 0.07 1.16E-52 1

Table 2: My caption

v1 v2 v3 v4

v1 1 0.49 -0.51 -0.51

v2 5.50E-08 1 -0.2793 -0.24

v3 1.30E-08 0.003 1 0.88

v4 1.40E-08 0.01 8.30E-36 1

Table 3: My caption

ver. µ � µ�

v1 516.59 7.96 64.88

v2 584.94 2.58 226.32

v3 654.89 13.56 48.30

v4 1125.81 16.92 66.56

1

- Different correlations- Different optimum Configurations- Different noise level

Page 14: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Traditionalmachinelearningvs.transferlearningSourceDomain

(cheap)TargetDomain(expensive)

LearningAlgorithm

DifferentDomains

LearningAlgorithm

LearningAlgorithm

LearningAlgorithm

TraditionalMachineLearning TransferLearning

Knowledge

DataSourceTarget

14

Page 15: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Traditionalmachinelearningvs.transferlearningSourceDomain

(cheap)TargetDomain(expensive)

LearningAlgorithm

DifferentDomains

LearningAlgorithm

LearningAlgorithm

LearningAlgorithm

TraditionalMachineLearning TransferLearning

Thegoaloftransferlearningistoimprovelearninginthetargetdomainbyleveragingknowledgefromthesourcedomain.

Knowledge

DataSourceTarget

15

Page 16: Transfer Learning for Improving Model Predictions in Highly Configurable Software

GPformodelingblackboxresponsefunction

true function

GP mean

GP variance

observation

selected point

true minimum

Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4

v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1

Table 2: Signal to noise ratios for WordCount.

Top. µ � µci �ciµ�

wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56

Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.

Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤

v2 = (6, 2), x⇤v3 = (2, 13), x⇤

v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.

Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the

latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.

3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION

3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy

that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-

ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:

y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)

where covariance k(x,x0) defines the distance between x

and x

0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:

µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)

�2t (x) = k(x,x) + �2

I � k(x)|(K + �2I)�1

k(x) (4)

where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],

µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.

3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other

previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).

In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO

3

Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4

v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1

Table 2: Signal to noise ratios for WordCount.

Top. µ � µci �ciµ�

wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56

Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.

Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤

v2 = (6, 2), x⇤v3 = (2, 13), x⇤

v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.

Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the

latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.

3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION

3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy

that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-

ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:

y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)

where covariance k(x,x0) defines the distance between x

and x

0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:

µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)

�2t (x) = k(x,x) + �2

I � k(x)|(K + �2I)�1

k(x) (4)

where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],

µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.

3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other

previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).

In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO

3

Motivations:1- mean estimates + variance2- all computations are linear algebra3- good estimations when few data

errors. Also, for building GP models, we do not need to knowany internal details about the system; the learning process canbe applied in a black-box fashion using the sampled perfor-mance measurements. In the GP framework, it is also possibleto incorporate domain knowledge as prior, if available, whichcan enhance the model accuracy [20].

In order to describe the technical details of our transferlearning methodology, let us briefly describe an overview ofGP model regression; a more detailed description can be foundelsewhere [35]. GP models assume that the function f̂(x) canbe interpreted as a probability distribution over functions:

y = f̂(x) ⇠ GP(µ(x), k(x,x0)), (4)

where µ : X ! R is the mean function and k : X ⇥ X ! Ris the covariance function (kernel function) which describesthe relationship between response values, y, according to thedistance of the input values x,x

0. The mean and variance ofthe GP model predictions can be derived analytically [35]:

µ

t

(x) = µ(x) + k(x)|(K + �

2I)�1(y � µ), (5)

2t

(x) = k(x,x) + �

2I � k(x)|(K + �

2I)�1

k(x), (6)

where k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt

)], I isidentity matrix and

K :=

2

64k(x1,x1) . . . k(x1,xt

)...

. . ....

k(xt

,x1) . . . k(xt

,x

t

)

3

75 (7)

GP models have shown to be effective for performancepredictions in data scarce domains [20]. However, as wehave demonstrated in Figure 2, it may becomes inaccuratewhen the samples do not cover the space uniformly. Forhighly configurable systems, we require a large number ofobservations to cover the space uniformly, making GP modelsineffective in such situations.

C. Model prediction using transfer learningIn transfer learning, the key question is how to make accu-

rate predictions for the target environment using observationsfrom other sources, D

s

. We need a measure of relatedness notonly between input configurations but between the sourcesas well. The relationships between input configurations wascaptured in the GP models using the covariance matrix thatwas defined based on the kernel function in Eq. (7). Morespecifically, a kernel is a function that computes a dot product(a measure of “similarity”) between two input configurations.So, the kernel helps to get accurate predictions for similarconfigurations. We now need to exploit the relationship be-tween the source and target functions, g, f , using the currentobservations D

s

,Dt

to build the predictive model f̂ . To capturethe relationship, we define the following kernel function:

k(f, g,x,x0) = k

t

(f, g)⇥ k

xx

(x,x0), (8)

where the kernels k

t

represent the correlation between sourceand target function, while k

xx

is the covariance function forinputs. Typically, k

xx

is parameterized and its parameters arelearnt by maximizing the marginal likelihood of the modelgiven the observations from source and target D = D

s

[ Dt

.

Note that the process of maximizing the marginal likelihood isa standard method [35]. After learning the parameters of k

xx

,we construct the covariance matrix exactly the same way as inEq. 7 and derive the mean and variance of predictions usingEq. (5), (6) with the new K. The main essence of transferlearning is, therefore, the kernel that capture the source andtarget relationship and provide more accurate predictions usingthe additional knowledge we can gain via the relationshipbetween source and target.

D. Transfer learning in a self-adaptation loopNow that we have described the idea of transfer learning for

providing more accurate predictions, the question is whethersuch an idea can be applied at runtime and how the self-adaptive systems can benefit from it. More specifically, wenow describe the idea of model learning and transfer learningin the context of self-optimization, where the system adaptsits configuration to meet performance requirements at runtime.The difference to traditional configurable systems is that welearn the performance model online in a feedback loop undertime and resource constraints. Such performance reasoning isdone more frequently for self-adaptation purposes.

An overview of a self-optimization solution is depictedin Figure 3 following the well-know MAPE-K framework[9], [23]. We consider the GP model as the K (knowledge)component of this framework that acts as an interface to whichother components can query the performance under specificconfigurations or update the model given a new observation.We use transfer learning to make the knowledge more accurateusing observations that are taken from a simulator or anyother cheap sources. For deciding how many observationsand from what source to transfer, we use the cost modelthat we have introduced earlier. At runtime, the managedsystem is Monitored by pulling the end-to-end performancemetrics (e.g., latency, throughput) from the correspondingsensors. Then, the retrieved performance data needs to beAnalysed and the mean performance associated to a specificsetting of the system will be stored in a data repository.Next, the GP model needs to be updated taking into accountthe new performance observation. Having updated the GPmodel, a new configuration may be Planned to replace thecurrent configuration. Finally, the new configuration will beenacted by Executing appropriate platform specific operations.This enables model-based knowledge evolution using machinelearning [2], [21]. The underlying GP model can now beupdated not only when a new observation is available butalso by transferring the learning from other related sources.So at each adaptation cycle, we can update our belief aboutthe correct response given data from the managed system andother related sources, accelerating the learning process.

IV. EXPERIMENTAL RESULTS

We evaluate the effectiveness and applicability of ourtransfer learning approach for learning models for highly-configurable systems, in particular, compared to conventionalnon-transfer learning. Specifically, we aim to answer thefollowing three research questions:RQ1: How much does transfer learning improve the predictionaccuracy?

errors. Also, for building GP models, we do not need to knowany internal details about the system; the learning process canbe applied in a black-box fashion using the sampled perfor-mance measurements. In the GP framework, it is also possibleto incorporate domain knowledge as prior, if available, whichcan enhance the model accuracy [20].

In order to describe the technical details of our transferlearning methodology, let us briefly describe an overview ofGP model regression; a more detailed description can be foundelsewhere [35]. GP models assume that the function f̂(x) canbe interpreted as a probability distribution over functions:

y = f̂(x) ⇠ GP(µ(x), k(x,x0)), (4)

where µ : X ! R is the mean function and k : X ⇥ X ! Ris the covariance function (kernel function) which describesthe relationship between response values, y, according to thedistance of the input values x,x

0. The mean and variance ofthe GP model predictions can be derived analytically [35]:

µ

t

(x) = µ(x) + k(x)|(K + �

2I)�1(y � µ), (5)

2t

(x) = k(x,x) + �

2I � k(x)|(K + �

2I)�1

k(x), (6)

where k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt

)], I isidentity matrix and

K :=

2

64k(x1,x1) . . . k(x1,xt

)...

. . ....

k(xt

,x1) . . . k(xt

,x

t

)

3

75 (7)

GP models have shown to be effective for performancepredictions in data scarce domains [20]. However, as wehave demonstrated in Figure 2, it may becomes inaccuratewhen the samples do not cover the space uniformly. Forhighly configurable systems, we require a large number ofobservations to cover the space uniformly, making GP modelsineffective in such situations.

C. Model prediction using transfer learningIn transfer learning, the key question is how to make accu-

rate predictions for the target environment using observationsfrom other sources, D

s

. We need a measure of relatedness notonly between input configurations but between the sourcesas well. The relationships between input configurations wascaptured in the GP models using the covariance matrix thatwas defined based on the kernel function in Eq. (7). Morespecifically, a kernel is a function that computes a dot product(a measure of “similarity”) between two input configurations.So, the kernel helps to get accurate predictions for similarconfigurations. We now need to exploit the relationship be-tween the source and target functions, g, f , using the currentobservations D

s

,Dt

to build the predictive model f̂ . To capturethe relationship, we define the following kernel function:

k(f, g,x,x0) = k

t

(f, g)⇥ k

xx

(x,x0), (8)

where the kernels k

t

represent the correlation between sourceand target function, while k

xx

is the covariance function forinputs. Typically, k

xx

is parameterized and its parameters arelearnt by maximizing the marginal likelihood of the modelgiven the observations from source and target D = D

s

[ Dt

.

Note that the process of maximizing the marginal likelihood isa standard method [35]. After learning the parameters of k

xx

,we construct the covariance matrix exactly the same way as inEq. 7 and derive the mean and variance of predictions usingEq. (5), (6) with the new K. The main essence of transferlearning is, therefore, the kernel that capture the source andtarget relationship and provide more accurate predictions usingthe additional knowledge we can gain via the relationshipbetween source and target.

D. Transfer learning in a self-adaptation loopNow that we have described the idea of transfer learning for

providing more accurate predictions, the question is whethersuch an idea can be applied at runtime and how the self-adaptive systems can benefit from it. More specifically, wenow describe the idea of model learning and transfer learningin the context of self-optimization, where the system adaptsits configuration to meet performance requirements at runtime.The difference to traditional configurable systems is that welearn the performance model online in a feedback loop undertime and resource constraints. Such performance reasoning isdone more frequently for self-adaptation purposes.

An overview of a self-optimization solution is depictedin Figure 3 following the well-know MAPE-K framework[9], [23]. We consider the GP model as the K (knowledge)component of this framework that acts as an interface to whichother components can query the performance under specificconfigurations or update the model given a new observation.We use transfer learning to make the knowledge more accurateusing observations that are taken from a simulator or anyother cheap sources. For deciding how many observationsand from what source to transfer, we use the cost modelthat we have introduced earlier. At runtime, the managedsystem is Monitored by pulling the end-to-end performancemetrics (e.g., latency, throughput) from the correspondingsensors. Then, the retrieved performance data needs to beAnalysed and the mean performance associated to a specificsetting of the system will be stored in a data repository.Next, the GP model needs to be updated taking into accountthe new performance observation. Having updated the GPmodel, a new configuration may be Planned to replace thecurrent configuration. Finally, the new configuration will beenacted by Executing appropriate platform specific operations.This enables model-based knowledge evolution using machinelearning [2], [21]. The underlying GP model can now beupdated not only when a new observation is available butalso by transferring the learning from other related sources.So at each adaptation cycle, we can update our belief aboutthe correct response given data from the managed system andother related sources, accelerating the learning process.

IV. EXPERIMENTAL RESULTS

We evaluate the effectiveness and applicability of ourtransfer learning approach for learning models for highly-configurable systems, in particular, compared to conventionalnon-transfer learning. Specifically, we aim to answer thefollowing three research questions:RQ1: How much does transfer learning improve the predictionaccuracy?

Page 17: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Thecasewherewelearnfromcorrelatedresponses

-1.5 -1 -0.5 0 0.5 1 1.5-4

-3

-2

-1

0

1

2

3

(a) 3 sample response functions

configuration domain

resp

onse

val

ue

(1)

(2)

(3)

observations

(b) GP fit for (1) ignoring observations for (2),(3)

LCB

not informative

(c) multi-task GP fit for (1) by transfer learning from (2),(3)

highly informative

GP prediction meanGP prediction variance

probability distribution of the minimizers

Page 18: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Results

Page 19: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Evaluation

• Casestudy(robotics)• Controlledexperiments(highlyconfigurablesoftware)

19

Page 20: Transfer Learning for Improving Model Predictions in Highly Configurable Software

PerformancepredictionforCoBot

5 10 15 20 25

5

10

15

20

25

14

16

18

20

22

24

5 10 15 20 25

5

10

15

20

25

8

10

12

14

16

18

20

22

24

26

5 10 15 20 25

5

10

15

20

25

5

10

15

20

25

30CPU usage [%] CPU usage [%]

(a) (b)

(c) (d)

Prediction without transfer learning

5 10 15 20 25

5

10

15

20

25

10

15

20

25

Prediction with transfer learning

(a) Sourceresponsefunction(cheap)

(b) Targetresponsefunction(expensive)

20

Page 21: Transfer Learning for Improving Model Predictions in Highly Configurable Software

PerformancepredictionforCoBot

5 10 15 20 25

5

10

15

20

25

14

16

18

20

22

24

5 10 15 20 25

5

10

15

20

25

8

10

12

14

16

18

20

22

24

26

5 10 15 20 25

5

10

15

20

25

5

10

15

20

25

30CPU usage [%] CPU usage [%]

(a) (b)

(c) (d)

Prediction without transfer learning

5 10 15 20 25

5

10

15

20

25

10

15

20

25

Prediction with transfer learning

(a) Sourceresponsefunction(cheap)

(b) Targetresponsefunction(expensive)

(c) Predictionwithouttransferlearning

(d) Predictionwithtransferlearning

21

Page 22: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Predictionaccuracyandmodelreliability

2

2

2

2

2

2

2

2

2

44

4

4

6

6

68

8 1012

14 16

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

5

5

5

55

5

5

55

1010

10

10

1515

15

20

20

25303540

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

(a) (b)

0.5 0.5 0.51 1 1

1.5 1.5 1.5

2

2 2

2.5 2.52.5

33

3

3

3.5

3.5

3.5

3.5

4

4 4

4

4

4.5 4.5 4.5

5

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

0.005 0.005 0.0050.01 0.01 0.01

0.015 0.015 0.0150.02 0.02 0.02 0

.02

0.0

25

0.025

0.025 0.025

0.025

0.02

50.025

0.0

3

0.03

0.03

0.0

3

0.03

0.0

30.03

0.035

0.0

35 0.0

35

0.04

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

(c) (d)

(a) Predictionerror• Trade-offamongusingmore

sourceortargetsamples(b) Modelreliability

• Transferlearningcontributestolowerthepredictionuncertainty

22

Page 23: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Predictionaccuracyandmodelreliability

2

2

2

2

2

2

2

2

2

44

4

4

6

6

68

8 1012

14 16

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

5

5

5

55

5

5

55

1010

10

10

1515

15

20

20

25303540

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

(a) (b)

0.5 0.5 0.51 1 1

1.5 1.5 1.5

2

2 2

2.5 2.52.5

33

3

3

3.5

3.5

3.5

3.5

4

4 4

4

4

4.5 4.5 4.5

5

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

0.005 0.005 0.0050.01 0.01 0.01

0.015 0.015 0.0150.02 0.02 0.02 0

.02

0.0

25

0.025

0.025 0.025

0.025

0.02

50.025

0.0

3

0.03

0.03

0.0

3

0.03

0.0

30.03

0.035

0.0

35 0.0

35

0.04

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

(c) (d)

(a) Predictionerror• Trade-offamongusingmore

sourceortargetsamples(b) Modelreliability

• Transferlearningcontributestolowerthepredictionuncertainty

(c) Trainingoverhead(d) Evaluationoverhead

• Appropriateforruntimeusage

23

Page 24: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Levelofcorrelationbetweensourceandtargetisimportant

10

20

30

40

50

60

Ab

solu

te P

erc

en

tag

e E

rro

r [%

]

Sources s s1 s2 s3 s4 s5 s6

noise-level 0 5 10 15 20 25 30corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75

Fig. 6: Prediction accuracy of the model learned with samplesfrom different sources of different relatedness to the target.GP is the model without transfer learning.

the incoming stream and it is essentially a CPU intensiveapplication. RollingSort is a memory intensive system thatperforms rolling counts of incoming messages for identifyingtrending topics. SOL is a network intensive system, wherethe incoming messages will be routed through an multi-layer network. These are standard benchmarks that are widelyused in the community, e.g., research papers [14] as well asindustry benchmarks [18]. For more details about the internalarchitecture of these systems, we refer to the appendix [1].

The notion of source and target set depends on the subjectsystem. In CoBot, we again simulate the same navigation mis-sion in the default environment (source) and in a more difficultnoisy environment (target). For the three stream processingapplications, source and target represent different workloads,such that we transfer measurements from one workload forlearning a model for another workload. More specifically, wecontrol the workload using the maximum number of messageswhich we allow to enter the stream processing architecture. Forthe NoSQL application, we analyze two different transfers:First, we use as source a query on a database with 10 millionrecords and as target the same query on a database with 20million records, representing a more expensive environment tosample from. Second, we use as source a query on 20 millionrecords on one cluster and as target a query on the same datasetrun on a different cluster, representing hardware changes.Overall, our subjects cover different kinds of applicationsand different kinds of transfer scenarios (changes in theenvironment, changes in the workload, changes in the dataset,and changes in the hardware).

Experimental setup: As independent variables, we sys-tematically vary the size of the learning sets from both sourceand target environment in each subject system. We samplebetween 0 and 100 % of all configurations in the sourceenvironment and between 1 and 10 % of all configurationsin the target environment.

As dependent variable, we measure learning time and

TABLE I: Overview of our experimental datasets. “Size”column indicates the the number of measurements in thedatasets and “Testbed” refer to the infrastructure where themeasurements are taken and their details are in the appendix.

Dataset Parameters Size Testbed

1 CoBot(4D)

1-odom miscalibration,2-odom noise,3-num particles,4-num refinement

56585 C9

2 wc(6D)1-spouts, 2-max spout,3-spout wait, 4-splitters,5-counters, 6-netty min wait

2880 C1

3 sol(6D)1-spouts, 2-max spout,3-top level, 4-netty min wait,5-message size, 6-bolts

2866 C2

4 rs(6D)1-spouts, 2-max spout,3-sorters, 4-emit freq,5-chunk size, 6-message size

3840 C3

5

6

cass-10

cass-20

1-trickle fsync, 2-auto snapshot,3-con. reads, 4-con. writes5-file cache size in mb6-con. compactors

1024 C6x,C6y

prediction accuracy of the learned model. For each subjectsystem, we measure a large number of random configurationsas the evaluation set, independently from configurations sam-pled for learning, and compare the predictions of the learnedmodel f̂ to the actual measurements of the configurations inthe evaluation set D

o

. We compute the absolute percentageerror (APE) for each configuration x in the evaluation set|f̂(x)�f(x)|

f(x) ⇥ 100 and report the average to characterizeaccuracy of the prediction model. Ideally, we would use thewhole configuration space as evaluation set (D

o

= X), but themeasurement effort would be prohibitively high for most real-world systems [20], [31]; hence we use large random samples(cf. size column in Table I).

The measured and predicted metric depends on the subjectsystem: For the CoBot system, we measure average CPU usageduring the same mission of navigating along a corridor as inour case study; we use the average of three simulation runsfor each configuration. For the stream processing and NoSQLexperiments, we measure average response time (latency) overa window of 8 and 10 minutes respectively. Also, after eachsample collection, the experimental testbed was cleaned whichrequired several minutes for the Storm measurements andaround 1 hour (for offloading and cleaning the database) forthe Cassandra measurements. We sample the given number ofconfigurations in source and target randomly and report aver-age results and standard deviations of accuracy and learningtime over 3 repetitions.

Results: We show results of our experiments in Figure 7.The 2D plot shows average errors across all subject systems.The results in which the set of source samples D

s

is empty rep-resents the baseline case without transfer learning. In Figure 8,we additionally show a specific slice through our accuracyresults, in which we only vary the number of samples from thesource (and only for 4 subject systems to produce a reasonablyclear plot), but keep the number of samples from the target ata constant 1 %. Although the results differ significantly amongsubject systems (not surprising, given different relatedness ofsource and target) the overall trends are consistent.

First, our results show that transfer learning can achievehigh prediction accuracy with only few samples from the target

• Modelbecomesmoreaccuratewhenthesourceismorerelatedtothetarget

• Evenlearningfromasourcewithasmallcorrelationisbetterthannotransfer

24

Page 25: Transfer Learning for Improving Model Predictions in Highly Configurable Software

25

5 10 15 20 25number of particles

5

10

15

20

25

nu

mb

er

of

refin

em

en

ts

5

10

15

20

25

30

5 10 15 20 25number of particles

5

10

15

20

25

nu

mb

er

of

refin

em

en

ts

10

12

14

16

18

20

22

24

5 10 15 20 25number of particles

5

10

15

20

25

nu

mb

er

of

refin

em

en

ts

10

15

20

25

5 10 15 20 25number of particles

5

10

15

20

25n

um

be

r o

f re

fine

me

nts

10

15

20

25

5 10 15 20 25number of particles

5

10

15

20

25

nu

mb

er

of

refin

em

en

ts

6

8

10

12

14

16

18

20

22

24

(a) (b) (c)

(d) (e) 5 10 15 20 25number of particles

5

10

15

20

25

nu

mb

er

of

refin

em

en

ts

12

14

16

18

20

22

24

(f)

CPU usage [%] CPU usage [%] CPU usage [%]

CPU usage [%] CPU usage [%] CPU usage [%]

Page 26: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Predictionaccuracy• Themodelprovidesmore

accuratepredictionsasweexploitmoredata

• Transferlearningmay(i)boostinitialperformance,(ii)increasethelearningspeed,(iii)leadtoamoreaccurateperformancefinally

Given

Data

Source-TaskKnowledge

Learn

Target Task

Fig. 1. Transfer learning is machine learning with an additional source of informationapart from the standard training data: knowledge from one or more related tasks.

The goal of transfer learning is to improve learning in the target task byleveraging knowledge from the source task. There are three common measures bywhich transfer might improve learning. First is the initial performance achievablein the target task using only the transferred knowledge, before any further learn-ing is done, compared to the initial performance of an ignorant agent. Second isthe amount of time it takes to fully learn the target task given the transferredknowledge compared to the amount of time to learn it from scratch. Third is thefinal performance level achievable in the target task compared to the final levelwithout transfer. Figure 2 illustrates these three measures.

If a transfer method actually decreases performance, then negative transferhas occurred. One of the major challenges in developing transfer methods isto produce positive transfer between appropriately related tasks while avoidingnegative transfer between tasks that are less related. A section of this chapterdiscusses approaches for avoiding negative transfer.

When an agent applies knowledge from one task in another, it is often nec-essary to map the characteristics of one task onto those of the other to specifycorrespondences. In much of the work on transfer learning, a human providesthis mapping, but some methods provide ways to perform the mapping auto-matically. Another section of the chapter discusses work in this area.

perfo

rman

ce

training

with transferwithout transfer

higher start

higher slope higher asymptote

Fig. 2. Three ways in which transfer might improve learning.

2

[LisaTorreyandJudeShavlik]

26

Page 27: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Experimentalevaluations

• RQ1: Howmuchdoestransferlearningimprovethepredictionaccuracy?• RQ2:Whatarethetrade-offsbetweenlearningwithdifferentnumbersofsamplesfromsourceandtargetintermsofpredictionaccuracy?• RQ3:Isourtransferlearningapproachapplicableinthecontextofself-adaptivesystemsintermsofmodeltrainingandevaluationtime?

27

Page 28: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Subjectsystems

• Roboticsoftware(CoBot)• Streamprocessingsystems(WordCount,RollingSort,SOL)onApacheStorm• NoSQLbenchmarkonApacheCassandra

28

Page 29: Transfer Learning for Improving Model Predictions in Highly Configurable Software

StreamProcessingSystems

Logi

cal

View

Phys

ical

Vi

ew pipe

Spout A Bolt A Bolt B

socket socket

out queue in queue

Worker A Worker B Worker C

out queue in queue

Kafka Spout Splitter Bolt Counter Bolt

(sentence) (word)[paintings, 3][poems, 60][letter, 75]

Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)

Kafka SpoutRollingCount

BoltIntermediateRanking Bolt

(hashtags)(hashtag,

count)

Ranking Bolt

(ranking) (trending topics)Kafka Topic

Twitter to Kafka

(tweet)Twitter Stream

(tweet)

(tweet)

Storm Architecture

Word Count Architecture• CPU intensive

Rolling Sort Architecture• Memory intensive

Applications:• Fraud detection• Trending topics

Page 30: Transfer Learning for Improving Model Predictions in Highly Configurable Software

30

55

5

5

5

5 5

5

5

10

10

10

15

15

20253035

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

10

10

10

10

20

20

20

20

20

30

30

30

4040

40

5050

50

6060

60

70

708090

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

20

20

20

20

20

20

4040

40

606060

808080

100100

120

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

20

20

20

20

4040

40

6060

60

8080

100120

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

20

20

20

4040

40

606060

808080100

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

20

2020

20

20

4040

40

606060

808080100

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

(a) (b) (c)

(d) (e) (f)

1. Model Learning

2. Predictive Model

4. Transfer Learning

3. Cost Model

ImprovedPredictive

Model

Conf.Parameters

Samples0 500 1000 1500

Throughput (ops/sec)

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Avera

ge w

rite

late

ncy (µ

s)

TL4COBO4CO

Fig. 1: Overview of cost-aware transfer learning methodology.

{(xs

, y

s

)|ys

= g(xs

) + ✏

s

,x

s

2 X}, and transfer these obser-vations to learn a performance model for the real system usingonly few observations from that system, D

t

= {(xt

, y

t

)|yt

=f(x

t

)+✏

t

,x

t

2 X}. The cost of obtaining each observation onthe target system is c

t

, and we assume that cs

⌧ c

t

and thatthe source response function, g(·), is related (correlated) to thetarget function, and this relatedness contributes to the learningof the target function, using only sparse samples from thereal system, i.e., |D

t

| ⌧ |Ds

|. Intuitively, rather than startingfrom scratch, we transfer the learned knowledge from the datawe have collected using cheap methods such as simulatorsor previous system versions, to learn a more accurate modelabout the target configurable system and use this model toreason about its performance at runtime.

In Figure 2, we demonstrate the idea of transfer learningwith a synthetic example. Only three observations are takenfrom a synthetic target function, f(·), and we try to learna model, f̂(x), which provides predictions for unobservedresponse values at some input locations. Figure 2(b) depictsthe predictions that are provided by the model trained usingonly the 3 samples without considering the transfer learningfrom any source. The predictions are accurate around thethree observations, while highly inaccurate elsewhere. Figure2(c) depicts the model trained over the 3 observations and9 other observations from a different but related source.The predictions, thanks to transfer learning, are here closelyfollowing the true function with a narrow confidence interval(i.e., high confidence in predictions provided by the model).Interestingly, the confidence interval around points x = 2, 4in the model with transfer learning have disappeared, demon-strating more confident predictions with transfer learning.

Given a target environment, the effectiveness of any transferlearning depends on the source environment and how it isrelated to the target [33]. If the relationship is strong andthe transfer learning method can take advantage of it, theperformance in the target predictions can significantly improvevia transfer. However, if the source response is not sufficientlyrelated or if the transfer method does not exploit the relation-ship, the performance may not improve or even may decrease.We show this case via the synthetic example in Figure 2(d)where the prediction model uses 9 samples from a misleadingfunction that is unrelated to the target function (the responseremains constant as we increase x); as a consequence, thepredictions are very inaccurate and do not follow the growingtrend of the true function.

1 3 5 9 10 11 12 13 140

50

100

150

200source samplestarget samples

(a) (b)

(c) (d)

Fig. 2: (a) 9 samples have been taken from a source (re-spectively 3 from a target); (b) [without transfer learning] aregression model is constructed using only the target sam-ples. (c) [with transfer learning] another regression model isconstructed using the target and the source samples. The newmodel is based on only 3 target measurements while providingaccurate predictions. (d) [negative transfer] a regression modelis constructed with transfer from a misleading response func-tion and therefore the model prediction is inaccurate.

For the choice of number of samples from the source andtarget, we use a simple cost model (cf. Figure 1) that is basedon the number of source and target samples as follows:

C(Ds

,Dt

) = c

s

· |Ds

|+ c

t

· |Dt

|+ CTr

(|Ds

|, |Dt

|), (2)C(D

s

,Dt

) Cmax

, (3)

where Cmax

is the experimental budget and CTr

(|Ds

|, |Dt

|)is the cost of model training, which is a function of sourceand target sample sizes. Note that this cost model makes theproblem we have formulated in Eq. (1) into a multi-objectiveproblem, where not only accuracy is considered, but also costcan be considered in the transfer learning process.

B. Model learning: Technical detailsWe use Gaussian Process (GP) models to learn a reliable

performance model f̂(·) that can predict unobserved responsevalues. The main motivation to chose GP here is that it offersa framework in which performance reasoning can be doneusing mean estimates as well as a confidence interval foreach estimation. In other words, where the model is confidentwith its estimation, it provides a small confidence interval; onthe other hand, where it is uncertain about its estimations itgives a large confidence interval meaning that the estimationsshould be used with precautions. The other reason is that allthe computations are based on linear algebra which are cheapto compute. This is especially useful in the domain of self-adaptive systems where automated performance reasoning inthe feedback loop is typically time constrained and shouldbe robust against any sort of uncertainty including prediction

CoBot WC SOL

Cass (hdw)RS Cass (db size)

Page 31: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Trade-offbetweenmeasurementcostandaccuracy

0 200 400 600 800 1000 1200Measurement cost ($)

0

10

20

30

40

50

60

70

80

accuracy threshold

budg

et

thre

shol

dsweet spot

31

Page 32: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Modeltrainingandevaluationoverhead

0 10 20 30 40 50 60 70 80 90 100Percentage of training data from source

10-2

10-1

100

101

102

103

Model t

rain

ing tim

e

WCRSSOLcass

0 10 20 30 40 50 60 70 80 90 100Percentage of training data from source

10-3

10-2

10-1

100

101

Model e

valu

atio

n tim

e

WCRSSOLcass

(a) (b)

32

Page 33: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Futurework,insightsandideas

Page 34: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Whatifmultiplerelatedsources?

Target

corr_2

corr_3

corr_1

source 1

source 2

source 3

Target

select

(a) (b)

source 1source 2

source 3

34

Page 35: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Exampleofmultiplesources

Source Simulator Target Simulator

Source Robot Target Robot

(1)

(2)

(3)

(4)

(5)

35

Page 36: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Currentpriority

Wehavelookedintotransferlearningfromsimulatortorealrobot,butnow,weconsiderthefollowingscenarios:• Workloadchange(Newtasksormissions,newenvironmentalconditions)• Infrastructurechange(NewIntelNUC,newCamera,newSensors)• Codechange(newversionsofROS,newlocalizationalgorithm)

36

Page 37: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Learnedfunctionwithbadsamples->overfitting

Goal:findbestsamplepointsiterativelybygainingknowledgefromsourceandtargetdomain

Learnedfunctionwithgoodsamples

37

Activelearning+transferlearning

Page 38: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Futurework:IntegratingtransferlearningwithMAPE-K

38

Environment

R.1 Performance Monitoring performance

repository

GP modelM

A P

E

Highly Configurable System

R.2 Data Analysis

R.4 Select the most promising

configuration to test

R.5 Update System Configuration

R.3 Model Update

K

AS

Autonomic Manager

simulation data

repository

D.1 Change config.

D.2 Perform sim.

D.3 Extract data

R.3.1 Transfer Learning

Simulator

RuntimeDesign-time

Real System

Page 39: Transfer Learning for Improving Model Predictions in Highly Configurable Software

Conclusion:Ourcost-awaretransferlearning

• Improvesthemodelaccuracyuptoseveralordersofmagnitude• Isabletotrade-offbetweendifferentnumberofsamplesfromsourceandtargetenablingacost-awaremodel• Imposesanacceptablemodelbuildingandevaluationcostmakingappropriateforapplicationintheroboticsdomain

39