15
A Comparison of two Mixed-Integer Linear Programs for Piecewise Linear Function Fitting John Alasdair Warwicker 1 and Steffen Rebennack 1 1 Institute of Operations Research (IOR), Karlsruhe Institute of Technology, 76185 Karlsruhe, Baden-W¨ urttemberg, Germany , [email protected],[email protected] Abstract The problem of fitting continuous piecewise linear (PWL) functions to discrete data has applications in pattern recognition and engineering, amongst many others. To find an optimal PWL function, it is required that the positioning of the breakpoints connecting adjacent linear segments are not constrained, and are allowed to be placed freely. While the PWL fitting problem has often been approached from a global optimisation perspective, recently two mixed-integer linear programming (MILP) approaches have been presented which solve for optimal PWL functions. In this paper, we compare the two approaches: the first was presented by [Rebennack and Krasko, IJOC, 2020], the second by [Kong and Maravelias, IJOC, 2020]. Both formulations are similar in that they use binary variables and logical implications modelled by big-M constructs to ensure the continuity of the PWL function, yet the former model uses fewer binary variables. We present experimental results comparing the time taken to find optimal PWL functions with differing numbers of breakpoints across five data sets for three different objective functions. While neither of the two formulations is superior on all data sets, the presented computational results suggest that the formulation presented by Rebennack and Krasko is faster. This might be explained by the fact that it contains fewer complicating binary variables and constraints. 1 Introduction Fitting univariate discrete data points with a continuous function allows for the estimate of new data points through interpolation or extrapolation. However, the non-linearity of a continuous function often makes the calculation of its equation difficult to compute. Linear regression is used to model data points with a single linear function. This is advantageous when there is an apparent linear relation between the data points. In cases where there is no apparent relation, a piecewise linear (PWL) function can be used to model the data points. By approximating data points with a (piecewise) linear function, complicating mixed-integer and non-convex programming problems can be (approximately) solved quickly using standard programming techniques (Geißler et al. 2012, Feijoo and Meyer 1988, Rebennack 2016, Rebennack and Kallrath 2015). A PWL function is comprised of connected, linear functions (known as linear segments, each defined over a given range. The intersection points of the linear segments are known as breakpoints. The problem of fitting PWL functions to data is also known as linear spline regression in Statistics, and polyhedral function fitting in Mathematics. As well as numerous applications in these fields, PWL function fitting has also been used to solve complex problems in engineering (Gunnerud and Foss 2010), biomedical studies (Berman et al. 1996), pattern recognition (Chang 1973), and healthcare (Wagner et al. 2002). There are many different ways to approach the PWL function fitting problem. A recent trend is ap- proaching the problem of PWL function fitting from a global optimisation perspective. Various methods, ranging from dynamic programming (Bellman and Roth 1969) and numerical approaches (Jupp 1978), to heuristic approaches (Ertel and Fowlkes 1976) and the R package segmented (Muggeo 2003), have been implemented. However, these approaches often find difficulties in ensuring optimal breakpoint location, or ensuring the continuity of the linear segments (Chen and Wang 2009). 1

A Comparison of two Mixed-Integer Linear Programs for ... · The approach by Rebennack and Krasko (2020) was shortly followed by the approach by Kong and Maravelias (2020). Both MILP

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • A Comparison of two Mixed-Integer Linear Programs for Piecewise

    Linear Function Fitting

    John Alasdair Warwicker1 and Steffen Rebennack1

    1 Institute of Operations Research (IOR), Karlsruhe Institute of Technology, 76185Karlsruhe, Baden-Württemberg, Germany ,

    [email protected],[email protected]

    Abstract

    The problem of fitting continuous piecewise linear (PWL) functions to discrete data has applicationsin pattern recognition and engineering, amongst many others. To find an optimal PWL function, it isrequired that the positioning of the breakpoints connecting adjacent linear segments are not constrained,and are allowed to be placed freely. While the PWL fitting problem has often been approached from aglobal optimisation perspective, recently two mixed-integer linear programming (MILP) approaches havebeen presented which solve for optimal PWL functions. In this paper, we compare the two approaches:the first was presented by [Rebennack and Krasko, IJOC, 2020], the second by [Kong and Maravelias,IJOC, 2020]. Both formulations are similar in that they use binary variables and logical implicationsmodelled by big-M constructs to ensure the continuity of the PWL function, yet the former model usesfewer binary variables. We present experimental results comparing the time taken to find optimal PWLfunctions with differing numbers of breakpoints across five data sets for three different objective functions.While neither of the two formulations is superior on all data sets, the presented computational resultssuggest that the formulation presented by Rebennack and Krasko is faster. This might be explained bythe fact that it contains fewer complicating binary variables and constraints.

    1 Introduction

    Fitting univariate discrete data points with a continuous function allows for the estimate of new data pointsthrough interpolation or extrapolation. However, the non-linearity of a continuous function often makes thecalculation of its equation difficult to compute. Linear regression is used to model data points with a singlelinear function. This is advantageous when there is an apparent linear relation between the data points.In cases where there is no apparent relation, a piecewise linear (PWL) function can be used to model thedata points. By approximating data points with a (piecewise) linear function, complicating mixed-integerand non-convex programming problems can be (approximately) solved quickly using standard programmingtechniques (Geißler et al. 2012, Feijoo and Meyer 1988, Rebennack 2016, Rebennack and Kallrath 2015).

    A PWL function is comprised of connected, linear functions (known as linear segments, each defined overa given range. The intersection points of the linear segments are known as breakpoints. The problem offitting PWL functions to data is also known as linear spline regression in Statistics, and polyhedral functionfitting in Mathematics. As well as numerous applications in these fields, PWL function fitting has also beenused to solve complex problems in engineering (Gunnerud and Foss 2010), biomedical studies (Berman et al.1996), pattern recognition (Chang 1973), and healthcare (Wagner et al. 2002).

    There are many different ways to approach the PWL function fitting problem. A recent trend is ap-proaching the problem of PWL function fitting from a global optimisation perspective. Various methods,ranging from dynamic programming (Bellman and Roth 1969) and numerical approaches (Jupp 1978), toheuristic approaches (Ertel and Fowlkes 1976) and the R package segmented (Muggeo 2003), have beenimplemented. However, these approaches often find difficulties in ensuring optimal breakpoint location, orensuring the continuity of the linear segments (Chen and Wang 2009).

    1

    [email protected],[email protected]

  • Many mixed-integer linear programming (MILP) approaches for PWL fitting have also been presented inthe literature. However, most MILP approaches disregard the continuity requirement of the affine functionsin order to simplify the optimisation problem. Amongst these are the approach of Bertsimas and Shioda(2007), which used MILP models for classification and fit discontinuous PWL functions to each class. Furtherapproaches by Bertsimas and Mazumder (2014), Bertsimas and King (2016), Bertsimas et al. (2016) similarlydismiss the continuity problem. Amongst those approaches which include the continuity requirement are thenon-convex, dynamic programming approach by Goldberg et al. (2014), and the non-convex, quadraticallyconstrained approach by Toriello and Vielma (2012). Toriello and Vielma (2012) also introduced a mixed-integer linear model to fit convex PWL functions to discrete data; the convexity assumption simplifiesthe problem significantly. The first exact MINLP approach for continuous univariate function fitting waspresented by Rebennack and Kallrath (2015) and this approach was extended to fit area-minimizing tubesby Kallrath and Rebennack (2014).

    Recently, two MILP approaches to optimally solve the PWL function fitting problem have been presented.The approach by Rebennack and Krasko (2020) was shortly followed by the approach by Kong and Maravelias(2020). Both MILP formulations are based on the same idea: the non-linear constraint required to modelthe continuity of the constructed PWL function is avoided by the indirect modelling of the breakpointlocation. Given the information on the two data points enclosing the breakpoint location, continuity of thePWL function is ensured by restricting their intercept and gradients. The two proposed MILP approachesshare many further similarities, including the use of big-M constructs to model logical implications, andusing binary variables to assign data points to linear segments. However, there are differences in the twoapproaches which we analyse in detail in this article.

    The contribution of this paper is to present a detailed comparison of the MILP models of Rebennackand Krasko (2020) and Kong and Maravelias (2020). As well as comparing the two formulations from atheoretical perspective, we present a series of comparative results for five data sets often used to assess theeffectiveness of PWL function fitting models.

    The rest of the paper is structured as follows. In Section 2, we present the two MILP formulations.In Section 3, we compare the two approaches from a theoretical perspective. We present an experimentalcomparison in Section 4, and we present conclusions in Section 5.

    2 Mixed Integer Linear Programming Formulations

    We begin this section by defining the important concepts related to piecewise linear univariate functionfitting.

    Definition 1 (Rebennack and Krasko (2020)) A continuous univariate function p(x) : [X,X] → Rwith compact interval [X,X] is called a continuous piecewise linear (PWL) function, if there exists afinite number B with

    X = r1 < · · · < rb < rb+1 < · · · < rB = X,such that p(x) is an affine function on [rb, rb+1] for all b = 1, . . . , B − 1. The rb are called breakpoints,with B the number of breakpoints. For each b = 1, . . . , B − 1, the function p(x) : [rb, rb+1] → R is called alinear segment.

    For the problem of PWL function fitting of bivariate data, a set of I ordered tuples (Xi, Yi) ∈ R2, i = 1, . . . , I,is given where

    −∞ < X = X1 < · · · < Xi < Xi+1 < · · · < XI = X

  • solving a MILP whereby the breakpoints are not explicitly calculated, while the equations of the linearsegments are, will give information leading to the breakpoint locations. Both MILP formulations do notexplicitly provide the breakpoint locations (and hence the range for each of the B − 1 linear segments), yetthey are implicitly given by the intersections of consecutive linear segments.

    Suppose the optimal solution of the MILP gives a series of linear segments of the form y = c?bx+ d?b , for

    b = 1, . . . , B − 1, where B is the number of breakpoints1. We assume w.l.o.g. that there is at least one datapoint associated to each linear segment. The location of the breakpoints r?b (b = 1, . . . , B) can be calculatedas such:

    r?1 = X1

    r?b =d?b+1 − d?bc?b − c?b+1

    ∀b = 2, . . . , B − 1 if c?b 6= c?b+1

    r?B = XI .

    Hence, the PWL comprises of B − 1 linear segments, where linear segment b ∈ {1, . . . , B − 1} has equationpb(x) := c

    ?bx + d

    ?b on the range x ∈ [rb, rb+1]. Note that cb − cb+1 = 0 may be allowed by the formulations.

    In this instance, db+1−db is also enforced and the two linear segments lead to an affine function where thereis no discontinuity in the PWL function between [Xi, Xi+1]. The breakpoint is placed arbitrarily betweenthe two data points.

    Distance Metrics

    In the next subsections, we present a comparison of two MILP formulations for optimal PWL functionfitting. The first MILP was presented by Rebennack and Krasko (2020), while the second was presented byKong and Maravelias (2020). Both formulations use the absolute vertical distance between each data pointand the corresponding linear segment to calculate the optimal objective value, which can be calculated forthree different distance metrics. The first metric, Maximum Absolute Difference, calculates the maximumabsolute difference between the data points and the PWL function. The second two metrics, Sum of AbsoluteDifferences and Sum of Squared Differences, calculate the sum of absolute differences, and the sum of thesquares of the absolute differences respectively. Using the sum of squared differences metric causes the twoformulations to become mixed-integer quadratic programs.

    2.2 The First Mixed Integer Linear Program

    We first present the formulation of Rebennack and Krasko (2020). This formulation contains the followingvariables.

    Continuous Variables

    • cb, the slope of segment b,

    • db, the intercept of segment b,

    • ξi, absolute error at point i,

    • δ+/−i,b , continuous variables taking values in [0, 1], and either δ+i,b or δ

    −i,b is set to 1 if a breakpoint exists

    between Xi and Xi+1.

    Binary Variables

    • δi,b, set to 1 if data point Xi is associated with segment b,

    • γb, set to 0 or 1 depending on the change in gradient between adjacent linear segments.1For this discussion, we use the variables and notation as presented by Rebennack and Krasko (2020), but we note that

    Kong and Maravelias (2020) use different labels to represent similar variables

    3

  • This formulation also contains the following big-M constants: Mai and M2i for i = 1, . . . , I.

    min∑I

    i=1 ξqi (1a)

    s.t. Yi − (cbXi + db) ≤ ξi +Mai (1− δi,b) ∀i = 1, . . . , I; ∀b = 1, . . . , B − 1 (1b)(cbXi + db)− Yi ≤ ξi +Mai (1− δi,b) ∀i = 1, . . . , I; ∀b = 1, . . . , B − 1 (1c)

    B−1∑b=1

    δi,b = 1 ∀i = 1, . . . , I (1d)

    δi+1,b+1 ≤ δi,b + δi,b+1 ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1e)δi+1,1 ≤ δi,1 ∀i = 1, . . . , I − 1 (1f)δi,B−1 ≤ δi+1,B−1 ∀i = 1, . . . , I − 1 (1g)

    δi,b + δi+1,b+1 + γb − 2 ≤ δ+i,b ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1h)δi,b + δi+1,b+1 + (1− γb)− 2 ≤ δ−i,b ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1i)db+1 − db ≥ Xi(cb − cb+1)−M2i (1− δ

    +i,b) ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1j)

    db+1 − db ≤ Xi+1(cb − cb+1) +M2i+1(1− δ+i,b) ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1k)

    db+1 − db ≤ Xi(cb − cb+1) +M2i (1− δ−i,b) ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1l)

    db+1 − db ≥ Xi+1(cb − cb+1)−M2i+1(1− δ−i,b) ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1m)

    ξi ∈ [0,Mai ] ∀i = 1, . . . , I (1n)cb ∈ [Cb, Cb] ∀b = 1, . . . , B (1o)db ∈ [Db, Db] ∀b = 1, . . . , B (1p)δi,b binary ∀i = 1, . . . , I; b = 1, . . . , B − 1 (1q)γb binary ∀b = 1, . . . , B − 2 (1r)δ+i,b, δ

    −i,b ∈ [0, 1] ∀i = 1, . . . , I − 1; b = 1, . . . , B − 2 (1s)

    Each linear segment (b = 1, . . . , B − 1) is defined by an intercept cb and gradient db. The breakpointsare then given by the intersection of consecutive linear segments.

    The objective function (1a) minimises the chosen distance metric. If the Maximum Difference metricis chosen, the continuous variables ξi is replaced by a single continuous variable ξ, and constraint (1a) isreplaced by “min ξ”. The second two metrics, Sum of Absolute Differences and Sum of Squared Differences,are achieved by setting q = 1 and q = 2 respectively in constraint (1a). Note that the objective functiondoes not contain any of the binary variables, with its value implicitly given by constraints (1b) and (1c),which evaluate the objective value for the given PWL function.

    If the binary variable δi,b = 1, then the data point (Xi, Yi) is associated with segment b of the PWLfunction (for i = 1, . . . , I and b = 1, . . . , B − 1). The variables cb and db are respectively the gradientand intercept of affine function b (i.e., the affine function b ∈ {1, . . . , B − 1} has equation y = cbx + db).Constraint (1d) ensures each data point is associated with exactly one linear segment. Constraints (1e)-(1g)ensure the ordering of all data points, such that if a point is associated with a certain segment, the nextpoint must either be associated with the same or next function ((1f) and (1g) ensure this for the first andlast data point).

    Constraints (1h)-(1m) ensure continuity of the PWL functions. Fig. 1 explains how the continuityrequirement holds in constraints (1j)-(1m).

    4

  • x

    y

    Xi Xi+1

    r

    Data point

    Breakpoint

    Figure 1: Example data set showing breakpoint range for enforcing continuity, adapted from (Rebennackand Krasko 2020, Figure 1).

    Suppose there is a breakpoint between data point Xi and Xi+1, connecting the linear segments b and b+1(with equations y = cbx+ db and y = cb+1x+ db+1, respectively). In this case, we have δi,b = δi+1,b+1 = 1.In order for the two adjacent linear segments to be continuous, they must attain the same value at thebreakpoint location r, where Xi ≤ r ≤ Xi+1. That is, for cb 6= cb+1,

    cbr + db = cb+1r + db+1 =⇒ r =db+1 − dbcb − cb+1

    =⇒ Xi ≤db+1 − dbcb − cb+1

    ≤ Xi+1.

    When multiplying through by the denominator, the direction of the inequalities will change depending on itssign. If cb − cb+1 > 0 then γb = 1; otherwise cb − cb+1 < 0 and γb = 0. The denominator is then distributedaccordingly in constraints (1l)-(1o). In particular, depending on the value of the binary variable γb, eitherδ+i,b or δ

    −i,b is set to 1. If γb = 1, then δ

    +i,b = 1 and constraints (1j) and (1k) are activated, implying that the

    gradient decreases between the two consecutive linear segments. Alternatively, if γb = 0, then δ−i,b = 1 and

    constraints (1l) and (1m) are activated, implying the gradient increases. Note that if all of the γb variablestake the same value, then the PWL function is either convex or concave.

    Finally, constraints (1n)-(1s) give the domains of the variables.

    2.3 The Second Mixed-Integer Linear Program

    We now present the formulation by Kong and Maravelias (2020). For ease of comparison, we again refer to thenumber of breakpoints as B and refer to any similar variables from formulation (1) with the same notation.Hence, the model provides an optimal PWL function consisting of B − 1 linear segments, b = 1, . . . , B − 1.Each linear segment (b = 1, . . . , B − 1) is defined by an intercept cb and gradient db. The breakpoints arethen given by the intersection of consecutive linear segments. This formulation contains the following newvariables.

    Continuous Variables

    • p+/−i,b and q+/−i,b , non-negative slack variables.

    Binary Variables

    • ui,b, set to 1 if p+i,b = q+i+1,b+1 = 0,

    • vi,b, set to 1 if p−i,b = q−i+1,b+1 = 0,

    5

  • • δFi,b, set to 1 if point i is the first point in segment b,

    • δLi,b, set to 1 if point i is the last point in segment b.

    Note that both formulations use continuous variables (cb and db) to represent the gradient and interceptof each liner segment, and a binary variable (δi,b) to assign data points to a linear segment. Furthermore,for both formulations the error of the PWL function (ξi) is calculated for each data point using the PWLapproximation at this point (yi).

    This formulation also contains the following big-M constants: Mai for i = 1, . . . , I.

    min∑I

    i=1 ξqi (2a)

    s.t. (1b)− (1d) (2b)

    δi,b = δi−1,b + δFi,b − δLi,b ∀i = 1, . . . , I; ∀b = 1, . . . , B − 1 (2c)∑I

    i=1 δFi,b = 1 ∀b = 1, . . . , B − 1 (2d)∑I

    i=1 δLi,b = 1 ∀b = 1, . . . , B − 1 (2e)∑i

    i′=1 δFi′,b ≥

    ∑ii′=1 δ

    Fi′,b+1 ∀i = 1, . . . , I; ∀b = 1, . . . , B − 2 (2f)∑i

    i′=1 δLi′,b ≥

    ∑ii′=1 δ

    Li′,b+1 ∀i = 1, . . . , I; ∀b = 1, . . . , B − 2 (2g)

    δFi,b ≤ δi,b ∀i = 1, . . . , I; ∀b = 1, . . . , B − 1 (2h)δLi,b ≤ δi,b ∀i = 1, . . . , I; ∀b = 1, . . . , B − 1 (2i)

    Xicb+1 + db+1 − (Xicb + db) = p+i,b − p−i,b ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2j)

    Xi+1cb + db − (Xi+1cb+1 + db+1) = q+i+1,b+1 − q−i+1,b+1

    ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2k)p+i,b ≤Mai (1− ui,b) ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2l)q+i+1,b+1 ≤Mai (1− ui,b) ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2m)p−i,b ≤Mai (1− vi,b) ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2n)q−i+1,b+1 ≤Mai (1− vi,b) ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2o)ui,b + vi,b = δ

    Li,b ∀i = 1, . . . , I − 1; ∀b = 1, . . . , B − 2 (2p)

    (1n)− (1q) (2q)ui,b, vi,b, δ

    Fi,b, δ

    Li,b binary ∀i = 1, . . . , I; b = 1, . . . , B − 1 (2r)

    p+/−i,b , q

    +/−i,b ∈ [0,Mai ] ∀i = 1, . . . , I; b = 1, . . . , B − 1 (2s)

    The objective function (2a) minimises the chosen distance metric (see the discussion in Sect 2.2). Noteagain that the objective function does not contain any of the binary variables. The next constraints areidentical to constraints (1b)-(1d).

    Constraints (2c)-(2i) ensure the ordering of the data points across the linear segments, and assign onedata point in linear segment to be the first appearing in the segment, and one to be the last. The firstdata point of a given segment must come immediately after the last data point of the previous segment.Constraints (2f) and (2g) ensure the ordering of the first and last data point in each segment.

    Constraints (2j)-(2p) ensure the continuity in adjacent linear segments. Fig. 2 explains how the continuityrequirement holds in constraints (2j)-(2o).

    6

  • x

    y

    Xi Xi+1

    fb fb+1

    yi yi+1

    y′i+1y′i

    p+i,b − p−i,b q+i+1,b+1 − q

    −i+1,b+1

    Actual Approximation

    Adjacent Approximation

    Figure 2: Example showing caclulation for enforcing continuity, adapted from (Kong and Maravelias 2020,Figure 2).

    Consider two adjacent linear segments fb and fb+1, as seen in Fig. 2. Let us define Yi := fb(Xi)and Yi+1 := fb+1(Xi+1) as the approximated values of these data points in their respective linear segments.Furthermore, we define Y ′i := fb+1(Xi) and Y

    ′i+1 := fb(Xi+1) as the approximated values of the data points on

    the adjacent linear segment. If there exists a breakpoint between Xi and Xi+1, then if (Y′i −Yi)(Y ′i+1−Yi+1) ≥

    0, the PWL function is continuous for all i = {1, . . . , I − 1}. Hence, it is required that either both of Y ′i −Yiand Y ′i+1 − Yi+1 are either non-negative, or non-positive.

    If a given data point is the last it its segment (i.e., δLi,b = 1) then either ui,b or vi,b is set to 1 by

    constraint (2p). This then forces either p+i,b/q+i+1,b+1 or p

    −i,b/q

    −i+1,b+1 to be zero, by constraints (2l)-(2o).

    Hence, constraints (2l)-(2m), which give the values of Y ′i − Yi and Y ′i+1 − Yi+1, are both non-negative ornon-positive, ensuring continuity. Otherwise, there is no restriction on constraints (2j)-(2k) and the slack

    variables p+/−i,b and q

    +/−i,b .

    Finally, constraints (2q)-(2s) give the domains of the variables. The p+/−i,b and q

    +/−i,b are non-negative

    slack variables; however, we note that they cannot exceed the maximum absolute difference, so we constrainthem from above by Mai .

    7

  • 3 Comparison

    Rebennack and Krasko (2020) Kong and Maravelias (2020)Constraints (1a)-(1o) (2a)-(2s)Continuous Variables cb, db, δ

    +i,b, δ

    −i,b, ξi cb, db, p

    +i,b, p

    −i,b, q

    +i,b, q

    −i,b, ξi

    # Continuous Variables 2BI + 2− 3I 2B(2I − 1) + 6− 7IBinary Variables δi,b, γb δi,b, δ

    Fi,b, δ

    Li,b, ui,b, vi,b

    # Binary Variables B(I + 1)− 2− I B(5I − 2) + 4− 7I# Functional Constraints B(9I − 7) + 12− 13I B(14I − 5) + 12− 22I# Big-M constraints B(6I − 4) + 8− 10I B(6I − 4) + 8− 10I

    Table 1: A comparison of the two MILP formulations. For the Maximum Difference metric, there are I − 1fewer continuous variables in both formulations.

    Table 1 presents a comparison of the two formulations. It shows that the formulation of Kong and Maravelias(2020) contains more continuous variables, binary variables and functional constraints than the formulationof Rebennack and Krasko (2020) for B ≥ 3. As the number of breakpoints increases, the difficulty ofthe problem increases. However, the ratio of the number of variables and constraints between the twoformulations remains the same, since both formulations contain O(BI) variables and O(BI) functionalconstraints.

    We firstly note that the two formulations have many similar constraints. Both evaluate the PWL functionat each data point using a continuous variable, and measure the distance to the given data point for thecalculation of the objective value (constraints (1b)-(1d) and (2b)-(2d)). Furthermore, both formulations usea binary variable to assign data points to linear segments (δi,b).

    There are, however, some differences between the formulations. In particular, formulation (1) does notuse a binary variable to assign the first and last data point in each segment. While both formulations containa binary variable to assign data points to linear segments, formulation (1) achieves the ordering of thesebinary variables with constraints (1g)-(1i) only affecting the δi,b variables. However, formulation (2) requiresthe ordering of the binary variables δi,k, δ

    Fi,k and δ

    Li,k with constraints (2f)-(2i). These constraints are much

    more dense than the corresponding constraints in formulation (1) (constraints (1e)-(1g)). It is well knownthat dense constraints can lead to slower solve times (see e.g., (?)), so it is expected that this formulationwill be slower to find optimal solutions.

    Formulation (1) is able to achieve the continuity requirement using only one more binary variable, γb,

    which is used to represent the change in gradient between adjacent linear segments. While the variables δ+/−i,b

    are constrained within [0, 1], they are not binary and thus ease the formulation. Constraints (1j)-(1o) ensurethat adjacent linear segments attain the same value at the breakpoints, known to be within data points Xiand Xi+1 when δi,b + δi+1,b+1 = 2. However, formulation (2) uses the binary variables ui,b and vi,b whenensuring continuity, with the requirement that ui,b + vi,b + δ

    Li,b (constraints (2m)-(2s)). This requirement

    states that when a given data point is the last in a linear segments (i.e., δLi,b = 1), then either ui,b = 1 orvi,b = 1, activating either constraints (2o)-(2p) or (2q)-(2r).

    3.1 Bounds on the Big-M Values

    Both formulations use “big-M” constructs to model logical implications. Values for these big-M constantsare required in order to find an optimal PWL function.

    Both formulations contain the constantsMai andM2i (featuring in, for example, constraints (1b)-(1c)/(2l)-

    (2o) and (1j)-(1m) respectively). Let C, C, D and D be the minimal and maximal possible values for theslope and intercept of each linear segment. The following bounds on these constants are suggested byRebennack and Krasko (2020):

    Mai = max{|Yi − CXi −D|, |Yi − CXi −D|, |Yi − CXi −D|, |Yi − CXi −D|} ∀i = 1, . . . , I,M2i = D −D −Xi(C − C) ∀i = 1, . . . , I.

    8

  • 4 Computational Experiments

    In order to compare the effectiveness of the two formulations to the piecewise linear function fitting problem,we implemented the models in C++ and embedded them within IBM ILOG-Cplex version 12.9.1. usingstandard solver settings and the big-M values presented in Sect. 3.1. The experiments in this section wererun on an Intel 3.00 GHz machine with 16 GB of RAM.

    We present comparative results across five univariate data sets taken from real world instances commonlyused to assess the performance of programs for PWL function fitting. The MpStorage50 data set estimatesthe volume of the Morrow Point reservoir as a function of the water elevation (Goldberg et al. 2014). TheDebrisFlow data set shows the expected economic damage of a real-world location as a function of thevolume of debris flow (McCoy et al. 2016, ?). The Titanium data set shows a thermal property of Titanium(de Boor and Rice 1968, Jupp 1978). The RmHeight data set shows the height of plastic pellets in a tall,narrow container over time2. The Paperweight data set is a large data set that shows a measure of paperdensity over time (Macgregor and Harris 1993).

    We present results for three different distance metrics: the maximum absolute difference (Table 2), thesum of absolute differences (Table 3) and the sum of squared differences (Table 4). The first metric involvesa slight reformulation. Rather than having I different variables ξi in constraints (1b)-(1c), they are replacedby the singular variable ξ in these constraints. Hence, the objective functions (1a) and (2a) reduce simply tominimising the maximum absolute difference, given by ξ. For the second two metrics, we keep the originalformulations. We set q = 1 in constraints (1a) and (2a) for the sum of absolute differences, and set q = 2for the sum of squared differences. For the sum of squared differences metric, both formulations result in aconvex mixed-integer quadratically constrained program.

    4.1 Running Times

    We first present the running times to solve each of the data sets for the three different distance metrics. Foreach data set and number of breakpoints, we present the speedup found by formulation (1) (as a ratio of(2) : (1)) and the optimal objective function value (OF Value).

    BreakpointsData Set Method 4 5 6 7 8 9 10 11 12 13 14

    MpStorage50 Formulation (1) 0.1 0.2 0.4 0.7 2.3 2.3 3.9 12.7 51.5 59.8 89.3(I = 50) Formulation (2) 0.3 0.7 2.5 5.1 16.4 15.3 24.1 37.3 133.1 149.5 480.0

    Speedup 3.00 3.50 6.25 7.29 7.13 6.65 6.20 3.08 2.58 2.50 5.38Optimal OF Value 1.79 1.00 0.67 0.50 0.50 0.40 0.38 0.38 0.38 0.33 0.33

    DebrisFlow Formulation (1) 0.1 0.2 0.3 1.5 6.8 13.3 16.1 149.0 149.3 354.3 537.0(I = 44) Formulation (2) 0.2 0.7 1.5 5.5 12.3 20.8 47.6 141.3 286.4 477.5 570.9

    Speedup 2.00 3.50 5.00 3.67 1.81 1.56 2.96 0.95 1.92 1.35 1.06Optimal OF Value 2.04 0.78 0.65 0.50 0.42 0.39 0.30 0.26 0.19 0.16 0.16

    Titanium Formulation (1) 0.2 0.3 0.9 2.6 6.5 4.7 10.3 50.7 91.1 71.3 207.2(I = 49) Formulation (2) 0.4 1.0 2.8 4.6 9.1 16.5 23.7 22.0 25.6 44.2 180.3

    Speedup 2.00 3.33 3.11 1.77 1.40 3.51 2.30 0.43 0.28 0.62 0.87Optimal OF Value 0.49 0.079 0.063 0.050 0.024 0.022 0.012 0.010 0.0079 0.0075 0.0073

    RmHeight Formulation (1) 0.3 0.7 2.1 3.2 8.7 47.8 110.8 117.2 106.3 578.4 17654.2(I = 84) Formulation (2) 0.4 3.3 14.9 31.5 41.5 245.4 1069.1 1301.2 1361.3 18064.8 ?

    Speedup 1.33 4.71 7.10 9.84 4.77 5.13 9.65 11.10 12.81 31.23 >4.89Optimal OF Value 13.36 13.35 12.93 12.38 11.20 11.09 10.73 9.05 8.56 8.55 8.54

    Paperweight Formulation (1) 1.5 37.2 132.4 475.4 97.6 15208.2 ?(I = 231) Formulation (2) 12.9 105.6 146.5 1105.3 1537.7 ? ?

    Speedup 8.60 2.84 1.11 2.32 15.76 ≥5.68Optimal OF Value 1.86 1.85 1.73 1.67

    Table 2: Maximum Absolute Difference: Running time (in seconds) until optimality. Results denoted with? exceeded the time limit of 86,400 seconds.

    In Table 2, there are only five occasions out of the 51 tested instances for which formulation (2) is fasterthan formulation (1) (these instances are underlined). Notably, four of these occasions occur within theTitanium data set, where the optimal objective function value is small. Formulation (1) is faster in 45

    2RmHeight data set available from www.openmv.net.

    9

    www.openmv.net

  • instances. There are also at least four instances for which formulation (1) is more than 10 times faster. Inparticular, for the RmHeight data set, formulation (1) is significantly faster as the number of breakpointsincreases; for finding a PWL function with 13 breakpoints, it is 31 times faster.

    BreakpointsData Set Method 4 5 6 7 8 9 10 11 12 13 14

    MpStorage50 Formulation (1) 0.1 0.7 2.7 7.8 42.0 165.1 650.5 3507.3 15322.8 30714.7 ?(I = 50) Formulation (2) 0.8 2.5 5.6 21.2 55.7 939.3 4101.6 12741.0 12428.7 53685.2 ?

    Speedup 8.00 3.57 2.07 2.72 1.33 5.69 6.31 3.63 0.81 1.75Optimal OF Value 43.18 24.51 16.48 12.34 10.40 9.00 7.80 7.00 6.13 5.33

    DebrisFlow Formulation (1) 0.1 0.3 5.0 17.5 180.1 234.4 895.4 1266.7 4045.9 32918.5 21798.3(I = 44) Formulation (2) 0.4 0.7 4.5 29.6 58.8 273.3 2142.1 1318.8 5579.7 12768.9 33931.0

    Speedup 4.00 2.33 0.90 1.69 0.33 1.17 2.39 1.04 1.38 0.39 1.56Optimal OF Value 27.15 10.75 8.85 7.20 6.20 4.55 3.89 2.71 2.05 1.72 1.40

    Titanium Formulation (1) 0.4 1.2 2.7 6.3 24.0 47.8 1264.3 62707.9 ?(I = 49) Formulation (2) 0.4 1.9 4.4 12.0 26.0 1936.9 404.1 ? ?

    Speedup 1.00 1.58 1.63 1.90 1.08 40.52 0.32 ≥1.38Optimal OF Value 5.74 1.08 0.74 0.49 0.37 0.27 0.18

    RmHeight Formulation (1) 1.5 25.2 254.0 20613.7 ?(I = 84) Formulation (2) 3.2 30.3 364.4 11963.9 ?

    Speedup 2.13 1.20 1.43 0.58Optimal OF Value 479.54 449.44 434.16 405.19

    Paperweight Formulation (1) 18.0 784.2 37887.8 ?(I = 231) Formulation (2) 65.7 1217.7 ? ?

    Speedup 3.65 1.55 ≥2.28Optimal OF Value 113.55 110.77

    Table 3: Sum of Absolute Differences: Running time (in seconds) until optimality. Results denoted with ?exceeded the time limit of 86,400 seconds.

    In Table 3, there are only six occasions out of the 40 tested instances for which formulation (2) is fasterthan formulation (1) (these instances are underlined). Three of these occasions occur within the DebrisFlowdata set. Formulation (1) is faster in 29 instances. There is also one instances for which formulation (1) ismore that 10 times faster; for the Titanium data set with 9 breakpoints, it is more than 40 times faster.

    BreakpointsData Set Method 4 5 6 7 8 9 10 11 12 13 14

    MpStorage50 Formulation (1) 0.5 1.1 4.6 19.2 77.3 599.9 655.9 16562.7 59212.9 ?(I = 50) Formulation (2) 1.6 5.2 32.1 92.2 314.8 616.8 45531.1 ? ? ?

    Speedup 3.20 4.73 6.98 4.80 4.07 1.03 69.4 ≥5.22 ≥1.46Optimal OF Value 55.79 18.61 8.41 5.46 4.18 3.24 2.63 2.28 1.90

    DebrisFlow Formulation (1) 0.3 0.8 3.4 25.5 59.2 63.2 587.7 ?(I = 44) Formulation (2) 0.6 3.4 22.6 62.4 133.4 347.9 1157.7 ?

    Speedup 2.00 4.25 6.65 2.45 2.25 5.50 1.97Optimal OF Value 37.96 5.06 3.90 2.87 1.85 1.04 0.73

    Titanium Formulation (1) 0.5 1.8 2.8 31.4 50.4 20.5 1462.3 1132.0 1273.3 10220.8 ?(I = 49) Formulation (2) 1.3 4.0 12.1 52.5 78.3 248.2 247.9 554.5 6741.8 ? ?

    Speedup 2.60 2.22 4.32 1.67 1.55 12.11 0.17 0.49 5.29 ≥8.45Optimal OF Value 2.13 0.069 0.035 0.018 0.0069 0.0039 0.0013 0.0011 0.00086

    RmHeight Formulation (1) 2.2 31.0 159.5 1149.6 5679.0 57298.0 ?(I = 84) Formulation (2) 13.3 128.0 663.8 11013.1 44588.2 ? ?

    Speedup 6.05 4.13 4.16 9.58 7.85 ≥1.49Optimal OF Value 3923.14 3627.75 3294.75 3051.63

    Paperweight Formulation (1) 111.4 679.1 4802.1 ?(I = 231) Formulation (2) 382.7 17434.1 ? ?

    Speedup 3.44 25.67 ≥17.99Optimal OF Value 102.62 97.00 89.18

    Table 4: Sum of Squared Differences: Running time (in seconds) until optimality. Results denoted with ?exceeded the time limit of 86,400 seconds.

    In Table 4, there are only 2 occasions out of the 40 tested instances for which formulation (2) is fasterthan formulation (1) (these instances are underlined). Two of these occasions occur within the Titaniumdata set, where the optimal objective function value is small. Formulation (1) is faster in 33 instances.

    Overall, out of the 131 tested instances, formulation (1) was faster on 107 occasions (82%), while for-

    10

  • 10−1 101 1030

    0.5

    1

    Time (s)

    η

    Maximum AbsoluteDifference

    10−1 101 103

    Time (s)

    Sum of AbsoluteDifferences

    10−1 101 103

    Time (s)

    Sum of SquaredDifferences

    Formulation (1) Formulation (2)

    Figure 3: η refers to the fraction of instances solved to optimality within the given time.

    mulation (2) was faster on 13 occasions (10%). Fig. 3 shows the fraction of instances for each objectivefunction that were solved within a given time for both formulations. For all three objective functions, theblue curve (formulation (1)) stays above the red curve (formulation (2)) almost everywhere, demonstratingthe superiority of formulation (1) on the tested instances.

    4.2 Branch-and-bound Trees

    As well as analysing the running time to find the globally optimal solution, it is also interesting to considerthe branch-and-bound tree which is formed during the solve. Fig. 4 shows the number of nodes exploredand the size of the branch-and-bound tree (in MB) at the end of the search for the three distance metricson the MpStorage50 data set.

    Fig. 4 suggests that overall, the number of nodes explored and the size of the branch-and-bound treesare similar for both formulations. As expected, as the problem difficulty increases with the number ofbreakpoints, so does the number of explored nodes and the size of the branch-and-bound tree. The treegrows exponentially with the increase of the number of breakpoints. We know from Tables 2-4 that (asidefrom one possible outlier result), formulation (1) is faster than formulation (2) for this data set. This suggeststhat the branch-and-bound trees created during the optimisation process are similar for both formulations,yet formulation (1) is able to parse through the tree quicker. For an example, consider the sum of squareddifferences metric with 10 breakpoints. Formulation (1) takes 655.9 seconds to find an optimal solution, whileformulation (2) takes 45,531.1 seconds. However, the respective number of nodes explored are 581,198 and570,730, and the respective tree sizes are 2.73 and 2.78 MB. There are fewer binary variables and constraints(an, in particular, fewer dense constraints) in formulation (1) which may explain the difference in solve times.

    Fig. 5 shows the same results for the DebrisFlow data set.Again, we can see that in general, the number of nodes explored and sizes of the branch-and-bound trees

    is similar for both formulations. Formulation (1) is generally quicker to find the globally optimal solution forthis data set, so it is again interesting to see the similarity on the tree size and number of nodes explored.

    5 Conclusions

    We have presented a theoretical and experimental comparison between two recently presented mixed-integerlinear programming (MILP) formulations for finding optimal piecewise linear (PWL) functions to univariate,discrete data. The two formulations, presented by Rebennack and Krasko (2020) and Kong and Maravelias

    11

  • (2020), use binary variables and big-M constructs to model for the optimal breakpoint location and conti-nuity requirement necessary for fitting optimal PWL functions.

    The formulation presented by Rebennack and Krasko (2020) contains fewer continuous variables, binaryvariables and functional constraints than the formulation presented by Kong and Maravelias (2020). Hence,the first formulation is able to model the assignment of data points to linear segments, ordering of datapoints and the continuity requirement in fewer constraints with less reliance on complicating binary variables.Experimental results suggests this formulation is superior for most tested instances, outperforming the secondformulation in over 82% of the 131 presented results across five different data sets with three different distancemetrics and different numbers of breakpoints.

    For future work, we note that the large number of big-M constructs and binary variables in both formula-tions can lead to difficulties. Due to the prevalence of research into PWL function fitting, the implementationof problem-specific knowledge into the formulations could lead to speedups.

    References

    Bellman R, Roth R (1969) Curve fitting by segmented straight lines. Journal of the American Statistical Association64(327):1079–1084.

    Berman NG, Wong WK, Bhasin S, Ipp E (1996) Applications of segmented regression models for biomedical studies.American Journal of Physiology-Endocrinology and Metabolism 270(4):E723–E732.

    Bertsimas D, King A (2016) OR forum—an algorithmic approach to linear regression. Operations Research 64(1):2–16.

    Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. The Annals ofStatistics 44(2):813–852.

    Bertsimas D, Mazumder R (2014) Least quantile regression via modern optimization. The Annals of Statistics42(6):2494–2525.

    Bertsimas D, Shioda R (2007) Classification and regression via integer optimization. Operations Research 55(2):252–271.

    Chang C (1973) Pattern recognition by piecewise linear discriminant functions. IEEE Transactions on ComputersC-22(9):859–862.

    Chen DZ, Wang H (2009) Approximating points by a piecewise linear function: I. Dong Y, Du DZ, Ibarra O, eds.,Algorithms and Computation, 224–233 (Springer).

    de Boor C, Rice J (1968) Least squares cubic spline approximation II - variable knots. Technical report, ComputerScience Technical Reports, Purdue University.

    Ertel JE, Fowlkes EB (1976) Some algorithms for linear spline and piecewise multiple linear regression. Journal ofthe American Statistical Association 71(355):640–648.

    Feijoo B, Meyer RR (1988) Piecewise-linear approximation methods for nonseparable convex optimization. Manage-ment Science 34(3):411––419.

    Geißler B, Martin A, Morsi A, Schewe L (2012) Using piecewise linear functions for solving minlps. Lee J, Leyffer S,eds., Mixed Integer Nonlinear Programming, 287–314 (Springer).

    Goldberg N, Kim Y, Leyffer S, Veselka TD (2014) Adaptively refined dynamic program for linear spline regression.Computational Optimization and Applications 58(3):523–541.

    Gunnerud V, Foss B (2010) Oil production optimization—a piecewise linear model, solved with two decompositionstrategies. Computers & Chemical Engineering 34(11):1803 – 1812.

    Jupp DLB (1978) Approximation to data by splines with free knots. SIAM Journal on Numerical Analysis 15(2):328–343.

    Kallrath J, Rebennack S (2014) Computing Area-Tight Piecewise Linear Overestimators, Underestimators and Tubesfor Univariate Functions, 273–292 (New York, NY: Springer New York), ISBN 978-1-4939-0808-0, URL http://dx.doi.org/10.1007/978-1-4939-0808-0_14.

    Kong L, Maravelias CT (2020) On the derivation of continuous piecewise linear approximating functions. INFORMSJournal on Computing 0(0):null, URL http://dx.doi.org/10.1287/ijoc.2019.0949.

    Macgregor JF, Harris TJ (1993) The exponentially weighted moving variance. Journal of Quality Technology25(2):106–118.

    McCoy K, Krasko V, Santi P, Kaffine D, Rebennack S (2016) Minimizing economic impacts from post-fire debrisflows in the western united states. Natural Hazards 83(1):149–176.

    12

    http://dx.doi.org/10.1007/978-1-4939-0808-0_14http://dx.doi.org/10.1007/978-1-4939-0808-0_14http://dx.doi.org/10.1287/ijoc.2019.0949

  • Muggeo VMR (2003) Estimating regression models with unknown break-points. Statistics in Medicine 22(19):3055–3071.

    Rebennack S (2016) Computing tight bounds via piecewise linear functions through the example of circle cuttingproblems. Mathematical Methods of Operations Research 84(1):3–57.

    Rebennack S, Kallrath J (2015) Continuous piecewise linear delta-approximations for bivariate and multivariatefunctions. Journal of Optimization Theory and Applications 167(1):102–117.

    Rebennack S, Krasko V (2020) Piecewise linear function fitting via mixed-integer linear programming. INFORMSJournal on Computing To Appear.

    Toriello A, Vielma JP (2012) Fitting piecewise linear continuous functions. European Journal of Operational Research219(1):86–95.

    Wagner AK, Soumerai SB, Zhang F, Ross-Degnan D (2002) Segmented regression analysis of interrupted time seriesstudies in medication use research. Journal of Clinical Pharmacy and Therapeutics 27(4):299–309.

    13

  • 4 6 8 10 12 14

    102

    104

    Breakpoints

    # Nodes Explored

    4 6 8 10 12 14

    10−2

    100

    Breakpoints

    Tree Size (mb)

    (a) Maximum Absolute Difference

    4 6 8 10102

    104

    106

    Breakpoints

    # Nodes Explored

    4 6 8 1010−2

    100

    102

    Breakpoints

    Tree Size (mb)

    (b) Sum of Absolute Differences.

    4 5 6 7 8 9 10

    103

    104

    105

    106

    Breakpoints

    # Nodes Explored

    4 5 6 7 8 9 10

    10−1

    100

    Breakpoints

    Tree Size (mb)

    Formulation (1) Formulation (2)

    (c) Sum of Squared Differences.

    Figure 4: Number of nodes explored and branch-and-bound tree size for the MpStorage50 data set withthree distance metrics.

    14

  • 4 6 8 10 12 14

    102

    104

    106

    Breakpoints

    # Nodes Explored

    4 6 8 10 12 14

    10−2

    10−1

    100

    Breakpoints

    Tree Size (mb)

    (a) Maximum Absolute Difference

    4 6 8 10 12 14

    103

    105

    107

    Breakpoints

    # Nodes Explored

    4 6 8 10 12 1410−2

    100

    102

    Breakpoints

    Tree Size (mb)

    (b) Sum of Absolute Differences.

    4 5 6 7 8 9 10

    103

    104

    105

    106

    Breakpoints

    # Nodes Explored

    4 5 6 7 8 9 10

    10−1

    100

    Breakpoints

    Tree Size (mb)

    Formulation (1) Formulation (2)

    (c) Sum of Squared Differences.

    Figure 5: Number of nodes explored and branch-and-bound tree size for the DebrisFlow data set with threedistance metrics.

    15

    IntroductionMixed Integer Linear Programming FormulationsOptimal Piecewise Linear FunctionsDistance Metrics

    The First Mixed Integer Linear ProgramContinuous VariablesBinary Variables

    The Second Mixed-Integer Linear ProgramContinuous VariablesBinary Variables

    ComparisonBounds on the Big-M Values

    Computational ExperimentsRunning TimesBranch-and-bound Trees

    Conclusions