20
Input Invex Neural Network Suman Sapkota * NAAMII, Nepal [email protected] Binod Bhattarai Imperial College London, UK [email protected] Abstract In this paper, we present a novel method to constrain invexity on Neural Networks (NN). Invex functions ensure every stationary point is global minima. Hence, gra- dient descent commenced from any point will lead to the global minima. Another advantage of invexity on NN is to divide data space locally into two connected sets with a highly non-linear decision boundary by simply thresholding the output. To this end, we formulate a universal invex function approximator and employ it to enforce invexity in NN. We call it Input Invex Neural Networks (II-NN). We first fit data with a known invex function, followed by modification with a NN, compare the direction of the gradient and penalize the direction of gradient on NN if it contradicts with the direction of reference invex function. In order to penal- ize the direction of the gradient we perform Gradient Clipped Gradient Penalty (GC-GP). We applied our method to the existing NNs for both image classification and regression tasks. From the extensive empirical and qualitative experiments, we observe that our method gives the performance similar to ordinary NN yet having invexity. Our method outperforms linear NN and Input Convex Neural Network (ICNN) with a large margin. We publish our code and implementation details at github 2 . 1 Introduction Neural Networks (NN) have been a go-to Machine Learning algorithm for various tasks such as Image Classification [23, 32, 16], Natural Language Processing [33, 9], Audio Synthesis [26], Generative Modelling [12, 30, 22], Object Detection [29] and Image Segmentation [17]. It has been well studied that a single layer Neural Network can approximate any function given enough neurons [8]. However, it is difficult to constrain the Neural Network to have desirable property such as a) K-Lipschitz bounded, b) Invertible, c) Convex d) Quasi-convex and e) Invex function. All modern deep neural networks use some form of regularization [18, 25, 13] to achieve required properties during training. Multiple works have been done previously to constrain a Neural Network to have K-Lipschitz constant [14, 25, 28, 13]. Invertible Neural Networks which are used widely in Normalizing Flows[30] are made possible by methods such as Invertible Residual Networks[3] and i-RevNet[20]. Input Convex Neural Network (ICNN)[1] provides an elegant solution for constraining a Neural Network to output convex function. There has been no work previously in constraining a Neural Network to output quasi- convex function or invex function. Checking of invexity on Neural Network is not that straightforward too. This is mainly because of difficulty in defining an invex function in terms of constraining a function to have such property. In this paper, we propose a novel method to construct invex functions with Neural Network. Our method relies heavily on projected gradient regularization which could not be modelled by previous K-Lipschitz Constraint Methods. We discuss on this in section 3. * https://tsumansapkota.github.io/about/ 2 https://github.com/tsumansapkota/Input-Invex-Neural-Network Preprint. Under review. arXiv:2106.08748v1 [cs.LG] 16 Jun 2021

Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal [email protected] Binod Bhattarai Imperial College London, UK [email protected]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Input Invex Neural Network

Suman Sapkota∗NAAMII, Nepal

[email protected]

Binod BhattaraiImperial College London, UK

[email protected]

Abstract

In this paper, we present a novel method to constrain invexity on Neural Networks(NN). Invex functions ensure every stationary point is global minima. Hence, gra-dient descent commenced from any point will lead to the global minima. Anotheradvantage of invexity on NN is to divide data space locally into two connectedsets with a highly non-linear decision boundary by simply thresholding the output.To this end, we formulate a universal invex function approximator and employ itto enforce invexity in NN. We call it Input Invex Neural Networks (II-NN). Wefirst fit data with a known invex function, followed by modification with a NN,compare the direction of the gradient and penalize the direction of gradient on NNif it contradicts with the direction of reference invex function. In order to penal-ize the direction of the gradient we perform Gradient Clipped Gradient Penalty(GC-GP). We applied our method to the existing NNs for both image classificationand regression tasks. From the extensive empirical and qualitative experiments, weobserve that our method gives the performance similar to ordinary NN yet havinginvexity. Our method outperforms linear NN and Input Convex Neural Network(ICNN) with a large margin. We publish our code and implementation details atgithub2.

1 Introduction

Neural Networks (NN) have been a go-to Machine Learning algorithm for various tasks such as ImageClassification [23, 32, 16], Natural Language Processing [33, 9], Audio Synthesis [26], GenerativeModelling [12, 30, 22], Object Detection [29] and Image Segmentation [17]. It has been well studiedthat a single layer Neural Network can approximate any function given enough neurons [8]. However,it is difficult to constrain the Neural Network to have desirable property such as a) K-Lipschitzbounded, b) Invertible, c) Convex d) Quasi-convex and e) Invex function. All modern deep neuralnetworks use some form of regularization [18, 25, 13] to achieve required properties during training.

Multiple works have been done previously to constrain a Neural Network to have K-Lipschitz constant[14, 25, 28, 13]. Invertible Neural Networks which are used widely in Normalizing Flows[30] aremade possible by methods such as Invertible Residual Networks[3] and i-RevNet[20]. Input ConvexNeural Network (ICNN)[1] provides an elegant solution for constraining a Neural Network to outputconvex function. There has been no work previously in constraining a Neural Network to output quasi-convex function or invex function. Checking of invexity on Neural Network is not that straightforwardtoo. This is mainly because of difficulty in defining an invex function in terms of constraining afunction to have such property. In this paper, we propose a novel method to construct invex functionswith Neural Network. Our method relies heavily on projected gradient regularization which couldnot be modelled by previous K-Lipschitz Constraint Methods. We discuss on this in section 3.

∗https://tsumansapkota.github.io/about/2https://github.com/tsumansapkota/Input-Invex-Neural-Network

Preprint. Under review.

arX

iv:2

106.

0874

8v1

[cs

.LG

] 1

6 Ju

n 20

21

Page 2: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

`

A

Bx

w

(a) Convex Set

`

A

Bzy

x

w

(b) Connected Set

`

z

A

By

x

w

A

(c) Disconnected Set

`

A

Bzy

x

w

A Cu

vx

(d) Connected Sets

Figure 1: Different types of set according to the decision boundary in continuous space.(a) Convex sets have all the points inside a convex decision boundary. A straight line connectingany two points in the set also lies inside the set. Here, A is a convex set, B is non-convex set. (b)Connected sets have continuous space between any two points within the set. Any two points in theconnected set can be connected by a curve that also lies inside the set. Here, both A, B are connectedsets. (c) Disconnected sets are opposite of connected sets. Any two points in the disconnected setcan not be connected by a curve that also lies inside the set. Here, A is disconnected set and B isconnected set. (d) The same decision boundary as (c) is represented by multiple connected sets.Disconnected set A in (c) is a union of connected set A and C in (d). Here, all A, B and C areconnected sets. However, A ∪ C is disconnected set and B ∪ C is still a connected set.

Motivation: Without loss of generality, we tend to form local clusters in the data space whileworking with decision boundaries (clustering or classification or sets). This is due to relative easeto understand a single connected set as a single entity or class. The concept of locality is not new.One of the Classical works [5] highlights the benefit of local learning in Machine Learning fromglobal learning. The idea is to learn local networks from neighbourhood data. The usefulness of locallearning [5, 6, 31] was explored much earlier using classical kernel methods. It can be inferred thatlocal decision boundary is significant for local learning.

In Figure 1, we compare different types of decision boundary of sets. From the Figure, we can observethat the convex set is inferior to the connected set and the union of connected sets can represent anydisconnected set. This naturally brings out the question of how a Neural Network can approximatesuch connected sets or connected decision boundaries.

The quest for Input Invex Neural Network is highly motivated by the inability of existing methods tocreate a highly nonlinear local decision boundary (or connected set) for splitting the data. Either thesolution is too simple (linear classifier), or unsatisfactory (convex function classifier) or too complexand give decision boundary creating disconnected set (Neural Networks). This is shown graphicallyin Figure 2. Although the convex function gives a non-linear yet local decision boundary, it can onlycreate a convex set (lower contour set). And, this is not a general decision boundary of a connectedset.

Our goal is to find a binary connected set classifier for which a generalized convex set is a must.In this study, we identified that a general local decision boundary is obtained by thresholding ageneralised convex function called the invex function [15, 4]. Referring to Figure 2, if we carefullyobserve the lower and upper contour sets of invex functions we find both of them connected sets.

Properties of Generalised Convex Functions: Invex function and quasi-Convex function are ageneralization of convex function. These functions inherit a few common characteristics of theconvex function as we summarize in Table 1. We can see that all convex, quasi-convex and invexfunction have only global minima. Although the quasi-convex function does not have an increasingfirst derivative, it still has a convex set as its lower contour set. As mentioned before, thresholdingan invex function obtains two connected sets, which is a general type of connected binary decisionboundary. Such a decision boundary is more interpretable, as it has only two sets and can form a morecomplex disconnected set by a union of multiple connected sets. We present an example of invex,quasi-convex and ordinary function in Figure 2. Their class decision boundary/sets are compared inTable 1.

Recognising the importance of the invex function for connected/local decision boundary, we aimat implementing it in existing Neural Networks. Hence, to make a general local decision boundarywith Neural Networks, Input Invex Neural Network (II-NN) is needed. In order to achieve this, webegin with a simple invex function like cone and modify it with Neural Network while preserving theinvexity. To regularize the gradient to retain invexity during training of Neural Network, we applyGradient Clipped Gradient Penalty (GC-GP). The details of this method are elaborated in Section 3.

2

Page 3: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Function Type Only Global Minima Increasing First Derivative Set (−∞, θ)Convex X X convex

Quasi-Convex X 7 convexInvex X 7 connected

Ordinary function 7 7 disconnected

Table 1: Comparison of property of Convex, Quasi-Convex, Invex and Ordinary Function.

x1

1.00.5

0.00.5

1.0

x2

1.0

0.50.0

0.51.0

y =

f (x1

, x2)

0.5

0.0

0.5

1.0

(a) Quasi-Convex Function

x1

1.00.5

0.00.5

1.0

x2

1.00.5

0.00.5

1.0

y =

f(x1,

x2)

0.00

0.25

0.50

0.75

1.00

(b) Invex Function

x1

1.00.5

0.00.5

1.0

x2

1.0

0.50.0

0.51.0

y =

f (x1

, x2)

1.0

0.5

0.0

0.5

1.0

(c) Ordinary Function

1.0 0.5 0.0 0.5 1.0x1

1.0

0.5

0.0

0.5

1.0

x2

-0.6 -0.5

-0.4

-0.3 -0.2

0.0

0.3

0.3

0.60.6

0.6

0.8

0.8

0.8

1.0 1.0

1.01.0

(d) Quasi-Convex Function

1.0 0.5 0.0 0.5 1.0x1

1.0

0.5

0.0

0.5

1.0

x2

-0.1

0.0 0.1

0.2

0.3

0.3

0.4

0.4

0.5

0.5

0.6

0.6

0.7

0.7

0.8

0.8

0.9

0.9

(e) Invex Function

1.0 0.5 0.0 0.5 1.0x1

1.0

0.5

0.0

0.5

1.0

x2

-0.8

-0.6

-0.4

-0.4

-0.2

-0.2

0.0

0.0

0.0

0.2

0.2

0.4

0.6

0.8

0.8

0.8 0.8

(f) Ordinary Function

Figure 2: 3D plot (top row) and Contour plot (bottom row) of Quasi-Convex, Invex and OrdinaryFunction. The global minima (red star) is plotted in Convex and Invex Functions. Contour plot ondifferent levels show the decision boundary made by each class of functions. Zoom in the diagramfor details.

Overall, we summarize the contributions of this paper as follows:

1. We introduced invex function in the Neural Networks. To the best of our knowledge, this isthe first work introducing invex function on Neural Network.

2. We developed a theoretical foundation to construct an invex function by constraining thegradient.

3. We formulated Gradient Clipped Gradient Penalty (GC-GP) for constraining the inputgradient that overcomes the drawbacks of previous methods .

4. We experimented on both classification and regression tasks. Empirical and qualitativeinsights show the comparative performance of our method compared with ordinary neuralnetworks and superiority over convex neural networks.

2 Background

Our construction of invex function is highly inspired by projected input gradient constraint andpartially on convex and quasi-convex functions as well. In this section, we will discuss some of therepresentative previous works on convex and K-Lipschitz constraints which are closely related to ourwork.

One of the recent works on Input Convex Neural Network (ICNN) [1] presents a novel way toconstruct convex function by Neural Network. They use only non-decreasing activation functionalong with all positive weights except the ones connected with input. The drawbacks of this methodare those inherited from convex function which we discussed in Introduction Section. Other classicalconvex functions include convex cone and quadratic functions.

3

Page 4: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

The Lipschitz constraint of Neural Networks is another important constraint that has received muchattention after one of the seminal works on Wasserstein Generative Adversarial Network (WGAN) [2].WGAN-GP [14] regularizes the Lipschitz constant of the data points to be some value (eg. K=1).This method has two drawbacks. Firstly, it cannot exactly constraint the gradient to be preciselyK-Lipschitz as it is added to the loss term and constrained via gradient descent. This is shown in theexperiment section in Table 2. This is because, in many training examples, the gradient from thecriterion is opposite to the gradient from gradient-penalty, which will not allow the desired Lipschitzconstant. Secondly, the constraint adds the loss if the local Lipschitz constant at some points is lessthan desired. According to the definition of Lipschitz constant, the local Lipschitz value can be anybelow the maximum K specified. Similarly, WGAN-LP [28] regularizes only if the local Lipschitzconstant at a point is greater than the specified K. This solves the second problem with WGAN-GP.Yet, it faces the same first drawback of WGAN-GP. Spectral Normalization [25] is another robustmethod for constraining the Lipschitz constant of the Neural Network. It constraints the NeuralNetwork at the functional level, i.e. constraints the weights. The problem with this method is that itconstrains the upper bound of Lipschitz constant globally.

Although WGAN-GP and WGAN-LP constrain the function globally, they can be modified toconstrain locally as well. Since these methods constrain the gradient of the input w.r.t output, it canconstrain each input point to a different magnitude of the gradient, i.e. to a specific local Lipschitzconstant. The Spectral Normalization[25] cannot constrain the local K-Lipschitz or input gradient.To construct an invex function, we have a requirement to have local K-Lipschitz constraint or inputgradient constraint guaranteed as shown in Proposition 1.

Since we develop a method for constructing invex function depending on the projected input gradientconstrained neural network, we also engineered a method to impose such constrain on NeuralNetworks. Our method (GC-GP) improves on drawbacks of previous gradient constraint methods.The details are discussed in Section 3.

3 Methodology

We start this Section with a formal definiton of an invex function.

Invex Function A function on vector space X ∈ Rn, f : X→ R is invex if:

For all x1,x2 ∈ X, and η : X×X→ Rn

f(x1)− f(x2) ≥ η(x1,x2) · ∇f(x2) (1)

All convex and strongly quasi-convex functions are invex function, but converse is not true.

Property of invex function If f : Rn → R is invex and g : R → R is always-increasing, thenh = g ◦ f is also invex.

Proposition 1 Let f : X→ R and g : X→ R be two functions on vector space X. Let x ∈ X beany point, x∗ be its minima and x 6= x∗. If f be an invex function (convex or strongly quasi-convex)and If (∇g(x) · ∇f(x)

‖∇f(x)‖

)+ ‖∇f(x)‖ > 0

then h(x) = f(x) + g(x) is an invex function

Proof: Let us consider a 1D invex function f(x) as shown in Figure 3. If we add the function f(x)with some gi(x), then we will have a new function hi(x). The modified function might be invexor not. If it does not change the direction of gradient at any point with reference to the gradient oforiginal function f(x), i.e.

∇hi(x) · ∇f(x) > 0 (2)then hi(x) is an invex function. If the modified function changes the direction of gradient, it wouldmean that there exists a new minima/maxima that is not the minima of the f(x). If we preserve theabove inequality, we keep the position of minima intact and modify only those part which preservesthe invexity. Those modifications which do not follow the above rule might still be an invex function.For example, if the modification shifts the function in Figure 3 from left to right, it is still an invexfunction, but the direction of the gradient would be opposing in many input points. In Figure 3, we use

4

Page 5: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00x

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

y =

f(x)

0.4 0.5 0.6 0.7 0.8 0.90.2

0.0

0.2

0.4

0.6

0.8 x1 x2func h0(x)func h1(x)func h2(x)func h3(x)func h4(x)invex f(x)

Figure 3: A 1D function f(x) with modification by various gi(x), forming modified function hi(x).

x1 and x2 as example points to show the gradient and the inequality. The modified functions: h1(x),h2(x) and h3(x) have the same direction of gradient at all points hence, these are invex function.The modified functions h0(x) and h4(x) have different gradient direction at some points. Hence,these functions are not invex. In 1D, the dot product between the original gradient and gradientof the modified function at each point gives +ve value if the direction is the same. We want allthe dot products between gradients to be positive, in order to preserve invexity while changing thenon-linearity of the function. We ignore the inequality at minima (x∗) because the gradient is zeroand does not follow the inequality.

This definition is still true in 2D and ND functions as well. The visual proof for 2D is shown inAppendix A. The main goal is to change the existing invex function such that the modified gradientstill leads to the same global minima. If the projected gradient of modified function∇hi(x) in thedirection of∇f(x) is positive at all points, it implies that there is no new maxima/minima createdduring the modification. This is what preserves the invexity.

This statement can be written mathematically.

∇h(x) · ∇f(x)‖∇f(x)‖

> 0 (3)

Using h(x) = f(x) + g(x) and solving this we get,

(∇g(x) · ∇f(x)‖∇f(x)‖

)+ ‖∇f(x)‖ > 0 (4)

Proposition 2 Let g : X → R be a function on vector space X, let x ∈ X be any point, x∗ be itsminima and x 6= x∗. If

∇g(x) · x− x∗

‖x− x∗‖> 0

then g is an invex function.

Proof: Let us consider a cone as an initial invex function in Proposition 1. Let us take the cone ofform f(x) = a‖x− x∗‖, where x∗ is the center/tip of the cone. This is a cone with scale factor a.The unit vector of gradient of the cone is given by the following equation.

∇̂f =∇f(x)‖∇f(x)‖

=x− x∗

‖x− x∗‖(5)

Moreover, let us consider the cone to be a generalized function. When a → 0 then f(x) → 0 but∇̂f remains the same. The magnitude of this generalized function at all the points is zero, but thedirection of the gradient points away from the minima (tip of the cone). Using this in Proposition 1,we get.

∇g(x) · x− x∗

‖x− x∗‖> 0 (6)

The method of constructing invex function as mentioned above can not construct all invex function.Hence, there is a need for a universal invex function constructor.

5

Page 6: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

X∇f(X)

y

Projected Gradient

t

ErrorΔy

y = f(X,𝛳)

Gradient Penalty (GP)

⑤ ⑥

Output Gradient Clipping (GC)

0.6 0.4 0.2 0.0 0.2 0.4 0.6x

3

2

1

0

1

2

y

gradient penaltyoutput-gradient clipper

Figure 4: Left: Pipeline for Basic II-NN. Corresponding pseudo-code is Algorithm 1. Right:Function used for output gradient-clipping (GC) and projected gradient-penalty (GP) where, x-axis isthe projected-gradient value. The equations of these functions are in Appendix B

Proposition 3 Modifying invex function as shown in Proposition 1 for N iterations with non-linearg(x) can approximate any invex function.

Proof: The invex function according to Proposition 2 is very simple and can not approximate allinvex functions. Whereas, Proposition 1 requires an invex function to start from and modify thefunction. We can build a basic invex function using Proposition 2 and Modify it using Proposition 1for multiple iterations to make more and more complex function to approximate the required invexfunction. Hence iteratively modifying the invex function according to our method can approximateany invex function.

Gradient Clipped Gradient Penalty (GC-GP): This is our idea to constrain the input gradientvalue that is later used to construct invex neural network using the above Propositions 1, 2, 3.Gradient Penalty: To penalize the points which violate local Lipschitz constraint or projected gradientconstraint, we modify the Lipschitz penalty[28]. We create a smooth penalty based on the functionshown in Figure 4. We use the smooth-l1 loss function on top of it for penalty. This helps theoptimizer to have a smooth gradient and helps in optimization. It is modified to regularize projectedgradient as well in our II-NN.Gradient Clipping: To make the training stable and not have gradient opposing the projected-gradientor local K-Lipschitz penalty, we construct a smooth gradient clipping function as shown in Figure 4.The clipping is done at the output layer before back-propagating the gradients. We clip the gradientof the function to near zero when the K-Lipschitz or projected-gradient constraint is being violated atthat point. This helps to avoid criterion gradients opposing our constraint.

Combining these two methods allows us to achieve very accurate gradient constraint on the neuralnetwork. Our method can not guarantee that the learned function has desired gradient property suchas local K-Lipschitz or projected-gradient as defined. However, it constraints the desired propertywith near perfect accuracy. This is shown experimentally in Table 2. Still, we can easily verify theconstraint at any given point.

Input Invex Neural Network (II-NN) : A Neural Network y = f(x|θ) can be constrained to bea invex function w.r.t input using the constraints mentioned above in Propositions 1,2 and 3. Theconstraining of input gradient are made possible by our GC-GP method.

The simplest Input Invex Neural Network (Basic II-NN) is given by our Proposition 2. It can bewritten as y = fII−NN (x|θ,x∗) where, θ and center (x∗) are the learnable parameters. The pipelinefor Basic II-NN is shown in Figure 4 and the pseudo-code is shown in Algorithm 1. II-NN can alsobe constructed using proposition 2 and 3. A simple II-NN is modified by another constrained NN tomodify the invexity (towards more non-linear surface). The Algorithm for Modifying II-NN is inAppendix C, D.

Function constructed by II-NN in 1D and 2D are shown in Figure 3 and Figure 2 respectively. Wecompare the function learned by invex neural network as compared with convex and ordinary neuralnetwork in Experiment section.

Connected Set Decision Boundary: Although any function can be used for clustering and regressing,we choose to use invex function for clustering. This is because of the local connected decision

6

Page 7: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Algorithm 1: Basic II-NN (Proposition 2) - PyTorch like pseudocode# X, t is the dataset with m elements.# f_iinn is a model with neural network parameter and center parameter.# lamda is the scaling parameter for projected gradient penalty.# f_pg_scale and f_out_clip are gradient penalty and# output_gradient clipper respectively as shown in Figure 4for step in range(STEPS):

y = f_iinn(X) #1grad_X = torch.autograd.grad(y, X) #2grad_center = (X - center)grad_center = grad_center/torch.norm(grad_center, dim=1)pg = torch.bmm(grad_X.reshape(m, 1, -1), # bmm: batch matrix multiply

grad_center.reshape(m, -1, 1)).reshape(-1,1) #3pgp = f_smoothl1(f_pg_scale(pg)).mean() * lamda #4pgp.backward_to(f_iinn.parameters()) #5del_y = criterion(y, t).backward_to(y) #6clip_val = f_out_clip(pg)del_y_clip = del_y.clip(-clip_val, clip_val) #7del_y_clip.backward_to(f_iinn.parameters()) #8optimizer.step()

boundary of the cluster formed from it. It is easy to interpret as well as highly non-linear. Let x bethe input, then, we can construct a binary cluster/classifier/set decision boundary as follows.

ycluster = σ(−fII−NN (x))

Here, σ(x) is a threshold function or sigmoid function. The decision boundary created by our methodagainst other methods is shown in Figure 2. The order of complexity of the decision boundary learnedis as follows: Logistic Regression < ICNN < II-NN < (Ordinary) Neural Network.

4 Experiments

Gradient Constraint: Firstly, we experiment on three toy datasets for comparing the lipschitzconstraint on regression and classification with various methods. These are: two 2D regression anda 2D classification dataset. The Regression 1 and 2 datasets consists of 2500 and 5625 points ona grid respectively whereas the Classification 1 dataset is a 2D spiral data consisting of 400 datapoints. For regression we test on Mean Squared Error (MSE) and for classification we test on BinaryCross Entropy (BCE) and Accuracy on the training data. The details regarding the toy datasets anddetailed experimental results are in Appendix E. We compare constraint methods: Gradient-Penalty(GP) [14], Lipschitz-Penalty (LP) [28], Spectral Normalization (SN)[25] and Our Method (GradientClipped Gradient Penalty, GC-GP) for 1-Lipschitz function. The metrics and Lipschitz constant offunction learned is shown in the Table 2.

We use scaled sigmoid function for final layer of Spectral Normalized Neural Network duringclassification. The sigmoid is scaled by 4 such that its Lipschitz constant is 1. Thus, Lipschitzconstant of the overall function is unchanged by the sigmoid activation. The experiments on Table 2is conducted on Neural Network with configuration: (2,10,10,1), where 2 and 1 are input and outputdimension respectively. We use ELU [7] activation for regression and LeakyReLU [24] activationfor classification in intermediate layers. We use sigmoid in final layer for classification . We traineach model for 7500 epochs using full batch using soft-l1-loss for penalizing the gradient norm (orK-Lipschitz).

Invex Neural Network: We experiment on toy datasets (see Appendix F) to test the capacity ofInvex Neural Network. First, we compare on regression and secondly on classification dataset. Weuse 2D spiral dataset for classification which has binary connected sets as classes and is suitableto test invex function. On bigger dataset, we experiment on MNIST using MLP as well as CNNarchitecture and on Fashion-MNIST [34] dataset on CNN architecture. We compare the performanceof Invex function with Linear, Convex and Ordinary Neural Network on Table 3.

7

Page 8: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Dataset Method Loss / Lipschitz Minimum(Accuracy %) Constant (K) Gradient Norm

Regression 1

GP (λ = 1) 0.06391 1.45713 0.28021GP (λ = 3) 0.10442 1.18923 0.62169LP (λ = 1) 0.05524 1.25629 0.02150LP (λ = 3) 0.06310 1.18670 0.01166

SN 0.09121 0.99785 0.12192GC-GP (λ = 1) (Ours) 0.08185 0.92162 0.02632GC-GP (λ = 3) (Ours) 0.08708 0.87693 0.02011

Regression 2

GP (λ = 1) 0.01715 1.35935 0.25872GP (λ = 3) 0.02131 1.13570 0.47666LP (λ = 1) 0.00524 1.13546 0.01011LP (λ = 3) 0.00546 1.07079 0.01466

SN 0.01774 0.45701 0.006839GC-GP (λ = 1) (Ours) 0.00720 0.89899 0.01888GC-GP (λ = 3) (Ours) 0.00757 0.86108 0.00964

Classification 1

GP (λ = 1) 0.11239 (100.0) 1.74431 0.37148GP (λ = 3) 0.19764 (99.25) 1.51147 0.03693LP (λ = 1) 0.01828 (100.0) 1.37528 0.0LP (λ = 3) 0.05433 (100.0) 1.80411 1.72× 10−7

SN 0.42814 (76.5) 0.85739 0.22571GC-GP (λ = 1) (Ours) 0.29447 (98.0) 1.04589 0.13088GC-GP (λ = 3) (Ours) 0.32096 (96.0) 0.97788 0.14575

Table 2: Comparison between various K-Lipschitz / Gradient Constraint methods. The bold textrepresent the methods satisfying 1-Lipschitz. The blue highlight represents the best method thatsatisfy 1-Lipschitz. We can see that SN is able to constrain the K-Lipschitz very well but is farbeyond in loss/accuracy values compared to other methods. GP and LP don’t exactly constrainthe K-Lipschitz to required value, but low loss and high accuracy. Our method does better in bothsenarios at once. It achieves near perfect Lipschitz constant while having low loss. For each method,the best among 3 seeds is mentioned in the table. Details are in the Appendix E

The models for toy datasets are trained for 5000 epochs with full batch whereas the models forMNIST and F-MNIST are trained for 20 epochs with batch size of 50.

For toy regression and classification we use network with configuration (2,10,10,1) and (2,100,100,1)respectively. And MLP on MNIST, our configuration is (784, 200, 100, 1). In these configurations,the first and the last number represents the input and output dimension, and rest represents thedimension of hidden layers. For CNN on MNIST and F-MNIST we use configuration (1C, 16C, 32C,GAP, 1) and (1C, 32C, 32C, GAP, 1), where configurations are as : input channel, hidden channels,Global Average Pooling (GAP) and output unit. We train binary classification model for each class ofMNIST and F-MNIST and apply argmax over probability of each class for multi class classification.The per class datasets are balanced during sampling. All convolutional layers have kernel:5, stride:2and padding:1. For regression we use ELU activation and LeakyRelu for rest.

On the Classification dataset, MNIST and Fashion-MNIST, since the data are not evenly spaced, weuse random input points to check and penalize function where the constraint is not followed. Thishelps to constrain the function to be invex in region where data points are unavailable.

We can not be certain if a function is invex or not in high dimensional space like MNIST. To verifythat our function is invex we test for constraint on all training and test dataset as well as on 1 Millionrandom points. It is found that in F-MNIST our classifiers follow our invexity rule on > 99% of datapoints and > 99% of random points. In MNIST dataset, for both MLP and CNN architecture, someclassifiers have fewer percentage that follow our invex rule (see Appendix F).

From Table 3 we can see that the performance of the II-NN is higher than the Convex (IC-NN) andlower than or similar to Ordinary Neural Network. On Classification 1 dataset, composing the Invexfunction helps produce better results. This is mainly due to simplicity of Basic Invex NN which isimplemented from Proposition 2. The complexity is increased by use of Composed Invex function ontop of Basic Invex function, according to Proposition 1. This result supports our Proposition 3 aswell.

8

Page 9: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Architecture Dataset (MLP) Dataset (CNN)Regression 1 Classification 1 MNIST MNIST F-MNIST

Linear/Logistic 0.13971 72.0 90.79 - -Convex 0.01887 82.5 96.83 94.68 81.27

Ordinary 0.00035 100.0 97.09 98.08 87.76Basic Invex 0.01404 96.25 97.61 97.85 87.8

(Ours) (λ = 0.5) (λ = 2)Invex (composed) (Ours) - 100.0 (λ = 2) - - -

Table 3: Accuracy / Loss on various Datasets. For visualization on Synthetic datasets: Regression1 and Classification 1, check Appendix F. The bold numbers represent best results and the bluenumbers represent second best results. Regression 1 column measures MSE and remaining columnmeasure accuracy in percentage.

For all experiments we use Adam [21] optimizer with learning rate of 0.005. The experiment isconducted on PyTorch [27] framework on Intel i7-5500U CPU.

5 Conclusion, Limitations, and Future Work

We introduced a method for constructing a new type of Neural Network called Input Invex NeuralNetwork (II-NN). We formulated GC-GP for constraining the gradient of the Neural Network to makeit Invex. Experiments show that our constraining method is better than previous methods. We alsoshow useful properties of II-NN and compare the performance with Input Convex Neural Network(ICNN) [1] and Ordinary Neural Network on classification and regression problems. We show itssuperiority compared to ICNN for creating local decision boundary.

Using Invex Neural Networks we can construct any connected decision boundary in N-dimensionalspace. Also, we can approximate any function as Invex Function, which is the best assumption forglobal optimization problems.

Our method (GC-GP), however, does not guarantee the Lipschitz constant or invexity due to its datapoint dependent regularization, unlike Spectral Normalization which constrains the weights directly.Our method of constructing II-NN can approximate any invex function only after multiple iterationsof composing.

Although we use GC-GP for constructing invex function, it can be used for other tasks such ason WGAN[2] or modified for Invertible Residual Network[3]. In future, II-NN will be used toincorporate local learning in Neural Networks.

6 Broader Impact

Our work introduces a new concept of Invexity in Neural Networks. We also show that constructing aconnected set on X ∈ RN can be done by thresholding an invex function. This property is usefulin local explain-ability and local clustering of data points. This type of decision boundary can beused to improve decision making capacity of Decision Nodes on Soft Decision Trees [19, 11], and onIndividual Neurons of Neural Network.

We believe that local clustering will help in avoiding problems of global end-to-end frameworkssuch as global data space models. We can simplify our solution by using local models which willbe simpler. It will enable researchers to improve upon small part of the global data space, withouteffecting/retraining on larger data space. This will help researchers scale up models as well as savecompute by focusing on local model.

Invex function has global minima. Hence, an Invex function as a neuron will represent single modeof data (that maximizes the activation). Invex function has large scope in the field of optimization andlocal decision boundary. Having more robust method of constructing Invex functions with NeuralNetworks will certainly be useful in future works.

Since our work is methodological and solely rely on synthetic and simple data for experimentalpurposes, we believe it does not reflect on ethical and negative future societal consequences.

9

Page 10: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

References[1] B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In International Conference on Machine

Learning, pages 146–155. PMLR, 2017.

[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Internationalconference on machine learning, pages 214–223. PMLR, 2017.

[3] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen. Invertible residual networks.In International Conference on Machine Learning, pages 573–582. PMLR, 2019.

[4] A. Ben-Israel and B. Mond. What is invexity? The ANZIAM Journal, 28(1):1–9, 1986.

[5] L. Bottou and V. Vapnik. Local learning algorithms. Neural computation, 4(6):888–900, 1992.

[6] W. S. Cleveland and S. J. Devlin. Locally weighted regression: an approach to regression analysis by localfitting. Journal of the American statistical association, 83(403):596–610, 1988.

[7] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponentiallinear units (elus). arXiv preprint arXiv:1511.07289, 2015.

[8] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signalsand systems, 2(4):303–314, 1989.

[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformersfor language understanding. arXiv preprint arXiv:1810.04805, 2018.

[10] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating second-order functionalknowledge for better option pricing. Advances in neural information processing systems, pages 472–478,2001.

[11] N. Frosst and G. Hinton. Distilling a neural network into a soft decision tree. arXiv preprintarXiv:1711.09784, 2017.

[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.

[13] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitzcontinuity, 2020.

[14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved trainingof wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/892c3b1c6dccd52936e27cbd0ff683d6-Paper.pdf.

[15] M. A. Hanson. On sufficiency of the kuhn-tucker conditions. Journal of Mathematical Analysis andApplications, 80(2):545–550, 1981.

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017.

[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internalcovariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.

[19] O. Irsoy, O. T. Yıldız, and E. Alpaydın. Soft decision trees. In Proceedings of the 21st InternationalConference on Pattern Recognition (ICPR2012), pages 1819–1822. IEEE, 2012.

[20] J.-H. Jacobsen, A. Smeulders, and E. Oyallon. i-revnet: Deep invertible networks. arXiv preprintarXiv:1802.07088, 2018.

[21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. Advances in neural information processing systems, 25:1097–1105, 2012.

[24] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models.In Proc. icml, volume 30, page 3. Citeseer, 2013.

[25] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarialnetworks, 2018.

[26] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

10

Page 11: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil-amkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol-ume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.

[28] H. Petzka, A. Fischer, and D. Lukovnicov. On the regularization of wasserstein gans, 2018.

[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[30] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conferenceon Machine Learning, pages 1530–1538. PMLR, 2015.

[31] D. Ruppert and M. P. Wand. Multivariate locally weighted least squares regression. The annals of statistics,pages 1346–1370, 1994.

[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.

[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

[34] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms. arXiv preprint arXiv:1708.07747, 2017.

11

Page 12: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

AppendicesA Invex function proof on 2D

Proof: Consider a simple invex function f(x) as shown in figure 5. It is a simple modification ofsigmoid function. We can make it more non-linear by adding another function g(x). If the modifiedfunction g(x) does not change the direction of projected gradient at any point, then it it is an invexfunction.

∇h(x) · ∇f(x) > 0 (7)

X14 2 0 2 4 6X2

20

24

6

Y

0.000.250.500.751.001.251.501.75

4 2 0 2 4

2

0

2

4

6f(x)f(x)

a

b

c

d

e

Figure 5: Initial invex function

We present 4 modifications to the function in Figure 5 which shows various cases to prove ourproposition. We choose 5 points (a, b, c, d and e) which shows condition being satisfied or not beingsatisfied by various modifications.

Modification 1: The Figure 6 shows h(x) = f(x) + g(x). The modification has the same globalminima, i.e. no new minima/maxima has been created. The modified function satisfies the conditionin Equation 7 at all points. It can be seen that the projected gradient all have positive value. It canalso be seen on the contour plot that there are no new minima or maxima. Hence, the modification isstill an invex function.

X14 2 0 2 4 6X2

20

24

6

Y

0.000.250.500.751.001.251.501.75

4 2 0 2 4

2

0

2

4

6h(x)h(x)f(x)

a

b

c

d

e

Figure 6: Modified invex function 1, also an invex function.

12

Page 13: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Modification 2: The modification as shown in Figure 7 is similar to Modification 1. The modifiedfunction is still an invex function (same reason as Modification 1).

X14 2 0 2 4 6X2

20

24

6

Y

0.000.250.500.751.001.251.501.75

4 2 0 2 4

2

0

2

4

6h(x)h(x)f(x)

a

b

c

d

e

Figure 7: Modified invex function 2, also an invex function.

Modification 3: The modification as shown in Figure 8 is not an invex function. It has a newmaxima as compared to original function, as can be seen on the contour plot. This is reflected by theviolation of Equation 7. If we look at point a, then we can see that the projected gradient of modifiedfunction h(x) on f(x) is negative. Since it violates our condition, we are not sure that it is an invexfunction. Hence, our condition is violated if there is a new minima/maxima, which is predicted bythe projected gradient.

X14 2 0 2 4 6X2

20

24

6

Y

0.000.250.500.751.001.251.501.75

4 2 0 2 4

2

0

2

4

6h(x)h(x)f(x)

a

b

c

d

e

Figure 8: Modified invex function 3, not an invex function.

13

Page 14: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Modification 4: The modification as shown in Figure 9 is also not an invex function. It has a newminima as compared to the original function. It can be seen on the contour plot as well. The modifiedfunction violates the Equation 7 constraint at point d (and points around it). Hence it may not be aninvex function. In this case, it is not an invex function. But we can not say that the function is notinvex if it violates the Equation 7.

X14 2 0 2 4 6X2

20

24

6

Y

0.000.250.500.751.001.251.501.75

4 2 0 2 4

2

0

2

4

6h(x)h(x)f(x)

a

b

c

d

e

Figure 9: Modified invex function 4, not an invex function.

14

Page 15: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

B Gradient Clipped Gradient Penalty (GC-GP)

While clipping the output gradient using the projected gradient, we use the following equation.

fout_clip(pg) =

fsoftplus(20 · pg)

20if pg < 0.14845

3 · pg − 0.0844560006 otherwise

While penalizing the projected gradient that violates the conditional, we use the following equation.

fpg_scale(pg) =−fsoftplus(−20 · (pg − 0.1))

4

The smooth-l1 function is given by:

fsmooth−l1(x) =

0.5x2

βif |x| < β

|x| − 0.5β otherwise

By default, we use β = 1.0. We use softplus [10] function for smooth transition of clipping value.

C Algorithm for Modifying II-NN

An II-NN constructed according to Proposition 1. The parameters of existing invex function is freezedand the function is modified. The modification is made by adding a new Neural Network to existinginvex function or II-NN. Its corresponding pseudocode is in Algorithm 2.

Algorithm 2: Modified II-NN - PyTorch like pseudocode# X, t is the dataset with m elements.# g_iinn is a model with neural network parameter and center parameter.# f_invex is an invex function (maybe a Neural Network) that is not trained.# lamda is the scaling parameter for projected gradient penalty.# f_pg_scale and f_out_clip are gradient penalty and# output_gradient clipper respectively as shown in Figure 4for step in range(STEPS):

## 1. forward passy = g_iinn(X) + f_invex(X)## 2. Gradient of the functiong_gradX, f_gradX = torch.autograd.grad(y, X)## 3. Projected gradientpg = torch.bmm(g_gradX.reshape(m, 1, -1),

f_gradX.reshape(m, -1, 1)).reshape(-1,1)## 4. Compute projected gradient penaltypgp = f_smoothl1(f_pg_scale(pg)).mean() * lamda## 5. Compute gradient of parameterspgp.backward_to(g_iinn.parameters())## 6. Compute del_y from loss functiondel_y = criterion(y, t).backward_to(y)## 7. clip del_y using projected gradientclip_val = f_out_clip(pg)del_y_clip = del_y.clip(-clip_val, clip_val)## 8. Compute gradient of parameters from abovedel_y_clip.backward_to(g_iinn.parameters())## 9. Update the parametersoptimizer.step()

15

Page 16: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

D II-NN by guiding with invex function or II-NN

An II-NN can also be trained by scaling the output of existing invex function to zero. This is similar toProposition 2 but instead of generalized cone, it uses II-NN or other invex function. Its correspondingpseudocode is in Algorithm 3.

Algorithm 3: Guided II-NN - PyTorch like pseudocode# X, t is the dataset with m elements.# g_iinn is a model with neural network parameter and center parameter.# f_invex is an invex function (maybe a Neural Network) that guides the# projected gradient of g_iinn# lamda is the scaling parameter for projected gradient penalty.# f_pg_scale and f_out_clip are gradient penalty and# output_gradient clipper respectively as shown in Figure 4for step in range(STEPS):

## 1. forward passy = g_iinn(X)z = f_invex(X)## 2. Gradient of the functiong_gradX = torch.autograd.grad(y, X)f_gradX = torch.autograd.grad(z, X)## 3. Projected gradientpg = torch.bmm(g_gradX.reshape(m, 1, -1),

f_gradX.reshape(m, -1, 1)).reshape(-1,1)## 4. Compute projected gradient penaltypgp = f_smoothl1(f_pg_scale(pg)).mean() * lamda## 5. Compute gradient of parameterspgp.backward_to(g_iinn.parameters())## 6. Compute del_y from loss functiondel_y = criterion(y, t).backward_to(y)## 7. clip del_y using projected gradientclip_val = f_out_clip(pg)del_y_clip = del_y.clip(-clip_val, clip_val)## 8. Compute gradient of parameters from abovedel_y_clip.backward_to(g_iinn.parameters())## 9. Update the parametersoptimizer.step()

E Lipschitz Constraint : Experiment

The dataset used for Regression 1, 2 and Classification 1 are in Figure 10. We conduct detailedexperiments for comparing Lipschitz constraint of various methods in Table 4. It can be observed thatour method (GC-GP) achieves relatively high performance while maintaining the Lipschitz constantclose to the target (K = 1). In Classification 1 dataset, our method has Lipschitz constant close toone (1) and does better with λ = 3. Although GC-GP can not guarantee the target Lipschitz constant,it can be seen from the experiment that it achieves near perfect constraint.

F II-NN Details : Experiment

The Regression 1 dataset used is the same as in Figure 10. The Classification 1 dataset along withpredictions by Linear, Convex, Ordinary, Basic Invex and Composed Invex Neural Networks is inFigure 11. We conduct experiments per class basis on MNIST using MLP in Table 5 and CNN inTable 6 as well as on F-MNIST using CNN on Table 7. The experiments compare the performance ofConvex Classifier, Invex Classifier and Ordinary Classifier for all experiments. We compare withLogistic regression on MNIST dataset. We also test the percentage of Invexiy rule followed by all

16

Page 17: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

train and test points as well as 1 Million random points. It is observed that most of the classifiersfollow our Invex rule on > 99% of training and test points and random points.

X1

1.000.750.500.250.000.250.500.751.00

X21.000.750.500.250.000.250.500.751.00

Y

1.000.750.500.25

0.000.250.500.751.00

(a) Regression 1

X1

1.000.750.500.250.000.250.500.751.00

X21.000.750.500.250.000.250.500.751.00

Y

0.40.20.00.20.4

(b) Regression 2

1.0 0.5 0.0 0.5 1.0x1

1.0

0.5

0.0

0.5

1.0

x2

-0.8-0.8 -0.6 -0.4

-0.20.0

0.2

0.2

0.20.4

0.4

0.4

0.6

0.6

0.8

0.8

0.8

(c) Regression 1

1.0 0.5 0.0 0.5 1.0x1

1.0

0.5

0.0

0.5

1.0

x2

-0.5

-0.5

-0.4

-0.4-0.4

-0.4-0.3

-0.3

-0.3

-0.2

-0.2

-0.1

-0.1

0.0

0.0

0.0

0.1

0.10.2 0.2

0.2

0.2

0.3

0.4

(d) Regression 2

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.001.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

(e) Classification 1: Lipschitz Compare

Figure 10: (a) and (c) are 3D plot and Contour plot of Regression 1 dataset respectively. Similarly, (b)and (d) are 3D plot and Contour plot of Regression 2 dataset respectively. (e) is the Classification 1Dataset (Spiral) used on Lipschitz constraint comparing experiment. The source code for the datasetsis available in the supplementary material.

17

Page 18: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Dataset Method Seed Loss / Lipschitz Minimum Time (ms)(Accuracy) Constant (K) Gradient Norm

Regression 1

GP (λ = 1) A 0.06793 1.37824 0.12656 4.12±0.58B 0.06391 1.45713 0.28021C 0.07396 1.23643 0.30802

GP (λ = 3) A 0.11041 1.22805 0.59818 4.12±0.60B 0.11135 1.21324 0.55712C 0.10442 1.18923 0.62169

LP (λ = 1) A 0.05598 1.21801 0.01146 4.19±0.56B 0.05524 1.25629 0.02150C 0.05597 1.20679 0.01799

LP (λ = 3) A 0.06335 1.10782 0.02662 4.17±0.56B 0.06344 1.12563 0.00471C 0.06310 1.18670 0.01166

SN A 0.09156 1.00014 0.12324 2.88±0.47B 0.09139 0.99901 0.12381C 0.09121 0.99785 0.12192

GC-GP (λ = 1) (Ours) A 0.08365 0.92083 0.02074 5.60±0.96B 0.08426 0.99231 0.01201C 0.08185 0.92162 0.02632

GC-GP (λ = 3) (Ours) A 0.08874 0.88536 0.03419 6.05±0.77B 0.08708 0.87693 0.02011C 0.08819 0.88356 0.01996

Regression 2

GP (λ = 1) A 0.02060 1.19123 0.32234 6.06±0.81B 0.01715 1.35935 0.25872C 0.01786 1.31234 0.24638

GP (λ = 3) A 0.02641 1.14505 0.540290 7.80±0.75B 0.02131 1.13570 0.47666C 0.03304 1.20034 0.57999

LP (λ = 1) A 0.00530 1.12591 0.00758 6.13±0.78B 0.00524 1.13546 0.01011C 0.00529 1.17633 0.00312

LP (λ = 3) A 0.00571 1.14781 0.00490 6.15±0.84B 0.00550 1.07732 0.02155C 0.00546 1.07079 0.01466

SN A 0.01774 0.45701 0.006839 3.27±0.54B 0.01923 0.47381 0.00768C 0.01871 0.48544 0.00780

GC-GP (λ = 1) (Ours) A 0.00759 0.89684 0.00537 7.75±0.75B 0.00720 0.89899 0.01888C 0.00729 0.89305 0.01133

GC-GP (λ = 3) (Ours) A 0.00757 0.86108 0.00964 6.05±0.77B 0.00757 0.85756 0.01369C 0.00826 0.85425 0.00193

Classification 1

GP (λ = 1) A 0.12006 (100.0) 1.67273 0.28320 2.26±0.45B 0.11239 (100.0) 1.74431 0.37148C 0.11637 (100.0) 1.71859 0.26768

GP (λ = 3) A 0.20353 (98.0) 1.66700 0.35362 2.60±0.46B 0.23160 (97.25) 1.49510 0.51360C 0.19764 (99.25) 1.51147 0.03693

LP (λ = 1) A 0.01828 (100.0) 1.37528 0.0 2.66±0.47B 0.06048 (97.25) 2.25560 0.0C 0.01934 (100.0) 1.60750 0.0

LP (λ = 3) A 0.17167 (96.5) 1.58051 0.00492 2.66±0.46B 0.054330 (100.0) 1.80411 1.72 × 10−7

C 0.050757 (97.25) 2.06773 0.0SN A 0.43505 (76.0) 0.79094 0.17857 2.51±0.45

B 0.42814 (76.5) 0.85739 0.22571C 0.43819 (76.25) 0.95832 0.19253

GC-GP (λ = 1) (Ours) A 0.30689 (98.0) 1.19644 0.18211 3.55±0.40B 0.35706 (84.75) 1.03542 0.07420C 0.29447 (98.0) 1.04589 0.130878

GC-GP (λ = 3) (Ours) A 0.32096 (96.0) 0.97788 0.14575 3.55±0.40B 0.35798 (85.25) 0.96265 0.07568C 0.33111 (94.75) 1.07161 0.14411

Table 4: Comparison between various K-Lipschitz Constraint methods. Seeds A, B and C are 147,258 and 369 respectively. The time taken is measured for all seeds. Our method takes most time as itneeds to scale the penalty and clip the gradient. Spectral Normalization (SN) takes least time as itdoes layerwise iterative normalization and does not depend on the data.

18

Page 19: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

Architecture Invexity %Class Logistic Convex Invex Ordinary Train+Test Train + Test + 1M random

0 98.724 99.490 99.541 99.541 100.0000 100.00001 98.590 99.427 99.471 99.604 99.9971 99.99982 95.155 98.401 99.079 98.983 100.0000 89.52113 94.802 98.366 98.861 98.861 99.9843 73.86464 96.894 98.727 99.032 99.236 99.9971 99.99985 93.330 98.430 98.711 98.599 100.0000 100.00006 97.338 99.165 99.322 99.322 99.9986 97.25887 96.449 98.492 98.687 98.589 100.0000 33.22858 91.273 96.150 98.665 98.871 99.9986 99.99999 93.162 96.283 98.167 98.464 99.9871 99.9985

Argmax 90.790 96.830 97.090 97.670 - -

Table 5: MNIST MLP: Accuracy with various architectures and Invexity %

Architecture Invexity %Class Logistic Convex Invex Ordinary Train+Test Train + Test + 1M random

0 98.724 97.755 99.388 99.235 100.0000 100.00001 98.590 99.163 99.559 99.471 100.0000 96.53722 95.155 97.335 98.837 98.643 100.0000 100.00003 94.802 97.574 98.713 99.257 100.0000 99.99514 96.894 97.301 99.695 99.796 99.9857 10.06135 93.330 96.973 98.823 98.935 100.0000 100.00006 97.338 97.390 99.530 99.478 100.0000 99.99957 96.449 96.012 99.124 99.319 100.0000 99.95038 91.273 96.920 98.614 98.922 100.0000 6.54219 93.162 92.815 97.869 98.117 100.0000 100.0000

Argmax 90.790 94.680 97.850 98.080 - -

Table 6: MNIST CNN: Accuracy with various architectures and Invexity %

Architecture Invexity %Class Convex Invex Ordinary Train+Test Train + Test + 1M random

0 92.10 95.25 95.30 100.0000 99.97391 97.75 99.10 99.10 100.0000 99.90642 90.10 93.75 94.0 100.0000 98.45673 92.60 95.60 96.20 99.9971 99.91584 89.05 93.65 93.85 99.9929 99.84635 98.35 99.15 98.95 100.0000 100.00006 81.65 88.00 87.95 99.9986 99.82447 93.75 98.50 98.80 100.0000 100.00008 97.75 98.70 98.60 99.9971 99.99989 95.90 98.60 98.30 99.9986 99.9999

Argmax 81.27 87.76 87.80 - -

Table 7: F-MNIST CNN: Accuracy with various architectures and Invexity %

19

Page 20: Input Invex Neural Network · 2021. 6. 17. · Input Invex Neural Network Suman Sapkota NAAMII, Nepal suman.sapkota@naamii.org.np Binod Bhattarai Imperial College London, UK b.bhattarai@imperial.ac.uk

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(a) Classification 1 Dataset

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(b) Logistic Regression

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(c) Convex Neural Network (ICNN)

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(d) Ordinary Neural Network

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(e) Basic Invex

0.6 0.4 0.2 0.0 0.2 0.4 0.6

0.4

0.2

0.0

0.2

0.4

(f) Invex : composed over Basic Invex in (d)

Figure 11: 2D plot of decision boundary made by (b): Logistic Regression, (c): Input Convex NeuralNetwork, (d): Ordinary Neural Network, (e): Basic Invex and (f): Invex composed over Basic Invexin (d). The dataset used for the Classification 1 task is in (a).

20