28
CBMM Memo No. 058 April 29, 2019 Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality: a Review by Tomaso Poggio 1 Hrushikesh Mhaskar 2 Lorenzo Rosasco 1 Brando Miranda 1 Qianli Liao 1 1 Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, 02139. 2 Department of Mathematics, California Institute of Technology, Pasadena, CA 91125; Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711 Abstract: The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216. H.M. is supported in part by ARO Grant W911NF-15-1- 0385. 1 arXiv:1611.00740v2 [cs.LG] 22 Nov 2016

New CBMM Memo No. 058 April 29, 2019 arXiv:1611.00740v2 … · 2019. 4. 29. · CBMM Memo No. 058 April 29, 2019 Why and When Can Deep – but Not Shallow – Networks Avoid the Curse

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • CBMM Memo No. 058 April 29, 2019

    Why and When Can Deep – but Not Shallow – Networks Avoid theCurse of Dimensionality: a Review

    by

    Tomaso Poggio1 Hrushikesh Mhaskar2 Lorenzo Rosasco1 Brando Miranda1 Qianli Liao1

    1Center for Brains, Minds, and Machines, McGovern Institute for Brain Research,Massachusetts Institute of Technology, Cambridge, MA, 02139.

    2Department of Mathematics, California Institute of Technology, Pasadena, CA 91125;Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711

    Abstract: The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under whichit can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of theseconditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed,together with new results, open problems and conjectures.

    This work was supported by the Center for Brains, Minds and Machines (CBMM), fundedby NSF STC award CCF - 1231216. H.M. is supported in part by ARO Grant W911NF-15-1-0385.

    1

    arX

    iv:1

    611.

    0074

    0v2

    [cs

    .LG

    ] 2

    2 N

    ov 2

    016

  • Why and When Can Deep – but Not Shallow – Networks Avoid theCurse of Dimensionality: a Review

    Tomaso Poggio1 Hrushikesh Mhaskar2 Lorenzo Rosasco1 Brando Miranda1 Qianli Liao11Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, 02139.

    2Department of Mathematics, California Institute of Technology, Pasadena, CA 91125; Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711

    Abstract: The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially betterthan shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason fortheir exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

    Keywords: Deep and Shallow Networks, Convolutional Neural Networks, Function Approximation, Deep Learning

    1 A theory of deep learning

    1.1 Introduction

    There are at three main sets of theory questions about Deep NeuralNetworks. The first set of questions is about the power of the archi-tecture – which classes of functions can it approximate and learnwell? The second set of questions is about the learning process: whyis SGD (Stochastic Gradient Descent) so unreasonably efficient, atleast in appearance? The third, more important question is aboutgeneralization. Overparametrization may explain why minima areeasy to find during training but then why does overfitting seems tobe less of a problem than for classical shallow networks? Is this be-cause deep networks are very efficient algorithms for HierarchicalVector Quantization?

    In this paper we focus especially on the first set of ques-tions, summarizing several theorems that have appeared on-line in 2015[Anselmi et al., 2015b, Poggio et al., 2015c, Poggio et al., 2015a], and in2016[Mhaskar et al., 2016, Mhaskar and Poggio, 2016]). We then describe addi-tional results as well as a few conjectures and open questions. Themain message is that deep networks have the theoretical guarantee,which shallow networks do not have, that they can avoid the curseof dimensionality for an important class of problems, correspond-ing to compositional functions, that is functions of functions. Anespecially interesting subset of such compositional functions are hi-erarchically local compositional functions where all the constituentfunctions are local in the sense of bounded small dimensionality.The deep networks that can approximate them without the curse ofdimensionality are of the deep convolutional type (though weightsharing is not necessary).

    Implications of the theorems likely to be relevant in practice are:

    1. Cerrtain deep convolutional architectures have a theoreticalguarantee that they can be much better than one layer archi-tectures such as kernel machines;

    2. the problems for which certain deep networks are guaran-teed to avoid the curse of dimensionality correspond to input-output mappings that are compositional. The most interestingset of problems consists of compositional functions composedof a hierarchy of constituent functions that are local: an exam-ple is f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)),h22(h13(x5, x6), h14(x7, x8))). The compositional functionf requires only “local” computations (here with just dimen-sion 2) in each of its constituent functions h;

    3. the key aspect of convolutional networks that can give theman exponential advantage is not weight sharing but locality ateach level of the hierarchy.

    2 Previous theoretical work

    Deep Learning references start with Hinton’s backpropagationand with Lecun’s convolutional networks (see for a nice review[LeCun et al., 2015]). Of course, multilayer convolutional networks havebeen around at least as far back as the optical processing era ofthe 70s. The Neocognitron[Fukushima, 1980] was a convolutional neu-ral network that was trained to recognize characters. The prop-erty of compositionality was a main motivation for hierarchi-cal models of visual cortex such as HMAX which can be re-garded as a pyramid of AND and OR layers[Riesenhuber and Poggio, 1999],that is a sequence of conjunctions and disjunctions. Several pa-pers in the ’80s focused on the approximation power and learn-ing properties of one-hidden layer networks (called shallow net-works here). Very little appeared on multilayer networks, (but see[Mhaskar, 1993a, Chui et al., 1994, Chui et al., 1996]), mainly because one hiddenlayer nets performed empirically as well as deeper networks. Onthe theory side, a review by Pinkus in 1999[Pinkus, 1999] concludesthat “...there seems to be reason to conjecture that the two hiddenlayer model may be significantly more promising than the singlehidden layer model...”. A version of the questions about the im-portance of hierarchies was asked in [Poggio and Smale, 2003] as follows:“A comparison with real brains offers another, and probably re-lated, challenge to learning theory. The “learning algorithms” wehave described in this paper correspond to one-layer architectures.Are hierarchical architectures with more layers justifiable in termsof learning theory? It seems that the learning theory of the typewe have outlined does not offer any general argument in favorof hierarchical learning machines for regression or classification.This is somewhat of a puzzle since the organization of cortex –for instance visual cortex – is strongly hierarchical. At the sametime, hierarchical learning systems show superior performance inseveral engineering applications.” Because of the great empiricalsuccess of deep learning over the last three years, several papersaddressing the question of why hierarchies have appeared. Sum-Product networks, which are equivalent to polynomial networks(see [B. Moore and Poggio, 1998, Livni et al., 2013]), are a simple case of a hi-erarchy that was analyzed[Delalleau and Bengio, 2011] but did not provideparticularly useful insights. Montufar and Bengio[Montufar et al., 2014]showed that the number of linear regions that can be synthesizedby a deep network with ReLU nonlinearities is much larger thanby a shallow network. However, they did not study the conditionsunder which this property yields better learning performance. Infact we will show later that the power of a deep network cannotbe exploited in general but for certain specific classes of functions.Relevant to the present review is the work on hierarchical quadraticnetworks[Livni et al., 2013], together with function approximationresults[Mhaskar, 1993b, Pinkus, 1999]. Also relevant is the conjecture byShashua (see [Cohen et al., 2015]) on a connection between deep learn-ing networks and the hierarchical Tucker representations of ten-sors. In fact, our theorems describe formally the class of functions

  • for which the conjecture holds. This paper describes and extendsresults presented in[Anselmi et al., 2014, Anselmi et al., 2015a, Poggio et al., 2015b]

    and in[Liao and Poggio, 2016, Mhaskar et al., 2016] which derive new upperbounds for the approximation by deep networks of certain impor-tant classes of functions which avoid the curse of dimensionality.The upper bound for the approximation by shallow networks of gen-eral functions was well known to be exponential. It seems naturalto assume that, since there is no general way for shallow networksto exploit a compositional prior, lower bounds for the approxima-tion by shallow networks of compositional functions should alsobe exponential. In fact, examples of specific functions that cannotbe represented efficiently by shallow networks have been givenvery recently by [Telgarsky, 2015] and [Safran and Shamir, 2016]. We providein theorem 5 another example of a class of compositional functionsfor which there is a gap between shallow and deep networks.

    3 Function approximation by deep networks

    In this section, we state theorems about the approximation proper-ties of shallow and deep networks.

    3.1 Degree of approximation

    The general paradigm is as follows. We are interested in determin-ing how complex a network ought to be to theoretically guaranteeapproximation of an unknown target function f up to a given accu-racy � > 0. To measure the accuracy, we need a norm ‖ · ‖ on somenormed linear space X. As we will see the norm used in the resultsof this paper is the sup norm in keeping with the standard choicein approximation theory. Notice, however, that from the point ofview of machine learning, the relevant norm is the L2 norm. In thissense, several of our results are stronger than needed. On the otherhand, our main results on compositionality require the sup normin order to be independent from the unknown distribution of theinputa data. This is important for machine learning.

    Let VN be the be set of all networks of a given kind with complexityN which we take here to be the total number of units in the network(e.g., all shallow networks with N units in the hidden layer). Itis assumed that the class of networks with a higher complexityinclude those with a lower complexity; i.e., VN ⊆ VN+1. Thedegree of approximation is defined by

    dist(f, VN ) = infP∈VN

    ‖f − P‖. (1)

    For example, if dist(f, VN ) = O(N−γ) for some γ > 0, thena network with complexity N = O(�−

    1γ ) will be sufficient to

    guarantee an approximation with accuracy at least �. Since f isunknown, in order to obtain theoretically proved upper bounds, weneed to make some assumptions on the class of functions fromwhich the unknown target function is chosen. This a priori infor-mation is codified by the statement that f ∈W for some subspaceW ⊆ X. This subspace is usually a smoothness class characterizedby a smoothness parameter m. Here it will be generalized to asmoothness and compositional class, characterized by the parame-ters m and d (d = 2 in the example of Figure 1; in general is thesize of the kernel in a convolutional network).

    3.2 Shallow and deep networks

    This section characterizes conditions under which deep networksare “better” than shallow network in approximating functions. Thuswe compare shallow (one-hidden layer) networks with deep net-works as shown in Figure 1. Both types of networks use the same

    small set of operations – dot products, linear combinations, a fixednonlinear function of one variable, possibly convolution and pool-ing. Each node in the networks we consider usually corresponds toa node in the graph of the function to be approximated, as shown inthe Figure. In particular each node in the network contains a certainnumber of units. A unit is a neuron which computes

    (〈x,w〉+ b)+, (2)

    where w is the vector of weights on the vector input x. Both t andthe real number b are parameters tuned by learning. We assume herethat each node in the networks computes the linear combination ofr such units

    r∑i=1

    ci(〈x, ti〉+ bi)+ (3)

    Notice that for our main example of a deep network correspondingto a binary tree graph, the resulting architecture is an idealizedversion of the plethora of deep convolutional neural networks de-scribed in the literature. In particular, it has only one output at thetop unlike most of the deep architectures with many channels andmany top-level outputs. Correspondingly, each node computes asingle value instead of multiple channels, using the combinationof several units (see Equation 3). Our approach and basic resultsapply rather directly to more complex networks (see third note insection 7). A careful analysis and comparison with simulations willbe described in future work.

    The logic of our theorems is as follows.

    • Both shallow (a) and deep (b) networks are universal, that isthey can approximate arbitrarily well any continuous functionof n variables on a compact domain. The result for shallownetworks is classical. Since shallow networks can be viewedas a special case of deep networks, it clear that for any contin-uous function of n variables, there exists also a deep networkthat approximates the function arbitrarily well on a compactdomain.

    • We consider a special class of functions of n variables on acompact domain that are a hierarchical compositions of localfunctions such as

    f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)),h22(h13(x5, x6), h14(x7, x8))) (4)

    The structure of the function in equation 4 is represented by agraph of the binary tree type. This is the simplest example ofcompositional functions, reflecting dimensionality d = 2 forthe constituent functions h. In general, d is arbitrary but fixedand independent of the dimensionality n of the compositionalfunction f . In our results we will often think of n increasingwhile d is fixed. In section 4 we will consider the more generalcompositional case.

    • The approximation of functions with a compositional struc-ture – can be achieved with the same degree of accuracy bydeep and shallow networks but that the number of parametersare much smaller for the deep networks than for the shallownetwork with equivalent approximation accuracy. It is intu-itive that a hierarchical network matching the structure of acompositional function should be “better” at approximatingit than a generic shallow network but universality of shallownetworks asks for non-obvious characterization of “better”.Our result makes clear that the intuition is indeed correct.

  • x1 x2 x3 x4 x5 x6 x7 x8

    a b c

    ….∑

    x1 x2 x3 x4 x5 x6 x7 x8

    ….….

    …....... ... ...

    ...

    ...

    ...

    x1 x2 x3 x4 x5 x6 x7 x8x1 x2 x3 x4 x5 x6 x7 x8

    x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8

    Figure 1 The top graphs are associated to functions; each of the bottom diagrams depicts the network approximating the function above. In a) a shallow universalnetwork in 8 variables and N units approximates a generic function of 8 variables f(x1, · · · , x8). Inset b) shows a binary tree hierarchical network at the bottomin n = 8 variables, which approximates well functions of the form f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)), h22(h13(x5, x6), h14(x7, x8))) asrepresented by the binary graph above. In the approximating network each of the n− 1 nodes in the graph of the function corresponds to a set of Q = N

    n−1 ReLU

    units computing the ridge function∑Qi=1 ai(〈vi,x〉+ ti)+, with vi,x ∈ R2, ai, ti ∈ R. Each term in the ridge function corresponds to a unit in the node (this is

    somewhat different from todays deep networks, see text and note in 7). In a binary tree with n inputs, there are log2n levels and a total of n− 1 nodes. Similar tothe shallow network, a hierarchical network is universal, that is, it can approximate any continuous function; the text proves that it can approximate a compositionalfunctions exponentially better than a shallow network. No invariance – that is weight sharing – is assumed here. Notice that the key property that makes convolutionaldeep nets exponentially better than shallow for compositional functions is the locality of the constituent functions – that is their low dimensionality. Weight sharingcorresponds to all constituent functions at one level to be the same (h11 = h12 etc.). Inset c) shows a different mechanism that can be exploited by the deep network atthe bottom to reduce the curse of dimensionality in the compositional function at the top: leveraging different degrees of smoothness of the constituent functions, seeTheorem 6 in the text. Notice that in c) the input dimensionality must be≥ 2 in order for deep nets to have an advantage over shallow nets. The simplest examples offunctions to be considered for a), b) and c) are functions that are polynomials with a structure corresponding to the graph at the top.

  • We assume that the shallow networks do not have any structuralinformation on the function to be learned (here its compositionalstructure), because they cannot represent it directly. In any case,we will describe lower bounds of approximation by shallow net-works for the class of compositional functions to formalize thatassumption. Deep networks with standard architectures on the otherhand do represent compositionality in their architecture and can beadapted to the details of such prior information.

    We approximate functions of n variables of the form of Equation(4) with networks in which the activation nonlinearity is a smoothedversion of the so called ReLU, originally called ramp by Breimanand given by σ(x) = x+ = max(0, x) . The architecture of thedeep networks reflects Equation (4) with each node hi being a ridgefunction, comprising one or more neurons.

    Let In = [−1, 1]n, X = C(In) be the space of all continuousfunctions on In, with ‖f‖ = maxx∈In |f(x)|. Let SN,n denotethe class of all shallow networks with N units of the form

    x 7→N∑k=1

    akσ(〈wk, x〉+ bk),

    where wk ∈ Rn, bk, ak ∈ R. The number of trainable parametershere is (n+2)N ∼ n. Letm ≥ 1 be an integer, andWnm be the setof all functions of n variables with continuous partial derivativesof orders up to m

  • � = cN−m/2. Our assumption that f ∈WN,2m implies that each ofthese constituent functions is Lipschitz continuous. Hence, it is easyto deduce that, for example, if P , P1, P2 are approximations to theconstituent functions h, h1, h2, respectively within an accuracy of�, then since ‖h − P‖ ≤ �, ‖h1 − P1‖ ≤ � and ‖h2 − P2‖ ≤ �,then ‖h(h1, h2) − P (P1, P2)‖ = ‖h(h1, h2) − h(P1, P2) +h(P1, P2) − P (P1, P2)‖ ≤ ‖h(h1, h2) − h(P1, P2)‖ +‖h(P1, P2)− P (P1, P2)‖ ≤ c� by Minkowski inequality. Thus

    ‖h(h1, h2)− P (P1, P2)‖ ≤ c�,

    for some constant c > 0 independent of the functions involved.This, together with the fact that there are (n− 1) nodes, leads to(6). �

    Also in this case the proof provides the following corollary aboutthe subset Tnk of the space P

    nk which consists of compositional

    polynomials with a binary tree graph and constituent polynomialfunctions of degree h (in 2 variables)

    Corollary 2. Let σ : R → R be infinitely differentiable, and nota polynomial. Let n = 2l. Then f ∈ Tnk can be realized by adeep network with a binary tree graph and a total of r units withr = (n− 1)

    (2+k2

    )≈ (n− 1)k2.

    It is important to emphasize that the assumptions on σ in the the-orems are not satisfied by the ReLU function x 7→ x+, but theyare satisfied by smoothing the function in an arbitrarily small inter-val around the origin. This suggests that the result of the theoremshould be valid also for the non-smooth ReLU. Section 4.1 pro-vides formal results. Stronger results than the theorems of thissection (see [Mhaskar and Poggio, 2016]) hold for networks where eachunit evaluates a Gaussian non–linearity; i.e., Gaussian networks ofthe form

    G(x) =

    N∑k=1

    ak exp(−|x− wk|2), x ∈ Rd (7)

    where the approximation is on the entire Euclidean space.

    In summary, when the only a priori assumption on the target func-tion is about the number of derivatives, then to guarantee an ac-curacy of �, we need a shallow network with O(�−n/m) trainableparameters. If we assume a hierarchical structure on the target func-tion as in Theorem 2, then the corresponding deep network yieldsa guaranteed accuracy of � with O(�−2/m) trainable parameters.Note that Theorem 2 applies to all f with a compositional archi-tecture given by a graph which correspond to, or is a subgraph of,the graph associated with the deep network – in this case the graphcorresponding to Wn,dm . Theorem 2 lead naturally to the notion ofeffective dimensionality that we formalize in the next section

    Definition 1. The effective dimension of a class W of functions(for a given norm) is said to be d if for every � > 0, any function inW can be recovered within an accuracy of � (as measured by thenorm) using an appropriate network (either shallow or deep) with�−d parameters.

    Thus, the effective dimension for the class Wnm is n/m, that ofWn,2m is 2/m.

    4 General compositionality results: functionscomposed by a hierarchy of functions withbounded effective dimensionality

    The main class of functions we considered in previous papers con-sists of functions as in Figure 1 b that we called compositionalfunctions. The term “compositionality” was used with the meaningit has in language and vision, where higher level concepts are com-posed of a small number of lower level ones, objects are composedof parts, sentences are composed of words and words are composedof syllables. Notice that this meaning of compositionality is nar-rower than the mathematical meaning of composition of functions.The compositional functions we have described in previous papersmay be more precisely called functions composed of hierarchicallylocal functions.

    Here we generalize formally our previous results to the broaderclass of compositional functions (beyond the hierarchical localityof Figure 1b to Figure 1c and Figure 2) by restating formally a fewcomments of previous papers. Let begin with one of the previousexamples. Consider

    Q(x, y) = (Ax2y2 +Bx2y

    +Cxy2 +Dx2 + 2Exy

    +Fy2 + 2Gx+ 2Hy + I)210

    .

    Since Q is nominally a polynomial of coordinatewise degree 211,[Mhaskar, 1996, Lemma 3.2] shows that a shallow network with 211 + 1units is able to approximate Q arbitrarily well on I2. However,because of the hierarchical structure of Q, [Mhaskar, 1996, Lemma 3.2]shows also that a hierarchical network with 9 units can approxi-mate the quadratic expression, and 10 further layers, each with 3units can approximate the successive powers. Thus, a hierarchicalnetwork with 11 layers and 39 units can approximate Q arbitrarilywell. We note that even if Q is nominally of degree 211, each ofthe monomial coefficients in Q is a function of only 9 variables,A, · · · , I .

    A different example is

    Q(x, y) = |x2 − y2|. (8)

    This is obviously a Lipschitz continuous function of 2 variables.The effective dimension of this class is 2, and hence, a shallownetwork would require at least c�−2 parameters to approximate itwithin �. However, the effective dimension of the class of univariateLipschitz continuous functions is 1. Hence, if we take into accountthe fact that Q is a composition of a polynomial of degree 2 in 2variables and the univariate Lipschitz continuous function t 7→ |t|,then it is easy to see that the same approximation can be achievedby using a two layered network with O(�−1) parameters.

    To formulate our most general result that includes the examplesabove as well as the constraint of hierarchical locality, we firstdefine formally a compositional function in terms of a directedacyclic graph. Let G be a directed acyclic graph (DAG), with theset of nodes V . A G–function is defined as follows. Each of thesource node obtains an input from R. Each in-edge of every othernode represents an input real variable, and the node itself representsa function of these input real variables, called a constituent function.The out-edges fan out the result of this evaluation. We assume thatthere is only one sink node, whose output is the G-function. Thus,ignoring the compositionality of this function, it is a function of nvariables, where n is the number of source nodes in G.

  • x1 x2 x3 x4 x5 x6 x7 x8

    +

    x1 x2 x3 x4 x5 x6 x7 x8

    +

    x1 x2 x3 x4 x5 x6 x7 x8

    +

    x1 x2 x3 x4 x5 x6 x7 x8

    Figure 2 The figure shows the graphs of functions that may have small effectivedimensionality, depending on the number of units per node required for goodapproximation.

    Theorem 3. Let G be a DAG,n be the number of source nodes,and for each v ∈ V , let dv be the number of in-edges of v. Letf : Rn 7→ R be a compositional G-function, where each of the con-stitutent function is in W dvmv . Consider shallow and deep networkswith infinitely smooth activation function as in Theorem 1. Thendeep networks – with an associated graph that corresponds to thegraph of f – avoid the curse of dimensionality in approximating ffor increasing n, whereas shallow networks cannot directly avoidthe curse. In particular, the complexity of the best approximatingshallow network is exponential in n

    Ns = O(�−nm ), (9)

    where m = minv∈V mv , while the complexity of the deep networkis

    Nd = O(∑v∈V

    �−dv/mv ). (10)

    Following definition 1 we call dv/mv the effective dimension offunction v. Then, deep networks can avoid the curse of dimen-sionality if the constituent functions of a compositional functionhave a small effective dimension; i.e., have fixed, “small” dimen-sionality or fixed, “small” “roughness. A different interpretation ofTheorem 3 is the following.

    Proposition 1. If a family of functions f : Rn 7→ R of smooth-ness m has an effective dimension < n/m, then the functionsare compositional in a manner consistent with the estimates inTheorem 3.

    Notice that the functions included in this theorem are functionsthat are either local or the composition of simpler functions or both.Figure 2 shows some examples in addition to the examples at thetop of Figure Figure 1.

    4.1 Approximation results for shallow and deep net-works with (non-smooth) ReLUs

    The results we described so far use smooth activation functions. Wealready mentioned why relaxing the smoothness assumption shouldnot change our results in a fundamental way. While studies on theproperties of neural networks with smooth activation abound, theresults on non-smooth activation functions are much more sparse.Here we briefly recall some of them.

    In the case of shallow networks, the condition of a smooth acti-vation function can be relaxed to prove density (see [Pinkus, 1999],Proposition 3.7):

    Proposition 2. Let σ =: R→ R be in C0, and not a polynomial.Then shallow networks are dense in C0.

    In particular, ridge functions using ReLUs of the form∑ri=1 ci(〈wi, x〉+ bi)+, with wi, x ∈ R

    2, ci, bi ∈ R are dense inC.

    Networks with non-smooth activation functions are expected todo relatively poorly in approximating smooth functions such aspolynomials in the sup norm. “Good” degree of approximationrates (modulo a constant) have been proved in the L2 norm. De-fine B the unit ball in Rn. Call Cm(Bn) the set of all continuousfunctions with continuous derivative up to degree m defined onthe unit ball. We define the Sobolev space Wmp as the comple-tion of Cm(Bn) with respect to the Sobolev norm p (see for de-tails [Pinkus, 1999] page 168). We define the space Bmp = {f : f ∈Wmp , ‖f‖m,p ≤ 1} and the approximation errorE(Bm2 ;H;L2) =supf∈Bm2

    = E(f ;H;L2) = infg∈H ‖f − g‖L2 . It is shown in[Pinkus, 1999, Corollary 6.10] that

    Proposition 3. ForMr : f(x) =∑ri=1 ci(〈wi, x〉+bi)+ it holds

    E(Bm2 ;Mr;L2) ≤ Cr−mn for m = 1, · · · , n+3

    2.

    These approximation results with respect to the L2 norm cannotbe applied to derive bounds for compositional networks. Indeed,in the latter case, as we remarked already, estimates in the uni-form norm are needed to control the propagation of the errorsfrom one layer to the next, see Theorem 2. Results in this direc-tion are given in [Mhaskar, 2004], and more recently in [Bach, 2014] and[Mhaskar and Poggio, 2016] (see Theorem 3.1). In particular, using a resultin [Bach, 2014] and following the proof strategy of Theorem 2 it ispossible to derive the following results on the approximation of Lip-shitz continuous functions with deep and shallow ReLU networksthat mimics our Theorem 2:

    Theorem 4. Let f be a L-Lipshitz continuous function of n vari-ables. Then, the complexity of a network which is a linear combi-nation of ReLU providing an approximation with accuracy at least� is

    Ns = O(( �

    L

    )−n),

    wheres that of a deep compositional architecture is

    Nd = O((

    n− 1)( �L

    )−2).

    Our general Theorem 3 can be extended in a similar way. Theorem4 is an example of how the analysis of smooth activation functionscan be adapted to ReLU. Indeed, it shows how deep compositionalnetworks with standard ReLUs can avoid the curse of dimension-ality. In the above results, the regularity of the function class isquantified by the magnitude of Lipshitz constant. Whether the latteris best notion of smoothness for ReLU based networks, and if theabove estimates can be improved, are interesting questions thatwe defer to a future work. A result that is more intuitive and mayreflect what networks actually do is described in Appendix 4.3. Theconstruction described there, however, provides approximation inthe L2 norm but not in the sup norm though this may not matter fordigital implementations.

    Figures 3, 4, 5, 6 provide a sanity check and empirical support forour main results and for the claims in the introduction.

    4.2 Lower bounds and gaps

    So far we have shown that there are deep networks – for instance ofthe convolutional type – that can avoid the curse of dimensionalitywhen learning functions blessed with compositionality. There areno similar guarantee for shallow networks: in fact for shallownetworks approximating generic continuous functions the lowerand the upper bound are both exponential[Pinkus, 1999].

  • Figure 4 An empirical comparison of shallow vs 2-layers binary tree networks in the approximation of compositional functions. The loss function is the standardmean square error (MSE). There are several units per node of the tree. In our setup here the network with an associated binary tree graph was set up so that each layerhad the same number of units and shared parameters. The number of units for the shallow and binary tree neural networks had the same number of parameters. Onthe left the function is composed of a single ReLU per node and is approximated by a network using ReLU activations. On the right the compositional function isf(x1, x2, x3, x4) = h2(h11(x1, x2), h12(x3, x4)) and is approximated by a network with a smooth ReLU activation (also called softplus). The functions h1, h2,h3 are as described in Figure 3. In order to be close to the function approximation case, a large data set of 60K training examples was used for both training sets. Weused for SGD the Adam[Kingma and Ba, 2014] optimizer. In order to get the best possible solution we ran 200 independent hyper parameter searches using random search[Bergstra and Bengio, 2012] and reported the one with the lowest training error. The hyper parameters search was over the step size, the decay rate, frequency of decay andthe mini-batch size. The exponential decay hyper parameters for Adam were fixed to the recommended values according to the original paper [Kingma and Ba, 2014]. Theimplementations were based on TensorFlow [Abadi et al., 2015].

    Figure 5 Another comparison of shallow vs 2-layers binary tree networks in the learning of compositional functions. The set up of the experiment was the same as in theone in figure 4 except that the compositional function had two ReLU units per node instead of only one. The right part of the figure shows a cross section of the functionf(x1, x2, 0.5, 0.25) in a bounded interval x1 ∈ [−1, 1], x2 ∈ [−1, 1]. The shape of the function is piece wise linear as it is always the case for ReLUs networks.

  • 0 10 20 30 40 50 60

    Epoch

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    0.75

    Tra

    inin

    g e

    rro

    r o

    n C

    IFA

    R-1

    0

    ShallowFC, 2 Layers, #Params 1577984

    DeepFC, 5 Layers, #Params 2364416

    DeepConv, No Sharing, #Params 563888

    DeepConv, Sharing, #Params 98480

    0 10 20 30 40 50 60

    Epoch

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    Va

    lida

    tio

    n e

    rro

    r o

    n C

    IFA

    R-1

    0

    ShallowFC, 2 Layers, #Params 1577984

    DeepFC, 5 Layers, #Params 2364416

    DeepConv, No Sharing, #Params 563888

    DeepConv, Sharing, #Params 98480

    Figure 6 We show that the main advantage of deep Convolutional Networks (ConvNets) comes from "hierarchical locality" instead of weight sharing. We train two5-layer ConvNets with and without weight sharing on CIFAR-10. ConvNet without weight sharing has different filter parameters at each spatial location. There are 4convolutional layers (filter size 3x3, stride 2) in each network. The number of feature maps (i.e., channels) are 16, 32, 64 and 128 respectively. There is an additionalfully-connected layer as a classifier. The performances of a 2-layer and 5-layer fully-connected networks are also shown for comparison. Each hidden layer of thefully-connected network has 512 units. The models are all trained for 60 epochs with cross-entropy loss and standard shift and mirror flip data augmentation (duringtraining). The training errors are higher than those of validation because of data augmentation. The learning rates are 0.1 for epoch 1 to 40, 0.01 for epoch 41 to 50 and0.001 for rest epochs. The number of parameters for each model are indicated in the legends. Models with hierarchical locality significantly outperform shallow andhierarchical non-local networks.

    Figure 3 The figure shows on the top the graph of the function to be approxi-mated, while the bottom part of the figure shows a deep neural network with thesame graph structure. The left and right node inf the first layer has each n unitsgiving a total of 2n units in the first layer. The second layer has a total of 2nunits. The first layer has a convolution of size n to mirror the structure of thefunction to be learned. The compositional function we approximate has the formf(x1, x2, x3, x4) = h2(h11(x1, x2), h12(x3, x4)) with h11, h12 and h2as indicated in the figure.

    It seems obvious that shallow networks, unlike deep ones, cannotexploit in their architecture compositional priors corresponding tocompositional functions. In past papers we listed a few examples:

    • Consider the polynomial

    Q(x1, x2, x3, x4) = (Q1(Q2(x1, x2), Q3(x3, x4)))1024,

    where Q1, Q2, Q3 are bivariate polynomials of total degree≤ 2. Nominally, Q is a polynomial of total degree 4096in 4 variables, and hence, requires

    (41004

    )≈ (1.17) ∗ 1013

    parameters without any prior knowledge of the compositionalstructure. However, the compositional structure implies thateach of these coefficients is a function of only 18 parameters.In this case, the representation which makes deep networksapproximate the function with a smaller number of parametersthan shallow networks is based on polynomial approximationof functions of the type g(g(g())).

    • As a different example, we consider a function which is alinear combination of n tensor product Chui–Wang splinewavelets, where each wavelet is a tensor product cubic spline.It is shown in [Chui et al., 1994, Chui et al., 1996] that is impossible toimplement such a function using a shallow neural networkwith a sigmoidal activation function using O(n) neurons, buta deep network with the activation function (x+)2 can do so.In this case there is a formal proof of a gap between deep andshallow networks. Similarly, Eldan and Shamir (2015) showseparations which are exponential in the input dimension.

    • As we mentioned earlier, Telgarsky proves an exponentialgap between certain functions produced by deep networksand their approximation by shallow networks. The theorem[Telgarsky, 2015] can be summarized as saying that a certain fam-ily of classification problems with real-valued inputs cannotbe approximated well by shallow networks with fewer than ex-ponentially many nodes whereas a deep network achieves zero

  • error. This corresponds to high-frequency, sparse trigonomet-ric polynomials in our case. His upper bound can be proveddirectly from our main theorem by considering the real-valuedpolynomials x1x2...xd defined on the cube (−1, 1)d which isobviously a compositional function with a binary tree graph.

    • We exhibit here another example of a compositional functionthat can be approximated well by deep networks but not byshallow networks.Let n ≥ 2 be an integer, B ⊂ Rn be the unit ball of Rn.We consider the class W of all compositional functions f =f2 ◦ f1, where f1 : Rn → R, and

    ∑|k|≤4 ‖D

    kf1‖∞ ≤ 1,f2 : R→ R and ‖D4f2‖∞ ≤ 1. We consider

    ∆(AN ) := supf∈W

    infP∈AN

    ‖f − P‖∞,B ,

    whereAN is either the class SN of all shallow networks withN units or DN of deep networks with two layers, the firstwith n inputs, and the next with one input. The both cases,the activation function is a C∞ function σ : R → R that isnot a polynomial.

    Theorem 5. There exist constants c1, c2 > 0 such that forN ≥ c1,

    ∆(SN ) ≥ c2, (11)In contrast, there exists c3 > 0 such that

    ∆(DN ) ≤ c3N−4/n. (12)

    The constants c1, c2, c3 may depend upon n.

    PROOF. The estimate (12) follows from the estimates alreadygiven for deep networks. In the remainder of this proof, c willdenote a generic positive constant depending upon n alone,but its value may be different at different occurrences. Toprove (11), we use Lemma 3.2 in [Chui et al., 1996]. Let φ be aC∞ function supported on [0, 1], and we consider fN (x) =φ(|4Nx|2). I note that ‖fN‖1 ≥ C, with C independent ofN . Then it is clear that each fN ∈ W , and ‖fN‖1 ≥ c.Clearly,

    ∆(SN ) ≥ c infP∈SN

    ∫B

    |fN (x)− P (x)|dx. (13)

    We choose P ∗(x) =∑Nk=1 σ(〈w

    ∗k, x〉+ b∗k) such that

    infP∈SN

    ∫B

    |fN (x)−P (x)|dx ≥ (1/2)∫B

    |fN (x)−P ∗(x)|dx.

    (14)Since fN is supported on {x ∈ Rn : |x| ≤ 4−N}, we mayuse Lemma 3.2 in [Chui et al., 1996] with g∗k(t) = σ(t + b

    ∗k) to

    conclude that

    ∫B

    |fN (x)− P (x)| dx ≥ infgk∈L1loc, wk∈R

    n, ak∈R∫B

    |fN (x)−∑

    akgk(〈wk, x〉)|dx ≥ c.

    Together with (13) and (14), this implies (11).

    So plenty of examples of lower bounds exist showing a gap betweenshallow and deep networks. A particularly interesting case is theproduct function, that is the monomial f(x1, · · · , xn) = x1 ×x2 × · · · × xn which is, from our point of view, the prototypicalcompositional functions. Keeping in mind the question of lowerbounds the question here is: what is the minimum integer r(n) such

    that the function f is in the closure of the span of σ(〈wk, x〉+ bk),k = 1, · · · , r(n), and wk, bk ranging over their whole domains.Such a proof has been published for the case of smooth ReLUs,using unusual group techniques and is sketched in the Appendix of[Lin, 2016] yielding:

    Proposition 4. For shallow networks approximating the productmonomial f(x1, · · · , xn) = x1 × x2 × · · · × xn the minimuminteger r(n) is r(n) = O(2n).

    Notice, in support of the claim, that assuming that a shallow net-work with (non-smooth) ReLUs has a lower bound of r(q) = O(q)will lead to a contradiction with Hastad theorem by restricting xifrom xi ∈ (−1, 1) to xi ∈ {−1,+1}. Hastad theorem [Hastad, 1987]establishes the inapproximability of the parity function by shallowcircuits of non-exponential size. We conjecture that Hastad theo-rem can be used directly to prove Proposition 4 by expressing theproduct function in terms of the binary representation of the xi andthen representing the resulting function in terms of the parity basis.The upper bound for approximation by deep networks of ReLUscan be obtained following the arguments of Proposition 7 in theAppendix.

    4.3 Messy graphs and densely connected deep net-works

    As mentioned already, the approximating deep network does notneed to exactly match the architecture of the compositional functionas long as the graph or tree associated with the function is containedin the graph associated with the network. This is of course goodnews: the compositionality prior embedded in the architecture ofthe network does not to reflect exactly the graph of a new functionto be learned. We have shown that for a given class of composi-tional functions characterized by an associated graph there exista deep network that approximates such a function better than ashallow network. The same network approximates well functionscharacterized by subgraphs of the original class.

    The proofs of our theorems show that linear combinations of com-positional functions are universal in the sense that they can ap-proximate any function and that deep networks with a number ofunits that increases exponentially with layers can approximate anyfunction. Notice that deep compositional networks can interpolateif they are overparametrized with respect to the data, even if thedata reflect a non-compositional function (see Proposition 8 in theAppendix).

    As an aside, note that the simplest compositional function – addition– is trivial in the sense that it offers no approximation advantage todeep networks. The key function is multiplication which is for usthe prototypical compositional functions. As a consequence, poly-nomial functions are compositional – they are linear combinationsof monomials which are compositional.

    As we mentioned earlier, networks corresponding to graphs thatinclude the graph of the function to be learned can exploit com-positionality. The bounds, however, will depend on the numberof parameters r in the network used and not the parameters r∗(r∗ < r) of the optimal deep network with a graph exactly matchedto the graph of the function to be learned. As an aside, the price tobe paid in using a non-optimal prior depend on the learning algo-rithm. For instance, under sparsity constraints it may be possible topay a smaller price than r (but higher than r∗).

    In this sense, some of the densely connected deep networks usedin practice – which contain sparse graphs that may be relevant forthe function to be learned and are still “smaller” than the exponen-tial number of units required to represent a generic function of n

  • variables – are capable of exploiting some of the compositionalitystructure.

    5 Connections with the theory of Booleanfunctions

    The approach followed in our main theorems suggest the follow-ing considerations (see Appendix 1 for a brief introduction). Thestructure of a deep network is reflected in polynomials that are bestapproximated by it – for instance generic polynomials or sparsepolynomials (in the coefficients) in d variables of order k. The treestructure of the nodes of a deep network reflects the structure of aspecific sparse polynomial. Generic polynomial of degree k in dvariables are difficult to learn because the number of terms, train-able parameters and associated VC-dimension are all exponentialin d. On the other hand, functions approximated well by sparsepolynomials can be learned efficiently by deep networks with a treestructure that matches the polynomial. We recall that in a similarway several properties of certain Boolean functions can be “readout” from the terms of their Fourier expansion corresponding to“large” coefficients, that is from a polynomial that approximateswell the function.

    Classical results [Hastad, 1987] about the depth-breadth tradeoff incircuits design show that deep circuits are more efficient in rep-resenting certain Boolean functions than shallow circuits. Hastadproved that highly-variable functions (in the sense of having highfrequencies in their Fourier spectrum), in particular the parity func-tion cannot even be decently approximated by small constant depthcircuits (see also [Linial et al., 1993]). A closely related result follow im-mediately from our main theorem since functions of real variablesof the form x1x2...xd have the compositional form of the binarytree (for d even). Restricting the values of the variables to −1,+1yields an upper bound:

    Proposition 5. The family of parity functions x1x2...xd withxi ∈ {−1,+1} and i = 1, · · · , xd can be represented with expo-nentially fewer units by a deep than a shallow network.

    Notice that Hastad’s results on Boolean functions have been oftenquoted in support of the claim that deep neural networks can repre-sent functions that shallow networks cannot. For instance Bengioand LeCun [Bengio and LeCun, 2007] write “We claim that most functionsthat can be represented compactly by deep architectures cannot berepresented by a compact shallow architecture”.”.

    Finally, we want to mention a few other observations on Booleanfunctions that shows an interesting connection with our approach.It is known that within Boolean functions the AC0 class of poly-nomial size constant depth circuits is characterized by Fouriertransforms where most of the power spectrum is in the low ordercoefficients. Such functions can be approximated well by a poly-nomial of low degree and can be learned well by considering onlysuch coefficients. There are two algorithms [Mansour, 1994] that allowlearning of certain Boolean function classes:

    1. the low order algorithm that approximates functions by con-sidering their low order Fourier coefficients and

    2. the sparse algorithm which learns a function by approximat-ing its significant coefficients.

    Decision lists and decision trees can be learned by the first algo-rithm. Functions with small L1 norm can be approximated wellby the second algorithm. Boolean circuits expressing DNFs can beapproximated by the first one but even better by the second. In fact,

    in many cases a function can be approximated by a small set ofcoefficients but these coefficients do not correspond to low-orderterms. All these cases are consistent with the notes about sparsefunctions in section 7.

    6 Generalization bounds

    Our estimate of the number of units and parameters needed for adeep network to approximate compositional functions with an error�G allow the use of one of several available bounds for the general-ization error of the network to derive sample complexity bounds. Asan example consider theorem 16.2 in [Anthony and Bartlett, 2002] whichprovides the following sample bound for a generalization error �Gwith probability at least 1 − δ in a network in which the W pa-rameters (weights and biases) which are supposed to minimize theempirical error (the theorem is stated in the standard ERM setup)are expressed in terms of k bits:

    M(�G, δ) ≤2

    �2G(kW log 2 + log(

    2

    δ)) (15)

    This suggests the following comparison between shallow and deepcompositional (here binary tree-like networks). Assume a networksize that ensure the same approximation error � .

    Then in order to achieve the same generalization error �G, thesample size Mshallow of the shallow network must be much largerthan the sample size Mdeep of the deep network:

    MdeepMshallow

    ≈ �n. (16)

    This implies that for largish n there is a (large) range of trainingset sizes between Mdeep and Mshallow for which deep networkswill not overfit (corresponding to small �G) but shallow networkswill (for dimensionality n ≈ 104 and � ≈ 0.1 Equation 16 yieldsmshallow ≈ 1010

    4

    mdeep).

    A similar comparison is derived if one considers the best possibleexpected error obtained by a deep and a shallow network. Suchan error is obtained finding the architecture with the best trade-offbetween the approximation and the estimation error. The latter isessentially of the same order as the generalization bound impliedby inequality (15), and is essentially the same for deep and shallownetworks, that is

    rn√M,

    where we denoted by M the number of samples. For shallow net-works, the number of parameters corresponds to r units of n di-mensional vectors (plus off-sets), whereas for deep compositionalnetworks the number of parameters correspond to r units of 2 di-mensional vectors (plus off-sets) in each of the n − 1 units. Thenumber of units giving the best approximation/estimation trade-offis

    rshallow ≈(√

    M

    n

    ) nm+n

    and rdeep ≈(√

    M) 2m+2

    for shallow and deep networks, respectively. The corresponding(excess) expected errors E are

    Edeep ≈(

    n√M

    ) mm+n

  • for shallow networks and

    Eshallow ≈ n(

    1√M

    ) mm+2

    for deep networks. For the expected error, as for the generalizationerror, deep networks appear to achieve an exponential gain. Theabove observations hold under the assumption that the optimizationprocess during training finds the optimum parameters values forboth deep and shallow networks. Taking into account optimization,e.g. by stochastic gradient descent, requires considering a furthererror term, but we expect that the overall conclusions about gener-alization properties for deep vs. shallow networks should still holdtrue.

    Notice that independently of considerations of generalization, deepcompositional networks are expected to be very efficient memo-ries – in the spirit of hierarchical vector quantization – for as-sociative memories reflecting compositional rules (see Appendix5 and [Anselmi et al., 2015c]). Notice that the advantage with respectto shallow networks from the point of view of memory capacitycan be exponential (as in the example after Equation 16 showingmshallow ≈ 1010

    4

    mdeep).

    6.1 Generalization in multi-class deep networks

    Most of the succesfull neural networks exploit composi-tionality for better generalization in an additional impor-tant way (see [Lake et al., 2015]). . Suppose that the mappingsto be learned in a family of classification tasks (for in-stance classification of different object classes in Ima-genet) may be approximated by compositional functions suchas f(x1, · · · , xn) = hl · · · (h21(h11(x1, x2), h12(x3, x4)),h22(h13(x5, x6), h14(x7, x8) · · · )) · · · ), where hl depends on thetask (for instance to which object class) but all the other constituentfunctions h are common across the tasks. Under such an assump-tion, multi-task learning, that is training simultaneously for thedifferent tasks, forces the deep network to “find” common con-stituent functions. Multi-task learning has theoretical advantagesthat depends on compositionality: the sample complexity of theproblem can be significantly lower (see [Maurer, 2015]). The Maurer’sapproach is in fact to consider the overall function as the composi-tion of a preprocessor function common to all task followed by atask-specific function

    7 Notes on a theory of compositional compu-tation

    The key property of the theory of compositional functions sketchedhere is that certain deep networks can learn them avoiding the curseof dimensionality because of the blessing of compositionality via asmall effective dimension.

    We state here informally several comments and conjectures.

    1. General comments

    • The estimates on the n–width imply that there is somefunction in either Wnm (theorem 1) or Wn,2m (theorem2) for which the approximation cannot be better thanthat suggested by the theorems. This is of the coursethe guarantee we want.

    • Function compositionality can be expressed in termsof DAGs, with binary trees providing the simplest ex-ample. There is an analogy with n-dimensional tables

    decomposable in hierarchical 2-dimensional tables: thishas in turn obvious connections with HVQ (Hierarchi-cal Vector Quantization), discussed in Appendix 5.

    • Our results extend beyond binary trees network archi-tectures to other graph structures. The main issue aboutusing the theoretical results of this paper and networksused in practice has to do with the many “channels”used in the latter and with our assumption that eachnode in the networks computes the linear combinationof r units (Equation 3). The following obvious butinteresting extension of Theorem 1 to vector-valuedfunctions says that the number of hidden units requiredfor a given accuracy in each component of the func-tion is the same as in the scalar case (but of course thenumber of weigths is larger):

    Corollary 3. Let σ : R → R be infinitely differen-tiable, and not a polynomial. For a vector-function f :Rn → Rq with components fi ∈ Wnm, i = 1, · · · , qthe number of hidden units in shallow networks withn inputs, q outputs that provide accuracy at least � ineach of the components of f is

    N = O(�−n/m) . (17)

    The corollary above suggests a simple argument thatgeneralizes our binary tree results to standard, multi-channel deep convolutional networks.

    • We have used polynomials (but see Appendix 4.3) toprove results about complexity of approximation in thecase of neural networks. Neural network learning withSGD may or may not synthesize polynomial, depend-ing on the activation function and on the target. Thisis not a problem for theoretically establishing upperbounds on the degree of convergence because resultsusing the framework on nonlinear width guarantee the“polynomial” bounds are optimal.

    • Both shallow and deep representations may or maynot reflect invariance to group transformations of theinputs of the function ([Soatto, 2011, Anselmi et al., 2015a]). In-variance – also called weight sharing – decreases thecomplexity of the network. Since we are interestedin the comparison of shallow vs deep architectures,we have considered the generic case of networks (andfunctions) for which invariance is not assumed. In fact,the key advantage of deep vs. shallow network – asshown by the proof of the theorem – is the associatedhierarchical locality (the constituent functions in eachnode are local that is have a small dimensionality) andnot invariance (which designates shared weights thatis nodes at the same level sharing the same function).One may then ask about the relation of these resultswith i-theory[Anselmi and Poggio, 2016]. The original core ofi-theory describes how pooling can provide either shal-low or deep networks with invariance and selectivityproperties. Invariance of course helps but not exponen-tially as hierarchical locality does.

    • There are several properties that follow from the the-ory here which are attractive from the point of view ofneuroscience. A main one is the robustness of the re-sults with respect to the choice of nonlinearities (linearrectifiers, sigmoids, Gaussians etc.) and pooling.

    • In a machine learning context, minimization over atraining set of a loss function such as the square lossyields an empirical approximation of the regressionfunction p(y/x). Our hypothesis of compositionalitybecomes an hypothesis about the structure of the condi-tional probability function.

    2. Spline approximations and Boolean functions

  • • Consider again the case of section 4 of a multivari-ate function f : [0, 1]d → R. Suppose to discretize itby a set of piecewise constant splines and their tensorproducts. Each coordinate is effectively replaced by nboolean variables.This results in a d-dimensional tablewith N = nd entries. This in turn corresponds to aboolean function f : {0, 1}N → R. Here, the assump-tion of compositionality corresponds to compressibilityof a d-dimensional table in terms of a hierarchy ofd− 1 2-dimensional tables. Instead of nd entries thereare (d− 1)n2 entries. It seems that assumptions of thecompositionality type may be at least as effective as as-sumptions about function smoothness in countering thecurse of dimensionality in learning and approximation.

    • As Appendix 4.3 shows, every function f can be ap-proximated by an epsilon-close binary function fB .Binarization of f : Rn → R is done by using k parti-tions for each variable xi and indicator functions. Thusf 7→ fB : {0, 1}kn → R and sup|f − fB | ≤ �, with� depending on k and bounded Df .

    • If the function f is compositional the associatedBoolean functions fB is sparse; the converse is nottrue.

    • fB can be written as a polynomial (a Walsh decompo-sition) fB ≈ pB . It is always possible to associate a pbto any f , given �.

    • The binarization argument suggests a direct way toconnect results on function approximation by neuralnets with older results on Boolean functions. The latterare special cases of the former results.

    3. Sparsity

    • We suggest to define binary sparsity of f , in terms ofthe sparsity of the boolean function pB ; binary sparsityimplies that an approximation to f can be learned bynon-exponential deep networks via binarization.

    • Sparsity of Fourier coefficients is a a more general con-straint for learning than Tikhonov-type smoothing pri-ors. The latter is usually equivalent to cutting highorder Fourier coefficients. The two priors can be re-garded as two different implementations of sparsity,one more general than the other. Both reduce the num-ber of terms, that is trainable parameters, in the approx-imating trigonometric polynomial – that is the numberof Fourier coefficients.

    • Sparsity in a specific basis. A set of functions may bedefined to be sparse in a specific basis when the num-ber of parameters necessary for its �-approximationincreases less than exponentially with the dimension-ality. An open question is the appropriate definitionof sparsity. The notion of sparsity we suggest here isthe effective r in Equation 27. For a general functionr ≈ kn; we may define sparse functions those forwhich r

  • Figure 7 A shift-invariant, scalable operator. Processing is from the bottom(input) to the top (output). Each layer consists of identical blocks, each block hastwo inputs and one output; each block is an operatorH2 : R2 7→ R. The step ofcombining two inputs to a block into one output corresponds to coarse-grainingof a Ising model.

    (notice that the connections in the trees of the figuresmay reflect linear combinations of the input units).

    • Low order may be a key constraint for cortex. If it cap-tures what is possible in terms of connectivity betweenneurons, it may determine by itself the hierarchicalarchitecture of cortex which in turn may impose com-positionality to language and speech.

    • The idea of functions that are compositions of “sim-pler” functions extends in a natural way to recurrentcomputations and recursive functions. For instanceh(f (t)g((x))) represents t iterations of the algorithmf (h and g match input and output dimensions to f ).

    8 Why are compositional functions so com-mon or important?

    First, let us formalize the requirements on the algorithms of localcompositionality is to define scalable computations as a subclass ofnonlinear discrete operators, mapping vectors from Rn into Rd (forsimplicity we put in the following d = 1). Informally we call analgorithm Kn : Rn 7→ R scalable if it maintains the same “form”when the input vectors increase in dimensionality; that is, the samekind of computation takes place when the size of the input vectorchanges. This motivates the following construction. Consider a“layer” operator H2m : R2m 7→ R2m−2 for m ≥ 1 with a specialstructure that we call “shift invariance”.

    Definition 2. For integer m ≥ 2, an operator H2m is shift-invariant if H2m = H ′m ⊕H ′′m where R2m = Rm⊕Rm, H ′ =H ′′ and H ′ : Rm 7→ Rm−1. An operator K2M : R2M → R iscalled scalable and shift invariant if K2M = H2 ◦ · · ·H2M whereeach H2k, 1 ≤ k ≤M , is shift invariant.

    We observe that scalable shift-invariant operators K : R2m 7→ Rhave the structure K = H2 ◦ H4 ◦ H6 · · · ◦ H2m, with H4 =H ′2 ⊕H ′2, H6 = H ′′2 ⊕H ′′2 ⊕H ′′2 , etc..

    Thus the structure of a shift-invariant, scalable operator consists ofseveral layers; each layer consists of identical blocks; each blockis an operator H : R2 7→ R: see Figure 7 . We note also that analternative weaker constraint on H2m in Definition 2, instead ofshift invariance, is mirror symmetry, that is H ′′ = R ◦H ′, whereR is a reflection. Obviously, shift-invariant scalable operator areequivalent to shift-invariant compositional functions. Obviously thedefinition can be changed in several of its details. For instance for

    two-dimensional images the blocks could be operators H : R5 →R mapping a neighborhood around each pixel into a real number.

    The final step in the argument uses the results of previous sectionsto claim that a nonlinear node with two inputs and enough unitscan approximate arbitrarily well each of the H2 blocks. This leadsto conclude that deep convolutional neural networks are naturalapproximators of scalable, shift-invariant operators.

    Here are Let us provide a couple of very simple examples of com-positional functions. Addition is compositional but degree of ap-proximation does not improve by decomposing addition in differentlayers of network; all linear operators are compositional with noadvantage for deep networks; multiplication as well as the ANDoperation (for Boolean variables) is the prototypical compositionalfunction that provides an advantage to deep networks.

    This line of arguments defines a class of algorithms that is universaland can be optimal for a large set of problems. It does not howeverexplain why problems encountered in practice should match thisclass of algorithms. Though we and others have argued that theexplanation may be in either the physics or the neuroscience ofthe brain, these arguments (see Appendix 2) are not (yet) rigorous.Our conjecture is that compositionality is imposed by the wiring ofour cortex and is reflected in language. Thus compositionality ofmany computations on images many reflect the way we describeand think about them.

    Acknowledgment

    This work was supported by the Center for Brains, Minds andMachines (CBMM), funded by NSF STC award CCF – 1231216.HNM was supported in part by ARO Grant W911NF-15-1-0385.We thank O. Shamir for useful emails on lower bounds.

    References[Abadi et al., 2015] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

    Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser,L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray,D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K.,Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden,P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow:Large-scale machine learning on heterogeneous systems. Software availablefrom tensorflow.org.

    [Anselmi et al., 2014] Anselmi, F., Leibo, J., Rosasco, L., Mutch, J., Tacchetti,A., and Poggio, T. (2014). Unsupervised learning of invariant representationswith low sample complexity: the magic of sensory cortex or a new frameworkfor machine learning?. Center for Brains, Minds and Machines (CBMM)Memo No. 1. arXiv:1311.4158v5.

    [Anselmi et al., 2015a] Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tac-chetti, A., and Poggio, T. (2015a). Unsupervised learning of invariant repre-sentations. Theoretical Computer Science.

    [Anselmi and Poggio, 2016] Anselmi, F. and Poggio, T. (2016). Visual Cortexand Deep Networks. MIT Press.

    [Anselmi et al., 2015b] Anselmi, F., Rosasco, L., Tan, C., and Poggio, T.(2015b). Deep convolutional network are hierarchical kernel machines.Center for Brains, Minds and Machines (CBMM) Memo No. 35, also inarXiv.

    [Anselmi et al., 2015c] Anselmi, F., Rosasco, L., and Tan, C.and Poggio, T.(2015c). Deep Convolutional Networks are Hierarchical Kernel Machines.

  • Center for Brains, Minds and Machines (CBMM) Memo No. 35, also inarXiv.

    [Anthony and Bartlett, 2002] Anthony, M. and Bartlett, P. (2002). Neural Net-work Learning - Theoretical Foundations. Cambridge University Press.

    [B. Moore and Poggio, 1998] B. Moore, B. and Poggio, T. (1998). Representa-tions properties of multilayer feedforward networks. Abstracts of the Firstannual INNS meeting, 320:502.

    [Bach, 2014] Bach, F. (2014). Breaking the curse of dimensionality with convexneural networks. arXiv:1412.8690.

    [Bengio and LeCun, 2007] Bengio, Y. and LeCun, Y. (2007). Scaling learningalgorithms towards ai. In Bottou, L., Chapelle, O., and DeCoste, D.and We-ston, J., editors, Large-Scale Kernel Machines. MIT Press.

    [Bergstra and Bengio, 2012] Bergstra, J. and Bengio, Y. (2012). Randomsearch for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305.

    [Chui et al., 1994] Chui, C., Li, X., and Mhaskar, H. (1994). Neural networksfor localized approximation. Mathematics of Computation, 63(208):607–623.

    [Chui et al., 1996] Chui, C. K., Li, X., and Mhaskar, H. N. (1996). Limitationsof the approximation capabilities of neural networks with one hidden layer.Advances in Computational Mathematics, 5(1):233–243.

    [Cohen et al., 2015] Cohen, N., Sharir, O., and Shashua, A. (2015). On theexpressive power of deep learning: a tensor analysis. CoRR, abs/1509.0500.

    [Corominas and Balaguer, 1954] Corominas, E. and Balaguer, F. S. (1954).Condiciones para que una funcion infinitamente derivable sea un polinomio.Revista matemática hispanoamericana, 14(1):26–43.

    [Cramer and Wold, 1936] Cramer, H. and Wold, H. (1936). Some theorems ondistribution functions. J. London Math. Soc., 4:290–294.

    [Delalleau and Bengio, 2011] Delalleau, O. and Bengio, Y. (2011). Shallow vs.deep sum-product networks. In Advances in Neural Information ProcessingSystems 24: 25th Annual Conference on Neural Information ProcessingSystems 2011. Proceedings of a meeting held 12-14 December 2011, Granada,Spain., pages 666–674.

    [DeVore et al., 1989] DeVore, R. A., Howard, R., and Micchelli, C. A. (1989).Optimal nonlinear approximation. Manuscripta mathematica, 63(4):469–478.

    [Fukushima, 1980] Fukushima, K. (1980). Neocognitron: A self-organizingneural network for a mechanism of pattern recognition unaffected by shift inposition. Biological Cybernetics, 36(4):193–202.

    [Girosi et al., 1995] Girosi, F., Jones, M., and Poggio, T. (1995). Regularizationtheory and neural networks architectures. Neural Computation, 7:219–269.

    [Girosi and Poggio, 1989] Girosi, F. and Poggio, T. (1989). Representationproperties of networks: Kolmogorov’s theorem is irrelevant. Neural Compu-tation, 1(4):465–469.

    [Girosi and Poggio, 1990] Girosi, F. and Poggio, T. (1990). Networks and thebest approximation property. Biological Cybernetics, 63:169–176.

    [Grasedyck, 2010] Grasedyck, L. (2010). Hierarchical Singular Value Decom-position of Tensors. SIAM J. Matrix Anal. Appl., (31,4):2029–2054.

    [Hastad, 1987] Hastad, J. T. (1987). Computational Limitations for Small DepthCircuits. MIT Press.

    [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method forstochastic optimization. CoRR, abs/1412.6980.

    [Lake et al., 2015] Lake, B. M., Salakhutdinov, R., and Tenenabum, J. B. (2015).Human-level concept learning through probabilistic program induction. Sci-ence, 350(6266):1332–1338.

    [LeCun et al., 2015] LeCun, Y., Bengio, Y., and G., H. (2015). Deep learning.Nature, pages 436–444.

    [Liao and Poggio, 2016] Liao, Q. and Poggio, T. (2016). Bridging the gap be-tween residual learning, recurrent neural networks and visual cortex. Centerfor Brains, Minds and Machines (CBMM) Memo No. 47, also in arXiv.

    [Lin, 2016] Lin, H.and Tegmark, M. (2016). Why does deep and cheap learningwork so well? arXiv:1608.08225, pages 1–14.

    [Linial et al., 1993] Linial, N., Y., M., and N., N. (1993). Constant depth cir-cuits, fourier transform, and learnability. Journal of the ACM, 40(3):607–620.

    [Livni et al., 2013] Livni, R., Shalev-Shwartz, S., and Shamir, O. (2013).A provably efficient algorithm for training deep networks. CoRR,abs/1304.7045.

    [Mansour, 1994] Mansour, Y. (1994). Learning boolean functions via thefourier transform. In Roychowdhury, V., Siu, K., and Orlitsky, A., editors,Theoretical Advances in Neural Computation and Learning, pages 391–424.Springer US.

    [Maurer, 2015] Maurer, A. (2015). Bounds for Linear Multi-Task Learning.Journal of Machine Learning Research.

    [Mhaskar, 1993a] Mhaskar, H. (1993a). Approximation properties of a multi-layered feedforward artificial neural network. Advances in ComputationalMathematics, pages 61–80.

    [Mhaskar et al., 2016] Mhaskar, H., Liao, Q., and Poggio, T. (2016). Learningreal and boolean functions: When is deep better than shallow? Center forBrains, Minds and Machines (CBMM) Memo No. 45, also in arXiv.

    [Mhaskar and Poggio, 2016] Mhaskar, H. and Poggio, T. (2016). Deep versusshallow networks: an approximation theory perspective. Center for Brains,Minds and Machines (CBMM) Memo No. 54, also in arXiv.

    [Mhaskar, 1993b] Mhaskar, H. N. (1993b). Neural networks for localizedapproximation of real functions. In Neural Networks for Processing [1993]III. Proceedings of the 1993 IEEE-SP Workshop, pages 190–196. IEEE.

    [Mhaskar, 1996] Mhaskar, H. N. (1996). Neural networks for optimal approxi-mation of smooth and analytic functions. Neural Computation, 8(1):164–177.

    [Mhaskar, 2004] Mhaskar, H. N. (2004). On the tractability of multivariateintegration and approximation by neural networks. J. Complex., 20(4):561–590.

    [Mihalik, 1992] Mihalik, J. (1992). Hierarchical vector quantization. of imagesin transform domain. ELEKTROTECHN. CA5, 43, NO. 3. 92,94.

    [Minsky and Papert, 1972] Minsky, M. and Papert, S. (1972). Perceptrons: AnIntroduction to Computational Geometry. The MIT Press, ISBN 0-262-63022-2, Cambridge MA.

    [Montufar et al., 2014] Montufar, G. F.and Pascanu, R., Cho, K., and Bengio, Y.(2014). On the number of linear regions of deep neural networks. Advancesin Neural Information Processing Systems, 27:2924–2932.

    [Niyogi and Girosi, 1996] Niyogi, P. and Girosi, F. (1996). On the relationshipbetween generalization error, hypothesis complexity and sample complexityin regularization networks. Neural Computation, 8.4.

    [Pinkus, 1999] Pinkus, A. (1999). Approximation theory of the mlp model inneural networks. Acta Numerica, 8:143–195.

    [Poggio, 1975] Poggio, T. (1975). On optimal nonlinear associative recall.Biological Cybernetics, 19:201–209.

    [Poggio et al., 2015a] Poggio, T., Anselmi, F., and Rosasco, L. (2015a). I-theory on depth vs width: hierarchical function composition. CBMM memo041.

    [Poggio and Girosi, 1989] Poggio, T. and Girosi, F. (1989). A theory of net-works for approximation and learning. Laboratory, Massachusetts Instituteof Technology, A.I. memo n1140.

    [Poggio and Reichardt, 1980] Poggio, T. and Reichardt, W. (1980). On the rep-resentation of multi-input systems: Computational properties of polynomialalgorithms. Biological Cybernetics, 37, 3, 167-186.

    [Poggio et al., 2015b] Poggio, T., Rosaco, L., Shashua, A., Cohen, N., andAnselmi, F. (2015b). Notes on hierarchical splines, dclns and i-theory. CBMMmemo 037.

    [Poggio et al., 2015c] Poggio, T., Rosasco, L., Shashua, A., Cohen, N., andAnselmi, F. (2015c). Notes on hierarchical splines, dclns and i-theory. Tech-nical report, MIT Computer Science and Artificial Intelligence Laboratory.

    [Poggio and Smale, 2003] Poggio, T. and Smale, S. (2003). The mathematicsof learning: Dealing with data. Notices of the American Mathematical Society(AMS), 50(5):537–544.

    [Riesenhuber and Poggio, 1999] Riesenhuber, M. and Poggio, T. (1999). Hi-erarchical models of object recognition in cortex. Nature Neuroscience,2(11):1019–1025.

    [Ruderman, 1997] Ruderman, D. (1997). Origins of scaling in natural images.Vision Res., pages 3385 – 3398.

  • [Safran and Shamir, 2016] Safran, I. and Shamir, O. (2016). Depth sepa-ration in relu networks for approximating smooth non-linear functions.arXiv:1610.09887v1.

    [Shalev-Shwartz and Ben-David, 2014] Shalev-Shwartz, S. and Ben-David, S.(2014). Understanding Machine Learning: From Theory to Algorithms.Cambridge eBooks.

    [Soatto, 2011] Soatto, S. (2011). Steps Towards a Theory of Visual Information:Active Perception, Signal-to-Symbol Conversion and the Interplay BetweenSensing and Control. arXiv:1110.2053, pages 0–151.

    [Telgarsky, 2015] Telgarsky, M. (2015). Representation benefits of deep feed-forward networks. arXiv preprint arXiv:1509.08101v2 [cs.LG] 29 Sep 2015.

    Appendix: definitions andobservations, theorems and

    theories

    1 Boolean Functions

    One of the most important tools for theoretical computer scientistsfor the study of functions of n Boolean variables, their related cir-cuit design and several associated learning problems, is the Fouriertransform over the Abelian group Zn2 . This is known as Fourieranalysis over the Boolean cube {−1, 1}n. The Fourier expansion ofa Boolean function f : {−1, 1}n → {−1, 1} or even a real-valuedBoolean function f : {−1, 1}n → [−1, 1] is its representation as areal polynomial, which is multilinear because of the Boolean natureof its variables. Thus for Boolean functions their Fourier represen-tation is identical to their polynomial representation. In this paperwe use the two terms interchangeably. Unlike functions of realvariables, the full finite Fourier expansion is exact, instead of an ap-proximation. There is no need to distingush between trigonometricand real polynomials. Most of the properties of standard harmonicanalysis are otherwise preserved, including Parseval theorem. Theterms in the expansion correspond to the various monomials; thelow order ones are parity functions over small subsets of the vari-ables and correspond to low degrees and low frequencies in thecase of polynomial and Fourier approximations, respectively, forfunctions of real variables.

    2 Does Physics or Neuroscience imply compo-sitionality?

    It has been often argued that not only text and speech are compo-sitional but so are images. There are many phenomena in naturethat have descriptions along a range of rather different scales. Anextreme case consists of fractals which are infinitely self-similar,iterated mathematical constructs. As a reminder, a self-similar ob-ject is similar to a part of itself (i.e. the whole is similar to oneor more of the parts). Many objects in the real world are statisti-cally self-similar, showing the same statistical properties at manyscales: clouds, river networks, snow flakes, crystals and neuronsbranching. A relevant point is that the shift-invariant scalability ofimage statistics follows from the fact that objects contain smallerclusters of similar surfaces in a selfsimilar fractal way. Ruderman[Ruderman, 1997] analysis shows that image statistics reflects what hasbeen known as the property of compositionality of objects andparts: parts are themselves objects, that is selfsimilar clusters ofsimilar surfaces in the physical world. Notice however that, fromthe point of view of this paper, it is misleading to say that an im-age is compositional: in our terminology a function on an imagemay be compositional but not its argument. In fact, functions to belearned may or may not be compositional even if their input is animage since they depend on the input but also on the task (in thesupervised case of deep learning networks all weights depend onx and y). Conversely, a network may be given a function whichcan be written in a compositional form, independently of the na-ture of the input vector such as the function “multiplication of allscalar inputs’ components”. Thus a more reasonable statement isthat “many natural questions on images correspond to algorithmswhich are compositional”. Why this is the case is an interestingopen question. An answer inspired by the condition of “locality” ofthe constituent functions in our theorems and by the empirical suc-cess of deep convolutional networks has attracted some attention.The starting observation is that in the natural sciences– physics,chemistry, biology – many phenomena seem to be described wellby processes that that take place at a sequence of increasing scalesand are local at each scale, in the sense that they can be described

  • well by neighbor-to-neighbor interactions.

    Notice that this is a much less stringent requirement than renor-malizable physical processes [Lin, 2016] where the same Hamiltonian(apart from a scale factor) is required to describe the process at eachscale (in our observation above, the renormalized Hamiltonian onlyneeds to remain local at each renormalization step)1. As discussedpreviously [Poggio et al., 2015a] hierarchical locality may be related toproperties of basic physics that imply local interactions at eachlevel in a sequence of scales, possibly different at each level. Tocomplete the argument one would have then to assume that severaldifferent questions on sets of natural images may share some of theinitial inference steps (first layers in the associated deep network)and thus share some of features computed by intermediate layers ofa deep network. In any case, at least two open questions remain thatrequire formal theoretical results in order to explain the connectionbetween hierarchical, local functions and physics:

    • can hierarchical locality be derived from the Hamiltonians ofphysics? In other words, under which conditions does coarsegraining lead to local Hamiltonians?

    • is it possible to formalize how and when the local hierarchicalstructure of computations on images is related to the hierarchyof local physical process that describe the physical worldrepresented in the image?

    It seems to us that the above set of arguments is unsatisfactoryand unlikely to provide the answer. One of the arguments is thatiterated local functions (from Rn to Rn with n increasing withoutbound) can be Turing universal and thus can simulate any physicalphenomenon (as shown by the game Life which is local and Turinguniversal). Of course, this does not imply that the simulation willbe efficient but it weakens the physics-based argument. An alterna-tive hypothesis in fact is that locality across levels of explanationoriginates from the structure of the brain – wired, say, similarly toconvolutional deep networks – which is then forced to use localalgorithms of the type shown in Figure 7. Such local algorithmsallow the organism to survive because enough of the key problemsencountered during evolution can be solved well enough by them.So instead of claiming that all questions on the physical world arelocal because of its physics we believe that local algorithms aregood enough over the distribution of evolutionary relevant prob-lems. From this point of view locality of algorithms follows fromthe need to optimize local connections and to reuse computationalelements. Despite the high number of synapses on each neuron itwould be impossible for a complex cell to pool information acrossall the simple cells needed to cover an entire image, as needed by asingle hidden layer network.

    3 Splines: some observations

    3.1 Additive and Tensor Product Splines

    Additive and tensor product splines are two alternatives to radialkernels for multidimensional function approximation. It is wellknown that the three techniques follow from classical Tikhonovregularization and correspond to one-hidden layer networks witheither the square loss or the SVM loss.

    1Tegmark and Lin [Lin, 2016] have also suggested that a sequence of generative processescan be regarded as a Markov sequence that can be inverted to provide an inference problemwith a similar compositional structure. The resulting compositionality they describe does not,however, correspond to our notion of hierarchical locality and thus our theorems cannot beused to support their claims.

    We recall the extension of classical splines approximation tech-niques to multidimensional functions. The setup is due to Jones etal. (1995) [Girosi et al., 1995].

    3.1.1 Tensor product splines

    The best-known multivariate extension of one-dimensional splinesis based on the use of radial kernels such as the Gaussian or themultiquadric radial basis function. An alternative to choosing aradial function is a tensor product type of basis function, that is afunction of the form

    K(x) = Πdj=1k(xj)

    where xj is the j-th coordinate of the vector x and k(x) is theinverse Fourier transform associated with a Tikhonov stabilizer(see [Girosi et al., 1995]).

    We notice that the choice of the Gaussian basis function for k(x)leads to a Gaussian radial approximation scheme with K(x) =e−‖x‖

    2

    .

    3.1.2 Additive splines

    Additive approximation schemes can also be derived in the frame-work of regularization theory. With additive approximation wemean an approximation of the form

    f(x) =

    d∑µ=1

    fµ(xµ) (19)

    where xµ is the µ-th component of the input vector x and the fµare one-dimensional functions that will be defined as the additivecomponents of f (from now on Greek letter indices will be used inassociation with components of the input vectors). Additive modelsare well known in statistics (at least since Stone, 1985) and can beconsidered as a generalization of linear models. They are appeal-ing because, being essentially a superposition of one-dimensionalfunctions, they have a low complexity, and they share with linearmodels the feature that the effects of the different variables canbe examined separately. The resulting scheme is very similar toProjection Pursuit Regression. We refer to [Girosi et al., 1995] for ref-erences and discussion of how such approximations follow fromregularization.

    Girosi et al. [Girosi et al., 1995] derive an approximation scheme ofthe form (with i corresponding to spline knots – which are freeparameters found during learning as in free knots splines - andµ corresponding to new variables as linear combinations of theoriginal components of x):

    f(x) =

    d′∑µ=1

    n∑i=1

    cµiK(〈tµ, x〉−bµ) =

    ∑µ=1

    ∑i=1

    cµiK(〈tµ, x〉−bµi ) .

    (20)

    Note that the above can be called spline only with a stretch of theimagination: not only the wµ but also the tµ depend on the data ina very nonlinear way. In particular, the tµ may not correspond atall to actual data point. The approximation could be called ridgeapproximation and is related to projection pursuit. When the basis

  • function K is the absolute value that is K(x − y) = |x − y| thenetwork implements piecewise linear splines.

    3.2 Hierarchical Splines

    Consider an additive approximation scheme (see subsection) inwhich a function of d variables is approximated by an expressionsuch as

    f(x) =

    d∑i

    φi(xi) (21)

    where xi is the i-th component of the input vector x and the φiare one-dimensional spline approximations. For linear piecewisesplines φi(xi) =

    ∑j cij |x

    i − bji |. Obviously such an approxima-tion is not universal: for instance it cannot approximate the functionf(x, y) = xy. The classical way to deal with the problem is touse tensor product splines. The new alternative that we proposehere is hierarchical additive splines, which in the case of a 2-layershierarchy has the form

    f(x) =

    K∑j

    φj(

    d∑i

    φi(xi)). (22)

    and which can be clearly extended to an arbitrary depth. The intu-ition is that in this way, it is possible to obtain approximation of afunction of several variables from functions of one variable becauseinteraction terms such as xy in a polynomial approximation of afunction f(x, y) can be obtained from terms such as elog(x)+log(y).

    We start with a lemma about the relation between linear rectifiers,which do not correspond to a kernel, and absolute value, which is akernel.

    Lemma 1 Any given superposition of linear rectifiers∑i c′i(x − b

    ′i)+ with c′i, b′i given, can be represented

    over a finite interval in terms of the absolute value kernelwith appropriate weights. Thus there exist ci, bi such that∑i c′i(x− b

    ′i)+ =∑i ci|x− b

    i|.

    The proof follows from the facts that a) the superpositionsof ramps is a piecewiselinear function, b) piecewise linearfunctions can be represented in terms of linear splines and c) thekernel corresponding to linear splines in one dimension is theabsolute value K(x, y) = |x− y|.

    Now consider two layers in a network in which we assume degen-erate pooling for simplicity of the argument. Because of Lemma 1,and because weights and biases are arbitrary we assume that the thenonlinearity in each edge is the absolute value. Under this assump-tion, unit j in the first layer, before the non linearity, computes

    f j(x) =∑i=1

    cji |〈ti, x

    〉− bi|, (23)

    where x and w are vectors and the ti are real numbers. Then thesecond layer output can be calculated with the nonlinearity | · · · |instead of (·)2.

    In the case of a network with two inputs x, y the effective outputafter pooling at the first layer may be φ(1)(x, y) = t1|x + b1| +t2|y + b2|, that is the linear combination of two “absolute value”functions. At the second layer terms like φ(2)(x, y) = |t1|x+b1|+t2|y + b2| + b3| may appear. The output of a second layer stillconsists of hyperplanes, since the layer is a kernel machine with anoutput which is always a piecewise linear spline.

    Networks implementing tensor product splines are universal in thesense that they approximate any continuous function in an interval,given enough units. Additive splines on linear combinations of theinput variables of the form in eq. (23) are also universal (use Theo-rem 3.1 in [Pinkus, 1999]). However additive splines on the individualvariables are not universal while hierarchical additive splines are:

    Theorem Hierarchical additive splines networks are universal.

    4 On multivariate function approximation

    Consider a multivariate function f : [0, 1]d → R discretized bytensor basis functions:

    φ(i1,...,id)(x1, ..., xd) :=

    d∏µ=1

    φiµ(xµ), (24)

    with φiµ : [0, 1]→ R, 1 ≤ iµ ≤ nµ, 1 ≤ µ ≤ d

    to provide

    f(x1, ..., xd) =

    n1∑i1=1

    · · ·nd∑id=1

    c(i1, ..., id)φ(i1, ..., id)(x1, ..., xd).

    (25)

    The one-dimensional basis functions could be polynomials (asabove), indicator functions, polynomials, wavelets, or other setsof basis functions. The total number N of basis functions scalesexponentially in d as N =

    ∏dµ=1 nµ for a fixed smoothness class

    m (it scales as dm

    ).

    We can regard neural networks as implementing some form of thisgeneral approximation scheme. The problem is that the type ofoperations available in the networks are limited. In particular, mostof the networks do not include the product operation (apart from“sum-product” networks also called “algebraic circuits”) which isneeded for the straightforward implementation of the tensor prod-uct approximation described above. Equivalent implementationscan be achieve