Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by Recursive Neural Networks: A Survey of Computational Properties

  • Upload
    grettsz

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    1/28

    Adaptive Contextual Processing of Structured

    Data by Recursive Neural Networks: A Survey

    of Computational Properties

    Barbara Hammer1, Alessio Micheli2, and Alessandro Sperduti3

    1 Institute of Computer Science, Clausthal University of Technology, Julius AlbertStrae 4, Germany. [email protected]

    2 Dipartimento di Informatica, Universita di Pisa, Largo Bruno Pontecorvo 3,56127 Pisa, Italy. [email protected]

    3 Dipartimento di Matematica Pura ed Applicata, Universita di Padova, ViaTrieste 63, 35121 Padova, Italy. [email protected]

    1 Introduction

    In many real world application domains, entities are compound objects andthe main computational task consists in discovering relations among the dif-ferent parts of the objects or with respect to some other entity of interest.Very often, the problem formulation includes both discrete (e.g. symbols) andnumerical entities and/or functions. For example, in Chemistry and Biochem-istry, chemical compounds are naturally represented by their molecular graph,where information attached to each vertex is a symbol denoting the name ofthe involved atom/atoms, and where a major task is to correlate the chemical

    structures with their properties (e.g. boiling point, chemical reactivity, bio-logical and pharmaceutical properties, etc.). The acronyms QSPR and QSARare used in these fields to denote the analysis of the quantitative relationshipsbetween structure and properties or biological activities.

    Learning methods able to deal with both discrete information, as the struc-ture of a compound object, and numerical relationships and functions, suchas Recursive Neural Networks (RNNs) [34], offer an opportunity to directlyapproach QSPR/QSAR analysis when the knowledge of the mathematical re-lationships among data is poor or absent and when data are affected by noiseand errors, as it is typical when they are experimentally acquired.

    In particular, the RNN approach for QSPR/QSAR is characterized by thedirect treatment of variable-size structured data (in the form of hierarchi-

    cal data), which is a vehicle of information much richer than the traditionalflat vectors of descriptors used in the traditional QSPR/QSAR approaches.Moreover, RNN provide an adaptive encoding of such input structures, i.e.the encoding of the structures is a function learned according to the task and

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    2/28

    2 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    the training data. The use of an adaptive model for structures avoids both theneed for an a priori definition of a flat molecular representation (by vectorsof descriptors) or the need of the prior definition of a similarity measure forthe structured data.

    The success of the approach has been proven for some datasets both bycomparison with respect to state-of-art QSPR/QSAR approaches and, re-cently, by empirical comparison versus a Support Vector Machine (SVM) withtree kernel [23]. The flexibility of the structure-based approach allows for thetreatment of different problems and data sets: Recent applications showingthe feasibility of the approach exploiting RNN in a variegate set of QSPR andQSAR analysis problems can be found in [4, 2, 27, 28, 1, 6, 7]. The RNN ap-proach can be further exploited to deal with tasks for which the lack of priorknowledge on the problem makes the direct use of structural information forthe prediction task appealing.

    These features of RNN can be exploited also in other structured domains,such as XML document categorization, or Natural Language Processing tasks,

    just to name a few.In this chapter, we present some of the major RNN models with simplerecursive dynamics and with enlargements by contextual connections. Theaim of the chapter is to review the computational properties of these modelsand to give a comprehensive and unitary view of both the classes of data thatcan be used for learning, e.g. sequences, trees, acyclic graphs, and variousmodifications (such as positional graphs), and the classes of transductionsthat can be computed and approximated.

    In Section 2 we present the different types of structural transductionswhich are of interest. In particular, causal and contextual transductions areconsidered. Mathematical models for implementing these types of transduc-tions and amenable to neural realizations are discussed in Section 3. In Sec-tion 4, we discuss the computational properties of the models presented in

    the previous section, relating each other with respect to both the class of datathey are able to process and the class of transductions they can realize withrespect to their functional dependencies, disregarding computational issues.In Section 5, the universal approximation ability of the models is discussedbased on these results on the functional dependency of recursive mechanisms.

    2 Structural Transductions

    A transduction T : G O is a mapping between a structured domain (SD)G and a discrete or continuous output-space O. Various classes of data canbe considered in a SD. Vectors constitute the simplest data structures corre-

    sponding to a flat domain. In fact, each vector can be seen as a single vertex

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    3/28

    Processing structured data by RNNs 3

    with label l(v) composed by a tuple of attributes, which can represent bothnumerical or categorical information4.

    In a general discrete structure, a graph g G, we have a set of labeledvertexes Vert(g) connected by a set of edges Edg(g). Discrete structures of

    interest are hierarchical structures, such are sequences (where a total order onthe vertexes is defined), rooted positional trees (where a distinctive vertex isthe root, a partial topological order holds among the vertexes, and a positionis assigned to each edge leaving from each vertex) and Directed PositionalAcyclic Graphs (DPAG), the generalization of this tree structure to acyclicgraphs. Positional trees with bounded out-degree K are also called K-arytrees. In the following we use the term tree referring to the class of labeledK-ary trees.

    A supersource for DPAGs is defined as a vertex s, with zero in-degree, suchthat every vertex in the graph can be reached by a directed path starting froms. In the case of trees, the supersource is always defined by its root node.Rooted positional trees are a subclass of DPAG, with supersource, formed

    by prohibiting cycles in the undirected version of the graph. In DPAG theposition is specified by a positional index assigned to each entering and leavingedge from a node v. Formally, in a DPAG we assume that for each vertexv Vert(g) with a set of edges denoted by edg(v), two injective functions Pv :edg(v) [1, 2,...,in] and Sv : edg(v) [1, 2,...,out] are defined on the edgesentering and leaving from v. For a DPAG, if an edge (u, v) is present in edg(g),then u is a parent (predecessor) ofv (denoted by paj [v] for Pu((v, u)) = j) andv is a child (or successor) of u (denoted by chj [v] for Su((u, v)) = j). If thereexists a path from u to v then u is an ancestor of v and v is a descendant ofu. As before, we assume a restriction k of the number of children and parentsper node. Often, we assume k = 2 for simplicity.

    Other subclasses of DPAG can be considered for special purposes, e.g theclass of Directed Bipositional Acyclic Graphs (DBAGs), where a constraint

    is added on the positions such that Pu((v, u)) = Sv((v, u)) for each edges inEdg(g), i.e. the enumeration of children and parents is the same.

    Over a SD various classes of transductions can be defined. Of specificinterest to learning applications are the following classes:

    Adaptive transductions: the similarity measures on structures is learned forthe task at hand, e.g. using a trainable RNN.

    Contextual transductions: for each vertex the response depends on the wholeinformation represented in the structured data, according to its topology.

    Causal transductions (over hierarchical data structures): for each vertex v theresponse for the vertex v only depends on v and its descendant vertexes

    4 In the following the attribute vectors are assumed to be numeric: symbolic la-bels from a finite alphabet can be represented by a coding function as numericalvectors; for the sake of presentation, symbols are retained in the graphical repre-sentation of structures to label the vertexes in a concise way.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    4/28

    4 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    (i.e the vertexes in Vert(g) which can be reached from v by following adirected path).

    I/O-isomorphic and Supersource transductions: in an Input/Output (I/O) iso-morphic transduction, for each vertex of the input graph the system emits

    an output, i.e. the system produces a structured output with a topologyisomorphic to the original input graph. In a supersource transduction theresponse of the system is given in the form of a scalar value only in cor-respondence of the supersource of the input graph.

    We always assume stationarity of the transduction, i.e. the transfer func-tion is fixed for the whole structure. In Figure 1 we show an example of asupersource transduction and an I/O-isomorphic transduction. These exam-ples provide instances of contextual transductions that cannot be computedby pure causal and stationary transductions. We will show through the pa-per the effects that the causality assumption and its relaxation induce on thecomputational capabilities of the learning models and on the classes of datathat can be treated.

    (a) (b)

    g

    1

    T

    a

    f

    g

    b

    b

    1

    1

    1

    1

    0T

    0

    T

    g

    f

    g

    f

    g

    a

    aa

    Fig. 1. Example of supersource (a) and IO-isomorphic (b) contextual transductions.

    Notice that a causal IO-isomorphic transduction T1 can be implementedvia a suitable supersource transduction T2. In fact, for each vertex v in theinput structure g, the output label generated by T1 in correspondence ofthat vertex, can be reproduced by a transduction T2 that takes as input thesubgraph in g rooted in v (and including all and only all its descendants) andreturns the output label generated by T1 for v. This implementation scheme,

    however, does not work for contextual IO-isomorphic transductions since thesubgraph rooted in v does not constitute a complete context for v.

    Causality is a natural property of e.g. transductions for time series wherefuture events depend on past events but not vice versa. For the processing

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    5/28

    Processing structured data by RNNs 5

    of chemical data, the assumption of causality is usually not valid since chem-ical structures do not possess a distinguished direction of causality of theconstituents.

    3 Models

    Here we present formal models that can be used to implement causal andcontextual models. The main issue is consists in the way in which the con-stituents of a recursive structure are processed within the context posed by theinvolved structural elements. Different kinds of dependencies, i.e. causalitiesof the constituents can be accounted for.

    3.1 Causal Models

    A (stationary) causal IO-isomorph transduction, and thus a supersource trans-

    duction, can be implemented by resorting to a recursive state representation,i.e by defining a state (vector) x(v) X associated with each vertex v of aDPAG g and two functions fs and fo such that for v Vert(g)

    x(v) = fs(l(v),x(ch1[v]), . . . ,x(chK [v])) (1)

    y(v) = fo(x(v)) (2)

    where x(ch1[v]), . . . ,x(chK [v]) is the K(= out)-dimensional vector of statesfor the vertexes children of v. In the above definition, fs is referred to asthe state transition function and fo is referred to as the output function. Thebase case of the recursion is defined by x(nil) = x0. The output variable y(v)

    provides the value of T(g) for both the class of supersource or IO-isomorphictransductions. Stationarity assumption is made by the invariance of fs andfo over the vertexes of the structure.

    The above equations define a very general recursive processing system.For example, Mealy and Moore machines are finite-state machines that canbe described as an instance of such framework, where the input domain islimited to sequences, and X to a finite set. Extension of the input domainto trees leads to the definition of Frontier-to-Root tree Automata (FRA) (orbottom-up tree automata).

    Compact and graphical representations of the symbolic transformation ofthe state variables can be obtained by the shift operator. Specifically, givena state vector x(v) [x1(v), . . . , xm(v)]

    t, we define structural shift opera-tors q1j xi(v) xi(chj [v]) and q

    +1j xi(v) xi(paj [v]). If chj(v) = nil then

    q1j xi(v) x0, the null state. Similarly, if paj(v) = nil then q+1j xi(v) x0.Moreover, we consider a vector extension qexi(v) whose components areqej xi(v) as defined above and e {1, +1}. We also denote by q

    ejx the

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    6/28

    6 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    componentwise application of qej to each xi(v) for i = 1, . . . , m, and by qex

    the componentwise application ofqe to each xi(v) for i = 1, . . . , m [26].The recursive processing system framework provides an elegant approach

    for the realization of adaptive processing systems for learning in SD, whose

    main ingredients are:

    an input domain of structured data, composed in our case by a set oflabeled DPAGs with supersource;

    a parametric transduction function T, i.e. in the recursive processing sys-tem used to compute T, the functions fs and fo are made dependent ontunable (free) parameters;

    a learning algorithm that is used to estimate the parameters of T accordingto the given data and task.

    A specific realization of such approaches can be found in the family ofRNN models [34].

    The original recursive approach is based on the causality and stationarity

    assumptions. Since RNNs implement an adaptive transduction, they providea flexible tool to deal with hierarchical structured data. In fact, the recursivemodels have proved to be universal approximators for trees [16]. However, thecausality assumption put a constraint that affects the computational powerfor some classes of transductions and when considering the extension of theinput domain to DPAGs. In particular, the context of the state variablescomputed by RNN for a vertex is limited to the descendant of such vertexes.For instance, for sequential data the model computes a value that only dependson the left part of the sequence. The concept of context introduced by Elman[8] for recurrent neural networks (applied to time series) is to be understoodin the restrictive sense of past information. Of course this assumption mightnot be justified in spatial data. For structures the implications are even morecomplex as we will see in the following of the paper.

    RNN Architectures

    Recursive neural networks (RNNs) are based on a neural network realizationof fs and fo, where the state space X is realized by Rm. For instance, in afully connected Recursive Neural Network (RNN-fc) with one hidden layer ofm hidden neurons, the output x(v) Rm of the hidden units for the currentvertex v, is computed as follows:

    x(v) = fs(l(v), q1x(v)) = (Wl(v) +Kj=1

    Wj

    q1j x(v)), (3)

    where i(u) = (ui) (sigmoidal function), l(v) Rn is a vertex label, W Rmn is the free-parameters (weight) matrix associated with the label space

    and Wj

    Rmm is the free-parameters (weight) matrix associated with the

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    7/28

    Processing structured data by RNNs 7

    state information of jth children of v. Note that the bias vector is includedin the weight matrix W. The stationarity assumption allows for processing ofvariable-size structure by a model with fixed number of parameters.

    Concerning the output function g, it can be defined as a map g : Rm Rz

    implemented, for example, by a standard feed-forward network.Learning algorithms can be based on the typical gradient descendent ap-

    proach used for neural networks and adapted to recursive processing (Back-Propagation Through Structure and Real Time Recurrent Learning algo-rithms [34]).

    The problem with this approach is the apriori definition of a proper net-work topology. Constructive methods permit to start with a minimal networkconfiguration and to add units and connections progressively, allowing auto-matic adaptation of the architecture to the computational task. Constructivelearning algorithms are particularly suited to deal with structured inputs,where the complexity of learning is so high that it is better to use an incre-mental approach.

    Recursive Cascade Correlation (RCC) [34] is an extension of Cascade Cor-relation algorithms [10, 9] to deal with structured data.Specifically, in RCC, recursive hidden units are added incrementally to the

    network, so that their functional dependency can be described as follows:

    x1(v) = fs1

    l(v), q1x1(v)

    x2(v) = fs2

    l(v), q1[x1(v), x2(v)]

    (4)

    ...

    xm(v) = fsm

    l(v), q1[x1(v), x2(v),...,xm(v)]

    The training of a new hidden unit is based on already frozen units. The numberof hidden units realizing the fs() function depends on the training process.

    The computational power of RNN has been studied in [33] by using hardthreshold units and frontier-to-root tree automata. In [5], several strategiesfor encoding finite-state tree automata in high-order and first-order sigmoidalRNNs have been proposed. Complexity results on the amount of resourcesneeded to implement frontier-to-root tree automata in RNNs are presentedin [14]. Finally, results on function approximation and theoretical analysis oflearnability and generalization of RNNs (referred as folding networks) can befound in [15, 16, 17].

    3.2 Contextual Models

    Adaptive recursive processing systems can be extended to contextual trans-ductions by proper construction of contextual representations. The mainrepresentative of this class of models is the Contextual Recursive Cascade

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    8/28

    8 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    Correlation (CRCC), which appeared in [24] applied to the sequence domain,and subsequently extended to structures in [25, 26]. A theoretical study of theuniversal approximation capability of the model appears in [18].

    The aim of CRCC is to try to overcame the limitations of causal models.

    In fact, due to the causality assumption, a standard RNN cannot representin its hypothesis space contextual transductions. In particular for a RNN thecontext for a given vertex in a DPAG is restricted to its descending vertexes.In contrast, in a contextual transduction the context is extended to all thevertexes represented in the structured data.

    As for RCC, CRCC is a constructive algorithm. Thus, when training hid-den unit i, the state variables x1, . . . , xi1 for all the vertexes of all the DPAGsin the training set are already available, and can be used in the definition ofxi. Consequently, equations of causal RNN, and RCC in particular (Equa-tion 4), can be expanded in a contextual fashion by using, where possible, thevariables q+1xi(v). The equation for the state transition function for a CRCCmodel are defined as:

    x1(v) = fs1

    l(v), q1x1(v)

    x2(v) = fs2

    l(v), q1[x1(v), x2(v)], q

    +1[x1(v)]

    (5)

    ...

    xm(v) = fsm

    l(v),q1[x1(v), x2(v),...,xm(v)],q

    +1[x1(v), x2(v),...,xm1(v)]

    Hence, in the CRCC model, the frozen state values of both the children andparents of each vertex can be used as input for the new units without intro-ducing cycles in the dynamics of the state computing system.

    Following the line developed in Section 3.1 it is easy to define neural net-work models that realize the CRCC approach. In fact, it is possible to realizeeach fsj of Equation 5 by a neural unit associating each input argument of thefsj with free parameters W of the neural unit. In particular, given a vertex v,in the CRCC model each neural unit owns:

    connections associated to the input label of v, and connections associatedto the state variables (both the output of already frozen hidden units andthe current j-th hidden unit) of the children of the current vertex v (asfor the RNN/RCC model presented in Section 3.1);

    connections associated to the state variables of the parents of v which arealready frozen (i.e., the output of already frozen hidden units computedfor the parents of v). The connections associated to parents add new pa-rameters to the model that are fundamental to the contextual processing.

    The major aspect of this computation is that increasing context is takeninto account for large m. In fact, the hidden unit for each layer takes directly alocal context (parents and children) of each vertex as input and progressively,

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    9/28

    Processing structured data by RNNs 9

    by composition of context developed in the previous steps, it extends thecontext involving other vertexes, up to the vertexes of the whole structure.The context window follows a nested development; as an example, let considera vertex v = pa[v] and the presence of 3 hidden units. Following the definition

    given in eq. 5, x2(v) depends on x1(pa[v]) and x3(v) depends on x2(pa[v]) =x2(v

    ). Thus, we obtain that also x3(v) depends on x1(pa[v]), which means

    that x3(v) includes information by pa[v] = pa[pa[v]] in its context window.

    In particular, adding new hidden units to the CRCC network leads toan increase of the context window associated to each vertex v. Hence, thesize of the context window can grow during model training and we do notneed to fix it prior to learning. Given a sufficient number of hidden unitsthe information included in xi grows to all the context of the structureddata according to the structure topology. A formal analysis of the contextdevelopment of the CRCC model is presented in [26], showing the extensionof the power capability allowed by CRCC for tasks that cannot be computedby causal models. In particular, such results include contextual IO-isomorph

    transductions and supersource transductions on DPAG.Beside these results, there are other interesting cases that deserve to bediscussed for supersource transductions that both a causal and contextualmodels can compute. In fact, relaxing the causal assumption can be usefulalso for supersource transductions whenever the meaning of a sub-structuredepends on the context in which it is found. In such way it is possible toconsider in which position within a larger structure the given substructuredoes occur. In order to show the relevance of a contextual processing forthis issue, let us consider an example of sub-structure encoding, as shown inFigure 2. In this figure, the internal encoding of some fragments (subgraphs) is

    F

    fg a agg g f g a

    g ab

    Fig. 2. Different representations of the g(a) structural fragment in the projectionof the X space (denoted by F) by a contextual model.

    reported as represented in a 2-dimensional feature space F. Each point in theF space represents the numerical code developed by the CRCC model for eachfragment in the input structured domain. This can be practically obtained by

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    10/28

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    11/28

    Processing structured data by RNNs 11

    v1 M v2 term(x(v1)) = term(x(v2))

    This relation is very useful to characterize the class of input structures thatcan be discriminated by the given model. In fact, if two different vertexes are in

    relation, this means that they cannot be discriminated since they are mappedinto the same representation, regardless of the specific implementation of fs.We can thus exploit the collision relation to define equivalences classes

    over vertexes.

    Definition 2 (Collision Equivalence Classes). Given a model M, withVertM we define the partition of the set of vertexes into equivalence classesaccording to the corresponding collision relation M.

    Definition 3 (Transduction Support). Given a model M, and a trans-duction T defined over a set of structures S, with set of vertexes Vert(S), wesay that T is supported by M if each equivalence class belonging to Vert(S)Mis constituted by vertexes for which T returns the same value (or no value).

    Definition 4 (Model Completeness). A model M is said to be completewith respect to a family of transductions T defined over a set of structures S,if T T, T is supported by M.

    Theorem 1 (RNN/RCC Completeness for Supersource Transduc-tions on Trees). RNN/RCC supports all the supersource transductions de- fined on the set of trees (and sequences).

    Proof. First of all note that RNN/RCC is stationary, so it always returns thesame value if the same subtree is presented. Moreover it is causal in such away that the term generated for a vertex v, together with the label attachedto it, contains a subterm for each descendant of v. Descendants which corre-spond to different subtrees will get a different term. A model is incomplete ifthere exist two trees t1 and t2, t1 = t2, and a supersource transduction T forwhich T(t1) = T(t2), while term(supersource(t1)) = term(supersource(t2)).Let assume this happens. Since t1 = t2 there exists at least one path formthe root of the two trees which ends into a vertex v with associated la-bels that are different. But this means that also term(supersource(t1)) andterm(supersource(t2)) will differ in the position corresponding to v, contra-dicting the hypothesis that term(supersource(t1)) = term(supersource(t2)).

    Theorem 2 (RNN/RCC Incompleteness for I/O-isomorphic Trans-ductions on Trees). RNN/RCC cannot support all the I/O-isomorphictransductions defined on the set of trees.

    Proof. This can be proved quite easily by observing that any I/O-isomorphictransduction which assigns a different value to different occurrences of the

    same subtree within a given tree (see, for example, figure 1.(b)) cannot besupported. This is due to the fact that the term representation for the rootsof the two identical subtrees will belong to the same equivalence class, whilethe transduction assigns different values to these vertexes.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    12/28

    12 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    Corollary 1 (Sequences). RNN/RCC can support only I/O-isomorphic trans-ductions defined on prefixes of sequences.

    Proof. As for subtrees, the roots of all the identical prefixes of a set of se-

    quences belong to the same equivalence class. I/O-isomorphic transductionsthat depend on subsequences not included in the prefix can assign differentvalues to such prefixes.

    Theorem 3 (Characterization of Causal Model Collision Classes onDPAGs). Let consider any causal and stationary model M and DPAGs de-fined on the set of vertexes S. Let t denote any tree defined on S. Let visit(g)be the tree obtained by a complete recursive visit of the DPAG g starting fromits supersource. Let consider the DPAG g with the smallest number of vertexesfor whichvisit(g) = t. Then, each vertex v g will belong to a different equiv-alence class of SM. Moreover, the set of equivalence classes to which vertexesof g belong to, is exactly the same set of equivalence classes to which vertexesof any graph g such that visit(g) = t belong to.

    Proof. Basic idea: for causal models it is possible to turn/rewrite a DPAG intoan associated/equivalent tree, where shared vertexes are replicated accordingto all the possible positions that they can have in the DPAG (unsharingprocess, see Figure 3). Since this construction is possible for any DPAG, theclass of DPAG is absorbed into the class of trees for causal models.

    a

    f

    g

    b a

    f

    g

    b

    b

    Fig. 3. Example of unsharing process for a DPAG: causal models allow to rewritea DPAG as an equivalent tree.

    Corollary 2 (RNN/RCC Incompleteness for Transductions on DPAGs).RNN/RCC cannot support all the supersource transductions defined on DPAGs.The same is true for I/O-isomorphic transductions.

    Interestingly we can derive also the following result.

    Corollary 3 (Classes of structures). The set of transductions on DPAGssupported by causal and stationary models (e.g. RNN/RCC) can equivalentlybe defined on the class of trees.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    13/28

    Processing structured data by RNNs 13

    Proof. Since the unsharing construction is possible for any DPAG, for eachvertex of a DPAG g it is possible to find in its equivalence class a vertexbelonging to a tree t obtained by applying the unsharing construction to g, i.e.visit(g) = t. Thus, any transduction defined on a DPAG g can be computed

    with equal results on the equivalent tree t.

    The above result stresses the fact that causal and stationary models lookat input structures as if they were trees, i.e. for causal and stationary modelsthe class of DPAG is absorbed into the class of trees.

    From a computational point of view the unsharing construction helps inunderstanding that it is better to share all the vertexes that can be shared. Infact, in this way there are fewer vertexes to process while the computationalpower of the model stays the same. This observation is at the basis of theoptimization of the training set suggested in [34].

    The problem with causal (and stationary) models is that they only ex-ploit information concerning a vertex and its descendants. In this way it isnot possible for that models to discriminate between occurrences of the samesubstructure in different contexts, i.e. different sets of parents for the su-persource of .

    The CRCC model tries to embed this idea in its transition state func-tion (eq. 5) where information about the context of a vertex v is gatheredthrough factorization of the state space into a sequence of state variables thatprogressively increase the visibility above the ancestors of v.

    In [26] the concept of context window has been introduced to study thecomputational properties of CRCC. More precisely, the context window ofa state variable for a vertex is defined as the set of all the state variables(defined over all possible vertexes) that (directly or indirectly) contribute toits determination. Here we use a more precise definition of the context since theterm associated to a vertex state vector not only involves the terms generated

    by all the relevant state variables mentioned above, but it also relates theseterms each other through the state transition function(s).

    Theorem 4 (CRCC More Powerful than Causal Models). The set ofsupersource transductions defined on DPAGs supported by CRCC is strictlylarger than the set of transductions supported by causal (and stationary) mod-els (e.g., RCC/RNN). The same is true for I/O-isomorphic transductions.

    Proof. Let consider a tree t which does not contain two or more vertexes withthe same set of children and the same set of siblings and the same parent.Then, for causal models the supersources of t and any DPAG g such thatvisit(g) = t will belong to the same equivalence class. This will not be thecase for CRCC. In fact, for any such DPAG g1, given any other such DPAG

    g2 = g1, there will exists at least one vertex v1 belonging to g1 for whicheither the set of children or the set of siblings or the set of parents will bedifferent from the ones of any vertex v2 belonging to g2. This means that whenconsidering CRCCs eq. 5 for vertexes v1 and v2, the corresponding terms will

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    14/28

    14 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    0

    1

    2

    3

    4

    6

    1

    2

    3

    4

    5

    6

    5

    0

    Fig. 4. Example of two graphs which are mapped to identical values withCRCC networks for a non-compatible enumeration of parents. Thereby, theenumeration of parents and children is indicated by boxes at the vertexes, theleft box being number one, the right box being number two.

    be different as well as the terms corresponding to the supersources. Thusall the supersources of involved DPAGs will belong to different equivalenceclasses. Moreover, vertexes that are in different equivalence classes for causalmodels will remain in different equivalence classes for CRCC. Thus there willbe additional supersource and I/O-isomorphic transductions that CRCC cansupport with respect to causal models.

    Unfortunately it has been demonstrated in [18] that CRCC is not ableto support all the supersource transductions on DPAGs. This is due to thefact that there exist high symmetric DPAGs for which the term generated byCRCC is the same (see Figure 4).

    Thus, we have

    Theorem 5 (Incompleteness of CRCC for transductions defined onDPAGs). CRCC cannot support all the supersource transductions defined onDPAGs. The same is true for I/O-isomorphic transductions.

    There is, however, a restricted class of DPAGs, called DBAGs, where com-pleteness can be achieved.

    Definition 5 (DBAGs). Assume D is a DPAG. We say that the ordering ofthe parents and children is compatible if for each edge (v, u) of D the equality

    Pu((v, u)) = Sv((v, u))

    holds. This property states that if vertex v is enumerated as parent number iof u then u is enumerated as child number i of v and vice versa. We recallthat this subset of DPAGs with compatible positional ordering is called rooteddirected bipositional acyclic graphs (DBAGs).

    First, the causal models retain the incompleteness limitations on DBAGs.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    15/28

    Processing structured data by RNNs 15

    Theorem 6 (RNN/RCC Incompleteness for Transductions Definedon DBAGs). RNN/RCC cannot support all the supersource transductionsdefined on DBAGs. The same is true for I/O-isomorphic transductions.

    Proof. Follows directly from Theorem 3 applied to the subclasses of DBAGs.

    On the contrary, CRCC is complete for supersource and I/O-isomorphictransductions defined on DBAGs.

    Theorem 7 (CRCC Completeness for Supersource Transductions onDBAGs). The set of supersource and I/O-isomorphic transductions definedon DBAGs is supported by CRCC.

    Proof. The property that the ordering of parents and children is compatiblehas the consequence that a unique representation of DBAGs is given by anenumeration of all paths in the DBAG starting from the supersource. Whilethis fact is simple for tree structures, it is not trivial for graph structures

    which include nodes with more than one parent. For simplicity, we restrictthe argumentation to the case of at most two children and parents, the gener-alization to more children and parents being straightforward. Now the proofproceeds in several steps.

    Step 1: Assume v is a vertex of the DBAG. Consider a path from thesupersource to the vertex v. Assume this path visits the vertices v0, v1, . . . ,vn = v, v0 being the supersource. Then a linear path representation of thepath consists of the string s = s1 . . . sn {0, 1}

    with si = 0 vi is the leftchild ofvi1. Note that the absolute position of a vertex in the DBAG can berecovered from any given linear path representation. The dynamics of a CRCCas given by equation 5 allows to recover all paths leading to v starting fromthe supersource for large enough number of the context unit. More precisely,assume the maximum length of a path leading to v is h. Then all linear pathrepresentations to v can be recovered from the functional representation givenby xh+2(v) as defined in equation 5.

    This can easily be seen by induction: for the supersource, the only lin-ear path representation is empty. This can be recovered from the fact thatq+1[x1(v)] corresponds to an initial context for the root. For some vertex vwith maximum length h of a path leading to v, we can recover the linearpath representations of the parents from q+1[x1(v), . . . , xh+1(v)] by induc-tion. Further, it is clear from this term whether v is the left or right childfrom the considered parent because of the compatibility of the ordering. Thuswe obtain the linear path representations of all paths from the supersource tov by extending the linear path representation to the parents of v by 0 or 1,respectively.

    Step 2: Assume h is the maximum length of a path in a given DBAG.Assume v is a vertex. Then one can recover the following set from x2h+2(v):{(l(v), B(v)) | v is a vertex in the DBAG, B(v) is the lexicographicallyordered set of all linear path representations of the supersource to v}.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    16/28

    16 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    This fact is obvious for an empty DBAG. Otherwise, the term xh(v0), v0being the supersource and h h + 2 can easily be recovered from x2h+2(v).Because of Step 1, we can extract (v0, B(v0)) from this term, and we obtainthe representations xh(v) of all children v of v0. Thus, we can continue

    this way, adding the terms (v, B(v)) for the children. Note that a vertexmight be visited more than once, which can be easily be detected due to thelexicographic ordering of B(v).

    Step 3: For any given DBAG, the set {(l(v), B(v)) | v is a vertex in theDBAG, B(v) is the lexicographically ordered set of all linear path represen-tations of the supersource to v} is unique. This can be seen as follows: we canintroduce a unique identifier for every vertex v in the DBAG, and assign itslabel l(v). The supersource is uniquely characterized because the only path tothe supersource is empty. Starting from the supersource we can add all con-nections (together with a positioning) to the DBAG in the following way: weconsider the shortest linear path representation for which not yet all connec-tions are contained in the constructed graph. Assume the final vertex of this

    path is v. Obviously, all connections of this path but the last one are alreadycontained in the graph since we considered the shortest path. Thus, we canuniquely locate the parent ofv in the graph and insert the edge together withthe positioning.

    Thus, the functional dependence as defined by CRCC is unique for everyDBAG and vertex v contained in the DBAG if the context number is largeenough. Therefore, the set of supersource and I/O-isomorphic transductionsdefined on DBAGs is supported by CRCC.

    Note that these arguments hold also if the functional form of CRCC isrestricted to the function

    xm(v) = fsm

    l(v),q1[xm(v)], q

    +1[xm1(v)]

    since all relevant information used in the proof is contained in this represen-tation.

    It should be noticed that, the condition imposed by DBAGs is only a suf-ficient condition to assure the CRCC completeness for transductions definedon the classes of DPAGs, which includes DBAGs. In fact, as introduced in[18], the limitation occurs only for high symmetric cases.

    Thus, the overall picture as shown in Fig. 5 results if the functional formof the definitions is considered.

    5 Approximation capability

    An important property of function classes used for learning functions basedon examples is its universal approximation ability. This refers to the fact that,given an unknown regularity, there exists a function in the class which can

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    17/28

    Processing structured data by RNNs 17

    I/Oisomorphic transduction trees

    I/Oisomorphic transduction DBAGs

    supersource transduction DBAGs

    supersource transduction trees

    supersource transduction sequences

    RCC CRCCRNN

    I/Oisomorphic transduction DPAGs

    supersource transduction DPAGs

    Fig. 5. Classes of functions supported by RNN, RCC, and CRCC because of thefunctional form

    represent this regularity up to a small distance . Obviously, the universalapproximation capability is a condition for a function class to be successful asa learning tool in general situations. There exist various possibilities to makethis term precise. We will be interested in the following two settings:

    Definition 6 (Approximation completeness). Assume X is an input set.Assume Y is an output set. Assume l : Y Y R is a loss function. AssumeF and G are function classes of functions from X to Y. F is approximationcomplete for G in the maximum norm if for every g G and > 0 thereexists a function f F such that l(f(x), g(x)) for all x X.

    Assume X is measurable with probability measure P. Assume all composi-tions of functions in F andG with l are (Borel-)measurable. F is approxima-tion complete for G in probability if for every g G and > 0 there exists afunction f F such that P(x | l(f(x), g(x)) > ) .

    We are interested in the cases where X consists of sequences, trees,DPAGs, or DBAGs, and Y is either a real-vector space for supersourcetransductions or the same structural space for I/O-isomorphic transduc-tions. We equip X with the -algebra induced by the sum of structuresof a fixed form, whereby structures of a fixed form are identified with

    a real-vector space by concatenating all labels. We choose l as the Eu-clidean distance if Y is a real-vector space, and l(s, s) = m a x{l(v) l(v)|v and v are vertices of s and s, respectively, in the same position} forstructures s and s of the same form. Since we only consider I/O-isomorphic

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    18/28

    18 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    transductions, it is not necessary to define a loss function for structures of adifferent form.

    So far, the functional dependence of various models which process struc-tured data has been considered, neglecting the form of the transition function

    fs. Obviously, the fact that the functional form supports a certain class offunctions is a necessary condition for the universal approximation ability ofthe function class, but it is not sufficient. Due to the limitation of the respec-tive functional form, we can immediately infer the following results:

    Theorem 8 (Functional incompleteness). RNN/RCC is not approxima-tion complete in the maximum norm or in probability for I/O-isomorphictransductions on trees.

    RNN/RCC is not approximation complete in the maximum norm or inprobability for I/O-isomorphic or supersource transductions on DPAGs orDBAGs.

    CRCC is not approximation complete in the maximum norm or in proba-bility for I/O-isomorphic or supersource transductions on DPAGs.

    For all other cases, the functional form as defined in the models aboveprovides sufficient information to distinguish all different input structures.

    However, it is not clear whether this information can be stored in thefinite dimensional context space provided by neural networks, and whetherthese stored values can be mapped to any given desired outputs up to .

    First, we consider approximation in the maximum norm. This requiresthat input structures up to arbitrary depth can be stored in the context neu-rons and mapped to desired values. This situation includes the question ofthe computation capability of recurrent neural networks, i.e. the mappingof binary input sequences to desired binary output values. The capacity ofRNNs with respect to classical mechanisms has been exactly characterizedin the work provided by Siegelemann and coworkers [21, 30, 31, 32]: RNNswith semi-linear activation function are equivalent to so-called non-uniformBoolean circuits. Although this capacity subsumes Turing capability, it doesnot include all functions, hence RNNs and alternatives cannot approximateevery (arbitrary) function in the maximum norm.

    The publication [16] provides a general direct diagonalization argumentwhich can immediately be transferred to our settings which shows the princi-pled limitation of the systems:

    Theorem 9 (Approximation incompleteness for general functions inmax-norm). Assume the transition function fs of a RNN/RCC or CRCCnetwork is given as the composition of a finite number of fixed continuous non-linear functions and affine transformations. Consider input structures with la-

    bel1, i.e. only structural differences are important. Then, there exist functionsin the class of supersource transductions from sequences, trees, and DBAGswhich cannot be approximated by a RNN/RCC or CRCC network in the max-imum norm.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    19/28

    Processing structured data by RNNs 19

    Proof. For any number i there exists at most a finite number of neural net-works with at most i neurons and the specified structure. For any fixed archi-tecture f with i neurons and a fixed input structure si with i vertices the set{y | y is output of a network with architecture f and input structure si with

    weights absolutely restricted by i} is compact. Therefore we can find a valueyi which is more than away from the union of these sets for all architectureswith i neurons. Obviously, the mapping si yi cannot be approximated inthe maximum norm by any network with a finite number of neurons and finiteweights.

    For restricted function classes, approximation results can be found: it hasbeen shown e.g. in [29] that RNNs with sequential input can approximatefinite state automata up to arbitrary depth. Similarly, it has been shown in[12] that RNNs with tree structured inputs can approximate tree automataup to arbitrary depth. Thus, the following holds:

    Theorem 10 (Completeness of RNN for finite automata in max-

    norm). The class of RNN networks (with sigmoidal activation function orsimilar) is approximation complete in the maximum norm for the class of su-persource transductions computed by finite state automata or tree automata,respectively.

    Note that RCC restricts the recurrence of RNNs since recurrent connectioncannot go back to already frozen units due to the specific training mode ofCC. The effects of the limitation of recurrence on the capacity of the modelshas been discussed in various approaches including [11, 13]. In particular, [13]shows the following

    Theorem 11 (Incompleteness of RCC for finite automata in max-norm). The class of RCC networks cannot approximate every finite stateautomaton in the maximum norm.

    Thus, considering approximation in the maximum norm, RCCs are strictlyweaker than RNNs on sequences and the same transfers to tree structuredinputs, DPAGS, and also CRCC as depicted in Figure 6.

    In practice, however, approximation of functions up to some reasonabledepth is usually sufficient. Thus, in practice, approximation completeness inprobability is interesting. For such investigations, a number of properties ofthe transition functions are relevant:

    We assume that the function fs is implemented by a neuron which consti-tutes a composition of an affine transformation and the sigmoidal function as shown in Equation 3.

    is the logistic function which has the following properties: it is a squash-

    ing function, i.e. limx (x) = 1 and limx (x) = 0. is continu-ously differentiable at 0 with nonvanishing derivative (0), therefore, onecan approximate uniformly the identity by x (( x) (0))/ (0)for 0.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    20/28

    20 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    functions computable by RCC

    functions computable by RNN

    finite automata/tree automata

    general functions

    Fig. 6. Approximation capability by RCC and RNN networks when consideringapproximation in the maximum norm

    We only consider continuous supersource transductions starting from

    structures to [0, 1], the codomain of , and I/O-isomorphic transductionswhere labels are restricted to [0, 1], such that the restriction of the functionto one label is continuous.

    The following result has been shown in [16] which extends well known ap-proximation results of recurrent networks for sequences such as [20] to treestructures:

    Theorem 12 (Completeness of RNNs in probability). Consider RNNswith the logistic activation function . Then every continuous supersourcetransduction from tree structures can be approximated arbitrarily well in prob-ability by a RNN network.

    Obviously, at most causal I/O-isomorphic transductions can be approximated

    by RNNs. Note that approximation of causal transduction reduces to theapproximation of all substructures of a given structure, thus, it follows imme-diately that Theorem 12 also holds for causal I/O-isomorphic transductions.

    We now consider RCC and CRCC models. A general technique to proveapproximation completeness in probability has been provided in [18]:

    1. Design a unique linear code in a real-vector space for the considered inputstructures assumed a finite label alphabet. This needs to be done for everyfinite but fixed maximum size.

    2. Show that this code can be computed by the recursive transition functionprovided by the model.

    3. Add a universal neural network in the sense of [19] as readout; this canbe integrated in the specific structure of CC.

    4. Transfer the result to arbitrary labels by integrating a sufficiently finediscretization of the labels.

    5. Approximate specific functions of the construction such as the identity ormultiplication by the considered activation function .

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    21/28

    Processing structured data by RNNs 21

    Steps 3-5 are generic. They require some general mathematical argumentationwhich can be transferred immediately to our situation together with the prop-erty that is squashing and continuously differentiable (such that the identitycan be approximated). Therefore, we do not go into details concerning these

    parts but refer to [18] instead.Step 1 has already been considered in the previous section: we have investi-

    gated whether the functional form of the transition function, i.e. the sequenceterm(x(v)), v being the supersource resp. root vertex, allows to differentiateall inputs. If so, it is sufficient to construct a representation of the functionalform within a real-vector space and to show that this representation can becomputed by the respective recursive mechanism. We shortly explain the mainideas for RCC and CRCC. Technical details can be found in [18].

    Theorem 13 (Completeness of RCC for sequences in probability).Every continuous supersource transduction from sequences to a real-vectorspace can be approximated arbitrarily well by a RCC network in probability.

    Proof. We only consider steps 1 and 2: Assume a finite label alphabet L ={l1, . . . , ln} is given. Assume the symbols are encoded as natural numberswith at most d digits. Adding leading entries 0, if necessary, we can assumethat every number has exactly d digits. Given a sequence (li1 , . . . , lit), thereal number 0.lit . . . li1 constitutes a unique encoding. This encoding can becomputed by a single RCC unit because it holds

    0.1d lit + 0.1d 0.lit1 . . . li1 = 0.lit . . . li1

    This is a simple linear expression depending on the input lit at time step tand the context 0.lit1 . . . li1 which can be computed by a single RCC unitwith linear activation function resp. approximated arbitrarily well by a singleneuron with activation function .

    As beforehand, this result transfers immediately to continuous causal I/O-isomorphic transductions since this relates to approximation of prefixes.

    For tree structures, we consider more general multiplicative neurons whichalso include a multiplication of input values.

    Definition 7 (Multiplicative neuron). A multiplicative neuron of degreed with n inputs computes a function of the form

    x

    di=1

    j1,...,ji{1,...,n}

    wj1...ji xj1 . . . xji +

    (including an explicit bias ).

    It is possible to encode trees and more general graphs using recursive multi-plicative neurons as shown in the following.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    22/28

    22 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    Theorem 14 (Completeness of RCC with multiplicative neurons fortrees in probability). Every continuous supersource transduction from treestructures to a real-vector space can be approximated arbitrarily well by a RCCnetwork with multiplicative neurons in probability.

    Proof. As beforehand, we only consider steps 1 and 2, the representation oftree structures and its computation in a RCC network. As beforehand, werestrict to two children per vertex, the generalization to general fan-out beingstraightforward. Assume a finite label alphabet L = {l1, . . . , ln} is given. As-sume these labels are encoded as numbers with exactly d digits each. Assumelf and l are numbers with d digits not contained in L. Then a tree t canuniquely be represented by r(t) = l if t is empty and r(t) = lfl(v)r(t1)r(t2)(concatenation of these digits), otherwise, where l(v) is the root label and t1and t2 are the two subtrees. Note that this number represents the functionalsequence term(x(v)), v being the root vertex, which arises from a recursivefunction and uniquely characterizes a tree structure.

    It is possible to compute the real numbers 0.r(t) and (0.1)|r(t)|, |r(t)| beingthe length of the digit string r(t), by two multiplicative neurons of a RCCnetwork as follows:

    (0.1)|r(t)| = 0.12d (0.1)|r(t1)| (0.1)|r(t2)|

    for the neuron computing the length (included in the RCC network first) and

    0.r(t) = 0.1d lf + 0.12d l(v) + 0.12d 0.r(t1) + 0.1

    2d (0.1)|r(t1)| 0.r(t2)

    for the neuron computing the representation. Note that the second neurondepends on the output of the first, but not vice versa, such that the restrictedrecurrence of RCC is accounted for. The neuron have a linear activation func-tion which can be a arbitrarily well by .

    As beforehand, this result can immediately be transferred to continuous causalI/O-isomorphic transductions since these are equivalent to supersource trans-ductions on subtrees.

    Contextual processing adds the possibility to include information aboutthe parents of vertices. This extends the class of functions which can be com-puted by such networks. We obtain the following completeness result:

    Theorem 15 (Completeness of CRCC with multiplicative neuronsfor DBAGs in probability). Every continuous I/O-isomorphic or super-source transduction on DBAGs can be approximated by a CRCC network withmultiplicative neurons in probability.

    Proof. We restrict to fan-in and fan-out 2, the generalization to larger fan-inand fan-out being straightforward. Note that, according to Theorem 7, thefunctional form given by CRCC allows a unique representation of DBAGs,provided the number of the hidden neurons is large enough (depending on the

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    23/28

    Processing structured data by RNNs 23

    maximum length of the paths of the BAG). Thus, it is sufficient to show thata unique numeric representation of the functional form of CRCC can be com-puted by a CRCC network. Thereby, as remarked in the proof of Theorem 7,we can restrict the functional form to

    xm(v) = fsm

    l(v),q1[xm(v)], q

    +1[xm1(v)]

    Assume the label alphabet is given by L = {l1, . . . , ln} represented bynumbers with exactly d digits. Assume l(, l), l,, lf, and l are numbers withexactly d digits not contained in L. Denote an empty child/parent by . Definea representation ri(v) of a vertex v of a DBAG with children ch1(v) andch2(v) and parents pa1(v) and pa2(v) recursively over i and the structure inthe following way:

    r1(v) = f(l(v), r1(ch1(v)), r1(ch2(v))),

    ri(v) = f(l(v), ri(ch1(v)), ri(ch2(v)), ri1(pa1(v)), ri1(pa2(v)))

    for i > 1. Note that ri(v) coincides with the functional form term(x(v)), xbeing the activation of a CRCC network with i hidden neurons and reduceddependence as specified in Theorem 7. Since this mirrors the functional de-pendence of CRCC, it yields unique values for large enough i. Denote by Rithe sequence of digits which is obtained by substituting the symbol f by lf, (by l(, , by l,, and by l. Then we obtain unique numbers for every structureand vertex in the structure for large enough i. Thus, this proofs step 1.

    In the second step we show that the number 0.Ri(v) and 0.1|Ri(v)|, |Ri(v)|being the length ofRi(v), can be computed by a CRCC network. The generalrecurrence is of the form

    0.1|Ri(v)| = 0.18d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))|

    0.1|Ri1(pa1(v))| 0.1|Ri1(pa2(v))|

    for the length and

    0.Ri(v) =

    0.1d lf + 0.12d l( + 0.1

    3d l(v) + 0.14d l,

    + 0.14d 0.Ri(ch1(v))

    + 0.15d 0.1|Ri(ch1(v))| n,

    + 0.15d 0.1|Ri(ch1(v))| 0.Ri(ch2(v))

    + 0.16d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))| l,

    + 0.16d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))| 0.Ri1(pa1(v))

    + 0.17d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))| 0.1|Ri1(pa1(v))| l,

    + 0.17d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))| 0.1|Ri1(pa1(v))| 0.Ri1(pa2(v))

    + 0.18d 0.1|Ri(ch1(v))| 0.1|Ri(ch2(v))| 0.1|Ri1(pa1(v))| 0.1|Ri1(pa2(v)| l)

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    24/28

    24 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    I/Oisomorphic transduction trees

    I/Oisomorphic transduction DBAGs

    supersource transduction DBAGs

    supersource transduction trees

    supersource transduction sequences

    RCC CRCCRNN

    Fig. 7. Approximation capability of RCC and CRCC networks when consideringapproximation in probability

    for the representation. This can be computed by a CRCC network with mul-tiplicative neurons since the length does not depend on the representation,hence the restricted recurrence of CRCC is accounted for. As beforehand, theidentity as activation function can be approximated arbitrarily well by .

    Hence, CRCC extends causal models in the sense that it allows for gen-eral I/O-isomorphic transductions and supersource transductions on DBAGs,which extend trees to an important subclass of acyclic graphs. Note that thisresult also includes the fact, that non causal I/O-isomorphic transductionson trees can be approximated by CRCC models, thus, CRCC also extendsthe capacity of RCC on tree structures. Every (general) DPAG can be trans-formed to a DBAG by reenumeration of the children and extension of the

    fan-in/fan-out, as discussed in [18]. Hence, the overall picture as depicted inFigure 7 arises.

    Actually, the capacity of CRCC is even larger: as discussed in Theorem 15,the functional form term(x(v)) for CRCC neurons x and vertices v of DPAGscan uniquely be encoded by multiplicative neurons up to finite depth and forfinite label alphabet. Therefore, all IO-isomorphic or supersource transduc-tions of structures which are supported by CRCC networks, i.e. for which theexpression term(x(v)) yields unique representations, can be approximated byCRCC networks. We can identify general DPAGs with graphs with directedand enumerated connections from vertices pointing to their children and totheir parents. As demonstrated in Theorem 7, all (labeled) paths within thegraph starting from any vertex can be recovered from term(x(v)) if the num-

    ber of hidden neurons of the CRCC network is large enough. Thus, every twostructures for which at least two different paths exist can be differentiated byCRCC networks, i.e. approximation completeness holds for this larger class ofstructures which are pairwise distinguishable by at least one path.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    25/28

    Processing structured data by RNNs 25

    6 Conclusion

    In this chapter we have analysed, mainly from a computational point of view,Recursive Neural Networks models for the adaptive processing of structured

    data. These models are able to directly deal with the discrete nature of struc-tured data, allowing to deal in a unified way both with discrete topologiesand real valued information attached to each vertex of the structures. Specifi-cally, RNNs allow for adaptive transduction on sequences and trees assuminga causal recursive processing system. However, causality affects computationalcapabilities and classes of data. Some of these capabilities can be recovered bycontextual approaches, which provide an elegant approach for the extensionof the classes of data that RNNs can deal with for learning.

    We have reported and compared in a common framework the computa-tional properties already presented in the literature for these models. More-over, we have presented new results which allowed us to investigate the inti-mate relatiships among models and classes of data. Specifically, we discussed

    the role of contextual information for adaptive processing of structured do-mains, evaluating the computational capability of the models with respect tothe different classes of data. The use of contextual models allows to properlyextend the class of transductions from sequences and trees, for which causalmodels are complete, to classes of DPAGs. Moreover, the characterization ofthe completeness introduced through this chapter allowed us to show that evenif the causal models can treat DPAG structures, a proper extension to classesof DPAGs needs the use of contextual models. In fact, causal models inducecollisions in the state representations of replicated substructures inserted intodifferent contexts. This feature of causal models can be characterized throughthe definition of a constraint of invariance with respect to the unsharing con-struction we have defined for DPAGs. Contextual models, beside the extensionto contextual transduction over sequences and trees, allow the relaxation of

    this constraint over DPAGs. This general observation allowed us to formallycharacterize the classes of structures and the transductions that the modelscan support.

    This initial picture which relies on the functional form of the models, hasbeen completed by results on the approximation capability using neural net-works as transfer functions. Essentially, approximation completeness in prop-baility can be proved wherevever the functional representation supports therespective transduction. Note that, unlike [13], we discussed the practicallyrelevant setting of approximation up to a set of small probability, such thatthe maximum recursive depth is limited in all concrete settings. For these situ-ations, when considering multiplicative neurons, approximation completenessof RecCC for supersource transductions on tree structures, and the approx-

    imation completeness of CRecCC for IO-isomorphic and supersource trans-ductions for tree structures and acyclic graphs, which are distinguishable byat least one path, results.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    26/28

    26 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    Interestingly, all cascade correlation models can be trained in a quite ef-fective way with an extension of cascade correlation, determining in the samerun the weights and structure of the neural networks. These models have al-ready been successfully applied to QSAR and QSPR problems in chemistry

    [4, 2, 27, 28, 3, 1, 6, 7]. Although these tasks consist only of supersource trans-ductions since the activity resp. relevant quantities of chemical molecules haveto be predicted, the possibility to incorporate an expressive context provedbeneficial for these tasks, as can be seen by an inspection of the context rep-resentation within the networks [22, 27].

    References

    1. L. Bernazzani, C. Duce, A. Micheli, V. Mollica, A. Sperduti, A. Starita, andM. R. Tine. Predicting physical-chemical properties of compounds from molec-ular structures by recursive ne ural networks. J. Chem. Inf. Model, 46(5):20302042, 2006.

    2. A. M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Application of cascadecorrelation networks for structures to chemistry. Journal of Applied Intelligence(Kluwer Academic Publishers), 12:117146, 2000.

    3. A. M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. A Novel Approachto QSPR/QSAR Based on Neural Networks for Structures, pages 265296.Springer-Verlag, Heidelberg, March 2003.

    4. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Quantitative structure-activity relationships of benzodiazepines by Recursive Cascade Correlation. InIEEE, editor, Proceedings of IJCNN 98 - IEEE World Congress on Computa-tional Intelligence, pages 117122, Anchorage, Alaska, May 1998.

    5. R.C. Carrasco and M.L. Forcada. Simple strategies to encode tree automata insigmoid recursive neural networks. IEEE TKDE, 13(2):148156, 2001.

    6. C. Duce, A. Micheli, R. Solaro, A. Starita, and M. R. Tine. Prediction ofchemical-physical properties by neural networks for structures. Macromolecular

    Symposia, 234(1):1319, 2006.7. C. Duce, A. Micheli, A. Starita, M. R. Tine, and R. Solaro. Prediction of polymer

    properties from their structure by recursive neural networks. MacromolecularRapid Communications, 27(9):711715, 2006.

    8. J. L. Elman. Finding structure in time. Cognitive Science, 14:179211, 1990.9. S. E. Fahlman. The recurrent cascade-correlation architecture. In R.P. Lipp-

    mann, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural InformationProcessing Systems 3, pages 190196, San Mateo, CA, 1991. Morgan KaufmannPublishers.

    10. S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture.In D.S. Touretzky, editor, Advances in Neural Information Processing Systems2, pages 524532. San Mateo, CA: Morgan Kaufmann, 1990.

    11. P. Frasconi and M. Gori. Computational capabilities of local-feedback recur-rent networks acting as finite-state machines. IEEE Transactions on NeuralNetworks, 7(6):15211524, 1996.

    12. P. Frasconi, M. Gori, A. Kuechler, and A. Sperduti. From sequences to datastructures: Theory and applications. In J. Kolen and S. Kremer, editors, A FieldGuide to Dynamic Recurrent Networks, pages 351374. IEEE Press, 2001.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    27/28

    Processing structured data by RNNs 27

    13. C.L. Giles, D. Chen, G.Z. Sun, H.H. Chen, Y.C. Lee, and M.W. Goudreau.Constructive learning of recurrent neural networks: limitations of recurrent cas-cade correlation and a simple solution. IEEE Transactions on Neural Networks,6(4):829836, 1995.

    14. M. Gori, A. Kuchler, and A. Sperduti. On the implementation of frontier-to-root tree automata in recursive neural networks. IEEE Transactions on NeuralNetworks, 10(6):1305 1314, 1999.

    15. B. Hammer. On the learnability of recursive data. Mathematics of ControlSignals and Systems, 12:6279, 1999.

    16. B. Hammer. Learning with Recurrent Neural Networks, volume 254 of SpringerLecture Notes in Control and Information Sciences. Springer-Verlag, 2000.

    17. B. Hammer. Generalization ability of folding networks. IEEE TKDE, 13(2):196206, 2001.

    18. B. Hammer, A. Micheli, and A. Sperduti. Universal approximation capability ofcascade correlation for structures. Neural Computation, 17(5):11091159, 2005.

    19. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networksare universal approximators. Neural Networks, pages 359366, 1989.

    20. Funahashi K. and Y. Nakamura. Approximation of dynamical systems by con-

    tinuous time recurrent neural networks. Neural Networks, 6(6):801806, 1993.21. Joe Kilian and Hava T. Siegelmann. The dynamic universality of sigmoidal

    neural networks. Information and Computation, 128, 1996.22. A. Micheli. Recursive Processing of Structured Domains in Machine Learning.

    Ph.d. thesis: TD-13/03, Department of Computer Science, University of Pisa,Pisa, Italy, December 2003.

    23. A. Micheli, F. Portera, and A. Sperduti. A preliminary empirical comparison ofrecursive neural networks and tree kernel methods on regression tasks for treestructured domains. Neurocomputing, 64:7392, March 2005.

    24. A. Micheli, D. Sona, and A. Sperduti. Bi-causal recurrent cascade correlation. InIJCNN2000 - Proceedings of the IEEE-INNS-ENNS International Joint Con-

    ference on Neural Networks, volume 3, pages 38, 2000.25. A. Micheli, D. Sona, and A. Sperduti. Recursive cascade correlation for contex-

    tual processing of structured data. In Proc. of the Int. Joint Conf. on NeuralNetworks - WCCI-IJCNN2002, volume 1, pages 268273, 2002.26. A. Micheli, D. Sona, and A. Sperduti. Contextual processing of structured data

    by recursive cascade correlation. IEEE Trans. Neural Networks, 15(6):13961410, November 2004.

    27. A. Micheli, A. Sperduti, A. Starita, and A. M. Bianucci. Analysis of the in-ternal representations developed by neural networks for structures applied toquantitative structure-activity relationship studies of benzodiazepines. Journalof Chem. Inf. and Comp. Sci., 41(1):202218, January 2001.

    28. A. Micheli, A. Sperduti, A. Starita, and A. M. Bianucci. Design of new bio-logically active molecules by recursive neural networks. In IJCNN2001 - Pro-ceedings of the INNS-IEEE International Joint Conference on Neural Networks,pages 27322737, Washington, DC, July 2001.

    29. C.W. Omlin and C.L. Giles. Stable encoding of large finite-state automata in

    recurrent neural networks with sigmoid discriminants. Neural Computation, 8,1996.

    30. Hava T. Siegelmann. The simple dynamics of super Turing theories. TheoreticalComputer Science, 168, 1996.

  • 8/3/2019 Barbara Hammer, Alessio Micheli and Alessandro Sperduti- Adaptive Contextual Processing of Structured Data by R

    28/28

    28 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti

    31. Hava T. Siegelmann and Eduardo D. Sontag. Analog computation, neural net-works, and circuits. Theoretical Computer Science, 131, 1994.

    32. Hava T. Siegelmann and Eduardo D. Sontag. On the computational power ofneural networks. Journal of Computer and System Sciences, 50, 1995.

    33. A. Sperduti. On the computational power of recurrent neural networks forstructures. Neural Networks, 10(3):395400, 1997.

    34. A. Sperduti and A. Starita. Supervised neural networks for the classification ofstructures. IEEE Trans. Neural Networks, 8(3):714735, May 1997.