Discrete Signal Processing on Graphs - arXiv · PDF filearXiv:1210.4752v2 [cs.SI] 28 Dec 2012 1 Discrete Signal Processing on Graphs Aliaksei Sandryhaila, Member, IEEE and Jose´ M

arX

iv:1

210.

4752

v2 [

cs.S

I] 2

8 D

ec 2

012

1

Discrete Signal Processing on GraphsAliaksei Sandryhaila,Member, IEEEand Jose M. F. Moura,Fellow, IEEE

Abstract—In social settings, individuals interact through websof relationships. Each individual is a node in a complex network(or graph) of interdependencies and generates data, lots ofdata.We label the data by its source, or formally stated, weindexthe data by the nodes of the graph. The resulting signals (dataindexed by the nodes) are far removed from time or imagesignals indexed by well ordered time samples or pixels. DSP,discrete signal processing, provides a comprehensive, elegant,and efficient methodology to describe, represent, transform,analyze, process, or synthesize these well ordered time or imagesignals. This paper extends tosignals on graphs DSP and itsbasic tenets, including filters, convolution,z-transform, impulseresponse, spectral representation, Fourier transform, frequencyresponse, and illustratesDSP on graphs by classifying blogs,linear predicting and compressing data from irregularly locatedweather stations, or predicting behavior of customers of a mobileservice provider.

Keywords: Network science, signal processing, graphicalmodels, Markov random fields, graph Fourier transform.

I. I NTRODUCTION

There is an explosion of interest in processing and analyzinglarge datasets collected in very different settings, includingsocial and economic networks, information networks, internetand the world wide web, immunization and epidemiologynetworks, molecular and gene regulatory networks, citationand coauthorship studies, friendship networks, as well asphysical infrastructure networks like sensor networks, powergrids, transportation networks, and other networked criticalinfrastructures. We briefly overview some of the existing work.

Many authors focus on the underlyingrelational structureof the data by: 1) inferring the structure from communityrelations and friendships, or from perceived alliances betweenagents as abstracted through game theoretic models [1], [2];2) quantifying the connectedness of the world; and 3) de-termining the relevance of particular agents, or studying thestrength of their interactions. Other authors are interestedin the network functionby quantifying the impact of thenetwork structure on the diffusion of disease, spread of newsand information, voting trends, imitation and social influence,crowd behavior, failure propagation, global behaviors devel-oping from seemingly random local interactions [2], [3], [4].Much of these works either develop or assume network modelsthat capture the interdependencies among the data and thenanalyze the structural properties of these networks. Modelsoften considered may be deterministic like complete or regular

Copyright (c) 2012 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

This work was supported in part by AFOSR grant FA95501210087.A. Sandryhaila and J. M. F. Moura are with the Department of Elec-trical and Computer Engineering, Carnegie Mellon University, Pitts-burgh, PA 15213-3890. Ph: (412)268-6341; fax: (412)268-3890. Email:[email protected], [email protected].

graphs, or random like the Erdos-Renyi and Poisson graphs,the configuration and expected degree models, small world orscale free networks [2], [4], to mention a few. These modelsare used to quantify network characteristics, such as connect-edness, existence and size of the giant component, distributionof component sizes, degree and clique distributions, and nodeor edge specific parameters including clustering coefficients,path length, diameter, betweenness and closeness centralities.

Another body of literature is concerned with inference andlearning from such large datasets. Much work falls underthe generic label of graphical models [5], [6], [7], [8], [9],[10]. In graphical models, data is viewed as a family ofrandom variables indexed by the nodes of a graph, wherethe graph captures probabilistic dependencies among dataelements. The random variables are described by a family ofjoint probability distributions. For example, directed (acyclic)graphs [11], [12] represent Bayesian networks where eachrandom variable is independent of others given the variablesdefined on its parent nodes. Undirected graphical models, alsoreferred to as Markov random fields [13], [14], describe datawhere the variables defined on two sets of nodes separated bya boundary set of nodes are statistically independent giventhevariables on the boundary set. A key tool in graphical modelsis the Hammersley-Clifford theorem [13], [15], [16], and theMarkov-Gibbs equivalence that, under appropriate positivityconditions, factors the joint distribution of the graphical modelas a product of potentials defined on the cliques of the graph.Graphical models exploit this factorization and the structureof the indexing graph to develop efficient algorithms forinference by controlling their computational cost. Inferencein graphical models is generally defined as finding from thejoint distributions lower order marginal distributions, likeli-hoods, modes, and other moments of individual variables ortheir subsets. Common inference algorithms include beliefpropagation and its generalizations, as well as other messagepassing algorithms. A recent block-graph algorithm for fastapproximate inference, in which the nodes are non-overlappingclusters of nodes from the original graph, is in [17]. Graphicalmodels are employed in many areas; for sample applications,see [18] and references therein.

Extensive work is dedicated to discovering efficient datarepresentations for large high-dimensional data [19], [20],[21], [22]. Many of these works use spectral graph theory andthe graph Laplacian [23] to derive low-dimensional represen-tations by projecting the data on a low-dimensional subspacegenerated by a small subset of the Laplacian eigenbasis. Thegraph Laplacian approximates the Laplace-Beltrami operatoron a compact manifold [24], [21], in the sense that if thedataset is large and samples uniformly randomly a low-dimensional manifold then the (empirical) graph Laplacianacting on a smooth function on this manifold is a good discrete

http://arxiv.org/abs/1210.4752v2

2

approximation that converges pointwise and uniformly to theelliptic Laplace-Beltrami operator applied to this function asthe number of points goes to infinity [25], [26], [27]. Onecan go beyond the choice of the graph Laplacian by choos-ing discrete approximations to other continuous operatorsand obtaining possibly more desirable spectral bases for thecharacterization of the geometry of the manifold underlyingthe data. For example, if the data represents a non-uniformsampling of a continuous manifold, a conjugate to an ellipticSchrodinger-type operator can be used [28], [29], [30].

More in line with our paper, several works have proposedmultiple transforms for data indexed by graphs. Examples in-clude regression algorithms [31], wavelet decompositions[32],[33], [34], [30], [35], filter banks on graphs [36], [37], de-noising [38], and compression [39]. Some of these transformsfocus on distributed processing of data from sensor fieldswhile addressing sampling irregularities due to random sensorplacement. Others consider localized processing of signals ongraphs in multiresolution fashion by representing data usingwavelet-like bases with varying “smoothness” or definingtransforms based on node neighborhoods. In the latter case,the graph Laplacian and its eigenbasis are sometimes usedto define a spectrum and a Fourier transform of a signal on agraph. This definition of a Fourier transform was also proposedfor use in uncertainty analysis on graphs [40], [41]. This graphFourier transform is derived from the graph Laplacian andrestricted to undirected graphs with real, non-negative edgeweights, not extending to data indexed by directed graphs orgraphs with negative or complex weights.

The algebraic signal processing (ASP) theory [42], [43],[44], [45] is a formal, algebraic approach to analyze dataindexed by special types of line graphs and lattices. Thetheory uses an algebraic representation of signals and filtersas polynomials to derive fundamental signal processing con-cepts. This framework has been used for discovery of fastcomputational algorithms for discrete signal transforms [42],[46], [47]. It was extended to multidimensional signals andnearest neighbor graphs [48], [49] and applied in signalcompression [50], [51]. The framework proposed in this papergeneralizes and extends the ASP to signals on arbitrary graphs.

Contribution

Our goal is to develop a linear discrete signal processing(DSP) framework and corresponding tools for datasets arisingfrom social, biological, and physical networks. DSP has beenvery successful in processing time signals (such as speech,communications, radar, or econometric time series), space-dependent signals (images and other multidimensional signalslike seismic and hyperspectral data), and time-space signals(video). We refer to dataindexedby nodes of a graph asa graph signal or simply signal and to our approach asDSP on graphs(DSPG)1. We introduce the basics of linear2

1The term “signal processing for graphs” has been used in [52], [53] inreference to graph structure analysis and subgraph detection. It should not beconfused with our proposed DSP framework, which aims at the analysis andprocessing ofdata indexed by the nodes of a graph.

2We are concerned with linear operations; in the sequel, we refer only toDSPG but have in mind that we are restricted to linear DSPG.

DSPG, including the notion of a shift on a graph, filterstructure, filtering and convolution, signal and filter spacesand their algebraic structure, the graph Fourier transform,frequency, spectrum, spectral decomposition, and impulseandfrequency responses. With respect to other works, ours is adeterministic framework to signal processing on graphs ratherthan a statistical approach like graphical models. Our workis an extension and generalization of the traditional DSP,and generalizes the ASP theory [42], [43], [44], [45] and itsextensions and applications [49], [50], [51]. We emphasizethe contrast between the DSPG and the approach to the graphFourier transform that takes the graph Laplacian as a point ofdeparture [32], [38], [36], [35], [39], [41]. In the latter case,the Fourier transform on graphs is given by the eigenbasis ofthe graph Laplacian. However, this definition is not applicableto directed graphs, which often arise in real-world problems,as demonstrated by examples in Section VI, and graphs withnegative weights. In general, the graph Laplacian is a second-order operator for signals on a graph, whereas an adjacencymatrix is a first-order operator. Deriving a graph Fourier trans-form from the graph Laplacian is analogous in traditional DSPto restricting signals to be even (like correlation sequences)and Fourier transforms to represent power spectral densitiesof signals. Instead, we demonstrate that the graph Fouriertransform is properly defined through the Jordan normal formand generalized eigenbasis of the adjacency matrix3. Finally,we illustrate the DSPG with applications like classification,compression, and linear prediction for datasets that includeblogs, customers of a mobile operator, or collected by anetwork of irregularly placed weather stations.

II. SIGNALS ON GRAPHS

Consider a dataset withN elements, for which somerela-tional information about its data elements is known. Examplesinclude preferences of individuals in a social network andtheir friendship connections, the number of papers publishedby authors and their coauthorship relations, or topics ofonline documents in the World Wide Web and their hyperlinkreferences. This information can be represented by a graphG = (V ,A), whereV = {v0, . . . , vN−1} is the set of nodesandA is the weighted4 adjacency matrix of the graph. Eachdataset element corresponds to nodevn, and each weightAn,m of a directed edge fromvm to vn reflects the degreeof relation of thenth element to themth one. Since dataelements can be related to each other differently, in general,G is adirected, weightedgraph. Its edge weightsAn,m are notrestricted to being nonnegative reals; they can take arbitraryreal or complex values (for example, if data elements arenegatively correlated). The set of indices of nodes connectedto vn is called the neighborhoodof vn and denoted byNn = {m | An,m 6= 0}.

3 Parts of this material also appeared in [54], [55]. In this paper, we presenta complete theory with all derivations and proofs.

4Some literature defines the adjacency matrixA of a graphG = (V ,A)so thatAn,m only takes values of 0 or 1, depending on whether there is anedge fromvm to vn, and specifies edge weights as a function on pairs ofnodes. In this paper, we incorporate edge weights intoA.

3

v0 v1 vN-1

vN–2

(a) Time series (b) Digital image

(c) Sensor field (d) Hyperlinked documents

Fig. 1. Graph representations for different datasets (graph signals.)

Assuming, without a loss of generality, that dataset elementsare complex scalars, we define agraph signalas a map fromthe setV of nodes into the set of complex numbersC:

s : V → C,

vn 7→ sn. (1)

Notice that each signal is isomorphic to a complex-valuedvector withN elements. Hence, for simplicity of discussion,we write graph signals as vectorss =

(s0 s1 . . . sN−1

)T,

but remember that each elementsn is indexedby nodevn ofa given representation graphG = (V ,A), as defined by (1).The spaceS of graphs signals (1) then is identical toCN .

We illustrate representation graphs with examples shownin Fig. 1. The directed cyclic graph in Fig. 1(a) represents afinite, periodic discrete time series [44]. All edges are directedand have the same weight1, reflecting the causality of a timeseries; and the edge fromvN−1 to v0 reflects its periodicity.The two-dimensional rectangular lattice in Fig. 1(b) representsa general digital image. Each node corresponds to a pixel, andeach pixel value (intensity) is related to the values of the fouradjacent pixels. This relation is symmetric, hence all edges areundirected and have the same weight, with possible exceptionsof boundary nodes that may have directed edges and/or dif-ferent edge weights, depending on boundary conditions [45].Other lattice models can be used for images as well [48].The graph in Fig. 1(c) represents temperature measurementsfrom 150 weather stations (sensors) across the United States.We represent the relations of temperature measurements bygeodesic distances between sensors, so each node is connectedto its closest neighbors. The graph in Fig. 1(d) represents aset of50 political blogs in the World Wide Web connected byhyperlink references. By their nature, the edges are directedand have the same weights. We discuss the two latter examplesin Section VI, where we also consider a network of customersof a mobile service provider. Clearly, representation graphsdepend on prior knowledge and assumptions about datasets.For example, the graph in Fig. 1(d) is obtained by followingthe hyperlinks networking the blogs, while the graph inFig. 1(c) is constructed from known locations of sensors underassumption that temperature measurements at nearby sensors

have highly correlated temperatures.

III. F ILTERS ON GRAPHS

In classical DSP, signals are processed byfilters—systemsthat take a signal as input and produce another signal as output.We now develop the equivalent concept ofgraph filters forgraph signals in DSPG. We consider only linear, shift-invariantfilters, which are a generalization of linear time-invariant filtersused in DSP for time series. This section uses Jordan normalform and characteristic and minimal polynomials of matrices;these concepts are reviewed in Appendix A. The use of Jordandecomposition is required since for many real-world datasetsthe adjacency matrixA is not diagonalizable. One example isthe blog dataset, considered in Section VI.

Graph Shift

In classical DSP, the basic building block of filters is aspecial filterx = z−1 called thetime shiftor delay [56]. Thisis the simplest non-trivial filter that delays the input signal sby one sample, so that thenth sample of the output issn =sn−1 mod N . Using the graph representation of finite, periodictime series in Fig. 1(a), for which the adjacency matrix is theN ×N circulant matrixA = CN , with weights [43], [44]

An,m =

{1, if n−m = 1 mod N

0, otherwise, (2)

we can write the time shift operation as

s = CN s = As. (3)

In DSPG, we extend the notion of the shift (3) to generalgraph signalss where the relational dependencies among thedata are represented by an arbitrary graphG = (V ,A). Wecall the operation (3) thegraph shift. It is realized by replacingthe samplesn at nodevn with the weighted linear combinationof the signal samples at its neighbors:

sn =

N−1∑

m=0

An,msm =∑

m∈Nn

An,msm. (4)

Note that, in classical DSP, shifting a finite signal requiresone to consider boundary conditions. In DSPG, this problemis implicitly resolved, since the graphG = (V ,A) explicitlycaptured the boundary conditions.

Graph Filters

Similarly to traditional DSP, we can represent filteringon a graph using matrix-vector multiplication. Any systemH ∈ C

N×N , or graph filter, that for inputs ∈ S producesoutputHs represents alinear system, since the filter’s outputfor a linear combination of input signals equals the linearcombination of outputs to each signal:

H(αs1 + βs2) = αHs1 + βHs2.

Furthermore, we focus onshift-invariant graph filters, forwhich applying the graph shift to the output is equivalent toapplying the graph shift to the input prior to filtering:

A(Hs)= H

(As). (5)

4

The next theorem establishes that all linear, shift-invariantgraph filters are given bypolynomialsin the shiftA.

Theorem 1:Let A be the graph adjacency matrix andassume that its characteristic and minimal polynomials areequal:pA(x) = mA(x). Then, a graph filterH is linear andshift invariant if and only if (iff) H is a polynomial in thegraph shiftA, i.e., iff there exists a polynomial

h(x) = h0 + h1x+ . . .+ hLxL (6)

with possibly complex coefficientshℓ ∈ C, such that:

H = h(A) = h0 I+h1A+ . . .+ hLAL. (7)

Proof: Since the shift-invariance condition (5) holds forall graph signalss ∈ S = CN , the matricesA and H

commute:AH = HA. As pA(x) = mA(x), all eigenvaluesof A have exactly one eigenvector associated with them, [57],[58]. Then, the graph matrixH commutes with the shiftA iffit is a polynomial inA (see Proposition 12.4.1 in [58]).

Analogous to the classical DSP, we call the coefficientshℓ

of the polynomialh(x) in (6) the graph filtertaps.

Properties of Graph Filters

Theorem 1 requires the equality of the characteristic andminimal polynomialspA(x) andmA(x). This condition doesnot always hold, but can be successfully addressed throughthe concept ofequivalentgraph filters, as defined next.

Definition 1: Given any shift matricesA and A, filtersh(A) and g(A) are calledequivalentif for all input signalss ∈ S they produce equal outputs:h(A)s = g(A)s.

Note that, when no restrictions are placed on the signals,so that S = C

N , Definition 1 is equivalent to requiringh(A) = g(A) as matrices. However, if additional restrictionsexist, filters may not necessarily be equal as matrices and stillproduce the same output for the considered set of signals.

It follows that, given an arbitraryG = (V ,A) with pA(x) 6=mA(x), we can consider another graphG = (V , A) with thesame set of nodesV but potentially different edges and edgeweights, for whichp

A(x) = m

A(x) holds true. Then graph

filters on G can be expressed as equivalent filters onG, asdescribed by the following theorem (proven in Appendix B).

Theorem 2:For any matrixA there exists a matrixA andpolynomialr(x), such thatA = r(A) andp

A(x) = m

A(x).

As a consequence of Theorem 2, any filter on the graphG = (V ,A) is equivalent to a filter on the graphG = (V , A),since h(A) = h(r(A)) = (h ◦ r)(A), whereh ◦ r is thecomposition of polynomialsh andr and thus is a polynomial.Thus, the conditionpA(x) = mA(x) in Theorem 1 can beassumed to hold for any graphG = (V ,A). Otherwise, byTheorem 2, we can replace the graph by anotherG = (V , A)for which the condition holds and assignA to A.

The next result demonstrates that we can limit the numberof taps in any graph filter.

Theorem 3:Any graph filter (7) has a unique equivalentfilter on the same graph with at mostdegmA(x) = NA taps.

Proof: Consider the polynomialsh(x) in (6). By polyno-mial division, there exist unique polynomialsq(x) and r(x):

h(x) = q(x)mA(x) + r(x), (8)

wheredeg r(x) < NA. Hence, we can express (7) as

h(A) = q(A)mA(A) + r(A) = q(A)0N +r(A) = r(A).

Thus,h(A) = r(A) anddeg r(x) < degmA(x).As follows from Theorem 3, all linear, shift-invariant fil-

ters (7) on a graphG = (V ,A) form a vector space

F =

{H : H =

NA−1∑

ℓ=0

hℓAℓ

∣∣∣∣∣ hℓ ∈ C

}. (9)

Moreover, addition and multiplication of filters inF producenew filters that are equivalent to filters inF . Thus,F is closedunder these operations, and has the structure of an algebra [43].We discuss it in detail in Section IV.

Another consequence of Theorem 3 is that the inverse of afilter on a graph, if it exists, is also a filter on the same graph,i.e., it is a polynomial in (9).

Theorem 4:A graph filter H = h(A) ∈ F is invertibleiff polynomial h(x) satisfiesh(λm) 6= 0 for all distincteigenvaluesλ0, . . . , λM−1, of A. Then, there is a uniquepolynomialg(x) of degreedeg g(x) < NA that satisfies

h(A)−1 = g(A) ∈ F . (10)

Appendix C contains the proof and the procedure for theconstruction of g(x). Theorem 4 implies that instead ofinverting theN × N matrix h(A) directly we only need toconstruct a polynomialg(x) specified by at mostNA taps.

Finally, it follows from Theorem 3 and (9) that anygraph filter h(A) ∈ F is completely specified by its tapsh0, · · · , hNA−1. As we prove next, in DSPG, as in traditionalDSP, filter taps uniquely determine theimpulse responseofthe filter, i.e., its outputu = (g0, . . . , gN−1)

T for unit impulseinput δ = (1, 0, . . . , 0)

T , and vice versa.Theorem 5:The filter tapsh0, . . . , hNA−1 of the filterh(A)

uniquely determine its impulse responseu. Conversely, the im-pulse responseu uniquely determines the filter taps, providedrank A = NA, whereA =

(A0δ, . . . ,ANA−1δ

).

Proof: The first part follows from the definition of filter-ing: u = h(A)δ = Ah yields the first column ofh(A), whichis uniquely determined by the tapsh = (h0, . . . , hNA−1)

T .Since we assumepA(x) = mA(x), thenN = NA, and thesecond part holds ifA is invertible, i.e.,rank A = NA.

Notice that a relabeling of the nodesv0, . . . , vN−1 does notchange the impulse response. IfP is the corresponding permu-tation matrix, then the unit impulse isPδ, the adjacency matrixis PAPT , and the filter becomesh(PAPT ) = Ph(A)PT .Hence, the impulse response is simply reordered according tosame permutation:Ph(A)PTPδ = Pu.

IV. A LGEBRAIC MODEL

So far, we presented signals and filters on graphs as vectorsand matrices. An alternative representation exists for filters andsignals as polynomials. We call this representation the graphz-transform, since, as we show, it generalizes the traditionalz-transform for discrete time signals that maps signals andfilters to polynomials or series inz−1. The graphz-transformis defined separately for graph filters and signals.

5

Consider a graphG = (V ,A), for which the characteristicand minimal polynomials of the adjacency matrix coincide:pA(x) = mA(x). The mappingA 7→ x of the adjacencymatrix A to the indeterminatex maps the graph filtersH =h(A) in F to polynomialsh(x). By Theorem 3, the filterspaceF in (9) becomes apolynomial algebra[43]

A = C[x]/mA(x). (11)

This is a space of polynomials of degree less thandegmA(x)with complex coefficients that is closed under addition andmultiplication of polynomials modulomA(x). The mappingF → A, h(A) 7→ h(x), is a isomorphism ofC-algebras [43],which we denote asF ∼= A. We call it thegraphz-transformof filters on graphG = (V ,A).

The signal spaceS is a vector space that is also closedunder filtering, i.e., under multiplication by graph filtersfromF : for any signals ∈ S and filterh(A), the output is a signalin the same space:h(A)s ∈ S. Thus,S is anF -module [43].As we show next, thegraphz-transform of signalsis definedas an isomorphism (13) fromS to anA-module.

Theorem 6:Under the above conditions, the signal spaceSis isomorphic to anA-module

M = C[x]/pA(x) =

{s(x) =

N−1∑

n=0

snbn(x)

}(12)

under the mapping

s = (s0, . . . , sN−1)T 7→ s(x) =

N−1∑

n=0

snbn(x). (13)

The polynomialsb0(x), . . . , bN−1(x) are linearly independentpolynomials of degree at mostN − 1. If we write

b(x) = (b0(x), . . . , bN−1(x))T , (14)

then the polynomials satisfy

b(r)(λm) =(b(r)0 (λm) . . . b

(r)N−1(λm)

)T= r!vm,0,r

(15)for 0 ≤ r < Rm,0 and 0 ≤ m < M , whereλm and vm,0,r

are generalized eigenvectors ofAT , andb(r)n (λm) denotes therth derivative ofbn(x) evaluated atx = λm. Filtering in Mis performed as multiplication modulopA(x): if s = h(A)s,then

s 7→ s(x) =

N−1∑

n=0

snbn(x) = h(x)s(x) mod pA(x). (16)

Proof: Due to the linearity and shift-invariance of graphfilters, we only need to prove (16) forh(A) = A. Let us writes(x) = b(x)T s and s(x) = b(x)T s, whereb(x) is givenby (14). Since (16) must hold for alls ∈ S, for h(A) = A itis equivalent to

b(x)T s = b(x)T (As) = b(x)T (xs) mod pA(x)

⇔(AT − x I

)b(x) = cpA(x), (17)

wherec ∈ CN is a vector of constants, sincedeg pA(x) = N

anddeg (xbn(x)) ≤ N for 0 ≤ n < N .

It follows from the factorization (43) ofpA(x) that, for eacheigenvalueλm and0 ≤ k < Am, the characteristic polynomialsatisfiesp(k)

A(λm) = 0. By taking derivatives of both sides

of (17) and evaluating atx = λm, 0 ≤ m < M , we constructA0 + . . .+AM−1 = N linear equations

(AT − λm I

)b(λm) = 0

(AT − λm I

)b(r)(λm) = rb(λm), 1 ≤ r < Am

Comparing these equations with (35), we obtain (15). SinceNpolynomialsbn(x) = bn,0 + . . .+ bn,N−1x

N−1 are character-ized byN2 coefficientsbn,k, 0 ≤ n, k < N , (15) is a systemof N linear equations withN2 unknowns that can always besolved using inverse polynomial interpolation [58].

Theorem 6 extends to the general casepA(x) 6= mA(x).By Theorem 2, there exists a graphG = (V , A) withpA(x) = m

A(x), such thatA = r(A). By mappingA to x,

the filter space (9) has the structure of the polynomial algebraA = C[x]/mA(r(x)) = C[x]/(mA ◦ r)(x)) and the signalspace has the structure of theA-moduleM = C[x]/p

A(x).

Multiplication of filters and signals is performed modulopA(x). The basis ofM satisfies (15), whereλm andvm,d,r

are eigenvalues and generalized eigenvectors ofA.

V. FOURIER TRANSFORM ONGRAPHS

After establishing the structure of filter and signal spacesinDSPG, we define other fundamental DSP concepts, includingspectral decomposition, signal spectrum, Fourier transform,and frequency response. They are related to the Jordan normalform of the adjacency matrixA, reviewed in Appendix A.

Spectral Decomposition

In DSP, spectral decomposition refers to the identificationof subspacesS0, . . . ,SK−1 of the signal spaceS that areinvariant to filtering, so that, for any signalsk ∈ Sk and filterh(A) ∈ F , the outputsk = h(A)sk lies in the same subspaceSk. A signal s ∈ S can then be represented as

s = s0 + s1 + . . .+ sK−1, (18)

with each componentsk ∈ Sk. Decomposition (18) is uniquelydetermined for every signals ∈ S if and only if: 1) invariantsubspacesSk have zero intersection, i.e.,Sk ∩ Sm = {0} fork 6= m; 2) dimS0 + . . . + dimSK−1 = dimS = N ; and3) eachSk is irreducible, i.e., it cannot be decomposed intosmaller invariant subspaces. In this case,S is written as adirect sum of vector subspaces

S = S0 ⊕ S1 ⊕ . . .⊕ SK−1. (19)

As mentioned, since the graph may have arbitrary struc-ture, the adjacency matrixA may not be diagonalizable;in fact, A for the blog dataset (see Section VI) is notdiagonalizable. Hence, we consider the Jordan decomposi-tion (39) A = VJV−1, which is reviewed in AppendixA. Here, J is the Jordan normal form (40), andV isthe matrix of generalized eigenvectors (38). LetSm,d =span{vm,d,0, . . . ,vm,d,Rm,d−1} be a vector subspace ofS

6

spanned by thedth Jordan chain ofλm. Any signal sm,d ∈Sm,d has a unique expansion

sm,d = sm,d,0vm,d,0 + . . .+ sm,d,Rm,d−1vm,d,Rm,d−1

= Vm,d

(sm,d,0 . . . sm,d,Rm,d−1

)T,

where Vm,d is the block of generalized eigenvectors (37).As follows from the Jordan decomposition (39), shifting thesignal sm,d produces the outputsm,d ∈ Sm,d from the samesubspace, since

sm,d = Asm,d = AVm,d

(sm,d,0 . . . sm,d,Rm,d−1

)T

= Vm,d JRm,d(λm)

(sm,d,0 . . . sm,d,Rm,d−1

)T

= Vm,d

λmsm,d,0 + sm,d,1

...λmsm,d,Rm,d−2 + sm,d,Rm,d−1

λmsm,d,Rm,d−1

. (20)

Hence, each subspaceSm,d ≤ S is invariant to shifting.Using (39) and Theorem 1, we write the graph filter (7) as

h(A) =

L∑

ℓ=0

hℓ(VJV−1)ℓ =

L∑

ℓ=0

hℓ VJℓ V−1

= V( L∑

ℓ=0

hℓ Jℓ)V−1 = V h(J)V−1 . (21)

Similarly to (20), we observe that filtering a signalsm,d ∈Sm,d produces an outputsm,d ∈ Sm,d from the same subspace:

sm,d = h(A)sm,d = h(A)Vm,d

sm,d,0

...sm,d,Rm,d−1

= Vm,d

h(JRm,d

(λm))

sm,d,0

...sm,d,Rm,d−1

.(22)

Since allN generalized eigenvectors ofA are linearly inde-pendent, all subspacesSm,d have zero intersections, and theirdimensions add toN . Thus, thespectral decomposition(19)of the signal spaceS is

S =

M−1⊕

m=0

Dm−1⊕

d=0

Sm,d. (23)

Graph Fourier Transform

The spectral decomposition (23) expands each signals ∈ Son the basis of the invariant subspaces ofS. Since we chosethe generalized eigenvectors as bases of the subspacesSm,d,the expansion coefficients are given by

s = V s, (24)

whereV is the generalized eigenvector matrix (38). The vectorof expansion coefficients is given by

s = V−1 s. (25)

The union of the bases of all spectral componentsSm,d,i.e., the basis of generalized eigenvectors, is called thegraphFourier basis. We call the expansion (25) of a signals into the

graph Fourier basis thegraph Fourier transformand denotethe graph Fourier transform matrix as

F = V−1 . (26)

Following the conventions of classical DSP, we call thecoefficientssn in (25) thespectrumof a signals. The inversegraph Fourier transformis given by (24); it reconstructs thesignal from its spectrum.

Frequency Response of Graph Filters

The frequency responseof a filter characterizes its effect onthe frequency content of the input signal. Let us rewrite thefiltering of s by h(A) using (21) and (24) as

s = h(A)s = F−1 h(J)Fs = F−1 h(J)s

⇒ F s = h(J)s. (27)

Hence, the spectrum of the output signal is the spectrum ofthe input signal modified by the block-diagonal matrix

h(J) =

h(Jr0,0(λ0))

. . .h(JrM−1,DM−1(λM−1))

,

(28)so that the part of the spectrum corresponding to the invariantsubspaceSm,d is multiplied by h(Jm). Hence,h(J) in (28)represents the frequency response of the filterh(A).

Notice that (27) also generalizes theconvolution theoremfrom classical DSP [56] to arbitrary graphs.

Theorem 7:Filtering a signal is equivalent, in the frequencydomain, to multiplying its spectrum by the frequency responseof the filter.

Discussion

The connection (25) between the graph Fourier transformand the Jordan decomposition (39) highlights some desirableproperties of representation graphs. For graphs withdiago-nalizable adjacency matricesA, which haveN linearly in-dependent eigenvectors, the frequency response (28) of filtersh(A) reduces to a diagonal matrix with the main diagonalcontaining valuesh(λm), whereλm are the eigenvalues ofA. Moreover, for these graphs, Theorem 6 provides theclosed-form expression (15) for the inverse graph FouriertransformF−1 = V. Graphs with symmetric (or Hermitian)matrices, such as undirected graphs, are always diagonalizableand, moreover, have orthogonal graph Fourier transforms:F−1 = FH . This property has significant practical importance,since it yields a closed-form expression (15) forF andF−1.Moreover, orthogonal transforms are well-suited for efficientsignal representation, as we demonstrate in Section VI.

DSPG is consistent with the classical DSP theory. Asmentioned in Section II, finite discrete periodic time seriesare represented by the directed graph in Fig. 1(a). The corre-sponding adjacency matrix is theN ×N circulant matrix (2).Its eigendecomposition (and hence, Jordan decomposition)is

CN =1

NDFT−1

N

e−j 2π·0

N

. . .

e−j2π·(N−1)

N

DFTN ,

7

whereDFTN is the discrete Fourier transform matrix. Thus,as expected, the graph Fourier transform isF = DFTN .Furthermore, for a general filterh(CN ) =

∑N−1ℓ=0 hℓC

ℓN ,

coefficients of the outputs = h(CN )s are calculated as

sn = hns0 + . . .+ h0sn + hN−1sn+1 + . . .+ hn+1sN−1

=

N−1∑

k=0

skh(n−k mod N).

This is the standard circular convolution. Theorem 5 holds aswell, with impulse response identical to filter taps:u = h.

Similarly, it has been shown in [45], [43] that unweightedline graphs similar to Fig. 1(a), but with undirected edges anddifferent, non-periodic boundary conditions, give rise toall16 types of discrete cosine and sine transforms as their graphFourier transform matrices. Combined with [59], it can beshown that graph Fourier transforms for images on the latticein Fig. 1(b) are different types of two-dimensional discretecosine and sine transforms, depending on boundary conditions.This result serves as additional motivation for the use of thesetransforms in image representation and coding [60].

In discrete-time DSP, the concepts of filtering, spectrum,and Fourier transform have natural, physical interpretations.In DSPG, when instantiated for various datasets, the interpre-tation of these concepts may be drastically different and notimmediately obvious. For example, if a representation graphreflects the proximity of sensors in some metric (such astime, space, or geodesic distance), and the dataset containssensor measurements, then filtering corresponds to linear re-combination of related measurements and can be viewed as agraph form of regression analysis with constant coefficients.The graph Fourier transform then decomposes signals overequilibrium points of this regression. On the other hand, ifagraph represents a social network of individuals and their com-munication patterns, and the signal is a social characteristic,such as an opinion or a preference, then filtering can be viewedas diffusion of information along established communicationchannels, and the graph Fourier transform characterizes signalsin terms of stable, unchangeable opinions or preferences.

VI. A PPLICATIONS

We consider several applications of DSPG to data pro-cessing. These examples illustrate the effectiveness of theframework in standard DSP tasks, such as predictive filteringand efficient data representation, as well as demonstrate thatthe framework can tackle problems less common in DSP, suchas data classification and customer behavior prediction.

Linear Prediction

Linear prediction (LP) is an efficient technique for repre-sentation, transmission, and generation of time series [61].It is used in many applications, including power spectralestimation and direction of arrival analysis. Two main steps ofLP are the construction of a prediction filter and the generationof an (approximated) signal, implemented, respectively, withforward and backward filters, shown in Fig. 2. The forward(prediction) filter converts the signal into aresidual, which is

rs I¡ h(A)

(a) Forward (prediction) filter

sr (I¡ h(A))¡1

(b) Backward (synthesis) filter

Fig. 2. Components of linear prediction.

then closely approximated, for example, by a white noise–flatpower spectrum signal or efficient quantization with few bits.The backward (synthesis) filter constructs an approximation ofthe original signal from the approximated residual.

Using the DSPG, we can extend LP to graph signals.We illustrate it with the dataset [62] of daily temperaturemeasurements from sensors located near 150 major US cities.Data from each sensor is a separate time series, but encodingit requires buffering measurements from multiple days beforethey can be encoded for storage or transmission. Instead, webuild a LP filter on a graph to encode daily snapshots of all150 sensor measurements.

We construct a representation graphG = (V ,A) for thesensor measurements using geographical distances betweensensors. Each sensor corresponds to a nodevn, 0 ≤ n < 150,and is connected toK nearest sensors with undirected edgesweighted by the normalized inverse exponents of the squareddistances: ifdnm denotes the distance between thenth andmth sensors5 andm ∈ Nn, then

An,m =e−d2

nm

√∑k∈Nn

e−d2nk

∑ℓ∈Nm

e−d2mℓ

. (29)

For each snapshots of N = 150 measurements, weconstruct a prediction filterh(A) with L taps by minimizingthe energy of the residualr = s − h(A)s = (IN −h(A)) s.We seth0 = 0 to avoid the trivial solutionh(A) = I, andobtain (

h1 . . . hL−1

)T= (BTB)−1BT s.

Here,B =(As . . . AL−1s

)is aN × (L− 1) matrix. The

residual energy||r||22 is relatively small compared to the energyof the signals, since shifted signals are close approximationsof s, as illustrated in Fig. 3. This phenomenon provides theintuition for the graph shift: if the graph represents a similarityrelation, as in this example, then the shift replaces each signalsample with a sum of related samples with more similarsamples weighted heavier than less similar ones.

The residualr is then quantized usingB bits, and thequantized residualr is processed with the inverse filter tosynthesize an approximated signals = (IN −h(A))−1

r.We considered graphs with1 ≤ K ≤ 15 nearest neighbors,

and for eachK constructed optimal prediction filters with2 ≤L ≤ 10 taps. As shown in Fig. 4, the lowest and highest errors||s− s||2/||s||2 were obtained forK = 11 andL = 3, and forK = 8 andL = 9. During the experiments, we observed thatgraphs with few neighbors (approximately,3 ≤ K ≤ 7) lead tolower errors when prediction filters have impulse responsesof

5The construction of representation graphs for datasets is an importantresearch problem and deserves a separate discussion that isbeyond the scopeof this paper. The procedure we use here is a popular choice for constructionof similarity graphs based on distances between nodes [21],[30], [35].

8

-30

-20

-10

0

10

20

30

40

0 15 30 45 60 75 90 105 120 135 150

Tem

per

ature

(deg

rees

Cel

sius)

Sensor index

Signal

Shifted signal

Twice shifted signal

Fig. 3. A signal representing a snapshot of temperature measurements fromN = 150 sensors. Shifting the signal produces signals similar to the original.

0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Err

or

(%)

Bits used for quantization

K=11, L=3

K=10, L=3

K=8, L=9

Fig. 4. Average approximation errors||s− s||2/||s||2 for LP coding of365signalss representing daily temperature snapshots. Graphs with1 ≤ K ≤ 15nearest neighbors for each sensor were analyzed, and filterswith 2 ≤ L ≤ 10taps were constructed. The residual was quantized using1 ≤ B ≤ 16 bits.The lowest, second lowest, and highest errors were obtained, respectively forK = 11 andL = 3, K = 10 andL = 3, andK = 8 andL = 9.

medium length (4 ≤ L ≤ 6), while graphs with7 ≤ K ≤ 11neighbors yield lower errors for3 ≤ L ≤ 5. Using largervalues ofK andL leads to large errors. This tendency maybe due to overfitting filters to signals, and demonstrates thatthere exists a trade-off between graph and filter parameters.

Signal Compression

Efficient signal representation is required in multiple DSPareas, such as storage, compression, and transmission. Somewidely-used techniques are based on expanding signals intoor-thonormal bases with the expectation that most informationiscaptured with few basis functions. The expansion coefficientsare calculated using an orthogonal transform. If the transformrepresents a Fourier transform in some model, it means thatsignals are sparse in the frequency domain in this model, i.e.,they contain only few frequencies. Some widely-used imagecompression standards, such as JPEG and JPEG 2000, useorthogonal expansions implemented, respectively, by discretecosine and wavelet transforms [60].

As discussed in the previous example, given a signals on agraphG = (V ,A), whereA reflects similarities between dataelements, the shifted signalAs can be a close approximationof s, up to a scalar factor:As ≈ ρs. This is illustrated in

0

5

10

15

20

25

30

35

40

0 15 30 45 60 75 90 105 120 135 150

Err

or

(%)

Number of used coefficients

Fig. 5. Average reconstruction error||s− s||2/||s||2 for the compression of365 daily temperature snapshots based on the graph Fourier transform using1 ≤ C ≤ N coefficients.

1

0

Fig. 6. The Fourier basis vector that captures most energy oftemperaturemeasurements reflects the relative distribution of temperature across themainland United States. The coefficients are normalized to the interval[0, 1].

Fig. 3, whereρ ≈ 1. Hence,s can be effectively expressed asa linear combination of a few [generalized] eigenvectors ofA.

Consider the above dataset of temperature measurements.The matrixA in (29) is symmetric by construction, henceits eigenvectors form an orthonormal basis, and the graphFourier transform matrixF is orthogonal. In this case, we cancompress each daily updates of N = 150 measurements bykeeping only theC spectrum coefficients (25)sn with largestmagnitudes. Assuming that|s0| ≥ |s1| ≥ . . . ≥ |sN−1|, thesignal reconstructed after compression is

s = F−1 (s0, . . . , sC−1, 0, . . . , 0)T. (30)

Fig. 5 shows the average reconstruction errors obtained byretaining1 ≤ C ≤ N spectrum coefficients.

This example also provides interesting insights into thetemperature distribution pattern in the United States. Considerthe Fourier basis vector that most frequently (for 217 days outof 365) captures most energy of the snapshots, i.e., yieldsthe spectrum coefficients0 in (30). Fig. 6 shows the vectorcoefficients plotted on the representation graph accordingtothe sensors’ geographical coordinates, so the graph naturallytakes the shape of the mainland US. It can be observed thatthis basis vector reflects the relative temperature distributionacross the US: the south-eastern region is the hottest one, andthe Great Lakes region is the coldest one [63].

Data Classification

Classification and labeling are important problems in dataanalysis. These problems have traditionally been studied in

9

machine learning [64], [65]. Here, we propose a novel dataclassification algorithm by demonstrating that a classifiersystem can be interpreted as a filter on a graph. Thus, theconstruction of an optimal classifier can be viewed and studiedas the design of an adaptive graph filter. Our algorithm scaleslinearly with the data sizeN , which makes it an attractivealternative to existing classification methods based on neuralnetworks and support vector machines.

Our approach is based on the label propagation [66], [67],which is a simple, yet efficient technique that advances knownlabels from labeled graph nodes along edges to unlabelednodes. Usually this propagation is modeled as a stationarydiscrete-time Markov process [68], and the graph adjacencymatrix is constructed as a probability transition matrix, i.e.,An,m ≥ 0 for all n,m, and A1N = 1N , where1N is acolumn vector ofN ones. Initially known labels determinethe initial probability distributions. For a binary classificationproblem with only two labels, the resulting labels are deter-mined by the distributions = AP s. If sn ≤ 1/2, nodevn isassigned one label, and otherwise the other. The numberP ofpropagations is determined heuristically.

Our DSPG approach has two major distinctions from theoriginal label propagation. First, we do not requireA tobe a stochastic matrix. We only assume that edge weightsAk,m ≥ 0 are non-negative and indicate similarity or depen-dency between nodes. In this case, nodes with positive labelssn > 0 are assigned to one class, and with negative labels toanother. Second, instead of propagating labels as in a Markovchain, we construct a filterh(A) that produces labels

s = h(A)s. (31)

The following example illustrates our approach. Considera set of N = 1224 political blogs on the Web that wewish to classify as “conservative” or “liberal” based on theircontext [69]. Reading and labeling each blog is very time-consuming. Instead, we read and label only a few blogs, anduse these labels to adaptively build a filterh(A) in (31).

Let signals contain initially known labels, where “conser-vative,” “liberal,” and unclassified blogs are representedbyvaluessn = +1, −1, and 0, respectively. Also, let signaltcontaintraining labels, a subset of known labels froms. Boths and t are represented by a graphG = (V ,A), where nodevn containing the label of thenth blog, and edgeAn,m = 1if and only if there is a hyperlink reference from thenth tothe mth blog; hence the graph is directed. Observe that thediscovery of hyperlink references is a fast, easily automatedtask, unlike reading the blogs. An example subgraph for50blogs is shown in Fig. 1(d).

Recall that the graph shiftA replaces each signal coefficientwith a weighted combination of its neighbors. In this case,processing training labelst with the filter

IN +h1A (32)

produces new labelst = t + h1At. Here, every node labelis adjusted by a scaled sum of its neighbors’ labels. Theparameterh1 can be interpreted as the “confidence” in ourknowledge of current labels: the higher the confidenceh1, the

Blog selection methodFraction of initially known labels

2% 5% 10%

Random 87% 93% 95%

Most hyperlinks 93% 94% 95%

TABLE IACCURACY OF BLOG CLASSIFICATION USING ADAPTIVE FILTERS.

more neighbors’ labels should affect the current labels. Werestrict the value ofh1 to be positive.

Since the sign of each label indicates its class, labeltn isincorrect if its sign differs fromsn, or tnsn ≤ 0 for sn 6=0. We determine the optimal value ofh1 by minimizing thetotal error, given by the number of incorrect and undecidedlabels. This is done in linear time proportional to the numberof initially known labelssn 6= 0, since each constraint

tnsn =

(tn + h1

∑

k∈Nn

tk

)sn ≤ 0 (33)

is a linear inequality constraint onh1.To propagate labels to all nodes, we repeatedly feed them

throughP filters (32) of the formh(p)(A) = IN +hpA, eachtime optimizing the value ofhp using the greedy approachdiscussed above. The obtained adaptive classification filter is

h(A) = (IN +hPA)(IN +hP−1A) . . . (IN +h1A). (34)

In experiments, we setP = 10, since we observed thatfilter (34) converges quickly, and in many cases,hp = 0for p > 10, which is similar to the actual graph’s diameterof 8. After the filter (34) is constructed, we apply it to allknown labelss, and classify allN nodes based on the signsof resulting labelss = h(A)s.

In our experiments, we considered two methods for se-lecting nodes to be labeled initially: random selection, andselection of blogs with most hyperlinks. As Table I shows,our algorithm achieves high accuracy for both methods. Inparticular, assigning initial labelss to only 2% of blogs withmost hyperlinks leads to the correct classification of93 % ofunlabeled blogs.

Customer Behavior Prediction

The adaptive filter design discussed in the previous examplecan be applied to other problems as well. Moreover, the linearcomputational cost of the filter design makes the approacheasily scalable for the analysis of large datasets. Consideran example of a mobile service provider that is interested inkeeping its customers. The company wants to predict whichusers will stop using their services in the near future, and offerthem incentives for staying with the provider (improved callplan, discounted phones, etc.). In particular, based on theirpast behavior, such as number and length of calls within thenetwork, the company wants to predict whether customers willstop using the services in the next month.

This problem can be formulated similarly to the previousexample. In this case, the value at nodevn of the representation

10

50

60

70

80

90

100

3 4 5 6 7 8 9 10

Acc

ura

cy (

%)

Month

Stopped

Continued

Overall

Fig. 7. The accuracy of behavior prediction for customers ofa mobileprovider. Predictions for customers who stopped using the provider and thosewho continued are evaluated separately, and then combined into the overallaccuracy.

graph G = (V ,A) indicates the probability that thenthcustomer will not use the provider services in the next 30days. The weight of a directed edge from nodevn to vm isthe fraction of time thenth customer called and talked to themth customer; i.e., ifTn,m indicates the total time thenthcustomer called and talked to themth customer in the pastuntil the present moment, then

An,m =Tn,m∑

k∈NnTn,k

.

The initial input signals has sn = 1 if the customer hasalready stopped using the provider, andsn = 0 otherwise.As in the previous example, we design a classifier filter (34);we setP = 10. We then process the entire signals with thedesigned filter obtaining the output signals of the predictedprobabilities; we conclude that thenth customer will stopusing the provider ifsn ≥ 1/2, and will continue ifsn < 1/2.

In our preliminary experiments, we used a ten-month-longcall log for approximately3.5 million customers of a Europeanmobile service provider, approximately10% of whom stoppedusing the provider during this period6. Fig. 7 shows theaccuracy of predicting customer behavior for months 3-10using filters with at mostL ≤ 10 taps. The accuracy reflectsthe ratio of correct predictions for all customers, the oneswho stop using the service and the ones who continue; it isimportant to correctly identify both classes, so the providercan focus on the proper set of customers. As can be seen fromthe results, the designed filters achieve high accuracy in theprediction of customer behavior. Unsurprisingly, the predictionaccuracy increases as more information becomes available,since we optimize the filter for monthK using cumulativeinformation from precedingK − 1 months.

VII. C ONCLUSIONS

We have proposed DSPG, a novel DSP theory for datasetswhose underlying similarity or dependency relations are rep-resented by arbitrary graphs. Our framework extends funda-mental DSP structures and concepts, including shift, filters,signal and filter spaces, spectral decomposition, spectrum,Fourier transform, and frequency response, to such datasets

6We use a large dataset on Call Detailed Records (CDRs) from a largemobile operator in one European country, which we call EURMOfor short.

by viewing them as signals indexed by graph nodes. Wedemonstrated that DSPG is a natural extension of the classicaltime-series DSP theory, and traditional definitions of the aboveDSP concepts and structures can be obtained using a graphrepresenting discrete time series. We also provided exampleapplications of DSPG to various social science applications,and our experimental results demonstrated the effectivenessof using the DSPG framework for datasets of different nature.

Acknowledgment

We thank EURMO, CMU Prof. Pedro Ferreira, and the iLabat CMU Heinz College for granting us access to EURMO CDRdatabase and related discussions.

APPENDIX A: M ATRIX DECOMPOSITION ANDPROPERTIES

We review relevant properties of the Jordan normal form,and the characteristic and minimal polynomial of a matrixA ∈ CN×N ; for a thorough review, see [57], [58].

Jordan Normal Form

Let λ0, . . . , λM−1 denoteM ≤ N distinct eigenvalues ofA. Let each eigenvalueλm haveDm linearly independenteigenvectorsvm,0, . . . ,vm,Dm−1. The Dm is the geometricmultiplicity of λm. Each eigenvectorvm,d generates aJordanchain of Rm,d ≥ 1 linearly independentgeneralized eigen-vectorsvm,d,r, 0 ≤ r < Rm,d, wherevm,d,0 = vm,d, thatsatisfy

(A− λm I)vm,d,r = vm,d,r−1. (35)

For each eigenvectorvm,d and its Jordan chain of lengthRm,d, we define aJordan blockmatrix of dimensionRm,d as

Jrm,d(λm) =

λm 1

λm

. . .

. . . 1λm

∈ CRm,d×Rm,d . (36)

Thus, each eigenvalueλm is associated withDm Jordanblocks, each with dimensionRm,d, 0 ≤ d < Dm. Next,for each eigenvectorvm,d, we collect its Jordan chain intoa N ×Rm,d matrix

Vm,d =(vm,d,0 . . . vm,d,Rm,d−1

). (37)

We concatenate all blocksVm,d, 0 ≤ d < Dm and0 ≤ m <M , into one block matrix

V =(V0,0 . . . VM−1,DM−1

), (38)

so thatVm,d is at position∑m−1

k=0 Dk+d in this matrix. Then,matrix A has theJordan decomposition

A = VJV−1, (39)

where the block-diagonal matrix

J =

JR0,0(λ0)

. . .JRM−1,DM−1

(λM−1)

(40)

is called theJordan normal formof A.

11

Minimal and Characteristic Polynomials

The minimal polynomialof matrix A is the monic polyno-mial of smallest possible degree that satisfiesmA(A) = 0N .Let Rm = max{Rm,0, . . . , Rm,Dm−1} denote the maximumlength of Jordan chains corresponding to eigenvalueλm. Thenthe minimal polynomialmA(x) is given by

mA(x) = (x − λ0)R1 . . . (x− λM−1)

RM−1 . (41)

The index of λm is Rm, 1 ≤ m < M . Any polynomialp(x) that satisfiesp(A) = 0N , is a polynomial multiple ofmA(x), i.e., p(x) = q(x)mA(x). The degree of the minimalpolynomial satisfies

degmA(x) = NA =

M−1∑

m=0

Rm ≤ N. (42)

Thecharacteristic polynomialof the matrixA is defined as

pA(x) = det(λ I−A) = (x− λ0)A0 . . . (x− λM−1)

AM−1 .(43)

Here: Am = Rm,0 + . . . + Rm,Dm−1 for 0 ≤ m < M , isthe algebraic multiplicityof λm; deg pA(x) = N ; pA(x) isa multiple of mA(x); and pA(x) = mA(x) if and only ifthe geometric multiplicity of eachλm, Dm = 1, i.e., eacheigenvalueλm has exactly one eigenvector.

APPENDIX B: PROOF OFTHEOREM 2

We will use the following lemma to prove Theorem 2.Lemma 1:For polynomials h(x), g(x), and p(x) =

h(x)g(x), and a Jordan blockJr(λ) as in (36) of arbitrarydimensionr and eigenvalueλ, the following equality holds:

h(Jr(λ))g(Jr(λ)) = p(Jr(λ)). (44)

Proof: The (i, j)th element ofh(Jr(λ)) is

h(Jr(λ))i,j =1

(j − i)!h(j−i)(λ) (45)

for j ≥ i and 0 otherwise, whereh(j−i)(λ) is the (j −i)th derivative ofh(λ) [58]. Hence, the(i, j)th element ofh(Jr(λ))g(Jr(λ)) for j < i is zero and forj ≥ i is

j∑

k=i

h(Jr(λ))i,kg(Jr(λ))k,j

=

j∑

k=i

1

(k − i)!h(k−i)(λ)

1

(j − k)!g(j−k)(λ)

=1

(j − i)!

j∑

k=i

(j − i

k − i

)h(k−i)(λ)g(j−k)(λ)

=1

(j − i)!

j−i∑

m=0

(j − i

m

)h(m)(λ)g(j−i−m)(λ)

=1

(j − i)!

(h(λ)g(λ)

)(j−i). (46)

Matrix equality (44) follows by comparing (46) with (45).As before, letλ0, . . . , λM−1 denote distinct eigenvalues of

A. Consider the Jordan decomposition (39) ofA. For each0 ≤ m < M , select distinct numbersλm,0, . . . , λm,Dm−1, so

that all λm,d for 0 ≤ d < Dm and0 ≤ m < M are distinct.Construct the block-diagonal matrix

J =

JR0,0(λ0,0). . .

JRM−1,DM−1(λM−1,DM−1−1)

.

The Jordan blocks on the diagonal ofJ match the sizes of theJordan blocks ofJ in (40), but their elements are different.

Consider a polynomialr(x) = r0+ r1x+ . . .+ rN−1xN−1,

and assume thatr(J) = J. By Lemma 1, this is equivalent to

r(λm,d) = λm,

r(1)(λm,d) = 1

r(i)(λm,d) = 0, for 2 ≤ i < Dm

for all 0 ≤ d < Dm and0 ≤ m < M . This is a system ofNlinear equations withN unknownsr0, . . . , rN−1 that can beuniquely solved using inverse polynomial interpolation [58].

Using (39), we obtainA = VJV−1 = V r(J)V−1 =r(V JV−1) = r(A). Furthermore, since allλm,d are distinctnumbers, their geometric multiplicities are equal to1. As dis-cussed in Appendix A, this is equivalent top

A(x) = m

A(x).

APPENDIX C: PROOF OFTHEOREM 4

Lemma 1 leads to the construction procedure of the inversepolynomial g(x) of h(x), when it exists, and whose matrixrepresentation satisfiesg(A)h(A) = IN . Observe that thiscondition, together with (44), is equivalent to

{h(λm)g(λm) = 1, for 0 ≤ m ≤ M − 1(h(λm)g(λm)

)(i)= 0, for 1 ≤ i < Rm.

(47)

Here, Rm is the degree of the factor(x − λm)Rm in theminimal polynomialmA(λ) in (41). Since values ofh(x) andits derivatives atλm are known, (47) amount toNA linearequations withNA unknowns. They have a unique solutionif and only if h(λm) 6= 0 for all λm, and the coefficientsg0, . . . , gMA−1 are then uniquely determined using inversepolynomial interpolation [58].

REFERENCES

[1] C. Chamley, Rational Herds: Economic Models of Social Learning,Cambridge Univ. Press, 2004.

[2] M. Jackson,Social and Economic Networks, Princeton Univ., 2008.[3] D. Easley and J. Kleinberg,Networks, Crowds, and Markets: Reasoning

About a Highly Connected World, Cambridge Univ. Press, 2010.[4] M. Newman, Networks: An Introduction, Oxford Univ. Press, 2010.[5] J. Whittaker,Graphical Models in Applied Multivariate Statistics, Wiley,

1990.[6] S. L. Lauritzen,Graphical Models, Oxford Univ. Press, 1996.[7] F. V. Jensen,Bayesian Networks and Decision Graphs, IEEE Comp.

Soc. Press, 2001.[8] M. I. Jordan, “Graphical models,”Statistical Science (Special Issue on

Bayesian Statistics), vol. 19, no. 1, pp. 140–155, 2004.[9] M. J. Wainwright and M. I. Jordan,Graphical Models, Exponential

Families, and Variational Inference, Now Publishers Inc., 2008.[10] D. Koller and N. Friedman,Probabilistic Graphical Models: Principles

and Techniques, MIT Press, 2009.[11] D. Edwards,Introduction to Graphical Modelling, Springer, 2000.[12] J. Bang-Jensen and G. Gutin,Digraphs: Theory, Algorithms and

Applications, Springer, 2nd edition, 2009.[13] R. Kindermann and J. L. Snell,Markov Random Fields and Their

Applications, American Mathematical Society, 1980.

12

[14] A.S. Willsky, “Multiresolution Markov models for signal and imageprocessing,”Proc. IEEE, vol. 90, no. 8, pp. 1396–1458, 2002.

[15] J. Besag, “Spatial interaction and the statistical analysis of latticesystems,”J. Royal Stat. Soc., vol. 36, no. 2, pp. 192–236, 1974.

[16] J. M. Hammersley and D. C. Handscomb,Monte Carlo Methods,Chapman & Hall, 1964.

[17] D. Vats and J. M. F. Moura, “Finding Non-overlapping Clusters forGeneralized Inference Over Graphical Models,”IEEE Trans. SignalProc., vol. 60, no. 12, pp. 6368–6381, 2012.

[18] M.I. Jordan, E.B. Sudderth, M. Wainwright, and A.S. Willsky, “Majoradvances and emerging developments of graphical models,”IEEE SignalProc. Mag., vol. 27, no. 6, pp. 17–138, 2010.

[19] J. F. Tenenbaum, V. Silva, and J. C. Langford, “A global geometricframework for nonlinear dmensionality reduction,”Science, vol. 290,pp. 2319–2323, 2000.

[20] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locallylinear embedding,”Science, vol. 290, pp. 2323–2326, 2000.

[21] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,”Neural Comp., vol. 15, no. 6, pp.1373–1396, 2003.

[22] D. L. Donoho and C. Grimes, “Hessian eigenmaps: Locallylinearembedding techniques for high-dimensional data,”Proc. Nat. Acad.Sci., vol. 100, no. 10, pp. 5591–5596, 2003.

[23] F. R. K. Chung,Spectral Graph Theory, AMS, 1996.[24] M. Belkin and P. Niyogi, “Using manifold structure for partially labeled

classification,” 2002.[25] M. Hein, J. Audibert, and U. von Luxburg, “From graphs tomanifolds -

weak and strong pointwise consistency of graph Laplacians,” in COLT,2005, pp. 470–485.

[26] E. Gine and V. Koltchinskii, “Empirical graph Laplacian approximationof Laplace–Beltrami operators: Large sample results,”IMS LectureNotes Monograph Series, vol. 51, pp. 238–259, 2006.

[27] M. Hein, J. Audibert, and U. von Luxburg, “Graph Laplacians and theirconvergence on random neighborhood graphs,”J. Machine Learn., vol.8, pp. 1325–1370, June 2007.

[28] R. R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. J. Warner,and S. W. Zucker, “Geometric diffusions as a tool for harmonic analysisand structure definition of data: Diffusion maps,”Proc. Nat. Acad. Sci.,vol. 102, no. 21, pp. 7426–7431, 2005.

[29] R. R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. J. Warner,and S. W. Zucker, “Geometric diffusions as a tool for harmonic analysisand structure definition of data: Multiscale methods,”Proc. Nat. Acad.Sci., vol. 102, no. 21, pp. 7432–7437, 2005.

[30] R. R. Coifman and M. Maggioni, “Diffusion wavelets,”Appl. Comp.Harm. Anal., vol. 21, no. 1, pp. 53–94, 2006.

[31] C. Guestrin, P. Bodik, R. Thibaux, M. Paskin, and S. Madden, “Dis-tributed regression: an efficient framework for modeling sensor networkdata,” in IPSN, 2004, pp. 1–10.

[32] D. Ganesan, B. Greenstein, D. Estrin, J. Heidemann, andR. Govindan,“Multiresolution storage and search in sensor networks,”ACM Trans.Storage, vol. 1, pp. 277–315, 2005.

[33] R. Wagner, H. Choi, R. G. Baraniuk, and V. Delouille, “Distributedwavelet transform for irregular sensor network grids,” inIEEE SSPWorkshop, 2005, pp. 1196–1201.

[34] R. Wagner, A. Cohen, R. G. Baraniuk, S. Du, and D.B. Johnson, “Anarchitecture for distributed wavelet analysis and processing in sensornetworks,” in IPSN, 2006, pp. 243–250.

[35] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets ongraphs via spectral graph theory,”J. Appl. Comp. Harm. Anal., vol. 30,no. 2, pp. 129–150, 2011.

[36] S. K. Narang and A. Ortega, “Local two-channel critically sampledfilter-banks on graphs,” inICIP, 2010, pp. 333–336.

[37] S. K. Narang and A. Ortega, “Perfect reconstruction two-channel waveletfilter banks for graph structured data,”IEEE Trans. Signal Proc., vol.60, no. 6, pp. 2786–2799, 2012.

[38] R. Wagner, V. Delouille, and R. G. Baraniuk, “Distributed wavelet de-noising for sensor networks,” inProc. CDC, 2006, pp. 373–379.

[39] X. Zhu and M. Rabbat, “Approximating signals supportedon graphs,”in Proc. ICASSP, 2012, pp. 3921–3924.

[40] A. Agaskar and Y. M. Lu, “Uncertainty principles for signals definedon graphs: Bounds and characterizations,” inProc. ICASSP, 2012.

[41] A. Agaskar and Y. Lu, “A spectral graph uncertainty principle,”Submitted for publication., June 2012.

[42] M. Puschel and J. M. F. Moura, “The algebraic approach to the discretecosine and sine transforms and their fast algorithms,”SIAM J. Comp.,vol. 32, no. 5, pp. 1280–1316, 2003.

[43] M. Puschel and J. M. F. Moura, “Algebraic signal processing theory,”http://arxiv.org/abs/cs.IT/0612077.

[44] M. Puschel and J. M. F. Moura, “Algebraic signal processing theory:Foundation and 1-D time,”IEEE Trans. Signal Proc., vol. 56, no. 8, pp.3572–3585, 2008.

[45] M. Puschel and J. M. F. Moura, “Algebraic signal processing theory:1-D space,” IEEE Trans. Signal Proc., vol. 56, no. 8, pp. 3586–3599,2008.

[46] M. Puschel and J. M. F. Moura, “Algebraic signal processing theory:Cooley-Tukey type algorithms for DCTs and DSTs,”IEEE Trans. SignalProc., vol. 56, no. 4, pp. 1502–1521, 2008.

[47] A. Sandryhaila, J. Kovacevic, and M. Puschel, “Algebraic signal pro-cessing theory: Cooley-Tukey type algorithms for polynomial transformsbased on induction,”SIAM J. Matrix Analysis and Appl., vol. 32, no. 2,pp. 364–384, 2011.

[48] M. Puschel and M. Rotteler, “Algebraic signal processing theory: 2-Dhexagonal spatial lattice,”IEEE Trans. on Image Proc., vol. 16, no. 6,pp. 1506–1521, 2007.

[49] A. Sandryhaila, J. Kovacevic, and M. Puschel, “Algebraic signalprocessing theory: 1-D Nearest-neighbor models,”IEEE Trans. onSignal Proc., vol. 60, no. 5, pp. 2247–2259, 2012.

[50] A. Sandryhaila, S. Saba, M. Puschel, and J. Kovacevic,“Efficientcompression of QRS complexes using Hermite expansion,”IEEE Trans.on Signal Proc., vol. 60, no. 2, pp. 947–955, 2012.

[51] A. Sandryhaila and J. M. F. Moura, “Nearest-neighbor image model,”in Proc. ICIP, 2012, to appear.

[52] B. A. Miller, N. T. Bliss, and P. J. Wolfe, “Toward signalprocessingtheory for graphs and non-Euclidean data,” inProc. ICASSP, 2010, pp.5414–5417.

[53] B. A. Miller, M. S. Beard, and N. T. Bliss, “Matched filtering forsubgraph detection in dynamic networks,” inProc. SSP, 2011, pp. 509–512.

[54] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs: Graph Fourier transform,” submitted for publication.

[55] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs: Graph filters,” submitted for publication.

[56] A. V. Oppenheim, R. W. Schafer, and J. R. Buck,Discrete-Time SignalProcessing, Prentice Hall, 2nd edition, 1999.

[57] F. R. Gantmacher,Matrix Theory, vol. I, Chelsea, 1959.[58] P. Lancaster and M. Tismenetsky,The Theory of Matrices, Academic

Press, 2nd edition, 1985.[59] D. E. Dudgeon and R. M. Mersereau,Multidimensional Digital Signal

Processing, Prentice Hall, 1983.[60] A. Bovik, Handbook of Image and Video Processing, Academic Press,

2nd edition, 2005.[61] P. P. Vaidyanathan,The Theory of Linear Prediction, Morgan and

Claypool, 2008.[62] “National climatic data center,” 2011,

ftp://ftp.ncdc.noaa.gov/pub/data/gsod.[63] “NCDC NOAA 1981-2010 climate normals,” 2011,

ncdc.noaa.gov/oa/climate/normals/usnormals.html.[64] T. Mitchell, Machine Learning, McGraw-Hill, 1997.[65] R. O. Duda, P. E. Hart, and D. G. Stork,Pattern Classification, Wiley,

2nd edition, 2000.[66] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active learning and

semi-supervised learning ising Gaussian fields and harmonic functions,”in Proc. ICML, 2003, pp. 58–65.

[67] F. Wang and C. Zhang, “Label propagation through linearneighbor-hoods,” inProc. ICML, 2006, pp. 985–992.

[68] A. Papoulis and S. U. Pillai, Probability, Random Variables andStochastic Processes, McGraw-Hill, 4th edition, 2002.

[69] L. A. Adamic and N. Glance, “The political blogosphere and the 2004U.S. election: Divided they blog,” inLinkKDD, 2005.

Documents

Discrete Signal Processing on Graphs - arXiv · PDF filearXiv:1210.4752v2 [cs.SI] 28 Dec 2012 1 Discrete Signal Processing on Graphs Aliaksei Sandryhaila, Member, IEEE and Jose´ M