dm09_008_mehtan.pdf

Embed Size (px)

Citation preview

  • 8/13/2019 dm09_008_mehtan.pdf

    1/12

    FuncICA for Time Series Pattern Discovery

    Nishant Mehta

    Alexander Gray

    Abstract

    We introduce FuncICA, a new independent component anal-

    ysis method for pattern discovery in inherently functional

    data, such as time series data. We show how applying the

    dual of temporal ICA to temporal data, and likewise apply-

    ing the dual of spatiotemporal ICA to spatiotemporal data,

    enables independent component regularization not afforded

    by the primal forms applied to their original domains. We

    call this family of regularized dual ICA algorithms FuncICA.

    FuncICA can be considered an analog to functional principalcomponent analysis, where instead of extracting components

    to minimize L2 reconstruction error, we maximize indepen-

    dence of the components over the functional observations.

    In this work, we develop an algorithm for extracting inde-

    pendent component curves, derive a method for optimally

    smoothing the curves, and validate this method on both

    synthetic and real datasets. Results for synthetic, gene ex-

    pression, and electroencephalographic event-related poten-

    tial data indicate that FuncICA can recover well-known sci-

    entific phenomena and improve classification accuracy, high-

    lighting its utility for unsupervised learning in continuous

    data. We conclude this work with a forward-looking, novel

    framework for fMRI data analysis by making use of the func-

    tional dual of spatiotemporal ICA.

    1 Introduction

    We propose a novel solution for unsupervised patterndiscovery in time series and other structured data. Inparticular, we are searching for linearly varying patternsof activation. A simple example of such a pattern is anuptrend sustained for a fixed period of time, such thatlinear variation affects the slope of the trend. Two pow-erful applications are automatic identification of event-related potentials in electroencephalography (EEG) and

    discovery of time-varying signaling mechanisms for geneexpression. Existing temporal methods are unable tofind such time series patterns: functional principal com-ponents analysis (FPCA) only finds patterns that haveGaussian variation over the observations, while existingtemporal independent component analysis (ICA) meth-ods are concerned with recovering statistically indepen-dent signals sampled over time rather than statistically

    Georgia Institute of TechnologyGeorgia Institute of Technology

    independent patterns sampled over entire time series ob-servations(e.g. each observation is a 2-second long uni-variate time series). The former type of source recovery,the primal form of ICA, is not sufficient for discovery ofmany types of patterns.

    1.1 ICA in the primal Algorithmic treatments ofICA abound, but our explorations of this space haveled to at least one conclusion: all currently existing ICAalgorithms treat the problem of ICA in the primal withrespect to their domain. By ICA in the primal we meanthat for a linear model

    (1.1) X=AS,

    where the i th row of X specifies the observationsof a random variable Xi, and the j th column of Xrepresents the j th joint observation of the joint variableX, ICA seeks a linear unmixing of the random variablesX1, . . . X n such that the resulting variables S1, . . . , S nare statistically independent over their observations.

    This formulation of ICA naturally lends itself tosolving the blind source separation (BSS) problem. In

    this problem, we observe n signals that are indepen-dently and identically distributed (IID)1, and the ob-served signals have been mixed (each observation at atime index is the same linear combination of the valuesof source signals at that time index). The goal in BSS isto recover the statistically independent source signals.

    1.2 ICA in the dual While the primal form cansolve highly interesting problems like blind source sep-aration [1] and compression, a dual form of ICA opensto the door to a world of different, equally importantproblems. In this work, we consider ICA in a dual rep-

    resentation. We still have precisely the same observeddataX, but we frame the problem as the mixing modelXT = AST which implies that A1XT = ST.2 In theprimal, the rows of A1 are the independent compo-nents, which by themselves are of little value except for

    1By IID, we mean that each signals observations at a giventime are assumed to be independent of observations of that signalat other times, and each observation of a given signal is assumedto be drawn from an identical distribution.

    2For now, assume thatA is positive definite (A 0) and henceinvertible.

    73 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    2/12

    Figure 1: The data for temporal ICA in the primal. Randomvariables are signals sampled over time indices.

    knowing the inverse mixing coefficients; however, when

    they are left multiplied with the data, they yield sourcesignals which likely are of value by themselves, such as aparticular speakers speech over some time period. Theconcept of the dual of ICA has been visited before, byStone et al. [2] and more recently by Choi et al. [3].

    In the dual, the rows ofA1 are time series, whichin themselves, are of particular value, while the sourcesST are not of value without the dataXT. The implica-tions of the independent components being time seriesin the dual is that the independent components repre-sent time series whose inner product over the observa-tions (i.e., their activation, or coefficient) is statisticallyindependent. For instance, an independent component

    in the dual may be some time series pattern of stockvalue change that occurs in financial market data thatis synchronized to a merger event, with a linear acti-vation coefficient that varies depending on the marketcapitalization of the new company produced. Further,this temporal progression is statistically independent ofthe other independent components over the time seriesobservations, and hence it appears to, in isolation, cap-ture some underlying phenomenon of some system (inthis case, a financial market).

    The dual of ICA just described is actually the dualof temporal ICA. For clarity, see the primal and dual

    forms in Figures 1 and 2 respectively. Researchers invarious communities, including neuroscience, psychol-ogy, and computer science, have explored another formof ICA, known as spatial ICA [4], which enables them toanalyze functional magnetic resonance imaging (fMRI)data to understand the brain. In spatial ICA, each fMRIimage is a random variable, and the voxel indices indexthe observations. This already appears to be quite sim-ilar to our dual representation of temporal ICA, and infact a nave implementation of the dual of spatial ICAfor this data is equivalent to temporal ICA. For the dual

    Figure 2: The data for temporal ICA in the dual. Randomvariables are time indices sampled over time series.

    of both temporal and spatial ICA, we will show how, us-ing functional3 representations of data, we can exploit

    the smoothness properties of temporally and spatiallysampled data respectively to perform noise reductionon our data that improves the utility of our recoveredtemporal and spatial patterns respectively.

    Our algorithm, FuncICA, finds precisely thesesmooth patterns with the level of smoothness optimizedby a novel, information-theory-inspired objective that isempirically validated. At this point in time, the brain israpidly being imaged via fMRI under a number of stim-uli, and it is critical that there exist a method suitablefor detecting underlying patterns of activation that canhelp us better understand neural processes. After in-troducing FuncICA, our main result, and analyzing itsperformance, we will argue that both spatial, tempo-ral, and a method known as spatiotemporal ICA are allnot quite the right ICA methods to use for fMRI data.We will then show how, with small changes, FuncICAis a more natural method to use for fMRI data analysisand can be a powerful tool for spatiotemporal patterndiscovery in this exciting domain.

    We begin by providing an overview of ICA in itsprimal form and some necessary functional notation, aswell as a brief summary of FPCA, the closest method toFuncICA. Our contribution follows: we introduce Fun-cICA and a regularization parameter for smoothing, de-

    velop an objective function for optimization of this pa-rameter, validate this objective function for syntheticand real EEG data, and demonstrate superior classi-fication results using FuncICA features for microarraygene expression data and ERP data from the brain, ascompared to FPCA and the standard (non-functional)dual of temporal ICA. Our results indicate that the al-gorithm produces components that can recover sources,

    3This use of the word functional refers to the function spacesnotion, not to be confused with the functional in fMRI.

    74 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    3/12

    improve classification accuracy by sparsely representinginformation, and effectively estimate the P300 event-related potential in the brain. Finally, we conclude witha forward-looking, novel framework for fMRI data anal-ysis.

    2 ICA and FPCA

    Prior to introducing FuncICA, we formally introduceICA and some functional analysis notation. We alsodescribe functional PCA (FPCA) because it is the firststep in performing FuncICA, and it is also an analogto FuncICA that finds uncorrelated but not necessarilystatistically independent components

    2.1 ICA ICA seeks a linear transformation from sta-tistically dependent to statistically independent compo-nents. The model can be expressed as an instantaneouslinear mixing model

    (2.2) X(t) =AS(t),

    where4 t[T], X(t) Rn, ARn x n , and S(t) Rn.The source signal distributions of S1, . . . , S n realizedat indices [T] are statistically independent, but themixing matrix Aclearly erases the independence of thedistributions of the observed signals X1, . . . , X n. Thetask is then to estimate the unmixing matrix W =A1

    such that

    (2.3) Y W X=W AS.

    It is known that we can recover Sonly up to a scalingand permutation.

    It can be shown that maximizing independence ofthe source distributions is equivalent to minimizing theKullback-Leibler (KL) divergence between the differ-ence between the joint density of the sources and theproduct of the marginal densities:

    H(Y) =DKL

    p(Y)ni=1

    p(Yi)

    (2.4)

    =

    ni=1 H(Y

    i) H(Y),

    where H(Y) is the entropy of Y. Notably, the KLdivergence of two distributions P and Q is 0 iff all ofthe Yi are statistically independent, so minimizing thisobjective function is firmly grounded.

    Before describing functional PCA, some notationand basis functional analysis is required.

    4Here, we adopt the notation [k] = {1, 2, . . . , k}.

    2.2 Functional representation Let t [0, T] andX be a set of n functionals{X1(t), . . . , X n(t)}, whereXi(t) L2. L2 is a Hilbert space with inner productdefined as the integral of the product of two functionals:

    (2.5) f, g= f(t)g(t)dt.If we let (t) be a set of m basis functions{1(t), . . . , m(t)}, then we can represent each func-tional Xi(t) as a linear combination of the basis func-tions

    (2.6) Xi(t) =

    mj=1

    i,jj(t).

    The choice of basis is a non-trivial task. For ouranalysis we have used a flexible cubic b-spline basis,so that each j(t) C20(R).5 Using a nonparametricbasis may offer a representation that captures functionalvariability in the right places, but the simplicity of ourchoice of basis makes the analysis cleaner and also isvery computationally efficient.

    Having established the necessary functional nota-tion, we describe functional PCA; we then show FPCAto be a close analog and the first step of FuncICA.

    2.3 Functional PCA FPCA, introduced by Ramsayand Silverman [5, 6], is in its unregularized form equiv-alent to the Karhunen-Loeve expansion. Ramsay and

    Silverman developed an optimally smoothed FPCA tominimize reconstruction error; we discuss one method inSection 4. Via the derivation of FPCA that follows, wehope to convey that FPCA is conceptually equivalent toPCA in a basis function space. This intepretation willease understanding of FuncICA as well.

    For a functional random variable X(t), consider thecovariance function

    (2.7) (s, t) =E[X(s)X(t)].

    We decompose (s, t) in terms of eigenfunctions j :

    (2.8) (s, t) =i=1

    jj(s)j(t),

    where j is thejth eigenvalue of the covariance function.

    Estimating the sample eigenfunctions j(s) is equivalent

    5C20

    (R) is the class of functions ofR that have compact supportand are continuous up to the second derivative.

    75 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    4/12

    to the following optimization problem:

    maximize

    j(s)(s, t)j(t) ds dt,

    subject to j(t)2dt= 1 i(t)j(t)dt= 0 for i < j.

    If we let f be a vector of coefficients for the basisexpansion of(t) in terms of the basis functions j(t),then

    (2.9) (t) =mj=1

    fjj(t).

    Further, for A a coefficients matrix for the basis func-tions over the data, we have

    (2.10) Xi(t) =

    mj=1

    Ai,jj(t).

    Let L be the basis function inner product matrix suchthat

    (2.11) Li,j =

    i(t)j(t)dt.

    Then the principal component score for Xi(t) is

    (2.12) (t)Xi(t)dt= A(i)LTf,where A(i) denotes the i

    th row ofA. The PC scores forall of the functional data are then ALTf.

    Let Vj,k = 1n1

    ni=1(Ai,j Aj)(Ai,k Ak) induce

    the sample variance matrix of the basis coefficients. Ourobjective is to maximize with respect to f the objectivefunction fTLV LTf.

    In this form, f is the leading eigenvector. We canestimate the coefficients for the ith eigenfunction byusing the Gram-Schmidt process and formulating theobjective function using the projected data.

    3 FuncICA3.1 FuncICA derivation. A functional version ofICA is equivalent to the original ICA formulation withA now being the linear operator:

    (3.13) A: L2 L2,because both the functional observations Xi(t) andthe independent functionals Sj are in L

    2. This linearoperator is somewhat like the analog to the A matrix inICA, ifA were permitted to take on an infinite number

    of rows and columns. As in the case of FPCA, togain tractability we consider each functional observationXi(t) by its expansion in terms of basis functions j(t),such that

    (3.14) Xi(t) =

    mj=1

    i,jj(t).

    To compactly represent this information, let be alinear map

    (3.15) : Rm L2.

    To ease the understanding of this operator, we treatit as an analog to a matrix with an infinite numberof columns. The jth row is then the functional j(t).If A Rn L2 and B Rm L2, then each rowof the matrices corresponds to a functional, and henceABT =C such that

    (3.16) Ci,j =

    Ai(t)Bj(t)dt,

    whereAi(t) is the ith row ofA and Bj(t) is the j

    th rowofB .

    Using this notation, X = , where each row ofX is a functional observation and Rn x m . Theprincipal component matrix is

    (3.17) E= UTX=UT= , for

    Rm x m .

    For some linear operator Z : L2 L2, let Z denotethe score matrix of Z over the data X. The principalcomponent score matrix is then

    (3.18) E=E XT =()T = TT.

    We can then apply a left linear transformation Wto theprincipal component matrix to obtain some

    (3.19) Y =W E=,

    where = W . The corresponding score matrix for thefunctionals in Y is then

    (3.20) Y =Y XT =W EXT =W E=

    TT.

    If we let Y denote the target score matrix whereindependence of the marginal distributions ofYiover thedataXis maximized, then the problem of ICA reducesto the familiar setting of optimizing an independenceobjective function with respect to a finite matrix ofparametersW(applied to E).

    76 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    5/12

    3.2 Algorithm To optimize W, recall our objectivefunction 2.4. After performing FPCA, we are left with

    (3.21) H(E) =ni=1

    H(Ei) H(E).

    Considering rotations of the functional data using thedual orthogonal decomposition provided by E, we con-sider left rotationsW (such that Y =W E) that main-tain uncorrelatedness of the distributions of theEiwhileminimizingH(Y). ReformulatingH(Y) in terms ofY,W, and E, we have

    H(Y) =ni=1

    H(Yi) H(Y)(3.22)

    =

    ni=1

    H(Yi) H(W E),

    which by Theorem 9.6.4 of Cover and Thomas [7] isequal to

    (3.23)ni=1

    H(Yi) H(E) log(|W|).

    To maintain the uncorrelatedness of the functionals, werestrict Wto the set of orthogonal (rotation) matrices.Because rotation matrices have unity determinant, itis true that log(|W|) = 0. Discarding the fixed H(E)term, we are left with the new objective function

    (3.24) H(Y) =ni=1

    H(Yi).

    Employing this objective is equivalent to direct en-tropy minimization, for which many existing ICA algo-rithms can be applied. Because of our unique domainof finding an optimal rotation of basis function coef-ficients, and also due to convincing source separationresults for a variety of mixtures, we use the RADICALICA algorithm for optimizingW.

    The core idea of RADICAL is consider pairwiserotations of (Yi, Yj) that minimize Hi + Hj ; hence, theoptimization directly minimizes the sum of marginal

    entropies of pairwise source estimates. In order toestimate the entropy of some Yi, one could use anonparametric method that exploits the order statisticsof univariate data to estimate entropy consistently [8].RADICAL employs a modified, more robust version ofthis estimator. It is beyond the scope of the paper togive a thorough treatment of RADICAL. The interestedreader is encouraged to read [9] for details6.

    6RADICAL is competitive with other recent algorithms [10],so our analysis is restricted to RADICAL.

    This work would be far from complete without spe-cial considerations for the functional data. A caveatof source separation in the functional case is that thedistributions estimated are heavily dependent upon thechoice of functionals obtained by applying FPCA. Vary-

    ing the level of smoothing appropriately can provide in-dependent functionals that both have lower marginalentropies and better approximate the source distribu-tions of interest.

    4 Optimal Smoothing

    Individual functional observations often exhibit a largeamount of spiky behavior due to observation error andsmall variations in the underlying stochastic processes,but the underlying processes of interest often evolvesmoothly. This nuisance roughness motivates a methodfor incorporating a smoothing method into the functionextraction. Numerous methods exist for controllingroughness in extracted curves; the method employedhere follows a quite general framework used by Ramsayand Silverman [6]. Instead of choosing eigenfunctionswith the constraint

    (4.25)

    (t)2dt= 1,

    we impose the alternate, roughness-penalizing con-straint

    (4.26)

    (t)2dt +

    (L(t))2dt= 1,

    where0 andL is a linear combination of differentialoperators such that

    (4.27) L=

    mi=0

    aiDi.

    Note that the goodness of a particular penalty operatorL is likely domain-dependent. One simple choice of Lthat we adopt for our experiments is L = D2, a penaltyon the norm of the second derivative. This choice yieldsthe new constraint

    (4.28) (t)2dt + (D2(t))2dt= 1,for 0. Optimal selection of L is also a worthyproblem to tackle; however, our results suggest that D2

    is sufficient for the data tested here.

    4.1 Objective for optimization A key inquiryin this work is to identify cases where it is beneficialto choose a smoothing parameter other than the L2optimal reconstruction error parameter P.

    Pcan be

    selected by minimizing reconstruction error via leave-one-out cross-validation (LOOCV), where in each epoch

    77 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    6/12

    one curve is the test set and the remaining curves are thetraining set [5]. Because our goal is to extract indepen-dent functions that maximally depart from Gaussianity(minimal marginal entropy), we conjecture that smooth-ing is beneficial only when Gaussian sources exist in the

    data and those sources are sufficiently rougher than thenon-Gaussian sources. An example of this would bea high frequency harmonic source with amplitude nor-mally distributed over the observed functionals. Theessence of our approach is to assign a FuncICA result(a set of independent functionals) a score that penal-izes the result increasingly as the sources become moreGaussian.

    The negentropy of a unit-variance random variableYi is defined as

    (4.29) J(Yi) =H(N(0, 1)) H(Yi).Note that J(Yi) is non-negative because for a fixedvariance the Gaussian distribution has maximal entropy[11].

    Our objective function is then

    (4.30) Q(Y) =

    pi=1

    1

    J(Yi).

    This objective heavily penalizes nearly-Gaussiansources and mostly ignores sources with high negen-tropy.

    The algorithm for optimizing is shown in Algo-rithm 1. The idea is to optimize Q starting at P, the

    optimal chosen by FPCA, and to slowly increase as long as Q decreases. Choice of the parameter (for >1) can be decided by an optimizer.

    Algorithm 1

    1: Minimize FPCA LOOCV error to find P.2: Q(0) =3: (1) =P4: = 05: repeat

    6: =+ 17: Y= FuncICA(X, ())8: Q() = pi=1 1J(Yi)9: (+1) =()

    10: until Q() > Q(1)

    11: return (1)

    5 Previous Temporal ICA Methods

    In this section, we frame FuncICA in the context ofother temporal ICA methods that also exploit thetemporal structure of the data. Two important pointsto note are:

    1. These methods solve a problem different than theone solved by FuncICA.

    2. The methods make use of temporal dependence ina way that is inconsistent with the problem solvedby FuncICA.

    Convolutive mixing is a variation of the instantaneouslinear mixing model, where the goal is to isolate an au-dio source signal from echoey microphone recordings. Aclosely related problem is source separation where thesources exhibit temporal structure [12, 13]; a simple ex-ample would be if a source signal had significant auto-correlation and the cross-correlation among the sourcesignals, by virtue of their independence, is zero. Thefamily of methods for solving these problems attemptto recover independent components whose activationsvary independently over time, and hence the temporal

    structure of the problem can be used by minimizing thecross-correlation between the observed signals. In thedual, the independent components are time series andthe goal is to attain statistical independence of theiractivation over the functional observations. In this set-ting, the cross-correlation between signals is not of in-terest, because the independent components themselvesare unconstrained; it is only their activation over thefunctional data that is optimized to be statistically in-dependent.

    Several ICA methods attempt to recover dominantindependent components [14, 15], with a goal similar toour goal of discovering time series patterns. A common

    strategy here is to first do ICA on whitened data thathas not been dimensionality-reduced, and then to usesome properties of the components to select the mostdominant components. For n signals, this becomes in-tractable in the primal formulation, due to O(n3) scalingproperties for ICA algorithms. Additionally, by usingthe primal interpretation of statistical independence,the recovered time series patterns are not guaranteed tohave statistically independent activation over the timeseries observations.

    In contrast, FuncICAs computational complexityis limited by the number of basis functions it uses to

    represent each observed time series, rather than thenumber of observed time series. Further, by usingFPCA prior to optimizing the independence objective,the dimensionality of the problem is further reduced.As a result, the algorithms computational complexityis linear in n, and we are able to show results of FuncICAfor datasets containing a very high number of signals;our synthetic dataset consists of 104 time series in whichwe mine for independent component curves. Hence,from a practical computability standpoint, we flexiblycan facilitate computability by reducing the number of

    78 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    7/12

    basis functions used in our representation or being moreaggressive in the FPCA dimensionality reduction step7.We now present results of FuncICA on synthetic andreal-world brain data.

    6 Experiments and ResultsWe tested FuncICAs performance on synthetic dataand two real-world data sets. Our goal is to demonstratethat FuncICA is capable of extracting highly interestingcomponents that sparsely encode the most relevantinformation for functional data sets, and further tocontrast the performance of FuncICA with FPCA andthe non-functional dual of temporal ICA. We firstdiscuss results on synthetic data as validation of theindependent curve extraction and smoothing methods.We then present scientifically intriguing results forfeature extraction and classification in EEG and geneexpression data. Note that in this section, when werefer to ICA, we mean the standard (non-functional)dual of temporal ICA.

    6.1 Synthetic data For our experiments with syn-thetic data, we created 2 independent functions by usingorthogonal functions with unit norm. These consistedof(6.31)

    S1(t) = 1

    2sin(10t) and S2(t) =

    12

    cos(10t),

    sampled at 1000 points with uniform spacing. Wegenerated 104 realizations of 2 IID Laplace random

    variables Z1 and Z2 with zero mean and unit variance(ZiLaplace(0, 12)). The sources were mixed to form104 observationsXi for i[104]:(6.32) Xi = Z1,iS1+ Z2,iS2,where Zi,j represents the jth realization ofZi.

    Source recovery accuracy We ran FuncICA onmultiple data sets generated as described above; 120cubic-bspline basis functions were used, and was setto zero since no observation noise or other roughnesswas added to the data.

    In order to rigorously evaluate source recovery per-

    formance, we define a measure of accuracy to be theminimum inner product among the source functionalsinner products with their best fit independent function-als; more formally, accuracy is(6.33)

    min{S1, (arg maxYi

    S1, Yi), S2, (arg maxYj

    S2, Yj)}.

    7Using FPCA for dimensionality reduction is preferred. Choiceof the number of basis functions appears to be subjective, whileFPCA minimizes a clear objective.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    t

    f(t)

    Figure 3: S1(t) and S2(t) are shown in red and green re-spectively. IC1(t) and IC2(t) are the black dotted lines.PC1(t) and PC2(t) are the blue and magenta lines respec-tively. The independent functionals are near-perfect replicasof the sources, while the principal component functions arephase-shifted from the sources.

    As shown by a typical recovery result in Figure 3,FuncICA achieved near-perfect source recovery , whileFPCA nearly always recovered a phase-shifted versionof the sources. The mean accuracy of FuncICA over5 realizations of the synthetic data was 0.9999, while

    FPCAs mean accuracy over the same 5 realizationswas only 0.9051. The reason for the discrepancyis that FPCA tends to select more Gaussian sourcedistributions as a result of minimizing reconstructionerror. This is evidenced from the entropy estimates(using Vasiceks entropy estimator [8]) for the 2 PCdistributions being 1.3963 and 1.3610 respectively, whileentropy estimates for the two IC distributions werelower at 1.3647 and 1.3296 respectively, yielding adifference of the sums of the marginal entropies of0.0629.

    Robustness with respect to Under two

    Laplace distributions mixed with an additive high fre-quency Gaussian component - that is, a function form ofsin(40t) with linear activation through the data vary-ing asN(0, ) - FuncICA and FPCA perform very dif-ferently when = 0. While FuncICA separates theLaplace-distributed sources well, FPCA extracts 4 func-tionals that all have significant high frequency harmon-ics. As was increased, the high frequency harmoniccomponent was driven into the 3rd principal component(PC) curve, with negligible change to the extracted ICcurves. FPCA therefore seems to be more sensitive to

    79 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    8/12

    1010

    109

    108

    95

    100

    105

    110

    115

    120

    125

    130Effect of smoothing on Q

    Q

    1010

    109

    108

    0.988

    0.99

    0.992

    0.994

    0.996

    0.998

    1

    1.002

    1.004

    1.006Effect of smoothing on accuracy

    accuracy

    Figure 4: All plots are for the synthetic dataset. Top,the regularization parameter plotted versus Q. Note theminimum at = 1 109 and the spurious maximum at

    = 3

    10

    9

    . Bottom, accuracy is high in close proximity to= 1 109, with sharp drops as increases above 2 109

    or decreases below 5 109.

    regularization in cases where the components exhibitvariable amounts of smoothness that can inform theseparation process, while FuncICA requires no regular-ization here.

    Effect of on Q and accuracy Additionally,FuncICA was able to extract the Laplacian-distributedcomponents from mixtures of two high-frequency Gaus-sian sources and two Laplacian sources. The two Lapla-

    cian sources are as described before, and the two Gaus-sian sources are sin(40t) and cos(40t), again dis-tributed according toN(0, ). We applied Algorithm 1to automatically find an optimal value for the regular-ization parameter by increasing, starting fromP = 0in this case, until Q reaches a minima. Figure 4 showsthe effect of smoothing on Q and accuracy.

    The left plot in Figure 4 shows that Q reaches aminimum at I = 1 109. Looking at the right plot,the accuracy measure is quite close to the optimal valuefor this choice of , and for a small window around

    I, accuracy remains high. Significant departure fromI results in large drops in accuracy. The intuitionbehind this result is that higher values of Q indicateindependent functionals that are more Gaussian dis-tributed, and the rest follows from the well-known result

    that Gaussian sources impede source recovery of non-Gaussian sources. MinimizingQ serves to dampen thehigh frequency Gaussian components and impede theirrecovery, but we are careful to stop the search for theoptimalonceQhits a local minimum, because we oth-erwise risk finding another local minimum at a higher where the non-Gaussian sources are also dampened.Although not shown in the plots, another minima ofQoccurs at = 1 105, but the accuracy there is sub-stantially lower (close to 0.92) because the non-Gaussiansources S1(t) and S2(t) have been dampened. Havingvalidated Algorithm 1 for optimization of I on syn-thetic data, we progress to the real-world results.

    6.2 Microarray gene expression data. The mi-croarray data obtained by [16] is of-factor8 synchro-nized temporal gene expression for the yeast cell cycle,with observations of 6178 genes at 18 times in 7 minuteincrements. Of interest in this domain is identificationof co-regulated genes related to specific cell cycle phases.Among the phases G1, S, S/G2, G2/M, and M/G1, wefocus on discriminating between genes related to G1ver-sus non-G1 phase regulation. The motivation for thiscomparison is in part due to [17], who present classifi-cation results using FPCA.

    Notably, Liebermeister [18] explored gene expres-sion with this data set using the primal form of tempo-ral ICA. Because Liebermeister did not use a functionalrepresentation, the analysis was able to consider eachgenes expression (at discrete time indices) over multi-ple cell cycles, with each cell cycle being synchronized bya different method. By contrast, our functional repre-sentation currently prohibits use of multiple curves perobservation, and as a result we (in agreement with [17])restrict our analysis to cell cycle gene expression syn-chronized by -factor mating pheromone. Though thisoutcome may seem less preferable, a functional repre-sentation facilitates natural methods for incorporating

    genes that have missing observations at a few time in-dices. Use of a nonzero is not necessary here, becausethe sparse temporal sampling of the gene expression pro-cess causes the spline-fitting stage to eliminates most ofthe roughness Algorithm 1 would otherwise remove.

    Results For the discrimination task, we applied

    8An unfortunate intersection of notation has occurred. Notethat-factor refers to a mating pheromone within biology, whichis completely separate from our which refers to a regularizationparameter.

    80 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    9/12

    FuncICA to data from a 5672-gene subset correspondingto all genes where the -factor synchronized data ismissing at most one value. We tested the relevance ofthe extracted IC curves to a classification task involving48 genes related to G1 phase and 50 genes related

    to non-G1 phases (identified by traditional methodsby [16]) that have data available and are missing nomore than one time-indexed observation. Missing valuesare handled automatically by the cubic b-spline fittingprocess. We measure the quality of a set of featuresaccording to performance of a support vector machine(SVM) [19] with Gaussian kernel, and we use leave-one-out cross-validation to select both a suitable SVM-regularization parameter and kernel bandwidth.

    SVM experiments were run using all 11 PC curvescore features versus all 11 IC curve features. The PCfeatures resulted in 9.2% error, while the IC featuresyielded 7.1% error. Removing 2 features determined

    to be detrimental from the IC feature set (IC curves 5and 10) reduced error to 6.1%, whereas removing thelast 3 PC curves yielded error of 8.2%. Unfortunately,ICA does not have a natural solution for missing dataimputation and so we cannot provide results comparingFuncICA to its non-functional version (RADICAL inthis case) for this problem. Our conjectures for theresults that we obtained are that FuncICA extractsmore meaningful features that are more easily exploitedby the SVM classifier.

    6.3 P300 event-related potential data. The EEG

    data set obtained from [20] includes 15300 1-second ob-servations, where for each time index in a given observa-tion there are 64 voltage values corresponding to scalpelectrical activity recorded during a brain-computer in-terface (BCI) experimental paradigm. During the ex-periment, a subject is attending to a particular letteron a screen containing a 6 by 6 grid of letters and num-bers. Each row and column is flashed once in a randomsequence, and the 1 second signal observations corre-spond to the data recorded after each flash. The BCIcommunity is interested in identifying the row and col-umn (the letter) to which a subject is attending.

    Much previous work has characterized an event-

    related potential (ERP) known as the P300, due toits signature appearance as a positivity in the EEGapproximately 300 ms concluding the presentation ofan attended stimulus. In Figure 6, the empirical P300waveform for this data set is shown; it was calculatedby taking the mean of the 2550 trials in which the P300is known to activate. While the P300 ERP exists inmost if not all subjects, the actual timing and waveformexhibited vary to some degree by subject and variousphysiological factors affecting attention. Therefore, an

    FuncICA ICA FPCA Empirical P300

    column 64.3% 35.7% 37.5% 35.7%row 55.4% 48.2% 37.5% 46.4%letter 30.4% 17.9% 16.1% 17.9%

    Figure 5: P300 classification accuracy for FuncICA, FPCA,ICA, and empirical P300 waveform. The best accuracyattained in each column is in bold.

    automated process for estimating the waveform froman unlabeled set of trials will prove highly useful forclassification problems where we wish to pick out thetrials that flash an attended letter.

    The problem is to decide which letter was attendedto, given 15 trials of each row and column flashing.Classification ideally takes place from 1 trial of eachrow and column flashing, but in practice the low SNR

    makes a larger number of trials necessary. All of ourexperiments are done from recordings at the electrodePz, with a common average reference (CAR) filter whichinstantaneously adjusts the value of the signal at anelectrode by the average of all 64 channels. We did notapply any further spatial nor frequency filtering. Theresult is a data set that is raw data with the exceptionof the CAR filter; poor results are obtained from EEGdata without considerable filtering.

    Results Our classification accuracy rates demon-strate recovery of the P300 waveform under differentchoices for the parameter . The classifier was simplyto pick the row/column that had the highest mean ac-

    tivation for the used feature. The results for the bestcomponent from FuncICA, FPCA, and ICA (for ICA,we use 256 dimensional data reduced by PCA to 12 di-mensions) are summarized in Figure 5. The best resultsfor a single IC curve were obtained with = 107, withcolumn, row, and letter classification accuracy of 64.3%,55.4%, and 30.4% respectively. Note that letter classifi-cation requires correct row and column classification for15 trials where the attended letter is constant. Figure6 illustrates the IC curve most similar to the P300 for2 values of.

    In Figure 8 we demonstrate the success of Algorithm

    1 for this data set. The valueI = 7.5 108

    returnedby the algorithm corresponds to a neighborhood of values that provide high accuracy. The accuracy atI itself is quite close to better accuracy values in itsneighborhood. We interpret these results as validationof our algorithm. To contrast FuncICA with FPCA, thebest results for a single PC curve where substantiallylower than the results for a single IC curve. As shownby Figure 7, this result is hardly surprising as the curvesextracted by FPCA are uninteresting harmonics. Theextracted curves being harmonics also makes sense due

    81 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    10/12

    0 0.2 0.4 0.6 0.8 10.5

    1

    1.5

    2

    2.5

    3

    3.5

    t

    P300(t)

    Empirical P300 waveform

    0 0.2 0.4 0.6 0.8 12.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    t

    IC3

    (t)

    IC closest to P300 ERP for = 1E7

    0 0.2 0.4 0.6 0.8 13

    2

    1

    0

    1

    2

    3

    t

    IC6

    (t)

    IC closest to P300 ERP for = 5E8

    Figure 6: Top to bottom. (1) Empirical P300 waveformcalculated from 2550 trials. (2) Closest IC to P300 for= 5 105. (3) Closest IC to P300 for = 1 107.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    t

    f(t)

    Figure 7: PC curves 1, 4, 7, and 10 plotted as a dashed redline, blue dotted line, green dashes line, and solid black linerespectively - Note that the components are almost entirelydescribed by harmonic behavior.

    to the dominance of harmonics within EEG, and hencea L2-loss minimizing representation is less useful foridentifying specific ERPs.

    Particularly interesting is the performance of ICAfor this problem. As shown in Figure 5, ICA performedquite poorly compared to FuncICA. We infer that the

    reason for this performance is due to a lack of smooth-ing used by ICA to extract its components; this infer-ence is partially validated by the fact that FuncICA per-forms optimally on this problem with nonzero choice of. The most surprising result is that using the empiricalP300 waveform as the component for classification re-sults in substantially lower classification accuracy thanusing the best IC curve. A possible explanation for thisresult is the use of the empirical mean without employ-ing any functional smoothing techniques. Nevertheless,the implications for the result is that FuncICA bettercaptures the underlying event-related potential of inter-

    est when humans are presented with unexpected stimuli.

    7 Extension to fMRI data

    We have shown promising results of the dual of tem-poral ICA for time series data; also of interest is spa-tial data, and beyond that, spatiotemporal data. Func-tional magnetic resonance imaging (fMRI) data consistsof three-dimensional spatial recordings of neural activa-tions, with as many as 5 such recordings a second whenthe number of acquired slices is limited (i.e., one of thespatial dimensions is limited). With the advent of fMRI

    82 Copyright by SIAM.

    Unauthorized reproduction of this article is prohibited.

  • 8/13/2019 dm09_008_mehtan.pdf

    11/12

  • 8/13/2019 dm09_008_mehtan.pdf

    12/12

    regularization not afforded by the primal forms appliedto their original domains. As a consequence of thisduality result, we have proposed a new method formining fMRI data for spatiotemporal patterns whoseactivation is statistically independent over a large set of

    trials.More pragmatically, our algorithm efficiently en-codes information via IC curves that provide superiorclassification performance for identifying covarying setsof genes. For EEG data, FuncICA can extract the P300event-related potential using unlabeled trials; previouswork requires labeled trials to identify the P300. Themethod of ICA introduced here is applicable to a dualclass of variation over sets of time series rather thanvariation within individual time series.

    This work is a novel addition to the set of ICAtools that analyze data with temporal structure. Aninherent issue is the unsolved problem of balancing

    reconstruction error and source separation. Withouta priori knowledge of the nature of the sources, itis difficult to design a robust smoothing method. Ingeneral, claims can be made about reduction of highfrequency noise. We hope that researchers working withfunctional data, such as time series and spatial data, useFuncICA to gain new insights into countless domains.

    References

    [1] A.J. Bell and T.J. Sejnowski. An information-maximization approach to blind separation and blind

    deconvolution. Neural Computation, 7(6):11291159,1995.

    [2] J. V Stone, J. Porrill, C. Buchel, and K. Friston.Spatial, temporal, and spatiotemporal independentcomponent analysis of fMRI data. InProceedings of the18th Leeds Statistical Research Workshop on Spatial-

    Temporal Modelling and Its Applications, pages 2328.Leeds University Press, 1999.

    [3] S. Choi, A. Cichocki, H.M. Park, and S.Y. Lee. Blindsource separation and independent component analysis:A review. Neural Information Processing-Letters andReviews, 6(1):157, 2005.

    [4] M.J. Mckeown, S. Makeig, G.G. Brown, T.P. Jung, S.S.

    Kindermann, A.J. Bell, and T.J. Sejnowski. Analysisof fMRI data by blind separation into independentspatial components. Human Brain Mapping, 6(3):160188, 1998.

    [5] J. Ramsay and B. Silverman. Functional data analysis.Springer, 2005.

    [6] J. Ramsay and B. Silverman. Applied functional dataanalysis: methods and case studies. Springer, 2002.

    [7] T.M. Cover and J.A. Thomas. Elements of informationtheory. Wiley New York, 1991.

    [8] O. Vasicek. A test for normality based on sampleentropy. J.R. Stat. Soc. Ser. B., 38(1):5459, 1976.

    [9] E.G. Learned-Miller et al. ICA using spacings esti-mates of entropy. JMLR, 4(1271-1295):12, 2003.

    [10] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf.Measuring statistical dependence with Hilbert-Schmidtnorms. In ALT, volume 3734, pages 6377. Springer,2005.

    [11] A. Hyvarinen and E. Oja. Independent componentanalysis: algorithms and applications. Neural Net-works, 13(4-5):411430, 2000.

    [12] J.V. Stone. Blind source separation using temporalpredictability.Neural Comp., 13(7):15591574, 2001.

    [13] B.A. Pearlmutter, L.C. Parra, et al. Maximum likeli-hood blind source separation:a context-sensitive gener-alization of ICA. NIPS, 9:613, 1997.

    [14] Y.M. Cheung and L. Xu. An empirical method to se-lect dominant independent components inICA for timeseries analysis. In Neural Networks, 1999. IJCNN99.International Joint Conference on, volume 6, 1999.

    [15] JM Gorriz, CG Puntonet, M. Salmeron, and EW Lang.

    Time series prediction using ICA algorithms. In Intel-ligent Data Acquisition and Advanced Computing Sys-tems: Technology and Applications, 2003. Proceedings

    of the Second IEEE International Workshop on, pages226230, 2003.

    [16] P.T. Spellman et al. Comprehensive identification ofcell cycle-regulated genes of the yeast Saccharomycescerevisiae by microarray hybridization. Molecular Bi-ology of the Cell, 9(12):32733297, 1998.

    [17] X. Leng and H.G. Muller. Classification using func-tional data analysis for temporal gene expression data.Bioinformatics, 22(1):6876, 2006.

    [18] W. Liebermeister. Linear modes of gene expression de-termined by independent component analysis. Bioin-

    formatics, 18(1):5160, 2002.[19] T. Joachims. Making large-scale support vector ma-

    chine learning practical. Advances in Kernel Methods:Support Vector Learning, 1999.

    [20] B. Blankertz, K.R. Muller, et al. The BCI competitionIII: validating alternative approaches to actual BCIproblems. IEEE Trans. Neural Sys. Rehab. Eng.,14(2):153159, 2006.

    [21] TE Nichols, J. Qi, E. Asma, and RM Leahy. Spatiotem-poral reconstruction of list-mode PET data. MedicalImaging, IEEE Transactions on, 21(4):396404, 2002.

    84 Copyright by SIAM.