[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Sparse Representation

(ICNN&B'05 Invited Paper)

Sparse Representation and Associative Learningin Multisensory Integration

Liqing ZhangDepartment of Computer Sciences and Engineering,

Shanghai Jiaotong UGiiversity, Shanghai 200030, ChinaEmail: zhang-lq@cs . sj tu. edu. cn

Abstract- In this paper, we discuss multisensory informationrepresentation and associative learning in multimodal integra-tion. First, we provide a brief overview of anatomical structureand neural information pathways of multimodal cortexes. Thenwe formulate the multimodal integration problem into a frame-work of statistical learning. The associative learning algorithm isproposed to adapt the representation matrix and the sparsenessof representation is also adapted in the statistical learning model.

I. INTRODUCTION

Current theories of multisensory representations do notfully resolve the issues of neural representation and statisticalinference involved in multisensory integration. In multisen-sory brain cortexes, signals that come from different sensorysystems, which might represent distinct aspects of the samephysical object, are integrated. For example, people mightlocate an object both from visual and auditory modalities,recognize a word on the basis of its sound and the speaker'slip movements.

There are two basic issues in multimodal information in-tegration: sensory information representation and multimodalinformation binding. Sensory information representation is toadapt sensory neural networks such that the sensory infor-mation are represented as a spatial pattern efficiently. Thereare a number of theories dealing with the sensory informationrepresentation, such as sparse representation, redundancy re-duction and efficient coding. These theories mostly investigateunimodal neural coding problem. Unimodal neural informationrepresentation is first stage towards multimodal informationprocessing. For example, the sound of a word and the imageof a person's moving lips - must be represented into theirprimary cortexes before they can be integrated, because initialsensory modalities do not use the same representations.The second issue is to integrate internal neural patterns from

different modalities, to make statistical inferences and to per-form associative learning to represent multimodal information.Statistical inference is necessary because sensory modalitiesare not equally reliable and their reliability can vary dependingon the context.

In this paper we focus on the multisensory representationsand associative learning problem. First, we overview the stateof art of anatomical structures of brain in multimodal infor-mation processing. We also restrict ourselves to multisensoryintegration in the context of associative learning - specifically,

to the computational issue of early concept learning using thevisual, auditory modalities.

II. MULTIMODAL ANATOMICAL STRUCTURESIn this section, we briefly introduce the basic knowledge

on multimodal anatomical structures of brain. we mainlydiscuss neural information processing pathways for the visualand auditory system. Furthermore, we discuss some neuro-physiological evidences on the organization and structures ofmultimodal information integration.

A. Visual PathwaysAnatomical, and physiological studies in monkeys about the

organization and structures of visual cortex have shown thatmonkey cortex contains more than 30 separate visual areas[5], which are organized into two functionally specializedprocessing pathways: the 'what' and 'where' pathways [13],[11], identifying objects in a ventral 'what' pathway, andlocating and perceiving the spatial location of objects in adorsal 'where' pathway.

Visual information enters the primary visual cortex viathe lateral geniculate nucleus(LGN). Then visual informationprogresses along two parallel hierarchical streams. Corticalareas along the 'dorsal stream' (including the posterior parietalcortex; PPC) are primarily concerned with spatial localizationand directing attention and gaze towards objects of interestin the scene. The control of attentional deployment is conse-quently believed to mostly take place in the dorsal stream.

Cortical areas along the 'ventral stream' (including theinferotemporal cortex; IT) are mainly concerned with therecognition and identification of visual stimuli. Neurophysi-ological experiments suggest that neurons in areas V4, andinferior temporal cortex of the ventral stream show responseselectivities for stimulus attributes that are important for objectvision, such as shape, color, and texture [13].

Within the ventral stream, or object recognition pathway,the processing of information is largely hierarchical. Forexample, the processing of object features begins with simplespatial filtering by cells in Vl, but by the time the inferiortemporal cortex is activated, the cells respond selectively toglobal object features, such as shape, and some cells are evenspecialized for the analysis of faces [13]. Furthermore, theaverage receptive field (RF) size increases as one moves alongthe pathway toward the temporal lobe. It thus appears that

0-7803-9422-4/05/$20.00 ©2005 IEEE1622

large RFs in later areas are built up from smaller ones inearlier areas.

For further detailed information, refer to [10], [6]

B. Auditory PathwaysNeurophysiological experiments [7] suggest a similar orga-

nization exists in the primate auditory system, the auditorysystem resembles the visual system in having ventral anddorsal streams for temporal and parietal lobe processing of'what' and 'where' for sounds, and in having two streamsthat contribute to functionally distinct regions of the frontallobe. The auditory system processes both the identity andthe location of the stimuli it receives. The initial signalsfrom the two cochleas are conveyed to a complex network ofpathways and nuclei in the brainstem and thalamus. Spectraland temporal information extracted by these structures is usedto determine the identity and location of the sound source.Although the identification of 'what' can be made on the basisof input to one ear alone, the precise estimation of 'where' inthree-dimensional space depends on comparison of inputs tothe two ears by specialized neural structures in the brainstem.Thus, the processing of 'what' and 'where' involves differentstructures and pathways even before the signals reach theauditory cortex. Yet cortical processing is important in bothof these tasks.

C. Multimodal IntegrationThere are also exist multimodal cortexes for enhancing

object location and object recognition. The spatial register ofthe different receptive fields of multisensory neurons in thesuperior colliculus (SC) plays a significant role in determiningthe responses of these neurons to cross-modal stimulus com-binations [8]. Spatially coincident visual-auditory stimuli fallwithin these overlapping receptive fields and generally produceresponse enhancements that exceed the individual modality-specific responses and can exceed their sum.

It has not been clear how spatial coincidence is operationallydefined. Given the large size of SC receptive fields, visualand auditory stimuli could be within their respective receptivefields even when there are substantial spatial disparities be-tween them. Specifically, that multisensory response enhance-ments become progressively weaker as the within-field visualand auditory stimuli become increasingly disparate.

Recent neuroscience experiments [9] have shed new lighton the function of superior temporal cortex, suggesting thatthe rostral part of the superior temporal cortex acts as aninterface between the dorsal and ventral streams of visual inputprocessing to allow the exploration of both object-related andspace-related information. The superior temporal cortex is alsoinvolved in processing species-specific vocalizations.

III. STATISTICAL MODELS

Now we discuss the statistical issues involved in multisen-sory integration. The statistical issue arises because the sensorysignals are typically corrupted by noise, either because of thestimulus itself, or because of neural noise within the central

nervous system. Given this uncertainty, one can only estimatethe location and identification of an object from the evokedsensory activity using statistical approach.The challenge is how to obtain the best possible estimate,

given the available data. When multiple sources of informationare available, the redundancy between them can be used torefine the estimate. In particular, the most likely location ofan object can be computed, given all the available multimodalobservations. This is known as a maximum-likelihood estimateand is optimal for the problem we are considering, in the sensethat it leads to an unbiased and efficient estimate.

Early concept learning is one good example of multimodalneural information integration. Previous work on early lan-guage acquisition has shown that word meanings can be ac-quired by an associative procedure that maps perceptual expe-rience onto linguistic labels based on cross-modal observation.To establish a computational model for this idea, we introducea statistical learning mechanism that provides a formal accountof cross-modal observation. The main strategy used in thispaper is to represent multimodal neural information efficientlyin the statistical leaming framework.What kinds of leaming mechanisms underlie language ac-

quisition? One of the central problems concerns whether theinnate or environmental contribution plays a vital role inlanguage development. Learning-based theories believe thatlanguage is leamed and the child's environment plays a crucialrole [15]. There is growing evidence that babies do possesspowerful statistical learning mechanisms [14].The theory of statistical learning suggests that language

acquisition is a statistically driven process in which younglanguage leamers utilize the lexical content and syntacticstructure of speech as well as non-linguistic contextual in-formation as input to compute distributional statistics.The statistical learning mechanism is more general than the

one dedicated solely to processing linguistic data. Further-more, statistical language learning includes not only statisticalcomputations to identity words in speech but also algebraic-like computations to leam grammatical structures. In thestudy of early concept learning, associationism believes thatconcept acquisition is based on statistical learning of co-occurring data from the linguistic modality and nonlinguisticcontext [12]. Concept learning is initially a process in whichchildren's attention is captured by objects or actions thatare the most salient in their environment, and then usedto associate those objects or actions with acoustic patterns.Plunkett [12] developed a connectionist model of vocabularydevelopment to associate preprocessed images and linguisticlabels. The linguistic behaviors of the network can mimicthe well-known vocabulary spurt based on small continuouschanges in the connection strengths with and across differentprocessing modalities in the network.The statistical and associative theory suggested that the

child's sensitivity to spatio-temporal contiguity is sufficientfor concept learning, as postulated by associationist modelsof language acquisition with support by computational imple-mentation.

1623

Although no one doubts this process, few research hasaddressed the computational mechanism and modelling ofcross-modal binding. We are to introduce a theoretical modelof statistical concept learning which provides a probabilis-tic framework for encoding multiple sources of information.Given multiple scenes paired with spoken words collectedfrom natural interactions, the model is able to compute theassociation probabilities of all the possible word-meaningpairs. The purpose of this study is to show quantitatively theeffects of statistical cross-modal observation through compu-tational modelling. In early concept learning, children need tostart by pairing spoken words with the co-occurring possiblereferents, collecting multiple such pairs, and then figuring outthe common elements.

IV. LEARNING ALGORITHM

This section discusses the computational principle of mul-timodal neural representation and field mapping associationbetween different modalities.

Associative learning mechanisms are to provide a perceptualmodel for binding neural information from different modali-ties. Following repeated presentation visual information withspatiotemporally coincident auditory information, we attemptto increase the probability of association between multimodalsources of information. In this paper, we are not to directlyassociate neural information from different modalities, butto attempt to represent the multimodal neural informationefficiently in the multimodal cortex.The multimodal associative learning is based on the intemal

representation of different modalities. In this paper, we do notconcern the learning and representation of unimodal systems.Assume that inputs to the multimodal model are the outputs ofthe internal representation of different unimodal models. Wefocus on computational model of the object recognition path-way to deal with the early concept learning paradigm. Supposethat the state field of multimodal area is denoted by z(r. t)at position r with two dimensional variable. The input fromthe visual field is denoted by X, (r, t) and their connectionweights are denoted by W, (r, (). And the input from auditoryfield is denoted by Xa (r, t) and their connection weightsare denoted by Wa (r, (). The multimodal neural informationrepresentation is described by the following equation

Trd z(r, t) -z(r, t) + W (r, ()X , t)<

+ Wv (r, 4XI (, t)<- jWz (r, ) (z(, (1)

where the integral is performed over the whole neural field.The last term represents the lateral inhibitory connections inthe multimodal cortex, Wz (r, () is the connection weightsbetween the neuron at position r and the neuron at position(. The purpose of learning process is to establish an associa-tive map between the neural representations from visual andauditory fields.

Fig. 1. The framework of multimodal associative learning

As the sparse representation in the primary cortexes, themultimodal associative learning can be also formulated intothe framework of the statistical learning. The key issue is howto define the objective function for the associative learning. Inthis paper, we use the mutual information of the multimodalneural field z(r, t) as an objective function. The learningprocess is to make the the multimodal neural representationas efficiently as possible. As discussed in the primary visualcortex, the sparseness is one of the computational criteriafound in neurophysiological experiments.

To simplify the learning model, we first neglect the bilateralinhibitory term and dynamical term, the multimodal represen-tation model is simplified as

z(r, t) - JWa(ri )Xa(Q t)d+ JWv(r, )XvQ((t)d< (2)

Now, we can apply directly the independent component analy-sis(ICA) algorithm to train the above model. It should be notedthat the standard ICA algorithm cannot apply to the abovemodel. Since the inputs of the model is a two dimensionalmatrix series, we have to formulate the above representationmodel to a local ICA model.

Denote X(t) = (vec(Xa(r, t))T, vec(X,(r, t))T)T, andZ(t) = vec(z(r,t)) and here operator vec is to transfer amatrix to a column vector. Then the multimodal representationmodel is transformed into a over-determined ICA model

Z(t) = WX(t). (3)where W is a sparse matrix, which is referred to as therepresentation matrix. Assume q(Z) is the joint probabilitydensity function of Z and qi(Zi) is the marginal probabilitydensity function of Zi, i = 1,... , N. If the random variables

1624

Z1, , ZN are mutually independent, we haven

q(Z) = fl qi(Zi). (4)i=l

Now we introduce the mutual information rate I(r) betweena set of stochastic processes Zl.... , ZN as

n

I(Z) =-H(Z) + E H(Zi), (5)i=l

which measures the mutual independency between the stochas-tic processes Z1,.. , ZN, where H(Z), H(Zi) are the entropyofrandom variable Z and Zi respectively. Since the probabilitydensity functions q(Z), qi(Zi) are unknown in the multimodalrepresentation model, we need to simplify the cost function inorder to obtain a learning algorithm from minimizing it. Theremaining problem becomes how to evaluate the first term ofthe cost function (5).

Following the the same procedure in deriving the costfunction for ICA Learning [4], [3], [17], we obtain a simplifiedcost function for multimodal sparse representation

N

L(WV) -log(ldet(WWT)D) + E E[logp (Zi)]. (6)i=l

The ICA model can be formulated into a framework ofsemiparametric statistical models. A semiparametric statisticalmodel usually consists of two types of parameters, which arenot equally important in finding the solution. The parameters,which are directly influence the solution, are referred to as theparameters of interest. Otherwise, the parameters are referredto as the nuisance parameters.

In the framework of ICA, the representation matrix W isregarded as the parameters of interest. While the probabilitydensity functions qj are seen as the nuisance parameters in thesemiparametric model. In our neural representation model, theprobability density functions qi represent the statistical neuralactivities. Usually their the probability density functions arevery sparse due to the sparseness of the action potentials ofneurons.The statistical neural activities depend on the sensory in-

formation. Generally, we have to estimate the the probabilitydensity functions qi during training the weight connections. Inthis paper, we are not only going to develop learning algorithmfor the weight adaptation, but also to develop algorithm foradapting the sparseness of the neural representation.

If we can choose the probability density functions as asparse function, the neural internal representation will becomesparse after training the representation model. When sparse-ness of the neural activities are misspecified, the likelihoodequation in general gives a biased solution. One of thefundamental problems is whether L(W) is minimized at therepresentation solution.

A. Learning Algorithm for Representation MatrixIn this section, we derive a learning algorithm based on

the gradient descent approach for the representation matrix.

We assume that the probability density functions are knownfor a moment during the derivation of a learning algorithmfor the representation matrix. We will further discuss how toestimate the sparseness in order to make the learning algorithmstable. The cost function for on-line statistical leaming can besimplified as [1]

l(V) =-2 log( det(WVWT) -E log qt (Zi)2i=

(7)

where qi(Z1) is the probability density function of Zi, andZ = (Z1,... , Znj)T - Wx(k). It should mentioned that therepresentation matrix is rectangular. Using the natural gradientdescent approach, we obtain the learning algorithm as follows

AXW - r, (I - p(Z(k))Z(k)T) W, (8)

where 71 is a learning rate, p(Z(k)) is an activation function.For the derivation of the algorithm for rectangular matrices,refer to [16]. Usually it is not easy to implement the learningalgorithm directly because the dimension of the representationmatrix W is huge. One solution to this is to implement thealgorithm by approximate local learning.

B. Learning Algorithm for SparsenessIn this section, we employ the generalized Gaussian family

for modelling the sparseness of the internal neural activities.The generalized Gaussian distribution family includes two freeparameters, the variance o and the sharpness 0 [2]. To simplifyour model, we impose that the internal random variables areof unit variance. Now the generalized Gaussian distribution issimplified as

P9(y O) (exP A(O) I I (9)

where .A(0) =- og(2A(0)F(1+ )), A(0) =- F(O)/]F( ),and F(x)= foo TX-le-TdT is the standard Gamma function.The ordinary gradient of the cost function (7) is given by

&1(y, H, W) = _ & (k),( )+jaOL - (k),01)~~±iAf(0i) (10)

where

DO, - (O~) og OiA'(0.) j.(1o8i A0i)_| log|A(9j) - A((0)The ordinary gradient descent learning algorithm for esti-mating the activation function of the ith component of thedemixing model is described by

/AO, = - 0l(y. O; W)Aoi ci (12)

Correspondingly, the natural gradicnit algorithm is given by

z\O, -17g1 1l(y, O,W) (13)

where gi is the Fisher information, defined by

0[ (y, 0, W) Dl(y, 0,W) 1(ii = E L ai DO0 (14)

1625

The ordinary gradient algorithm (12) and the natural gradient [171 L. Zhang, A. Cichocki, and S. Amari. Self-adaptive blind sourcealgorithm (13) have the same set of equilibria, but have separation based on activation function adaptation. IEEE Transactionsdifferent learning dynamics. on Neural Networks, 15(2):233-244, 2004.

By using the above two online learning algorithms, we cannot only obtain the representation matrix but also the sparse-ness in order to adapt the statistical properties in stimulus.

V. CONCLUSIONSIn this paper, we first review some fundamental knowledge

on the anatomical structures and neural information pathwaysof multimodal cortexes. We propose a statistical parametricmodel for multimodal neural information integration. Then anassociative learning algorithm is proposed for the adaptationof the representation matrix and the sparseness adaptationof the internal neural activities are also discussed. Detailedtheoretical analysis and computer simulations will provided inanother full paper.

VI. ACKNOWLEDGEMENTSThe work was supported by National Natural Science Foun-

dation of China (Grant No. 60375015) and the 973 Program(2005CB724301) of the Ministry of Science and Technologyof China.

REFERENCES[1] S. Amari and A. Cichocki. Adaptive blind signal processing- neural

network approaches. Proceedings of the IEEE, 86(10):2026-2048, 1998.[2] s. Amari, A. Cichocki, and H. Yang. Blind signal separation and

extraction: Neural and information theoretic approaches. In S. Haykin,editor, Unsupervised Adaptive Filtering, volume I, pages 63-138. JohnWiley & Sons, 2000.

[3] S. Amari, A. Cichocki, and H.H. Yang. A new learning algorithmfor blind signal separation. In G. Tesauro, D.S. Touretzky, and T.K.Leen, editors, Advances in Neural Information Processing Systems 8(NIPS*95), pages 757-763, 1996.

[4] A Cichocki and S. Amari. Adaptive Blind Signal and Image Processing.John Wiley, Chichester, UK, 2003.

[5] Felleman DJ and Van Essen DC. Distributed hierarchical processing inthe primate cerebral cortex. Cereb. Cortex, 1:1-47, 1991.

[6] Laurent Itti and Christof Koch. Computational modelling of visualattention. NATURE Reviews on Neuroscience, 2:194-203, 2001.

[7] Jon H. Kaas and Troy A. Hackett. what and where processing in auditorycortex. Nature Neuroscience, 2:1045-7, 1999.

[8] Daniel C. Kadunce, J. William Vaughan, Mark T. Wallace, and Barry E.Stein. The influence of visual and auditory receptive field organizationon multisensory integration in the superior colliculus. Exp Brain Res,139:303C310, 2001.

[9] Hans-Otto Karnath. New insights into the functions of the superiortemporal cortex. Nature Reviews Neuroscience, 2:568-576, 2001.

[10] Sabine Kastner and Leslie G. Ungerleider. Mechanisms of visualattention in the human cortex. Annu. Rev. Neurosci, 23:315C341, 2000.

[11] Ungerleider LG. Functional brain imaging studies of cortical mecha-nisms for memory. Science, 270:769-75, 1995.

[12] K. Plunkett. Theories of early language acquisition. Trends in cognitiveScience, 1:146-153, 1997.

[13] Desimone R and Ungerleider LG. Handbook of Neuropsychology,volume 2, chapter Neural mechanisms of visual processing in monkeys,pages 267-99. Elsevier, Amsterdam, 1989.

[14] JR. Saffrin, EL. Newport, and RN. Aslin. Word segmentation: the ruleof distributional cues. Journal of Memory and Language, 35:606-621,1996.

[15] M.S. Seidenberg. Language acquisition and use: Learning and applyingprobabiltic constraints. Science, 275:1599-1603, 1997.

[16] L. Zhang, A. Cichocki, and S. Amari. Natural gradient algorithm forblind separaiton of overdetermined mixture with additive noise. IEEESignal Processing Letters, 6(11):293-295, 1999.

1626

Documents

[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Sparse Representation