21
LETTER Communicated by Erkki Oja Variational Bayesian Learning of ICA with Missing Data Kwokleung Chan [email protected] Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A. Te-Won Lee [email protected] Institute for Neural Computation, University of California at San Diego, La Jolla, CA 92093, U.S.A. Terrence J. Sejnowski [email protected] Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem for many estimation techniques. We have developed a variational Bayesian method to perform independent component analysis (ICA) on high- dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative den- sity model. Modeling the distributions of the independent sources with mixture of gaussians allows sources to be estimated with different kurto- sis and skewness. Unlike the maximum likelihood approach, the varia- tional Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data with- out over tting problems. The technique is also extended to the clusters of ICA and supervised classi cation framework. 1 Introduction Data density estimation is an important step in many machine learning problems. Often we are faced with data containing incomplete entries. The data may be missing due to measurement or recording failure. Another frequent cause is dif culty in collecting complete data. For example, it could be expensive and time-consuming to perform some biomedical tests. Data scarcity is not uncommon, and it would be very undesirable to discard those data points with missing entries when we already have a small data set. Traditionally, missing data are lled in by mean imputation or regression imputation during preprocessing. This could introduce biases into the data Neural Computation 15, 1991–2011 (2003) c ° 2003 Massachusetts Institute of Technology

Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

LETTER Communicated by Erkki Oja

Variational Bayesian Learning of ICA with Missing Data

Kwokleung ChankwchansalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USA

Te-Won LeetewonsalkeduInstitute for Neural Computation University of California at San DiegoLa Jolla CA 92093 USA

Terrence J SejnowskiterrysalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USAand Department of Biology University of California at San DiegoLa Jolla CA 92093 USA

Missing data are common in real-world data sets and are a problem formany estimation techniques We have developed a variational Bayesianmethod to perform independent component analysis (ICA) on high-dimensional data containing missing entries Missing data are handlednaturally in the Bayesian framework by integrating the generative den-sity model Modeling the distributions of the independent sources withmixture of gaussians allows sources to be estimated with different kurto-sis and skewness Unlike the maximum likelihood approach the varia-tional Bayesian method automatically determines the dimensionality ofthe data and yields an accurate density model for the observed data with-out overtting problems The technique is also extended to the clustersof ICA and supervised classication framework

1 Introduction

Data density estimation is an important step in many machine learningproblems Often we are faced with data containing incomplete entries Thedata may be missing due to measurement or recording failure Anotherfrequent cause is difculty in collecting complete data For example it couldbe expensive and time-consuming to perform some biomedical tests Datascarcity is not uncommon and it would be very undesirable to discard thosedata points with missing entries when we already have a small data setTraditionally missing data are lled in by mean imputation or regressionimputation during preprocessing This could introduce biases into the data

Neural Computation 15 1991ndash2011 (2003) cdeg 2003 Massachusetts Institute of Technology

1992 K Chan T Lee and T Sejnowski

cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)

Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s

x D A s C ordm (11)

where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities

Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate

2 Model and Theory

21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data

Pxt j micro DZ

N xt j Ast C ordm ordfPst j micros dst (21)

where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed

Variational Bayesian Learning of ICA with Missing Data 1993

to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness

Pst j micros DLY

l

AacuteKX

kl

frac14lklN

iexclstl j Aacutelkl macrlkl

cent

(22)

Split each data point into a missing part and an observed part xgtt D xogt

t

xmgtt In this article we consider only the random missing case (Ghahra-

mani amp Jordan 1994) that is the probability for the missing entries xmt is

independent of the value of xmt but could depend on the value of xo

t Thelikelihood of the data set is then dened to be

L micro I X DY

tPxo

t j micro (23)

where

Pxot j micro D

ZPxt j micro dxm

t

DZ microZ

N xt j Ast C ordm ordf dxmt

paraPst j micros dst

DZ

N xot j [Ast C ordm]o

t [ordf]ot Pst j micros dst (24)

Here we have introduced the notation [cent]ot which means taking only the

observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules

22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by

Pmicro j X D PX j microPmicro

PXD

Qt Pxo

t j micro Pmicro

PX (25)

where PX is the marginal likelihood and given as

PX DZ Y

tPxo

t j microPmicro dmicro (26)

1994 K Chan T Lee and T Sejnowski

The ICA model for PX is dened with the following priors on the param-eters Pmicro

PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l

Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)

Pmacrlkl D G macrlklj aomacrlkl bomacrlkl

Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)

where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively

N x j sup1 curren D

sjcurrenj

2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)

G x j a b D ba

0axaiexcl1eiexclbxI (210)

Dfrac14 j d D 0P

dkQ0dk

frac14d1iexcl11 pound cent cent cent pound frac14

dKiexcl1K (211)

Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter

Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)

iexclKLQmicro j Pmicro j X DZ

Qmicro logPmicro j X

Qmicrodmicro

DZ

Qmicro

X

t

log Pxot j micro C log

Pmicro

Qmicro

dmicro

iexcl log PX (212)

Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood

log PX cedilZ

Qmicro X

t

log Pxot j micro dmicro C

ZQmicro log

Pmicro

Qmicrodmicro (213)

which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-

Variational Bayesian Learning of ICA with Missing Data 1995

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

l

Qfrac14lY

kl

QAacutelkl Qmacrlkl

(214)

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

DX

l

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

CX

l kl

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

X

l kl

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

CZ Z

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Preg

Qregdreg

CZ

Qordm logPordm

Qordmdordm C

ZQordf log

Pordf

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 2: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

1992 K Chan T Lee and T Sejnowski

cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)

Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s

x D A s C ordm (11)

where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities

Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate

2 Model and Theory

21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data

Pxt j micro DZ

N xt j Ast C ordm ordfPst j micros dst (21)

where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed

Variational Bayesian Learning of ICA with Missing Data 1993

to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness

Pst j micros DLY

l

AacuteKX

kl

frac14lklN

iexclstl j Aacutelkl macrlkl

cent

(22)

Split each data point into a missing part and an observed part xgtt D xogt

t

xmgtt In this article we consider only the random missing case (Ghahra-

mani amp Jordan 1994) that is the probability for the missing entries xmt is

independent of the value of xmt but could depend on the value of xo

t Thelikelihood of the data set is then dened to be

L micro I X DY

tPxo

t j micro (23)

where

Pxot j micro D

ZPxt j micro dxm

t

DZ microZ

N xt j Ast C ordm ordf dxmt

paraPst j micros dst

DZ

N xot j [Ast C ordm]o

t [ordf]ot Pst j micros dst (24)

Here we have introduced the notation [cent]ot which means taking only the

observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules

22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by

Pmicro j X D PX j microPmicro

PXD

Qt Pxo

t j micro Pmicro

PX (25)

where PX is the marginal likelihood and given as

PX DZ Y

tPxo

t j microPmicro dmicro (26)

1994 K Chan T Lee and T Sejnowski

The ICA model for PX is dened with the following priors on the param-eters Pmicro

PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l

Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)

Pmacrlkl D G macrlklj aomacrlkl bomacrlkl

Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)

where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively

N x j sup1 curren D

sjcurrenj

2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)

G x j a b D ba

0axaiexcl1eiexclbxI (210)

Dfrac14 j d D 0P

dkQ0dk

frac14d1iexcl11 pound cent cent cent pound frac14

dKiexcl1K (211)

Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter

Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)

iexclKLQmicro j Pmicro j X DZ

Qmicro logPmicro j X

Qmicrodmicro

DZ

Qmicro

X

t

log Pxot j micro C log

Pmicro

Qmicro

dmicro

iexcl log PX (212)

Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood

log PX cedilZ

Qmicro X

t

log Pxot j micro dmicro C

ZQmicro log

Pmicro

Qmicrodmicro (213)

which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-

Variational Bayesian Learning of ICA with Missing Data 1995

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

l

Qfrac14lY

kl

QAacutelkl Qmacrlkl

(214)

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

DX

l

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

CX

l kl

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

X

l kl

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

CZ Z

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Preg

Qregdreg

CZ

Qordm logPordm

Qordmdordm C

ZQordf log

Pordf

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 3: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 1993

to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness

Pst j micros DLY

l

AacuteKX

kl

frac14lklN

iexclstl j Aacutelkl macrlkl

cent

(22)

Split each data point into a missing part and an observed part xgtt D xogt

t

xmgtt In this article we consider only the random missing case (Ghahra-

mani amp Jordan 1994) that is the probability for the missing entries xmt is

independent of the value of xmt but could depend on the value of xo

t Thelikelihood of the data set is then dened to be

L micro I X DY

tPxo

t j micro (23)

where

Pxot j micro D

ZPxt j micro dxm

t

DZ microZ

N xt j Ast C ordm ordf dxmt

paraPst j micros dst

DZ

N xot j [Ast C ordm]o

t [ordf]ot Pst j micros dst (24)

Here we have introduced the notation [cent]ot which means taking only the

observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules

22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by

Pmicro j X D PX j microPmicro

PXD

Qt Pxo

t j micro Pmicro

PX (25)

where PX is the marginal likelihood and given as

PX DZ Y

tPxo

t j microPmicro dmicro (26)

1994 K Chan T Lee and T Sejnowski

The ICA model for PX is dened with the following priors on the param-eters Pmicro

PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l

Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)

Pmacrlkl D G macrlklj aomacrlkl bomacrlkl

Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)

where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively

N x j sup1 curren D

sjcurrenj

2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)

G x j a b D ba

0axaiexcl1eiexclbxI (210)

Dfrac14 j d D 0P

dkQ0dk

frac14d1iexcl11 pound cent cent cent pound frac14

dKiexcl1K (211)

Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter

Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)

iexclKLQmicro j Pmicro j X DZ

Qmicro logPmicro j X

Qmicrodmicro

DZ

Qmicro

X

t

log Pxot j micro C log

Pmicro

Qmicro

dmicro

iexcl log PX (212)

Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood

log PX cedilZ

Qmicro X

t

log Pxot j micro dmicro C

ZQmicro log

Pmicro

Qmicrodmicro (213)

which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-

Variational Bayesian Learning of ICA with Missing Data 1995

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

l

Qfrac14lY

kl

QAacutelkl Qmacrlkl

(214)

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

DX

l

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

CX

l kl

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

X

l kl

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

CZ Z

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Preg

Qregdreg

CZ

Qordm logPordm

Qordmdordm C

ZQordf log

Pordf

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 4: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

1994 K Chan T Lee and T Sejnowski

The ICA model for PX is dened with the following priors on the param-eters Pmicro

PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l

Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)

Pmacrlkl D G macrlklj aomacrlkl bomacrlkl

Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)

where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively

N x j sup1 curren D

sjcurrenj

2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)

G x j a b D ba

0axaiexcl1eiexclbxI (210)

Dfrac14 j d D 0P

dkQ0dk

frac14d1iexcl11 pound cent cent cent pound frac14

dKiexcl1K (211)

Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter

Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)

iexclKLQmicro j Pmicro j X DZ

Qmicro logPmicro j X

Qmicrodmicro

DZ

Qmicro

X

t

log Pxot j micro C log

Pmicro

Qmicro

dmicro

iexcl log PX (212)

Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood

log PX cedilZ

Qmicro X

t

log Pxot j micro dmicro C

ZQmicro log

Pmicro

Qmicrodmicro (213)

which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-

Variational Bayesian Learning of ICA with Missing Data 1995

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

l

Qfrac14lY

kl

QAacutelkl Qmacrlkl

(214)

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

DX

l

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

CX

l kl

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

X

l kl

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

CZ Z

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Preg

Qregdreg

CZ

Qordm logPordm

Qordmdordm C

ZQordf log

Pordf

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 5: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 1995

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

l

Qfrac14lY

kl

QAacutelkl Qmacrlkl

(214)

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

DX

l

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

CX

l kl

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

X

l kl

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

CZ Z

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Preg

Qregdreg

CZ

Qordm logPordm

Qordmdordm C

ZQordf log

Pordf

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 6: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

1996 K Chan T Lee and T Sejnowski

shy 1

shy 05

005

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

x1

x2

shy 1

shy 05

0

05

1

shy 1

shy 05

0

05

10

02

04

06

08

1

12

14

16

s1s2

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 7: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 1997

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

l

AacuteX

kl

frac14lklN stl j Aacutelkl macrlkl

DX

k1

cent cent centX

kL

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

DX

k

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

gt

macrk D

0

Bmacr1k1

macrLkL

1

CA

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

ZPxo

t j st microX

k

frac14k N st j Aacutek macrk dst

D logX

k

frac14k

ZPxo

t j st micro N st j Aacutek macrk dst

cedilX

k

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

CX

k

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 8: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

1998 K Chan T Lee and T Sejnowski

xt

Y

n

Aa

st

b

f

kt p

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

X

k

Qkt

microZQskt log Pxo

t j skt micro dskt

CZ

Qskt logN skt j Aacutek macrk

Qsktdskt

para

CX

k

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

t

log Pxot j micro dmicronordm C const (31)

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 9: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 1999

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

t

ont

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

(32)

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

X

t

ont

b9n D bo9n C 12

X

t

ont

X

k

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

QA DY

nN Ancent j sup1Ancent currenAncent

currenAncent D

0

Bhreg1i

hregLi

1

CA C h9niX

t

ont

X

k

Qkthsktsgtkti

sup1Ancent DAacute

h9niX

t

ontxnt iexcl hordmniX

k

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

l

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

X

nhA2

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

t

X

klDk

Qkt (36)

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 10: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2000 K Chan T Lee and T Sejnowski

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

t

X

klDk

Qkt

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

iP

tP

klDk Qkthsktli3Aacutelkl

(37)

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

X

t

X

klDk

Qkt

bmacrlkl D bomacrlkl C12

X

t

X

klDk

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

0

Bhmacr1k1 i

hmacrLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA A

+

currensktsup1skt D

0

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

1

CA

C

Agt

0

Bo1t91

oNt9N

1

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 11: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2001

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

t

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

t DX

k

Qkt

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 12: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2002 K Chan T Lee and T Sejnowski

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

ZN xm

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

DZ Z

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

Pt

shy 4 shy 3 shy 2 shy 1 0 1 20

01

02

03

04

05

06

07

08

09

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 13: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2003

withP

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

t DZ

Qmicro

ZN xm

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (46)

and the above formalism Qst becomes

Qst DX

k

Qkt

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

X

k

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

t

DX

k

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 14: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2004 K Chan T Lee and T Sejnowski

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

DX

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

P2

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

CX

c

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

X

cQct log

Pct

QctC

X

ck

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

kt

CZ

Qsckt log

N sckt j Aacutec

k macrck

Qsckt

dsckt

para

CX

ck

QctQkct log

frac14 ck

Qkct

(54)

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 15: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2005

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

X

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 16: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2006 K Chan T Lee and T Sejnowski

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

(68)

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 17: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2007

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 18: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2008 K Chan T Lee and T Sejnowski

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

log

mar

gina

l lik

elih

ood

low

er b

ound

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 19: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2009

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 20: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

2010 K Chan T Lee and T Sejnowski

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

Page 21: Variational Bayesian Learning of ICA with Missing Datakwchan/publications/nc03-mICA.pdfLa Jolla, CA 92093, U.S.A. Missing data are common in real-world data sets and are a problem

Variational Bayesian Learning of ICA with Missing Data 2011

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003