Click here to load reader

Multimodal Weighted Dictionary Learning - UTKweb.eecs.utk.edu/~ataalimi/wp-content/uploads/2013/... · Multimodal Weighted Dictionary Learning Ali Taalimi, Hesam Shams, Alireza Rahimpour,

Embed Size (px)

Citation preview

  • Multimodal Weighted Dictionary Learning

    Ali Taalimi, Hesam Shams, Alireza Rahimpour, Rahman Khorsandi, Wei Wang, Rui Guo, Hairong QiElectrical Engineering and Computer Science

    The University of Tennessee, Knoxville{ataalimi,hshams,arahimpo,wwang34,rguo1,hqi}@utk.edu

    Abstract

    Classical dictionary learning algorithms that rely on asingle source of information have been successfully usedfor the discriminative tasks. However, exploiting multiplesources has demonstrated its effectiveness in solving chal-lenging real-world situations. We propose a new frameworkfor feature fusion to achieve better classification perfor-mance as compared to the case where individual sourcesare utilized. In the context of multimodal data analysis,the modality configuration induces a strong group/couplingstructure. The proposed method models the coupling be-tween different modalities in space of sparse codes whileat the same time within each modality a discriminativedictionary is learned in an all-vs-all scheme whose class-specific sub-parts are non-correlated. The proposed dic-tionary learning scheme is referred to as the multimodalweighted dictionary learning (MWDL). We demonstratethat MWDL outperforms state-of-the-art dictionary learn-ing approaches in various experiments.

    1. Introduction

    Dictionary learning (DL) approaches can be categorizedinto two groups: unsupervised and supervised. The super-vised DL has been demonstrated to get better results for re-constructive and discriminative tasks [12]. The objectivefunction in unsupervised DL [16] is based on minimizationof reconstruction error which generates a dictionary that isadequate to reconstruct the data from noise, but it may besub-optimal for a discriminative task. Learning the dictio-nary in supervised fashion can exploit the class discrimina-tion information and lead to a discriminative and compactdictionary that adapts to a specific task and data [9, 17].

    The discriminative power of supervised DL is originatedfrom the decomposition coefficients, the dictionary, or both.In [4] the discriminative coefficients are obtained by ap-plying joint sparsity regularization to make the coefficientswithin the same class similar. The discriminative power ofdictionary in supervised methods depend on the relation be-

    tween the label of atoms and class labels in the data. In[26] discriminative class-specific dictionaries are made inthe one-vs-all scheme by concatenating all samples of eachclass. In practice, one-vs-all DL methods lead to large dic-tionaries. In all-vs-all setting, the dictionary is shared be-tween classes. This results in a dictionary with fewer atomsbut the discriminative power suffers from the fact that eachatom may represent multiple classes [2, 9, 12, 23].

    The majority of existing sparsity inducing and dictio-nary learning methods [1, 3, 15, 18] can handle only singlesource of data. Fusion of information from various sen-sors can increase robustness to the possible failure of a sin-gle sensor. For example, in [21] the classification resultsof using face, fingerprint, and hand signatures are fused us-ing a majority vote to achieve better performance in identityverification. Fusion can be conducted in either the featurelevel or classifier level [20]. In feature-fusion different typesof features are combined to make a new feature set whileclassifier fusion aggregates decisions from several classi-fiers which are individually trained on various features. Al-though feature-level fusion has been shown to be more ef-ficient, the algorithm design is challenging especially whenthe features have various dimensions, and consequently, itis a relatively less-studied problem [20]. The simplest wayof feature-fusion is to concatenate features into one high-dimensional vector. Besides higher dimension, the concate-nated feature vector also does not contain the valuable infor-mation of correlation between feature types. In this paper,we use modality, source, and feature interchange-ably.

    Our contribution to solving the above limitations is two-fold. First, a multimodal supervised dictionary learning isproposed to obtain a reconstructive and discriminative dic-tionary with a small number of atoms for each feature inthe all-vs-all scheme. A set of multimodal weight matricesare used to enforce each atom to represent a certain classin all modalities; which enables the multimodal dictionar-ies to be more discriminative. Second, multimodal sparserepresentations of each class are forced to share the samesparsity patterns at the column level of the corresponding

    978-1-5090-3811-4/16/$31.00 c2016 IEEE IEEE AVSS 2016, August 2016, Colorado Springs, CO, USA

  • Figure 1. Illustration for the proposed simultaneous multimodaldictionary learning and coupling in space of sparse codes using theIXMAS dataset [25]. Each action is viewed by M = 5 cameras(5 modalities). Modalities are shown as yellow, magenta, green,blue and red. There is a color coded dictionary corresponding toeach modality. A set of dictionaries {Dm}Mm=1 is learned to re-construct the multimodal action, {xmi }

    5m=1 while coupling across

    different features of a signal is enforced in the space of the sparsecodes. The entries of sparse codes have different colors and rep-resent different learned values, the white entries indicate the zerorow and columns. The camera/modality-specific sparse codes (e.g.red codes) are used to update the corresponding camera/modality-specific dictionary (e.g. red dictionary).

    dictionary, which is imposed by joint sparsity regulariza-tion. If the code for one modality includes certain indicesof dictionary atoms, we encourage the codes for the othermodalities to also include the atoms with the same set ofindices from their respective dictionaries. As a result, mul-timodal sparse codes would be optimized for classificationtasks. The optimization problem over multimodal dictio-naries, multimodal weights, and sparse representations issolved simultaneously. Figure 1 presents an overview ofthe proposed framework.

    This paper is organized as follows: In Sec. (2), after abrief introduction, we provide a mathematical explanationof the proposed optimization problem with the parametersof multimodal dictionary, multimodal weights, and multi-modal sparse codes. Then, we provide an exact solutionto the methodology in Sec. (3). In Sec. (4), we show theexperimental results to demonstrate the performance of theproposed method. Conclusions are discussed in Sec. (5).

    2. Method

    Latent dictionary learning (LDL) [29] is a state-of-the-art supervised DL designed for single modality. The pro-posed, MWDL generalizes LDL to be able to fuse informa-tion from various sources at feature-level in order to makemore discriminative sparse codes suitable for the classifica-tion task.

    Notation. Consider classification of N multimodaltraining samples that belong to C classes as {Xi,c}Ni=1 andc in {1, . . . , C}. A training sample from c-th class is ob-served from M features: Xi,c = {xmi,c}

    Mm=1, where x

    mi,c is

    a vector in Rnm with nm being the dimension of the m-thmodality. We denote the training data of c-th class in m-th

    modality as {xmi,c}. For a matrix A in Rqp, we show the

    vector of i-th row in Rp as Ai, the vector of j-th columnin Rq as Aj, and the element of i-th row and j-th columnas Aij . We denote vector of all samples that belong to theclass c shortly as {xi,c}.

    2.1. Latent Dictionary Learning

    In this section we briefly introduce LDL that is similarto MWDL for the case of single modality (M = 1). LDLmodels the relation between each atom of the dictionarywith class labels using weight matrix = [1, 2, . . . , C ]in RpC . The c-th vector c = [1c, . . . ,pc]

    > in Rp in-dicates the relationship of all atoms to the c-th class. Allelements of are constrained to be equal or greater thanzero: {kc}

    pk=1 0. When the k-th atom has no contribu-

    tion to reconstruct the c-th class, then kc = 0. Also, thesum of all weight elements for c-th class is

    k kc = .

    This is to ensure that the dictionary has enough representa-tion power for each class. The goal is to learn D and , sothat the data can be reconstructed in a sparse coding schemeas: xi,c D diag(c)i,c:

    argmin,D,

    N

    i=1

    (

    xi,c D diag(c)i,c22 + 1i,c1+

    2i,c Ei({i,c})22 +

    C

    c=1

    l 6=c

    p

    k=1

    j 6=k

    jc(d>j dk)

    2kl

    )

    s.t. kc > 0 andp

    k=1

    kc = , c {1, . . . , C} (1)

    2.2. Multimodal Sparse Representation

    Our intention is to generalize LDL with efficient and ef-fective feature-fusion algorithm so that it can achieve betterclassification performance in the presence of multimodal in-put data. Our approach is based on the proper formulationof the high-order prior knowledge of group structures in-duced by the modality configuration of the multimodal data.Here, the sample i from c-th class is multimodal and ob-served from M features: Xi,c = {xmi,c}

    Mm=1. The goal is to

    learn multimodal dictionaries that can reconstruct Xi,c withdecomposition coefficients {mi,c}

    Mm=1 that are suitable for

    classification task. However, in LDL, sparse coding is im-plemented (Eq. (1)) using the standard `1-norm, which pe-nalizes the cardinality of coefficients {i,c}. Particularly,this regularization treats each variable individually, and it isblind to potential group structure between different featuresof a sample. Joint sparsity priors can do fusion betweenmultiple features which makes them suitable to reconstructsamples originated from different sources [4].

    For each modality/feature, there is a weight matrixm = [m1 , . . . ,

    mC ] in R

    pC and a dictionary Dm in

  • Rnmp. Vector mc describes how much each of the patoms in Dm is used to reconstruct c-th class. Let us de-note the weight vectors of c-th class from all modalities byW c = [1c , . . . ,

    Mc ] in R

    pM and the multimodal sparserepresentation of the data Xi,c as Ai,c = [1i,c, . . . ,

    Mi,c].

    The proposed method is a bilevel optimization. The outer-level objective enforces similarity across columns of twomatrices within each class: Ai,c and W c. The outer-level issubject to inner-level constraints such that the class-specificdictionary in each modality, Dmc D

    m diag(mc ) for allm in {1, . . . ,M} is reconstructive while at the same time tobe incoherent with the dictionary of other classes.

    We propose to obtain simultaneously, the multimodalsparse representation Ai,c = [1i,c, . . . ,

    Mi,c] and the set

    of dictionaries {Dm,m}Mm=1, for all m in {1, . . . ,M}:

    argminM

    m=1

    (12xmi,c D

    m diag(mc )mi,c

    22+

    l 6=c

    p

    k=1

    j 6=k

    mjc(dmj

    >dmk )

    2mkl

    )

    +

    (Ai,c) +

    2Ai,c

    2F + (W c)

    s.t. mjc > 0 andp

    j=1

    mjc = , c {1, . . . , C} (2)

    where the fusion between M different features of the sam-ple {xmi,c}

    Mm=1 is enforced in the space of sparse codes using

    `1/`2 regularization, (A) =p

    r=1 Ar2; where Aris the r-th row of A and promotes a solution with sparsenon-zero rows in A. Applying joint sparse representationon the multimodal sparse codes, (A) promotes all modal-ities to share the same sparsity pattern: if k-th atom, dmk , isselected to reconstruct the input xmi,c, then all modalities of

    k-th atom, {d1k, . . . , dMk } should contribute to reconstruct

    {xmi,c}Mm=1. In the same way, if

    mc , the m-th column of

    W c determines a certain subset of atoms in Dm to repre-

    sent c-th class, other columns of W c should also have thesame opinion. Intuitively, the `1/`2 promotes a statisticalco-occurrence structure: in order to assign a sample to thec-th class, most of its features should vote for the c-th class.

    3. Optimization

    The optimization problem (2) has the product of threeoptimization variables D diag(c)i,c; which implies thatthis problem is not joint convex in the space of variables.However, when two of the three optimization variables arefixed, the problem (2) is convex with respect to the thirdvariable [12]. Hence, the problem (2) is solved by split-ting to three sub-problems: 1. given {m}Mm=1 and dic-tionaries {Dm}Mm=1, estimate the multimodal sparse codes

    {mi }Mm=1 for all i in {1, . . . , N}; 2. given

    m and sparsecodes {mi }

    Ni=1, update the corresponding dictionary of m-

    th modality Dm; 3. given {mi } and Dm, update m. Step

    2 is done for each m in {1, . . . ,M} and c in {1, . . . , C},separately.

    3.1. Step 1: Find Multimodal Sparse Codes

    In this section, we fix {m}Mm=1 and {Dm}Mm=1 and

    treat them as data for the problem (2). We initialize themultimodal dictionaries {Dm}Mm=1 by training samples ofall classes same as [9, 12]. The problem (2) is converted to(3) to find an optimal A?i,c = [

    1?i,c, . . . ,

    M?i,c ] in R

    pM forall i in {1, . . . , N}:

    minAi,c

    M

    m=1

    12xmi,c D

    mmi,c22 + (Ai,c)+

    2Ai,c

    2F (3)

    where Dm = Dm diag(mc ) and .F is the Frobeniusnorm. To obtain optimal multimodal sparse codes A?i,cof i-th sample {xmi }

    Mm=1, we solve the optimization prob-

    lem (3) for a limited number of iterations using the alter-nating direction method of multipliers (ADMM) [14]. As-sume Z RpM = [z1, . . . , zM ] and U R

    pM =[u1, . . . , uM ] and both initialized as zero. We denote theproximal operator associated with the norm as proxthat maps its domain, vector p, to the vector q, both in RM :prox(p) , argminq

    12pq

    22+(q). For simplicity,

    we drop subscript i, c from Ai,c. For iteration we have:

    A(+1) = proxf (Z() U ()) (4a)

    Z(+1) = prox(A(+1) + U ()) (4b)

    U (+1) = U () + A(+1) Z(+1) (4c)

    where the function f ,M

    m=112x

    mi,c D

    mmi,c22 +

    2

    mi,c2 is the smooth and differentiable part of the prob-

    lem (3). The optimization variables A() and Z() are thesolution of minimizing the smooth and non-smooth part ofthe (3) at iteration , respectively and they will eventuallyconverge to each other, (U (+1) U () 0). f is smoothwith gradient mi,cf = D

    m>(xmi,c Dmmi,c) + 2

    mi,c,

    we compute mi,c as the solution to problem (4a) in iteration + 1:

    mi,c = 1(Dm>xmi,c + (z

    ()m u

    ()m ))

    (5)

    where = 1/ and = (Dm>Dm + I + I). Wesolve (5) for each feature m in {1, . . . ,M} separately, andconcatenate the results to make A?i,c = [

    1?i,c, . . . ,

    M?i,c ] as

    the solution to (4a). Note that the method is designed toget high classification accuracy while the dictionaries havesmall numbers of atoms; but, this may increase the chance

  • of singularity in (5). However, > 0 and > 0 make thedenominator positive definite.

    Next, we solve the proximal step over row vector Zr RM in (4b) for each row r of Ai,c and r in {1, . . . , p}:

    prox(A(+1)r + U

    ()r) = argmin

    Zr

    (Z(+1)r ) +12Z(+1)r (A

    (+1)r + U

    ()r)

    22 (6)

    The optimization problem (6) is solved in p independent op-timizations, corresponding to p rows of A, while each opti-mization is done on an M -dimensional vectors, Zr. Wesolve the proximal step of inducing `1/`2 regularization ofEq. (6) using [8]. After Z is obtained, this iteration wouldbe finished by updating U according to (4c).

    3.2. Step 2: Multimodal Dictionary Learning

    In Sec. 3.1, we obtain multimodal sparse coefficientsof i-th sample, A?i,c = [

    1?i,c, . . . ,

    M?i,c ] by solving the

    optimization problem (2) given the set of dictionaries{Dm}Mm=1. In this section, the obtained coefficients{A?i }

    Ni=1 are used to update the dictionaries. The dictionary

    Dm = [dm1 . . . , dmp ] is updated using the sparse represen-

    tation of all samples from m-th modality: [m1 , . . . , mN ].

    Since in this step dictionary of each modality is obtainedindependent of other modalities, we drop superscript m.We solve following optimization problem with dictionaryas variable using Iterative Projection Method [19].

    minD

    i

    12xi,cDi

    22 +

    l 6=c

    p

    k=1

    j 6=k

    jc(d>j dk)

    2kl

    (7)where i = diag(c)i,c in R

    p. Let us define B =i

    >i and G = xi

    >i . Also, note that the second part

    of the Eq. (7) for the k-th atom would be simplified toS ,

    j 6=k(djd

    >j )

    l 6=c jckl. We solve problem (7)to update the k-th atom, dk following [12]:

    dk dk+

    (BkkI + diag (S))1(Gk DBk SDk) (8)

    where Bkk is the k-th element in the diagonal of the B.Finally, the updated atom dk is projected orthogonal to theunit-norm ball. We do the same to update each atom fromany feature, {dmk }

    pk=1 and m in {1, . . . ,M}.

    3.3. Step 3: Weight Estimation

    Given {Dm}Mm=1 and {mi,c}, Eq. (2) is converted to

    a constrained quadratic programming and solved for eachclass-specific weight matrix W c = [1c , . . . ,

    Mc ] in R

    pM

    separately.

    argminW c

    i

    12xmi,c D

    m diag(mc )mi,c

    22+

    p

    k=1

    mkc

    j 6=k

    (dmj>

    dmk )2

    l 6=c

    mjl + (W c) (9)

    with kc 0. Similar to (3), the optimization problem (9)is made of smooth and nonsmooth parts ((W c)); hencethe solution methodology is similar to that of Sec. (3.1):the proximal problem (4a) over smooth part of (9) is solvedsimilar to [29]. The proximal (4b) over nonsmooth part issolved same as Eq. (6), which enforces row-sparsity for thevariable W c.

    3.4. Classification Approach

    Each test sample Xt is observed from the same setof M features, Xt = {xmt }

    Mm=1. Given the learned

    dictionaries and weight matrices {Dm,m}Mm=1 from thetraining phase, we solve Eq. (3) to obtain the multimodalsparse codes of the test sample, {mt }

    Mm=1. The query

    is assigned to the class with minimum summation of re-construction error of all features, Et =

    Mm=1 x

    mt

    Dm diag(mc )mt

    22.

    4. Experiments

    In this section we evaluate the performance of MWDLin three different applications: multiview gender classifi-cation, multimodal face recognition, and multiview actionrecognition. Samples are normalized to have zero meanand unit `2 norm. To compare with the performance ofunimodal dictionary learning algorithms, we learn indepen-dent dictionaries and classifiers for each modality and thencombine the individual scores for a fused decision. This isequivalent to applying `1/`1-norm on A instead of `1/`2-norm in problem (2) as (A) =

    |Aij | in Eq. (2) [21, 4].

    The `1/`1 does not enforce correlation between the fea-tures.

    The gender classification and face recognition experi-ments are done using the AR database (Fig. 2) which con-sists of faces under different poses, illumination and expres-sions, captured in two sessions [13]. A set of 100 users (50males and 50 females) are used, each consisting of sevenimages from the first session as the training samples andseven images from the second session as test samples.

    We initialize {mc }Mm=1 for all C classes to have only one

    non-zero element. The Dm is obtained by putting togetherthe class-wise sub-dictionaries: Dm = [Dm1 , . . . , D

    mC ],

    whose {Dmc }Cc=1 is trained on data of c-th class. The corre-

    sponding weight of k-th atom in modality m is the k-th rowof mk. For k-th atom, we compute the average of `2-normof mk: k =

    m

    mk2/M . If k is small, k < ,

  • Figure 2. Samples of male and female with extracted modalities inAR dataset. We employ the rectangular masks and cropping outthe corresponding areas.

    Table 1. The gender classification accuracy (%) with p = 250.Methods Accuracy Methods Accuracy

    SRC [26] 93.0 JDL [33] 90.8Yang et al. [28] 94.5 DLSI [18] 93.2DKSVD [32] 85.6 FDDL [30] 94.1LCKSVD [9] 89.5 LDL [29] 94.8COPAR [11] 93.4 MWDL 97.9

    Table 2. Gender classification rates obtained with p = 25 atoms.DLSI JDL FDDL LDL COPAR MWDL

    93.7 91.0 92.1 92.4 93.0 97.1

    we remove the k-th atom from all dictionaries {Dm}Mm=1.Hence, the number of atoms would be less than or equal tothe initial chosen number. We choose = 0.1, = 0.001and = 0.1 in Eq. (2) for all experiments. Same asLDL [29], we choose = 0.1p/(N(N 1)).

    4.1. Gender Classification

    Same as [29, 30], we consider the first 25 males and25 females, 14 images per subject, for training, and test-ing is done on the rest. We extract three features from eachsample and treat them as modalities: raw pixels, quantizedgradient [5] and fhog with 9 orientations and 8 bins [6].We compare MWDL with recent dictionary learning meth-ods like SRC [26], DLSI [18], DKSVD [32], LCKSVD [9],FDDL [30], COPAR [11], DLSI [18], and JDL [33].

    This experiment is a two-class classification problemwith huge variations in each class and large number of train-ing samples. We report the performances for dictionary sizeof p = 250 in Table (1) and with p = 25 in Table (2). Whenthe number of atoms is large, p = 250, DL methods basedon all-vs-all scheme like DKSVD and LCKSVD have lessclassification accuracy comparing to the class-specific (one-vs-all) DL methods like LDL, FDDL and DLSI. MWDLoutperforms others with more than 3% in Table (1). Toshow the fact that class-specific DL methods need a largenumber of atoms, we reduce the number of atoms fromp = 250 to p = 25 and report the one-vs-all DL meth-ods performances in Table (2). As we expected, the one-vs-all methods have poor performance with small numberof atoms. Although with small p, the accuracy of all meth-ods are reduced, MWDL is more discriminative and outper-forms other methods including LDL for more than 4.0%.

    Figure 3. Different poses of a subject from UMIST database. Eachrow is a view-range or modality for the subject.

    4.2. Face Recognition

    AR dataset. Our goal is to show that in the presence ofmultimodal data, fusion of information increases the per-formance and more specifically we highlight the importanceof feature-level fusion rather than decision-level fusion. In-tensity values were used. The size of dictionary is set top = 400 which corresponds to 4 atoms for each of 100 per-sons in AR dataset. Similar to [4, 21], we extract four weakmodalities from each face of subjects inside AR database:left and right periocular, nose and mouth (see Fig. (2)). Westudy the effect of fusion of face as the strong modalityalong with the four weak modalities. Intensity values areused from each modality. We crop out and resize the re-spective rectangular masks of the weak modalities. To showthe importance of data fusion, we report the face recog-nition performance of sparse representation classification(SRC), unsupervised (UTDL), supervised dictionary learn-ing (STDL) [12], LDL, and FDDL using only the wholeface in Table (3). We did not report the proposed methodin Table (3) because, MWDL needs multiple sources and inthe presence of only one source (only face) it is similar toLDL. We report the classification accuracy for unsupervisedand supervised dictionary learning algorithm of [12] andLDL in multimodal case when fusion is done using `1/`1as UTDL`11 , STDL`11 and LDL`11 in Tables (4) and (5).

    Comparing Tables (3) and (4), we can see how effectivethe decision-fusion is for LDL, UTDL and STDL meth-ods. Decision-level fusion at UTDL`11 and STDL`11 im-proved the accuracy of UTDL and STDL more than 5%,respectively. However, MWDL which is a feature-fusionmethod outperforms decision-fusion competing methodswith `1/`1-norm. This outperformance is more significantfor fusion of left and right periocular (around 3%) in Ta-ble (4). The reason lies in the fact that these modalities arehighly correlated, and MWDL learns multimodal dictionar-ies jointly, which results in a high recognition accuracy.

    The proposed MWDL is compared with three otherfeature-fusion algorithms, the joint sparse representationclassifier (JSRC) [21], joint dynamic sparse representa-tion classifier (JDSRC) [31] and multimodal tree-structuredsparse representation classification (MTSRC) [4] in Ta-ble (5). The dictionary for JSRC, JDSRC and MTSRC isfixed without training. JSRC [21] applies `1/`2 to enforcesimilar sparsity pattern among all different modalities at thespace of sparse codes. JDSRC relaxes each multimodal in-

  • Table 3. Face recognition accuracy with the whole face modalitySRC LDL UTDL STDL FDDL

    88.86 84.56 89.58 90.57 88.90

    Table 4. Modalities include 1. left periocular, 2. right periocular,3. nose, 4. mouth, and 5. face.

    Modalities {1, 2} {1, 2, 3} {1, 2, 3, 4} {1, 2, 3, 4, 5}

    UTDL`11 81.90 87.57 90.14 95.57STDL`11 83.86 87.86 92.42 95.86LDL`11 82.52 87.46 91.16 94.29MWDL 86.36 87.24 93.16 97.63

    Table 5. Multimodal face recognition results for the AR datasetsJSRC JDSRC MTSRC MWDL

    95.57 93.14 97.14 97.63

    put data to have the same sparsity pattern and lets it be re-constructed using different training samples. It applies jointsparsity on data of each class separately. MTSRC enforcesa more generalized joint sparsity using a hierarchical struc-ture regularization on each multimodal data. The proposedMWDL with 400 atoms achieves better performance thanJSRC and JDSRC with 700 atoms. This superior resultsdemonstrate that the dictionary learning in MWDL is able tomake discriminative and reconstructive dictionaries that cangenerate more discriminative sparse codes with less numberof atoms. This is due to applying multimodal weights with`1/`2 regularization on multimodal class-specific weights,W c`12 through optimization problem (9).

    Multiview Face Recognition on UMIST. The UMISTface database consists of 564 cropped images of 20 per-sons with mixed race and gender [7]. Each person hasdifferent poses from profile to frontal views. The setup isunconstrained and faces may have pose variations withineach view-range. We run multiview face recognition usingUMIST by segmenting views of each person to M differentview-ranges with equal number of images. Intensity valueswere used. In Fig. (3), the poses of a subject from UMISTare divided in M = 3 view-ranges. We report the perfor-mance of the MWDL for 2 and 3 views. Table (6) has the theresults of 10-fold cross validation. The corresponding dic-tionary of each view has one normalized image from eachsubject in that view, p = 20. MWDL learns a dictionarywith uncorrelated class-specific atoms and outperforms thecompeting methods by more than 5%.

    4.3. Multi-view Action Recognition

    IXMAS [25] includes 11 categories of daily actions.Twelve actors performed each action three times. Five cam-eras, four side views and one top view, are considered themodalities in this experiment. A multimodal sample of theIXMAS dataset is shown in Fig. (1). Following [25, 22, 24]

    Table 6. Multiview face recognition results for the UMIST datasetsJSRC JDSRC MTSRC MWDL

    2 Views 87.77 86.52 88.42 93.593 Views 99.51 98.96 99.63 100.0

    Table 7. Multiview action recognition on the IXMAS (%)Methods Accuracy Methods Accuracy

    STDL [12] 91.9 Wang et al. [24] 87.8LCKSVD [9] 87.5 Tran et al. [22] 80.2Wu et al. [27] 88.2 LDL [29] 88.4MWDL 96.1 Junejo et al. [10] 79.6

    leave-one-actor-out cross validation is performed and sam-ples from all five views are used for training and testing. Weextract dense trajectories from all samples using the codeprovided by [24]. Then, using k-means we made a code-book of 2000 words from all the features. We consider 4atoms per class, which leads to 44 atoms per view.

    We report the performance of the state-of-the-art meth-ods with decision fusion using `1/`1 and our method in Ta-ble 7. The results show that MWDL outperforms competingmethods by more than 4% and enhances LDL with 7.0%.

    5. Conclusion

    In this paper, we presented a new method for learningmultimodal dictionaries while multimodal sparse represen-tations are forced to share the same sparsity pattern at theatom level of modality-based dictionaries using `1/`2 regu-larization. The imposed joint sparsity model enabled thealgorithm to fuse information in feature-level and it canbe easily extended to include augmenting the decision ofmodalities. The relation between atoms in different modal-ities and a certain class is determined by a weight matrixW c. We apply `1/`2 regularization on W c, to enforceeach atom to choose between representing a specific classfrom all modalities, or to be removed from all dictionaries.This way, in each modality, a dictionary is learned that canreconstruct class-specific data with sparse coefficients thatare distinctive from other classes, leading to higher classi-fication accuracy. The experimental results demonstratedthat the proposed method outperforms state-of-the-art dic-tionary learning methods in most challenging scenarios.

    References

    [1] P. Ahmadi and M. Joneidi. A new method for voice activitydetection based on sparse representation. In Image and Sig-nal Processing (CISP), 2014 7th International Congress on,pages 878882, Oct 2014.

    [2] B. Babagholami-Mohamadabadi, S. M. Roostaiyan,A. Zarghami, and M. S. Baghshah. Multi-modal distancemetric learning: A bayesian non-parametric approach. InEuropean Conference on Computer Vision, pages 6377.Springer, 2014.

  • [3] B. Babagholami-Mohamadabadi, A. Zarghami,M. Zolfaghari, and M. S. Baghshah. Pssdl: Probabilisticsemi-supervised dictionary learning. In Joint EuropeanConference on Machine Learning and Knowledge Discoveryin Databases, pages 192207. Springer, 2013.

    [4] S. Bahrampour, A. Ray, N. M. Nasrabadi, and K. W. Jenk-ins. Quality-based multimodal classification using tree-structured sparsity. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2014.

    [5] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channelfeatures. BMVC, 2(3):5, 2009.

    [6] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 32(9):16271645, Sept 2010.

    [7] D. B. Graham and N. M. Allinson. Face recognition usingvirtual parametric eigenspace signatures. In Image Process-ing and Its Applications, 1997., Sixth International Confer-ence on, volume 1, pages 106110 vol.1, Jul 1997.

    [8] R. Jenatton, J. Mairal, F. R. Bach, and G. R. Obozinski. Prox-imal methods for sparse hierarchical dictionary learning. InInternational Conference on Machine Learning, pages 487494, 2010.

    [9] Z. Jiang, Z. Lin, and L. Davis. Label consistent k-svd:Learning a discriminative dictionary for recognition. Pat-tern Analysis and Machine Intelligence, IEEE Transactionson, 35(11):26512664, Nov 2013.

    [10] I. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent action recognition from temporal self-similarities. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 33(1):172185, Jan 2011.

    [11] S. Kong and D. Wang. A dictionary learning approach forclassification: separating the particularity and the common-ality. In Computer VisionECCV 2012, pages 186199.Springer, 2012.

    [12] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionarylearning. Pattern Analysis and Machine Intelligence, IEEETransactions on, 34(4):791804, April 2012.

    [13] A. M. MARTINEZ. The ar face database. CVC TechnicalReport, 24, 1998.

    [14] N. Parikh and S. Boyd. Proximal algorithms. Found. TrendsOptim., 1(3):127239, Jan. 2014.

    [15] A. Rahimpour, J. Luo, A. Taalimi, and H. Qi. Distributedobject recognition in smart camera networks. In Image Pro-cessing, 2016 IEEE International Conference on, Sept 2016.

    [16] M. Rahmani and G. Atia. Innovation pursuit: Anew approach to subspace clustering. arXiv preprintarXiv:1512.00907, 2015.

    [17] M. Rahmani and G. Atia. A subspace learning approach forhigh dimensional matrix decomposition with efficient col-umn/row sampling. pages 12061214, 2016.

    [18] I. Ramirez, P. Sprechmann, and G. Sapiro. Classificationand clustering via dictionary learning with structured inco-herence and shared features. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages35013508, June 2010.

    [19] L. Rosasco, A. Verri, M. Santoro, S. Mosci, and S. Villa.Iterative projection methods for structured sparsity regular-ization. 2009.

    [20] D. Ruta and B. Gabrys. An overview of classifier fusionmethods. Computing and Information systems, 7(1):110,2000.

    [21] S. Shekhar, V. Patel, N. Nasrabadi, and R. Chellappa.Joint sparse representation for robust multimodal biomet-rics recognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 36(1):113126, Jan 2014.

    [22] D. Tran and A. Sorokin. Human activity recognition withmetric learning. In Computer VisionECCV 2008, pages548561. Springer, 2008.

    [23] T. Vu, H. Mousavi, V. Monga, G. Rao, and A. Rao.Histopathological image classification using discriminativefeature-oriented dictionary learning. IEEE Transactions onMedical Imaging, 35(3):738751, March 2016.

    [24] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Dense trajec-tories and motion boundary descriptors for action recogni-tion. International Journal of Computer Vision, 103(1):6079, 2013.

    [25] D. Weinland, E. Boyer, and R. Ronfard. Action recogni-tion from arbitrary views using 3d exemplars. In 2007 IEEE11th International Conference on Computer Vision, pages 17, 2007.

    [26] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robustface recognition via sparse representation. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 31(2):210227, Feb 2009.

    [27] X. Wu, D. Xu, L. Duan, and J. Luo. Action recognition usingcontext and appearance distribution features. In ComputerVision and Pattern Recognition (CVPR), 2011 IEEE Confer-ence on, pages 489496, June 2011.

    [28] J. Yang, K. Yu, and T. Huang. Supervised translation-invariant sparse coding. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages35173524, June 2010.

    [29] M. Yang, D. Dai, L. Shen, and L. Van Gool. Latent dictio-nary learning for sparse representation based classification.In Computer Vision and Pattern Recognition (CVPR), 2014IEEE Conference on, pages 41384145, June 2014.

    [30] M. Yang, D. Zhang, X. Feng, and D. Zhang. Fisher dis-crimination dictionary learning for sparse representation. InComputer Vision (ICCV), 2011 IEEE International Confer-ence on, pages 543550, Nov 2011.

    [31] H. Zhang, N. Nasrabadi, Y. Zhang, and T. Huang. Multi-observation visual recognition via joint dynamic sparse rep-resentation. In Computer Vision (ICCV), 2011 IEEE Inter-national Conference on, pages 595602, Nov 2011.

    [32] Q. Zhang and B. Li. Discriminative k-svd for dictionarylearning in face recognition. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages26912698, June 2010.

    [33] N. Zhou, Y. Shen, J. Peng, and J. Fan. Learning inter-relatedvisual dictionary for object recognition. In Computer Visionand Pattern Recognition (CVPR), 2012 IEEE Conference on,pages 34903497, June 2012.