7
Retrieval Based Cartoon Synthesis via Heterogeneous Features Learning Zhang Liang College of Computer Science and Tech. Zhejiang University Hangzhou, China [email protected] Jun Xiao College of Computer Science and Tech. Zhejiang University Hangzhou, China [email protected] Hong Pan School of Information Science and Tech. Hangzhou Normal University Hangzhou, China [email protected] Abstract—In this paper, we present a novel gesture recog- nition method for synthesizing cartoons from existing two- dimensional cartoon data. Drawing inspiration from cross- media community, human-subject images are acted as the queries here to retrieve cartoon images containing similar gestures. Optimal descriptors are assigned to express features of cartoon and human-subject images based on their char- acteristics, and they are defined as heterogeneous features for different dimensions. Inspired of exploiting data structure in manifold learning problem, we integrate heterogeneous dimensionality reduction and linear discriminant model into a hierarchical framework. Cartoon synthesis can be carried out based on the retrieved cartoon keyframes. Experiments and the application demonstrate the effectiveness of our proposed method. Keywords-character cartoon; gesture recognition; cartoon synthesis I. I NTRODUCTION 2D cartoon attracts persistent research attentions for recent decades [1], [2], [3], [4], and plays an important role in many areas, e.g., cartoon industry, advertisement, and entertainment. However, it still remains in labor-intensive and time-consuming state which inhibits 2D cartoon from putting into use on a large scale. Many assisted systems are designed to relive cartoonists from tedious manual works, i.e., keyframe interpolation, content-based retrieval [4], and frames re-arranging [2]. However, these methods demanding the background of cartoon knowledge behave too compli- cated for non-professional users, who may tend to synthesize cartoon clips easily. For example, a boy may like to create a cartoon clip personally which contains his favorite character Superman just for fun. As known, there are many researches concentrating on synthesizing new cartoons from library of existing cartoon data [5], [4], [6]. Typically, 2D cartoons drawn in recent decades are reused under users’ specification, meanwhile, retaining original style. These works motivate us to de- velop a system which synthesizes new cartoon clips by cartoon data reusing. Works in [2], [3], [6] obtain sequential 2D and 3D cartoon motions successfully by re-arranging original frames or calculating transformation probability. The concept among them is to keep reconstruction error among synthesized frames least, which is realized through discovering the embedded data manifold in a lower dimen- sional subspace. However, they share twofold limitations in common: users can not control synthesis process except for the start and end, and the performance often seriously depends on the content of cartoon database. In this paper, we propose a retrieval based cartoon reuse system to synthesize new cartoons In previous works, users are requested to provide ini- tial images. However, how defining these images remains implicit and impractical to handle. As known, the images containing identical cartoon character with ones in database are favorite, and sketch images drawn manually can also act as alternatives. It may fail in the situation when the user wants to retrieve images derived by multiple characters, or drawing skills are unavailable. Thus for convenience, we take the human-subject image as the query for the reason that there exists some prevalent algorithms extracting corresponding features [7]. Given an sequence of human-subject images, our system retrieves top q-nearest cartoon images containing similar gestures for each. An important factor impacting the per- formance of higher level analysis is feature selection. To this end, four kinds of feature descriptor algorithms are extensively studied to determine optimal ones for cartoon and human-subject images respectively, and the experiment shows Histogram of oriented gradients (HOG) and Occu- pancy map (OM) algorithms are the optimal ones. It can be explained that the difference between cartoon and human- subject images, such as color contrast, exaggerate effects, and body proportion, deserves different feature extracting tactics. Finally, we propose a hierarchical framework not only settling heterogeneous dimensionality reduction but also preserving class separability. Inspired of shared structure learning works [8], [9], we claim that there exists a common subspace shared by heterogeneous features. To learn reli- able lower dimensional subspace for heterogeneous gesture recognition, we integrate graph based transductive reduction and Linear discriminant analysis (LDA) into a hierarchical framework. It is worth mentioning that our framework can be used in content-based cartoon retrieval, further in cartoon synthesis. The system framework is shown in Figure 1. 2011 Workshop on Digital Media and Digital Content Management 978-0-7695-4413-7/11 $26.00 © 2011 IEEE DOI 10.1109/DMDCM.2011.35 125

[IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

  • Upload
    hong

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Retrieval Based Cartoon Synthesis via Heterogeneous Features Learning

Zhang LiangCollege of Computer Science and Tech.

Zhejiang UniversityHangzhou, China

[email protected]

Jun XiaoCollege of Computer Science and Tech.

Zhejiang UniversityHangzhou, China

[email protected]

Hong PanSchool of Information Science and Tech.

Hangzhou Normal UniversityHangzhou, [email protected]

Abstract—In this paper, we present a novel gesture recog-nition method for synthesizing cartoons from existing two-dimensional cartoon data. Drawing inspiration from cross-media community, human-subject images are acted as thequeries here to retrieve cartoon images containing similargestures. Optimal descriptors are assigned to express featuresof cartoon and human-subject images based on their char-acteristics, and they are defined as heterogeneous featuresfor different dimensions. Inspired of exploiting data structurein manifold learning problem, we integrate heterogeneousdimensionality reduction and linear discriminant model into ahierarchical framework. Cartoon synthesis can be carried outbased on the retrieved cartoon keyframes. Experiments andthe application demonstrate the effectiveness of our proposedmethod.

Keywords-character cartoon; gesture recognition; cartoonsynthesis

I. INTRODUCTION

2D cartoon attracts persistent research attentions for recentdecades [1], [2], [3], [4], and plays an important rolein many areas, e.g., cartoon industry, advertisement, andentertainment. However, it still remains in labor-intensiveand time-consuming state which inhibits 2D cartoon fromputting into use on a large scale. Many assisted systems aredesigned to relive cartoonists from tedious manual works,i.e., keyframe interpolation, content-based retrieval [4], andframes re-arranging [2]. However, these methods demandingthe background of cartoon knowledge behave too compli-cated for non-professional users, who may tend to synthesizecartoon clips easily. For example, a boy may like to create acartoon clip personally which contains his favorite characterSuperman just for fun.

As known, there are many researches concentrating onsynthesizing new cartoons from library of existing cartoondata [5], [4], [6]. Typically, 2D cartoons drawn in recentdecades are reused under users’ specification, meanwhile,retaining original style. These works motivate us to de-velop a system which synthesizes new cartoon clips bycartoon data reusing. Works in [2], [3], [6] obtain sequential2D and 3D cartoon motions successfully by re-arrangingoriginal frames or calculating transformation probability.The concept among them is to keep reconstruction erroramong synthesized frames least, which is realized through

discovering the embedded data manifold in a lower dimen-sional subspace. However, they share twofold limitationsin common: users can not control synthesis process exceptfor the start and end, and the performance often seriouslydepends on the content of cartoon database. In this paper, wepropose a retrieval based cartoon reuse system to synthesizenew cartoons

In previous works, users are requested to provide ini-tial images. However, how defining these images remainsimplicit and impractical to handle. As known, the imagescontaining identical cartoon character with ones in databaseare favorite, and sketch images drawn manually can alsoact as alternatives. It may fail in the situation when theuser wants to retrieve images derived by multiple characters,or drawing skills are unavailable. Thus for convenience,we take the human-subject image as the query for thereason that there exists some prevalent algorithms extractingcorresponding features [7].

Given an sequence of human-subject images, our systemretrieves top q-nearest cartoon images containing similargestures for each. An important factor impacting the per-formance of higher level analysis is feature selection. Tothis end, four kinds of feature descriptor algorithms areextensively studied to determine optimal ones for cartoonand human-subject images respectively, and the experimentshows Histogram of oriented gradients (HOG) and Occu-pancy map (OM) algorithms are the optimal ones. It can beexplained that the difference between cartoon and human-subject images, such as color contrast, exaggerate effects,and body proportion, deserves different feature extractingtactics.

Finally, we propose a hierarchical framework not onlysettling heterogeneous dimensionality reduction but alsopreserving class separability. Inspired of shared structurelearning works [8], [9], we claim that there exists a commonsubspace shared by heterogeneous features. To learn reli-able lower dimensional subspace for heterogeneous gesturerecognition, we integrate graph based transductive reductionand Linear discriminant analysis (LDA) into a hierarchicalframework. It is worth mentioning that our framework canbe used in content-based cartoon retrieval, further in cartoonsynthesis. The system framework is shown in Figure 1.

2011 Workshop on Digital Media and Digital Content Management

978-0-7695-4413-7/11 $26.00 © 2011 IEEE

DOI 10.1109/DMDCM.2011.35

125

Page 2: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 1. The system framework.

The rest of this paper is organized as follows. We detailour hierarchical framework in Section 2. The presentationof extensive experiments is shown in Section 3, followedby synthesis realization in Section 4. Section 5 gives theconclusion and future work.

II. THE HIERARCHICAL FRAMEWORK

In this section, we first present our algorithm of hetero-geneous dimensionality reduction, and then describe LDAapplied here.

A. Heterogeneous dimensionality reduction

Let us denote the matrixes X1 = [x(1)1 , . . . , x

(1)n ]T ∈

Rn×d1 is obtained by feature descriptor on cartoon imagesand X2 = [x(2)

1 , . . . , x(2)n ]T ∈ Rn×d2 is obtained by feature

descriptor on human-subject images, where d1 and d2 de-note feature dimensions of both features, respectively. Thegroundtruth label matrix Y = [y1, . . . , yn]T ∈ {0, 1}n×r,where if both x

(1)i and x

(2)i belong to k-th class, then

Yik = 1; otherwise Yik = 0. To satisfy the nonsingularityneed of Y , each vector yi �= 0 by means of manuallyconstructing pairwise in class layer as selected x

(1)i and x

(2)i

belonging to identical class.According to linear assumption, we define the projection

matrixes W1 ∈ Rd1×r and W2 ∈ Rd2×r, and project X1 andX2 onto those matrixes separately to obtain X ′

1 ∈ Rn×r andX ′

2 ∈ Rn×r.Our algorithm aims to find the optimal W1 and W2 which

maximize the correlations. Typically, to prevent overfittingcaused by insufficient samples, we impose regularizer [10]into our objective function. Then the optimization problemof regularized empirical errors can be written as follows:

minF,W1,W2

Reg(F ) + α‖X1W1 − F‖2F + λ1‖W1‖2

F

+ β‖X2W2 − F‖2F + λ2‖W2‖2

F + δ‖F − Y ‖2F ,

(1)

where Reg(F ) imposes the manifold structure with idea ofgraph Laplacian, and terms ‖W1‖2 and ‖W2‖2 are Tikhonovregularizers [10] controlling the learning complexity, and the

coefficients compromise between learning complexity andempirical loss.

The regularizer term Reg(F ) provides us with flexibilityto incorporate the prior knowledge accordingly, where thevariable F ∈ Rn×r denotes the graph embedded labelprediction matrix. For dimensionality reduction, it is as-sumed that nearby samples are likely to have the similarembeddings [4]. Therefore, we define q-nearest neighborgraph Gp where p = 1, 2 for cartoon and human-subjectviews to model the embeddings. The corresponding affinitymatrix is Ap and the element A

(p)ij reflects the adjacency

between x(p)i and x

(p)j on graph. Then, a natural regularizer

can be defined as follows:

Reg(F ) =r∑

k=1

n∑

i,j=1

(Fik − Fjk)2A(p)ij

=n∑

i,j=1

A(p)ij (FT

i Fi + FTj Fj − 2FT

i Fj)

= 2tr(FT (Dp − Ap)F ) = 2tr(FT LpF ),

(2)

where tr(.) denotes trace operator. Dp is a diagonal matrixwith element D

(p)ii =

∑nj=1 A

(p)ij . We obtain L1 = D1 −A1

and L2 = D2 −A2 as Laplacian matrixes for two views. Tosimplify our equation, we derive the synthesized Lsyn as:

Lsyn =(L1+L2

2 )′ + (L1+L2

2 )2

, (3)

thus Lsyn = L′syn. Then the objective function can be

rewritten as:

minF,W1,W2

tr(FT LsynF ) + α‖X1W1 − F‖2F + λ1‖W1‖2

F

+ β‖X2W2 − F‖2F + λ2‖W2‖2

F + δ‖F − Y ‖2F ,

(4)where ‖.‖F denotes Frobenius norm. Note that ‖Z‖2

F =tr(ZT Z) for any matrix Z. The weights α = 10−2,λ1 = 10−3, β = 10−1, λ2 = 10−2, and δ = 1 have beendetermined experimentally and remain fixed.

By setting the derivative of Eq. 4 w.r.t W1, W2, and Fto zero sequentially, we have:

W1 = δB1(αU + βV + E)−1Y, (5)

W2 = δB2(αU + βV + E)−1Y, (6)

where

U = BT1 XT

1 X1B1 − X1B1 − BT1 XT

1 + I, (7)

V = BT2 XT

2 X2B2 − X2B2 − BT2 XT

2 + I, (8)

E = Lsyn + λ1BT1 B1 + λ2B

T2 B2 + δI, (9)

B1 = α(αXT1 X1 + λ1I)−1XT

1 , (10)

andB2 = β(βXT

2 X2 + λ2I)−1XT2 . (11)

126

Page 3: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 2. Selective cartoon characters collected in our database.

B. Linear discriminant model

After dimensionality reduction, cartoon and human-subject samples share unified dimensionality currently, butthe intra- and inter- class covariances are not optimizedin new subspace. Thus, we apply LDA to deal with theproblem.

Linear discriminant analysis (LDA) [11] aims to finddirections on which the samples from different classes are farfrom each while requiring as close as possible within class. Itis supposed that we have a set of 2n samples x′

1, x′2, ..., x

′n

belonging to r classes. Note that, all these samples comefrom X ′

1 and X ′2 as they can be considered as different

expressions based on same semantic meaning. The objectivefunction of LDA is as follows:

Uopt = maxU

UT SbU

UT SwU, (12)

Sb =r∑

k=1

nk(μ(k) − μ)(μ(k) − μ)T , (13)

Sw =r∑

k=1

[nk∑

i=1

(x′(k)i − μ(k))(x′(k)

i − μ(k))T ], (14)

where μ is total sample mean vector, nk and μ(k) arethe number and mean vector of samples in the k-th class,respectively, and x

′(k)i is the i-th sample in the k-th class.

Sb is defined as between-class scatter matrix and Sw thewithin-class scatter matrix.

The total scatter matrix is defined as St =∑2n

i=1(x′i −

μ)(x′i − μ)T , with St = Sb + Sw. Then the optimal U ’s are

the eigenvectors corresponding to the non-zero eigenvalueof eigen-problem [11]:

SbU = λStU. (15)

III. EXPERIMENT

In this section, we conduct extensive experiments to testthe performance of the proposed framework in terms ofretrieval precision.

A. Database construction

The database of our framework demands cartoon andhuman-subject data keeping pairwise in class layer. Forcartoon preparation, we collected 214 clips directed by 14characters to provide generality as shown in Figure 2. Ac-cordingly, the actions in human-subject video refer to thosein cartoons, where subjects are supposed to imitate cartoon

actions as close as possible. However, the similarity ismerely tolerable due to unrepeatable cartoon exaggerations.

To extract characters from both videos, we apply distincttactics based on their visual characteristics. Firstly, theoutlines of cartoon characters are painted before colorationand can be easily detected without noise interference [3].Thus we employ Laplacian of Gaussian (LOG) filter [12]to detect character sketches and extract them from renderedimages. After that, we apply two-dimensional convolutionfilter to refine the edges and fill the surrounded regionsusing flood fill algorithm [3]. When the cartoon backgroundis complex and LOG becomes invalid, we implement aclassification method based on SVM algorithm, in whichseveral strokes annotated by users are acted as interactiveguidance for image segmentation. To evaluate influencescaused by extracting algorithms detailed above under noiseand noise-less conditions, Figure 3 (b) shows the resultsunder four combined configurations when applying ourmethod and the others, where Real means real cartooncharacters containing more noise information, Syn meanssynthetic ones rendered from MotionBuilder and consideredas noise-less, LOG denotes LOG filter, and SVM denotesSVM classification. Typically, characters in real cartoon aresupposed to contain more noise information than those insynthetic condition. For human-subject images, foregroundis extracted instead of sketch edges since outlines in real illu-mination condition are not remarkable. Different from pixelwise comparison, Gaussian mixture model (GMM) [13] isapplied to eliminate negative impacts caused by illuminationand viewpoint aberrations.

Given extracted images, we take keyframes to constructdatabase for preventing the size of database from unac-ceptable increasing. Inspired of [14], in which Hausdorffdistance (HD) is used for pose inferring, we design a normal-ized HD matrix to determine keyframes based on detectionof local maximum covariance as shown in Figure 4. Allthe keyframes are assigned with identical class labels asgroundtruth, and normalized and centralized after that.

B. Performance evaluation

We have obtained 3321 pairs of cartoon and human-subject keyframes fallen into 75 classes after previous stage.As a reuse approach, we assume that the database containssufficient samples with persuasive style variety for extensiveexperiments. Similar to previous works in multimedia com-munity [1], [2], we do not take the out-of-sample extension

127

Page 4: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 3. Comparative histograms. Comparing our proposed method with the others when cartoon images come from selective characters (a); and whenunder combined configurations for character extracting (b).

Figure 4. Result of keyframe determination on a cartoon sequence with thesize of 51. In visualization of normalized matrix, color contrast correspondsto gesture dissimilarity. Keyframes are shown beside the matrix.

into our consideration. Note that, each result comes from theaverage over 10 random splits. For all figures, the numberof training samples n acts as variable, and that of testingsamples nt is fixed to 500.

Considering that subjective evaluations are widely usedin many areas [15], [4], we invite 20 volunteers with equalgender separation. These persons ranged from 20 to 50are supposed to be familiar with 2D cartoons, and theirjudgements are recorded and calculated as mean error rate.

Our experiments focus on three aspects as selection offeature descriptors, method comparison, and practicality

evaluation. First, we implement comparative experimentsbased on Least square regression (LSR) algorithm to de-termine optimal feature descriptors. The objective functionis defined as follows:

minWp

‖XpWp − Y ‖2F + μ‖Wp‖2

F . (16)

where p = 1, 2 denotes for cartoon and human-subjectviews. After transformed by optimal W ∗

p , we calculate topq-nearest neighbors in lower dimensional subspace basedon Euclidean distances. Figure 5 (a) and (b) show theperformances of selective feature descriptors. It can beobserved that the overall performance of four descriptorsis better for cartoon images comparing with human-subjectones. Meanwhile, HOG and OM algorithms which yieldbetter precisions can act as optimal feature descriptors,respectively. Note that, all the curves converge slowly andthe decreases become subtle when n is larger than 500.

Figure 6 illustrates the quantitative comparison betweenour method and the others. We compare the proposedmethod against two classical supervised approaches solvingheterogeneous features problem, SDA-LSR and CCA-SVM.Briefly, Canonical correlation analysis (CCA) [17] is asupervised heterogeneous dimensionality reduction method.Semi-supervised discriminant analysis (SDA) [10] is aframework derived from LDA by adding a graph-basedregularizer. CCA-SVM uses SVM to classify the samplesinto classes in reduced dimensional subspace derived byCCA. SDA-LSR presents a two level projections containingseparately preserving and linear regression. For the sakeof fairness, all methods adopt HOG-OM collocation de-termined above. As shown in Figure 6, both SDA-LSRand our method converge synchronously and achieve betterresults than CCA-SVM. However, the superiority of ours

128

Page 5: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 5. The results when applying LSR algorithm based on four types of feature descriptors for cartoon data (a); and human-subject data (b). (c) showsperformances of various descriptor collocations. HD: Hausdorff distance algorithm, HOG: Histogram of oriented gradients algorithm, LCH: Local colorhistogram algorithm, and OM: Occupancy map algorithm. All of them received extensively studies in [16].

Figure 6. The comparative performance between our method and others.

over SDA-LSR is very minor. The possible reason is thatboth emphasize manifold learning and maximization of classseparability.

For the issue of practicality evaluation, we evaluate ourmethod on various feature collocations as shown in Fig-ure 5 (c). It can be observed that HOG-OM collocationachieves best precision which confirms our assumptionabove. Interestingly, even the worst performance of LCH-OM collocation deviates from optimal one only in range of10% precision after convergence, which demonstrates ourmethod’s insensitive to descriptor selection. Additionally,comparing with Figure 5 (a) and (b), in which the resultsscatter in loose ranges, our method improves the practicalityby achieving smooth and dense performances in only 5%precision cost.

IV. CARTOON SYNTHESIZED APPLICATION

We present the realization of cartoon synthesis basedon our proposed method. The common process is: givena sequence of human-subject keyframes, our system willretrieve as much cartoon keyframes which contains similargestures for users’ further editing.

A. Input specification

In our system, subjects behave actions in front of monoc-ular camera to obtain query source. Then, a natural problemis how adjusting subjects’ yaw angles relative to ones ofcartoon characters. As in pose inferring, the inherent 2D-3Dambiguity acts as a key factor resulting in lower recognitionprecision, and the alignment of yaw angles will eliminate theambiguity impacts dramatically [16]. However, yaw anglesof cartoon characters can hardly be obtained accurately inprior, which makes the alignment becomes an ill-conditionalproblem.

Inspired of direction feature based on optical flow [4]technique, the motion direction which can be easily detectedon both media is used for alignment. Thus, we apply Motionhistory image (MHI) [18] which is temporal template andweighted sum of successive binary silhouettes. Figure 7shows directional MHIs under eight yaw angles, whichmeans query action should be repeated for higher retrievalprecision. The MHI for each action in training samples iscalculated by its’ centralized keyframes, and all these MHIsare compared with human-subjects’ MHIs to determine theoptimal query.

B. Cartoon retrieval

An important issue of our system is content-based car-toon retrieval. Given human-subjects’ features, they are firstreduced into a lower dimensional subspace and then appliedtransformation learned by LDA. After that, top q-nearest

129

Page 6: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 7. Classification of representative yaw angles in the [0◦, 360◦)range (left) and the corresponding MHIs acted by subjects (right).

cartoon neighbors are returned as results. Figure 8 showstwo examples of the content-based retrieval. Unreasonableresults are remarked using red rectangles. It can be seen thatour method yields better results in both conditions visually.Figure 3 (a) confirms the observation statistically due tosubjective judgements as showing results across four cartooncharacters and the average.

Furthermore, we also extend our works to automaticlabeling for database enriching. The labeling process isvery similar to that of retrieval, and make difference withtraditional ones as: label result takes both cartoon an human-subjects’ neighbors into consideration, as subjects’ actionscan eliminate the gesture distortion caused by artistic factorsin cartoons.

C. Cartoon synthesis

Users can edit retrieved cartoon keyframes to obtaindesired clip interactively. After specification of backgroundimage, we estimate it’s perspective to obtain deep infor-mation for scaling based on vanishing and parallel linesdetection.

To refine the results, assisted editing means are provided.A natural way is to provide users with the ability of rotationand deformation. To this end, we employ 2D deformationtechnique as studied in [19], in which skeletal keypoints actas constrained handles in mesh morphing. Affine parameterspresented in [20] are applied here to control the globaltranslation and rotation of the character. Also, we employthe idea of hierarchical component [21] for allowing usersto replace any component in original character with desiredone. Figure 9 shows a cartoon synthesis result. It can beobserved that the user replaces original head and deformsright limb in some keyframes. It also validates that the resultsachieved by our method are visually smooth. Finally, weapply scattered interpolation technique detailed in [22] torender final clip.

V. CONCLUSION AND FUTURE WORK

This paper has presented a novel gesture recognitionmethod to synthesize cartoons for reusing purpose, in whichheterogeneous dimensionality reduction and linear discrim-inant model are used to construct a hierarchical framework.The key issue lies in extensive learning of data manifoldbased on heterogeneous features, which makes our methodintrinsically different from pervious ones. Experiments and

Figure 8. Two examples and top 7 retrieved images. Multi cartooncharacters example means multi characters are incorporated into test,meanwhile, single one means only with character TOM.

examples demonstrate the effectiveness in aspects of preci-sion, convenience, and generality.

Finally, there are improvements deserving for furtherexplores, e.g., importing multi-labeling in our learningframework, since a gesture often indicates multiple poseunderstanding. Also, bringing active learning into modelingfeature selection will increase performance.

ACKNOWLEDGMENT

Project supported by the National Natural Science Foun-dation of China (No. 60903134), the National Key Tech-nology R & D Program of China (No. 2007BAH11B00),Natural Science Foundation of Zhejiang Province (No.Y1101129).

REFERENCES

[1] Y. Li, T. Wang, and H.Y. Shum. Motion texture: a two-level statistical model for character motion synthesis. InProceedings of the 29th annual conference on Computergraphics and interactive techniques, pages 465–472. ACM,2002.

[2] C. de Juan and B. Bodenheimer. Cartoon textures. In Pro-ceedings of the 2004 ACM SIGGRAPH/Eurographics sympo-sium on Computer animation, pages 267–276. EurographicsAssociation, 2004.

[3] J. Yu, J. Xiao, C. Chen, and Y. Zhuang. Perspective-awarecartoon clips synthesis. Computer Animation and VirtualWorlds, 19, 3(4):355–364, 2008.

[4] Y. Yang, Y. Zhuang, D. Xu, Y. Pan, D. Tao, and S. Maybank.Retrieval based interactive cartoon synthesis via unsupervisedbi-distance metric learning. In Proceedings of the seventeenACM international conference on Multimedia, pages 311–320. ACM, 2009.

[5] A. Kort. Computer aided inbetweening. In Proceedingsof the 2nd international symposium on Non-photorealisticanimation and rendering, pages 125–132. ACM, 2002.

[6] J. Yu, D. Liu, and H.S. Seah. Transductive graph basedcartoon synthesis. Computer Animation and Virtual Worlds,21(3-4):277–288, 2010.

130

Page 7: [IEEE 2011 Workshop on Digital Media and Digital Content Management - Hangzhou, Zhejiang, China (2011.05.15-2011.05.16)] 2011 Workshop on Digital Media and Digital Content Management

Figure 9. Cartoon clip synthesis results. From (a) to (c): query human-subject keyframes, retrieved cartoon keyframes, and synthesized keyframes.

[7] M. Petrou and C. Petrou. Image processing: the fundamentals.Wiley, 2010.

[8] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncoveringshared structures in multiclass classification. In Proceedingsof the 24th international conference on Machine learning,pages 17–24. ACM, 2007.

[9] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation forlearning shared structures from multiple tasks. In Proceedingsof the 26th Annual International Conference on MachineLearning, pages 137–144. ACM, 2009.

[10] D. Cai, X. He, and J. Han. Semi-supervised discriminantanalysis. In Computer Vision, 2007. ICCV 2007. IEEE 11thInternational Conference on, pages 1–7. IEEE, 2007.

[11] K. Fukunaga. Introduction to statistical pattern recognition.Academic Pr, 1990.

[12] D. Marr and E. Hildreth. Theory of edge detection. Pro-ceedings of the Royal Society of London. Series B, BiologicalSciences, pages 187–217, 1980.

[13] Y. Zhuang and C. Chen. Efficient silhouette extraction withdynamic viewpoint. pages 1–8, 2007.

[14] N.R. Howe. Silhouette lookup for monocular 3d pose track-ing. Image and Vision Computing, 25(3):331–341, 2007.

[15] Z. Liang, J. Xiao, Y. Zhuang, and C. Chen. Competitive mo-tion synthesis based on hybrid control. Computer Animationand Virtual Worlds, 20(2-3):225–235, 2009.

[16] C. Chen, Y. Zhuang, and J. Xiao. Silhouette representationand matching for 3D pose discrimination-A comparativestudy. Image and Vision Computing, 28(4):654–667, 2010.

[17] L. Cao, J. Yu, J. Luo, and T.S. Huang. Enhancing semanticand geographic annotation of web images via logistic canon-ical correlation regression. In Proceedings of the seventeenACM international conference on Multimedia, pages 125–134. ACM, 2009.

[18] T. Ogata, J.K. Tan, and S. Ishikawa. High-speed humanmotion recognition based on a motion history image andan eigenspace. IEICE TRANSACTIONS on Information andSystems, 89(1):281, 2006.

[19] T. Igarashi, T. Moscovich, and J.F. Hughes. As-rigid-as-possible shape manipulation. ACM Transactions on Graphics(TOG), 24(3):1134–1141, 2005.

[20] C. Bregler, L. Loeb, E. Chuang, and H. Deshpande. Turningto the masters: motion capturing cartoons. ACM Transactionson Graphics (TOG), 21(3):399–407, 2002.

[21] F. Di Fiore, F. Van Reeth, J. Patterson, and P. Willis. Highlystylised drawn animation. Advances in Computer Graphics,pages 36–53, 2006.

[22] M.J. Park and S.Y. Shin. Example-based motion cloning.Computer Animation and Virtual Worlds, 15(3-4):245–257,2004.

131