5
Object Recognition Using Saliency Maps and HTM Learning Ioannis Kostavelis 1 , Lazaros Nalpantidis 2 , and Antonios Gasteratos 1 1 Laboratory of Robotics and Automation, Department of Production and Management Engineering Democritus University of Thrace, Vas. Sofias 12, GR-67100, Xanthi, Greece Email: {gkostave,agaster}@pme.duth.gr. 2 Computer Vision and Active Perception Lab., Centre for Autonomous Systems, Royal Institute of Technology - KTH, SE-100 44 Stockholm, Sweden Email: [email protected] Abstract—In this paper a pattern classification and object recognition approach based on bio-inspired techniques is pre- sented. It exploits the Hierarchical Temporal Memory (HTM) topology, which imitates human neocortex for recognition and categorization tasks. The HTM comprises a hierarchical tree structure that exploits enhanced spatiotemporal modules to mem- orize objects appearing in various orientations. In accordance with HTM’s biological inspiration, human vision mechanisms can be used to preprocess the input images. Therefore, the input images undergo a saliency computation step, revealing the plausible information of the scene, where a human might fixate. The adoption of the saliency detection module releases the HTM network from memorizing redundant information and augments the classification accuracy. The efficiency of the proposed framework has been experimentally evaluated in the ETH-80 dataset, and the classification accuracy has been found to be greater than other HTM systems. Index Terms—object recognition, HTM network, saliency map. I. I NTRODUCTION In the past years a lot of research took place in the field of object recognition and classification [1]. Although conven- tional computer vision techniques provided adequate solutions, the interest has been turned to the bio-inspired systems, which try to imitate the human capabilities to recognize a great variety of objects with little effort. The breakthrough came from the theory described in [2], where the authors supported that the machine learning techniques should follow a hierarchical structure, similar to that of the brain. A next step was the introduction of the HTM models [3], which are hierarchical networks capable of providing efficient solutions in the field of the machine learning [4]. HTM networks have already been exploited in numerous applications comprising supervised learning techniques. In [5] the authors introduced an object recognition method utilizing a stereo input in the lowest level of the HTM, taking advantage of the frequency of occurrence of spatiotemporal patterns, while a mobile robot moves erratically. Additionally, in [6] a supplementary node was placed on the top of the hierarchy of HTM in order to provide a solution to the problem of sign language recognition. Moreover, the HTM framework has been utilized for hand pose estimation, hand shape recognition and Fig. 1. The proposed HTM structure for supervised learning. even for traffic sign recognition as described in [7], [8] and [9], respectively. Contrary to the aforementioned techniques, that fed immediately the output of the imaging system to the HTM, the work described in [10] exploited the logpolar trans- formation at the bottom level of the network in order to comply with the human vision system. The logpolar transformation is a mapping from the Cartesian coordinate system to the retinal one and has been applied in several bio-inspired applications [11]. However, the visual saliency detection is closer to the human vision system, providing consistent fixating on the important information of the scene [12]. The proposed work constitutes a supervised learning method used to recognize objects in different orientations. It is based on the structure proposed in [10], however, it exploits a saliency detection processing step of the input images, en- dowing the system with more biological inspired attributes. More analytically, it introduces specific alternative rules for the design of each building block of a HTM, i.e. both the spatial and the temporal module. The input level comprises a visual saliency preprocessing step that isolates the region 978-1-4577-1775-8/12/$26.00 ©2012 IEEE

Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

  • Upload
    hathu

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

Object Recognition Using SaliencyMaps and HTM Learning

Ioannis Kostavelis1, Lazaros Nalpantidis2, and Antonios Gasteratos1

1Laboratory of Robotics and Automation, Department of Production and Management EngineeringDemocritus University of Thrace, Vas. Sofias 12, GR-67100, Xanthi, Greece

Email: {gkostave,agaster}@pme.duth.gr.2Computer Vision and Active Perception Lab., Centre for Autonomous Systems,

Royal Institute of Technology - KTH, SE-100 44 Stockholm, SwedenEmail: [email protected]

Abstract—In this paper a pattern classification and objectrecognition approach based on bio-inspired techniques is pre-sented. It exploits the Hierarchical Temporal Memory (HTM)topology, which imitates human neocortex for recognition andcategorization tasks. The HTM comprises a hierarchical treestructure that exploits enhanced spatiotemporal modules to mem-orize objects appearing in various orientations. In accordancewith HTM’s biological inspiration, human vision mechanismscan be used to preprocess the input images. Therefore, theinput images undergo a saliency computation step, revealingthe plausible information of the scene, where a human mightfixate. The adoption of the saliency detection module releasesthe HTM network from memorizing redundant informationand augments the classification accuracy. The efficiency of theproposed framework has been experimentally evaluated in theETH-80 dataset, and the classification accuracy has been foundto be greater than other HTM systems.

Index Terms—object recognition, HTM network, saliency map.

I. INTRODUCTION

In the past years a lot of research took place in the fieldof object recognition and classification [1]. Although conven-tional computer vision techniques provided adequate solutions,the interest has been turned to the bio-inspired systems,which try to imitate the human capabilities to recognize agreat variety of objects with little effort. The breakthroughcame from the theory described in [2], where the authorssupported that the machine learning techniques should followa hierarchical structure, similar to that of the brain. A nextstep was the introduction of the HTM models [3], which arehierarchical networks capable of providing efficient solutionsin the field of the machine learning [4].

HTM networks have already been exploited in numerousapplications comprising supervised learning techniques. In [5]the authors introduced an object recognition method utilizing astereo input in the lowest level of the HTM, taking advantageof the frequency of occurrence of spatiotemporal patterns,while a mobile robot moves erratically. Additionally, in [6]a supplementary node was placed on the top of the hierarchyof HTM in order to provide a solution to the problem of signlanguage recognition. Moreover, the HTM framework has beenutilized for hand pose estimation, hand shape recognition and

Fig. 1. The proposed HTM structure for supervised learning.

even for traffic sign recognition as described in [7], [8] and[9], respectively. Contrary to the aforementioned techniques,that fed immediately the output of the imaging system to theHTM, the work described in [10] exploited the logpolar trans-formation at the bottom level of the network in order to complywith the human vision system. The logpolar transformation isa mapping from the Cartesian coordinate system to the retinalone and has been applied in several bio-inspired applications[11]. However, the visual saliency detection is closer to thehuman vision system, providing consistent fixating on theimportant information of the scene [12].

The proposed work constitutes a supervised learning methodused to recognize objects in different orientations. It is basedon the structure proposed in [10], however, it exploits asaliency detection processing step of the input images, en-dowing the system with more biological inspired attributes.More analytically, it introduces specific alternative rules forthe design of each building block of a HTM, i.e. both thespatial and the temporal module. The input level comprisesa visual saliency preprocessing step that isolates the region

978-1-4577-1775-8/12/$26.00 ©2012 IEEE

Page 2: Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

of interest in the input images, releasing the HTM learningprocedure from redundant information. The consecutive stepsof the proposed method are summarized in Fig. 1.

II. ALGORITHM DESCRIPTION

The contribution of this work is twofold. In the first step,the input images undergo a saliency detection procedure by ex-ploiting the Graph-Based Visual Saliency algorithm (GBVS).During this step the coordinates of the image that correspondto information rich regions are extracted. In the second step,only the detected salient locations are utilized in order to traina specially designed HTM network.

A. Detection of the saliency regions

The method adopted in this is work is based on the GBVSalgorithm, which exploits the image signature technique de-scribed in [13]. The image signature is a simple image descrip-tor that takes advantage of sparse signal mixing to determinethe spatial locations of the foreground of an image. Moreanalytically, it is assumed that an image I can be accordinglydecomposed to the foreground f and to the background bsignal as follows:

I = f + b (1)

The contribution of the foreground signal f can be de-termined by taking the sign of the mixture signal I in thetransformed domain and, then, inversely transform it backinto the spatial domain. The latter is achieved by computingthe reconstructed image I = IDCT [sign(I)], where ICDTis the Inverse Discrete Cosine Transform. Additionally, theforeground of an image, contrary to its background, is visuallyprominent. Therefore, the saliency map can be formed by ap-plying on the reconstructed image I a Gaussian kernel g. Thegaussian filtering step is necessary because the salient objectsare not only arbitrarily located in a scene, but they might alsoappear in a continuous region. The GBVS is decomposed intwo consecutive steps. The first one is named activation mapand comprises the formation of a fully connected weightedgraph G by exploiting a simple dissimilarity measure overlocal neighborhoods of the images, as described in [12]. Thenext step is the normalization of the activation map andconcerns the normalization of the weights between the nodesof G. The output of GBVS algorithm is presented in Fig. 2,where the saliency method has been applied in a sample imageof the ETH-80 dataset. More specifically, Fig. 2(b) presentsthe output of the CBV S procedure, which exhibits greatperformance as the foreground is easily distinguished fromthe background. In Fig. 2(c) the saliency map is overlayed onthe reference image. This method can significantly improve theresults of the subsequent HTM network. The saliency map canbe used to isolate the most prominent area of the input image,removing the peripheral irrelevant and also possibly noisyimage areas. Therefore, a bounding box is defined around theregion that contains the 90% of the saliency map (Fig. 2(d))and, then, is utilized for the training of the HTM network.

(a) (b)

(c) (d)

Fig. 2. a) Reference Image, b) the saliency map of the GBVS algorithm, c)the output of the GBVS algorithm overlayed on the initial image, d) imagepatch enclosed by bounding box around the salient region.

B. Hierarchical Temporal Memory Network

In this work specific optimizations have been considered andapplied to the original HTM algorithm [14]. The attributes ofan HTM network are summarized as follows:

• Each level consists of adjoint nodes, the number of whichdecreases ascending in the hierarchy.

• The number of nodes comprising level ν is 22λ−ν , whereλ is the number of the levels in the network.

• Level 0 corresponds to the images presented to thenetwork. The input images are also divided into patchesof n by n pixels.

• Nodes receive inputs from spatially-specific areas,namely the receptive fields.

• The nodes follow the same algorithmic procedures inde-pendent of the level they belong to i.e. the spatial andthe correlation one.

Every single computational node undergoes two specificoperation mechanisms. The first one comprises the trainingmode of the spatial module and the formation of the correlationone, whilst the second step subtends the inference mechanism,where the node produces outputs to be fed into the highernodes.

1) Spatial Module: The input to the nodes of the first levelis the region inside a bounding box containing the most salientfeatures, as it has resulted from the GBVS algorithm. Thespatial module has to learn a representative subset of thesemaps as they are being presented as input vectors in thereceptive field of the network. The stored input vectors are thequantization centers that encode the knowledge of the network.

Page 3: Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

It is imperative that these centers are carefully selected toensure that the spatial module will be able to learn a finitespace of quantization centers from an infinite number of inputvectors.

In the first step of the learning procedure, the initial inputvector is considered as a quantization center at the respectivenode. Assuming that the learned quantization space in thespatial module of a node is L = [l1, l2, ..., lN]T , where licorresponds to quantization centers and N is the number of theexisting centers, all the Euclidean distances d between thesecenters are calculated and their sum S is then computed:

S =1

2

N∑i

N∑j

d(li, lj) (2)

As a new input vector lc appears in the receptive field of thenode, all the distances d within the existing centers L andthe new input vector lc are computed. The sum Sc is thencalculated:

Sc = S +N∑i

d(li, lc) (3)

The value of S represents the scatter of the existing quanti-zation centers in the node. Any new input vector should beadded in the node only when the within scatter Sc is muchgreater than the previous one. This approximation ensuresthat input vectors that contain substantial information will bepooled in the spatial module, whereas those that do not containrepresentative information will be discarded. Therefore, foreach new input vector the alteration (alt = (S − Sc)/S)between S and Sc should be examined against a thresholdT. If alt > T then, the query input vector becomes anew quantization center; otherwise, the next input vector isexamined.

In the example presented in Fig. 3, the ability of thespacial module to learn a given three-dimensional circularspace is examined. Every point belonging to the 3D sphereis considered as a 3D vector. These vectors are presentedrandomly in time and the spatial module should add to itsnode those quantization centers that characterize the inputspace in a more representative way. The control frequencywas 1 check every 15 examined quantization centers. For thepooling of new quantization centers in the spatial module,strict thresholds were utilized. The resulted learned space isuniform and the stored quantization centers in Figure 3(b)represent sufficiently all the area of the input space. In thisexample the proposed learning procedure converges in a fewiterations as exhibited in Figure 3(c), where the center-addingrate for the proposed method is indicated in continuous line.

2) Correlation Module: Robust object recognition and cat-egorization systems have to be trained from a queue of images,that rarely share any amount of common information and donot always retain time proximity. Therefore, instead of thetime adjacency matrix T which is proposed in [14], this work

(a) (b)

(c)

Fig. 3. (a) The spherical input space presented in the receptive field of thenode, (b) a mapping of the learned quantization centers, (c) the learning rateof the proposed method.

(a) Correlation Matrix (b) Thresholded CorrelationMatrix

Fig. 4. (a) The correlation matrix between the quantization centers in a nodeand (b) the thresholded correlation matrix with a threshold of 0.8.

exploits a correlation coefficient one R. The adopted measureof correlation is Pearson’s coefficient, which obtain values in[-1, 1]. The correlation matrix is an N by N matrix containingthe Pearson correlation coefficients between all the possiblepairs of quantization centers. The larger the absolute valueof correlation, the stronger the association between the twovariables and the more accurate the prediction of any variable.The R is a symmetrical matrix as any quantization centeris fully correlated with itself. The resulted correlation matrix(Fig. 4(a)) is thresholded and only correlation values greaterthan 0.8 are kept (Fig. 4(b)). The next step of the proposedHTM topology comprises the partitioning of the correlationmatrix into coherent subgroups. Each subgroup includes thosequantization centers that share great coherence and, therefore,the resulted subgroups are utilized in the inference mode, inorder to form input vectors for the nodes that lie in the upperlevel.

Page 4: Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

III. ALGORITHM ASSESSMENT

This section deals with the evaluation of the accuracy of theproposed method. In this point it should be mentioned that thenode placed on the top of the hierarchy is the classificationone. For the experimental validation the ETH-80 dataset wasutilized that contains 8 different classes, each one representedby 10 similar objects in 41 orientations. The receptive fieldof level 1 of the network is 32 by 32 pixels. The initialdimensions of the images are 256 by 256 pixels, which aremodified by the isolation of their salient region. The outputimages of this procedure have then to be resized down to 32by 32 pixels to comply with the requirements of the HTMnetwork. More analytically, Fig. 5 depicts the output of thesaliency map detector algorithm for every class sample of theutilized dataset. The first column corresponds to the referenceimages, the middle one corresponds to the output of thesalience detector algorithm and the last column correspondsto the patch of pixels that will be fed as input in the first levelof the HTM network.

The accuracy of the proposed method has been testedutilizing two different classifiers in the node placed at the topof the hierarchy of the HTM. The first one is a simple k -nearest neighbor (k-nn), where the most suitable value for theparameter k was examined among the others and set equal to5 nearest neighbors. Then, the k-nn classifier was replaced bya linear Support Vector machine (SVM), where the selectedlibrary for the implementation was the LIBSVM as proposedin [15]. Moreover, in the SVM, the model regularizationparameter C, which penalizes large errors, was set equal to100 in order to achieve greater separability of the data. Duringthe learning procedure, the dataset was divided into a trainand a test set corresponding to 70% and to 30% of the totalsize of the dataset, respectively. In the validation procedurethe N-fold cross validation method has been adopted and,consequently the per class calculated accuracy is an averageof the N examined circles. In our case the parameter N wasset equal to 10 samples.

The results of the proposed algorithm along with the resultsof the algorithm proposed in [10], are summarized in Table I,where the classification accuracy for each class of the datasetis presented. The k-nn classifier achieved 86,29% averageclassification rate for all the classes, while the SVM classifierachieved 96,82%. Additionally, the per class accuracy isconstantly greater that 80% and 90% for the k-nn and the SVMclassifiers repsectively, revealing that the saliency detectionpreprocessing step of the images increases the performance ofthe HTM network as compared with the results described in[10]. In the latter the input images in the lowest level of theHTM was a logpolar mapping of the initial ones, somethingthat inherited redundant information in the learning procedure.However, in the proposed method, only the region of imagewhich is delimited from a bounding box is exploited for thetraining and testing of the HTM network leading to greaterclassification accuracy.

Fig. 5. The first column corresponds to the reference images, the middlecolumn presents the respective saliency map as it results from the GBVSalgorithm, the last column depicts image patch enclosed by the bounding boxaround the salient region.

Page 5: Object Recognition Using Saliency Maps and HTM Learning83.212.134.96/robotics/.../Object-Recognition-Using... · tional computer vision techniques provided adequate solutions,

TABLE ISUMMARY OF THE CLASSIFICATION RATE FOR THE DIFFERENT CLASSES

Class Type Apple Car Cow Cup Dog Horse Pear Tomato Average

Proposed Method k-nnn 85,98% 91,60% 84,72% 90,57% 85,45% 82,66% 86,42% 82,96% 86,29%SVM 94,98% 99,81% 96, 18% 99,18% 94,93% 96,67% 96,89% 95,92% 96,82%

Method described in [10] k-nnn 83,32% 89,51% 82,50% 89,96% 84,12% 81,23% 84,31% 80,56% 84,44%SVM 93,35% 98,65% 94,22% 98,65% 93,31% 95,28% 94,67% 93,46% 95,20%

IV. CONCLUSIONS

This work presents a biological inspired object recognitionand classification method. It exploits a saliency map detectionmodule, which is utilized as a preprocessing step of theinput space. The main advantage of this part of the proposedalgorithm is that it retains only the useful information of animage and the noisy regions along with the background arediscarded. Right after, the redundant input space is fed to aspecially designed for object recognition HTM network. Thisnetwork comprises different rules in memorizing the inputspace as it does not make any assumptions of time proximityin the input data. The performance of the proposed algorithmhas also been examined by utilizing two different classifiers.The results of the experimental procedure reveal that theproposed method achieved great classification accuracy i.e.86,29% for k-nn classifier. However, by exploiting a strongerclassifier on the top node of the hierarchy, such as the SVM,the performance has been improved by achieving 96,82%classification rate. The proposed method can provide reliablesolutions to the object categorization problems as it utilizesonly the substantial information for the training of HTMnetwork, endowing the HTM with even more generalizationcapabilities.

REFERENCES

[1] H. Kjellstrom, J. Romero, and D. Kragic, “Visual object-action recogni-tion: Inferring object affordances from human demonstration,” ComputerVision and Image Understanding, 2010.

[2] J. Hawkins and S. Blakeslee, “On intelligence: How a new understandingof the brain will lead to the creation of truly intelligent machines,” HenryHolt & Company, New York,NY, 2004.

[3] J. Hawkins and D. George, “Hierarchical temporal memory: Concepts,theory and terminology,” Whitepaper, Numenta Inc, 2006.

[4] Numenta, “Problems that fit HTM,” Numenta, Tech. Rep., 2006.[5] M. Bundzel and S. Hashimoto, “Object identification in dynamic images

based on the memory-prediction theory of brain function,” Journal ofIntelligent Learning Systems and Applications, vol. 2, pp. 212–220,2010.

[6] D. Rozado, F. Rodriguez, and P. Varona, “Optimizing hierarchicaltemporal memory for multivariable time series,” in Artificial NeuralNetworks–ICANN 2010. Springer, 2010, pp. 506–518.

[7] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla, “Hand poseestimation using hierarchical detection,” in Computer Vision in Human-Computer Interaction. Springer, 2004, pp. 105–116.

[8] T. Kapuscinski, “Using hierarchical temporal memory for vision-basedhand shape recognition under large variations in hands rotation,” ArtificalIntelligence and Soft Computing, pp. 272–279, 2010.

[9] W. Melis and M. Kameyama, “A study of the different uses of colourchannels for traffic sign recognition on hierarchical temporal memory,”in Fourth International Conference on Innovative Computing, Informa-tion and Control (ICICIC), 2009, pp. 111–114.

[10] I. Kostavelis and A. Gasteratos, “On the optimization of hierarchicaltemporal memory,” Pattern Recognition Letters, vol. 33, no. 5, pp. 670– 676, 2012.

[11] R. Manzotti, A. Gasteratos, G. Metta, and G. Sandini, “Disparityestimation on log-polar images and vergence control,” Computer Visionand Image Understanding, vol. 83, no. 2, pp. 97–117, 2001.

[12] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” Ad-vances in neural information processing systems, vol. 19, p. 545, 2007.

[13] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparsesalient regions,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 34, no. 1, p. 194, 2012.

[14] D. George and B. Jaros, “The HTM learning algorithms,” Whitepaper,Numenta Inc, 2007.

[15] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for supportvector machines,” ACM Transactions on Intelligent Systems andTechnology, vol. 2, pp. 27:1–27:27, 2011, software available athttp://www.csie.ntu.edu.tw/ cjlin/libsvm.