Spatial Pooling as Feature Selection Method for Object ...€¦ · we employed a feature selection method that can work without such assump-tions, employing the spatial pooler phase

Spatial Pooling as Feature Selection Method forObject Recognition

Murat Kirtay, Lorenzo Vannucci, Ugo Albanese,Alessandro Ambrosano, Egidio Falotico and Cecilia Laschi ∗

The BioRobotics Institute, Scuola Superiore Sant’Anna, Pontedera (PI), Italy.

Abstract. This paper reports our work on object recognition by usingthe spatial pooler of Hierarchical Temporal Memory (HTM) as a methodfor feature selection. To perform the recognition task, we employed thispooling method to select features from COIL-100 dataset. We bench-marked the results with state-of-the-art feature extraction methods whileusing different amounts of training data (from 5% to 45%). The resultsindicate that the performed method is effective for object recognition witha low amount of training data in which state-of-the-art feature extractionmethods show limitations.

1 Introduction

Object recognition is a crucial task in many robotic applications. A key compo-nent of an object recognition system is a feature extractor, capable of detectinguseful pieces of information that can improve the actual recognition. In par-ticular, object recognition is employed in industrial environments, where com-putational efficiency and storage benefits must be taken into consideration, toperform visually guided manipulation. In such cases, it could be that the imagehas a very low resolution or a uniform background where there are no sharp localchanges in the contrast. This can impair some state-of-the-art feature extrac-tion methods that rely on these image properties, such as scale-invariant featuretransform (SIFT), and speeded up robust features (SURF) [1]. In this study,we employed a feature selection method that can work without such assump-tions, employing the spatial pooler phase of Hierarchical Temporal Memory,a Neo-cortically inspired machine learning algorithm [2]. This method is em-ployed in a full object recognition pipeline that includes a multi-class supportvector machine (M-SVM) [3] to perform the actual recognition starting fromthe features extracted. The whole process is tested on the Columbia UniversityImage Library (COIL-100), a collection of images taken in a controlled indoorenvironment with an artificially created uniform background. The employedmethod is benchmarked against other feature extraction methods capable ofdealing with such limitations, such as: using raw pixels, 3D color histogramsand Histogram Oriented Gradients (HOG) [1]. The obtained results show that,with this method, the recognition accuracy is improved even with a smalleramount of training data, therefore proving the usefulness of the approach.

∗This project has received funding from the European Union Horizon 2020 Research andInnovation Programme under Grant Agreement No. 720270 (HBP SGA1).

ESANN 2018 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 25-27 April 2018, i6doc.com publ., ISBN 978-287587047-6. Available from http://www.i6doc.com/en/.

91

2 Methods

The recognition pipeline consists of three distinct processing parts: preprocess-ing, feature selection and recognition. We note that the main purpose is tobenchmark the recognition performance of feature extraction methods by usinga lesser amount of training data.

2.1 Preprocessing

In this part, the images are transformed and downscaled into preprocessed vec-tors to be inputs of either feature extraction phase or directly to M-SVM. Thesesteps are unique for each feature extraction method. For instance, when em-ploying raw pixels or 3D color histograms, only normalization of the image wasperformed. Whereas, HOG requires normalization of grayscaled images and thecomputation of horizontal and vertical gradients of the images in a predefinedsize of pixel length. Finally, for the spatial pooler (cortical activations), theimages were grayscaled and thresholded to be binarized. As the last step, thegenerated input vectors were randomly grouped into training, validation andtest sets. The size of training and validation sets were chosen to be the same(from 5% to 45%), while the remaining data were used as the test set.

2.2 Feature Selection: Spatial Pooler Method

Fig. 1: Feature selection flow from bina-rized inputs to the cortical activations.

To derive features as cortical activa-tions from a given visual pattern, wefollow the formalization provided inthe study [4] by using the neurosci-entific concepts in [2] and implemen-tation in [5]. Constructing a spatialpooler region requires predefining anumber of parameters which are n asthe total number of patterns (7200),p as the vectorized length of a pat-tern (1024), m as a total numberof cortical columns (1024) and q asnumber of proximal dendrite synapses(100). To perform spatial pooling op-erations, U ∈ {0, 1}n×p represents thebinarized input matrix, c ∈ N1×m de-fines the indices of constructed corti-cal columns and Φ ∈ Rm×q is used for a set of randomly initialized synapticpermanence values for each columns. The connections between cortical columnsand an input pattern elements were formed by using set of proximal synapses,Λ ∈ {0, . . . , p − 1}m×q. In Figure 1, the cortical columns were shown as cylin-ders, a row-vector of the U shown as a binary input vector, some of the proximalsynapses for Λ0 and Λ8 are illustrated as solid and dashed lines. Additionally, the


92

operations of spatial pooler require defining several functions as follows. In that,I(ω) is an indicator function which yields 1 if ω is true, otherwise 0. kmax(S,k) returns kth largest element of S and max(v) returns maximum value of thegiven vector (v). A clipping function denoted as clip(M, lb, ub), defined to re-turn a bounded matrix with [lb, ub] where values smaller than lb assigned as lb,and values larger than ub assigned as ub. Finally, � and ⊕ indicate element-wisemultiplication and sum, respectively.

The spatial pooler computations are performed into three distinct phases:overlap, column inhibition and learning. The purpose of overlap is to derivethe total number of active proximal synapses (shown as α ∈ R1×m) for eachcolumn to decide whether a column should be active. To derive α, we obtainedthe number of synaptic permanence values which are above the threshold, asps = 0.5. X ∈ {0, 1}m×q represents set of inputs for the columns. Y is a matrixthat contains bitmasked values for synaptic permanence values which computedas (Y ≡ I(Φi ≥ ps) ∀i). α ∈ N1×m holds sum of the active connected proximalsynapses as αi = XiYi. In Figure 1, an active synapse shown a solid line withcircular end point while dashed line with diamond endpoint indicated inactiveone. To encourage less frequently active columns to compete for activation,boosting values (as b ∈ R1×m) are assigned to the associated columns (i.e, bi isboosting value for the column ci). As a final step of this phase, the elements ofα vector was derived by αibi where αi ≥ pd where pd = 10 refers to the proximaldendrite activation, otherwise it is assigned to be 0. Column inhibition phaseaims to compute the set of active columns, blue colored cylinders in Figure 1,while achieving desired column activity which is shown as pc and set to be20% of all columns. The neighborhood mask matrix (H ∈ {0, 1}m×m) indicatesthat Hi are the neighborhood columns for ci. In particular, if Hi,j is 1 the cjcolumn assumed to be a neighbor of ci. To determine which columns shouldbe active after inhibition, the lower-bounded kth largest overlap (γ ∈ N1×m)was computed by γ ≡ max(kmax(Hi � α, pc), 1) ∀i. Lastly, c ∈ {0, 1}1×m isan indicator vector constructed for the active columns by c ≡ I(αi ≥ γi) ∀i.Learning phase consists of three subphases: synaptic permanence adaptation,boosting operations and updating inhibition radius. To begin with the firstsubphase, the permanence matrix (Φ) is adjusted by Φ ≡ clip(Φ ⊕ δΦ, 0, 1)where δΦ is calculated by δΦ ≡ cᵀ�(φ+X−(φ−¬X)). Performing this subphaseprovides the adapted values for the synapses (Φi) that exist on a column (ci).We employed values of φ+ = 0.001, φ− = −0.001 for this subphase. In the nextsubphase, the columnar and synaptic level boosting operations carried out todetermine how frequently a column become active and boost the less frequentlyactive columns to compete. In that, η(a) ∈ R1×m refers to set of active dutycycle for all columns where η(min) ∈ R1×m as the minimum active duty cyclesfor all columns derived by η(min) ≡ kamax(Hi � η(a)) ∀i where ka = 0.001

is minimum activity level scaling factor. Based on the comparison of η(a)i and

η(min)i the boosting value of columns is determined by b ≡ β(η

(a)i , η

(min)i ) ∀i.

This boosting function returns 1 if the η(a)i >η

(min)i , returns β0 = 10 as maximum

amount of boosting if η(min)i = 0, otherwise it returns η

(a)i (1− β0)/η

(min)i + β0.


93

Boosting for permanence values was applied on a vector η(o) ∈ R1×m where η(o)i

is the overlap duty cycle for each ci ∈ c. The permanence values boosted by

Φ ≡ clip(Φ⊕ kbpsI(η(o)i < η

(min)i ), 0, 1) where kb = 0.1 refers to the permanence

boosting scaling factor. In the last subphase, the inhibition radius should beupdated to adjust the average receptive size (input space) between columnsand their corresponding connected synapses, yet we prefer to use fixed desiredcolumn activation level (pc) to represent a feature vector with a constant numberof active bits. To generate a feature vector, the aforementioned spatial pooleroperations will be performed for each input that exists in U.

2.3 Recognition: Multi-Class Support Vector Machine

A multi-class support vector machine was trained to optimize the Hinge lossvia gradient descent [3]. The generated feature vector annotated with xi anda bias node and bias vector were concatenated to xi and the weight matrix(W ), respectively. Then, the score function for an object j (f(xi,W )j) wascomputed by taking the j-th column of the result of the multiplication be-tween xi and W . The loss value for i-th feature vector is defined as Li =∑M

j 6=yi(max(0, f(xi,W )j − f(xi,W )yi

+ ∆)) + λ||W ||2, and was computed byadding 1 as a soft margin (∆) parameter to the object score of the actual objectyi (f(xi,W )yi

), and by summing the losses over all the object represented in thedataset (M), except the actual one. To avoid overfitting, we added a L2 regu-larization factor with a hyperparameter λ. The calculated loss for each input isaveraged over the dataset to get the full loss, Lfull = 1

N

∑Ni=1 Li. To find values

for the hyperparameters such as learning rate for the gradient descent and λ,cross-validation employed for model selection, with an early stopping criterion.In particular, we stopped the training if the accuracy rate of the validation setremained below a threshold of 0.01 for 20 iterations. Then, the best model (withhyperparameters) was used to recognize the objects in the test set images. Tofurther analyze the recognition performances of each feature extraction meth-ods, the recognition procedures were performed with different training set sizesand different recognition metrics, such as average accuracy and F1 scores, werecomputed.

3 Dataset description and reproducibility of the study

The object recognition task was carried out on the Columbia University ImageLibrary (COIL-100) [6]. The dataset consists of 100 objects with 72 color images(7200 in total) for each object that captured by rotating 5 degrees on a motorizedturntable. Note that, several preprocessing operations were performed on theimages such as cropping images from the background and resizing them to 128×128. The images downsized to 32× 32 for our recognition pipeline. The relatedsource code for this study can be found on the repository named ESANN2018 1.

1www.github.com/muratkirtay/ESANN2018


94

4 Results

Fig. 2: Accuracy rates for differenttraining set sizes.

This section provides the results forthe object recognition performed bythe proposed pipeline with the fourdifferent feature vectors. The resultswere benchmarked by increasing thetraining set size by 5% (from 5 to45%) and using a validation set hav-ing the same size of the training set.Then, we used the remaining data forthe test sets to obtain recognition per-formance metrics including accuracy,precision, recall, and F1 scores. Theaccuracies on the test sets at differenttraining percentages are illustrated inFigure 2, where cortical activations,raw pixels, histogram oriented gradi-ents and 3D color histograms can becompared. By comparing the increment trends in these curves, we observe thatthe use of cortical activations as input yielded better accuracy rates than usingother feature vectors while employing less amount of training data. For instance,recognition accuracy by using a training percentage of 5% obtained an 82.0454%accuracy for cortical activations, 63.3939% for raw pixels, 56.5909% for HOGsand 60.3787% for 3D histograms. To further analyze the results we constructed

(a) CA, F1:0.9060 (b) Pixel, F1:0.7613 (c) HOGs, F1:0.6903 (d) 3D Hists, F1:0.774

Fig. 3: Confusion Matrices and Average F1 Scores for Cortical Activations (CA),Raw Pixels (Pixels), Histogram Oriented Gradients (HOGs) and 3D Color his-tograms (3D Hists).

confusion matrices as heat maps for each feature vector at 10% training size.These matrices are depicted in Figure 3 and the actual object IDs and the rec-ognized object IDs are presented, respectively, as the rows and the columns ofthe matrices. These figures can be interpreted in this manner: the densely popu-lated main diagonal indicates the success in recognition while denotes a sparselypopulated matrix lower performances. In these visualizations, it can be observedthat the elements in the matrices for Figure 3(b), 3(c), 3(b) are more sparse.Conversely, in the case of cortical activations in Figure 3(a), the elements tend


95

to be more concentrated on the main diagonal. In order to numerically explainthis observation, we report the average F1 scores, computed as the harmonicmean of precision and recall. Consistent with the confusion matrices, the F1

scores obtained for the recognition algorithms follow the same trend, (CAF1>

3DHistF1> PixelF1

> HOGF1) that we previously observed in the accuracy

values (CAaccuracy > 3DHistaccuracy > Pixelaccuracy > HOGaccuracy).

5 Conclusions

The use of spatially pooled features (cortical activations) gives rise to a signif-icant improvement in recognition accuracy rates while employing less amountof training data, compared to the other feature extraction algorithms consid-ered. Moreover, the performed feature selection is capable of working with thedataset composed of images with low resolution and few changes in the con-trast. Thus, our ongoing studies aim to utilize this method on the dataset thatconstructed by employing the iCub humanoid robot for recognition and grasp-ing tasks where environmental changes are controlled, and the state-of-the-artmethods show limitations or poor performance [7]. The generality of the compu-tational and representation aspects of the Neo-cortically inspired spatial poolingcould be further exploited to develop a hierarchical framework, in a realisticsimulation platform [8], which allows for multimodal (e.g., vision and tactile)sensory representations of the objects for recognition.

References

[1] Richard Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag NewYork, Inc., New York, NY, USA, 1st edition, 2010.

[2] Jeff Hawkins and Subutai Ahmad. Why neurons have thousands of synapses, a theory ofsequence memory in neocortex. Frontiers in Neural Circuits, 10:23, 2016.

[3] J Weston and C Watkins. Support Vector Machines for Multi-Class Pattern Recognition.Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99),(April):219–224, 1999.

[4] James Mnatzaganian, Ernest Fokoue, and Dhireesha Kudithipudi. A mathematical for-malization of hierarchical temporal memory’s spatial pooler. Frontiers in Robotics and AI,3:81, 2017.

[5] Murat Kirtay, Egidio Falotico, Alessandro Ambrosano, Ugo Albanese, Lorenzo Vannucci,and Cecilia Laschi. Visual Target Sequence Prediction via Hierarchical Temporal MemoryImplemented on the iCub Robot, pages 119–130. Springer International Publishing, Cham,2016.

[6] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Object image library (coil-100).Technical report, 1996.

[7] M. Kirtay, U. Albanese, L. Vannucci, E. Falotico, and C. Laschi. A graspable object datasetconstructed with the icub humanoid robot’s camera. Technical report, 2017.

[8] Egidio Falotico and et al. Connecting artificial brains to robots in a comprehensive simu-lation framework: The neurorobotics platform. Frontiers in Neurorobotics, 11:2, 2017.


96

Documents

Spatial Pooling as Feature Selection Method for Object ...€¦ · we employed a feature selection method that can work without such assump-tions, employing the spatial pooler phase