3
Classification of Individually Pleasant Images Based on Neural Networks with the Bag of Features Abstract—It is important to determine the correct semantic categories for the classification of photographic scenes. In particular, it is difficult to categorize emotional pictures, including individually pleasant or unpleasant contents. Therefore, the method of searching for individually emotional information from various pictures was investigated using neural networks with the bag of features scheme. The neural network classifier for emotional pictures performed a partially accurate estimation; however, there were some cases in which the bag of features scheme based on local features mistakenly selected similar images in a different semantic category. Further robust searching methods for individually emotional categorization must be considered. Keywords: emotional pictures; neural networks; bag of features; object recognition I. INTRODUCTION Computer vision (CV) technology has been rapidly developed with the increase of digitized contents and materials. Generic object recognition and image categorization are especially important themes in CV [1]. Although specific object recognition identifies strictly defined targets, generic object recognition categorizes visual object classes and labels them. Thus, generic object recognition needs grouping across semantic categories in general scenes and/or objects, requiring a wide-ranging categorization of images. However, it is hard to extract and define image features in all objects. The analytical methods to express image features and to categorize visual images have realized recognizing generic objects with high accuracy. For example, the speeded-up robust features (SURF) and scale-invariant feature transform can efficiently explore local information of the object boundary [2]; they have been applied for feature detection in target objects. The bag of features (BoF) scheme also has been demonstrated as an effective style of image feature distribution [1]. To detect individually emotional images with high accuracy, those techniques for image feature detection may be combined with machine learning methods such as support vector machines, a boosting algorithm, and neural networks (NN) [3], [4]. It is important to consider efficient methods for the correct semantic categorization of photographic scenes (e.g., the judgment of social scenes or situations); however, they have not been established. This problem also applies to the categorization of individually emotional images. It is hard for CV to recognize emotional situations because of individual differences in visual scenes. Consequently, CV may misread target images and categorize them incorrectly. To overcome such an occurrence, the hybrid method based on learning algorithms with the BoF scheme might be able to accurately classify emotional pictures with individual differences. The classification of individually emotional scenes is a crucial task in CV. Accordingly, this study aims to estimate the rating of individually emotional pictures, using the NN classifier based on the BoF scheme. The correct classification rate was evaluated in individually emotional images quantified by rating scores. II. IMAGE FEATURE EXTRACTION A. Picture dataset Emotional images (8-bit grayscale bitmaps) were selected from the International Affective Picture System (IAPS) [5]. The mean and distribution of the luminance of the images (360 × 270 pixels) were adjusted among the images. The emotional images were rated by eight volunteers (age 28.8 ± 4.7 years.). In the rating scores, on a 1–7 scale (1: extremely negative; 7: extremely positive), the paired t-test showed that there was a significant difference among the picture conditions (p < 0.01 in each): 4.0 ± 0.1, 4.8 ± 0.5, and 2.6 ± 0.4 for the neutral, pleasant, and unpleasant images, respectively. B. Analysis methods Figure 1 shows the scheme of the BoF and the NN learning algorithm for emotional rating scores. A picture dataset with three emotional categories (neutral, pleasant, and unpleasant; 60 images in each) was set for the assessment of the proposed system. The BoF, based on the idea of the bag of words scheme for the categorization of textual data, was applied for searching emotional images [1]. The following steps were actually performed to identify individually emotional images: 1) The SURF technique was used for the extraction of the descriptor as an image feature. The SURF is a robust feature descriptor indicating the distribution of Koji Kashihara Institute of Technology and Science The University of Tokushima 2-1 Minamijyousanjima, Tokushima 770-8506, Japan [email protected] 978-1-4673-5936-8/13/$31.00 ©2013 IEEE 291

[IEEE 2013 1st International Conference on Orange Technologies (ICOT 2013) - Tainan (2013.3.12-2013.3.16)] 2013 1st International Conference on Orange Technologies (ICOT) - Classification

  • Upload
    k

  • View
    219

  • Download
    7

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 1st International Conference on Orange Technologies (ICOT 2013) - Tainan (2013.3.12-2013.3.16)] 2013 1st International Conference on Orange Technologies (ICOT) - Classification

Classification of Individually Pleasant Images Based on Neural Networks with the Bag of Features

Abstract—It is important to determine the correct semantic categories for the classification of photographic scenes. In particular, it is difficult to categorize emotional pictures, including individually pleasant or unpleasant contents. Therefore, the method of searching for individually emotional information from various pictures was investigated using neural networks with the bag of features scheme. The neural network classifier for emotional pictures performed a partially accurate estimation; however, there were some cases in which the bag of features scheme based on local features mistakenly selected similar images in a different semantic category. Further robust searching methods for individually emotional categorization must be considered.

Keywords: emotional pictures; neural networks; bag of features; object recognition

I. INTRODUCTION Computer vision (CV) technology has been rapidly

developed with the increase of digitized contents and materials. Generic object recognition and image categorization are especially important themes in CV [1]. Although specific object recognition identifies strictly defined targets, generic object recognition categorizes visual object classes and labels them. Thus, generic object recognition needs grouping across semantic categories in general scenes and/or objects, requiring a wide-ranging categorization of images. However, it is hard to extract and define image features in all objects.

The analytical methods to express image features and to categorize visual images have realized recognizing generic objects with high accuracy. For example, the speeded-up robust features (SURF) and scale-invariant feature transform can efficiently explore local information of the object boundary [2]; they have been applied for feature detection in target objects. The bag of features (BoF) scheme also has been demonstrated as an effective style of image feature distribution [1]. To detect individually emotional images with high accuracy, those techniques for image feature detection may be combined with machine learning methods such as support vector machines, a boosting algorithm, and neural networks (NN) [3], [4].

It is important to consider efficient methods for the correct semantic categorization of photographic scenes (e.g., the judgment of social scenes or situations); however, they have

not been established. This problem also applies to the categorization of individually emotional images. It is hard for CV to recognize emotional situations because of individual differences in visual scenes. Consequently, CV may misread target images and categorize them incorrectly. To overcome such an occurrence, the hybrid method based on learning algorithms with the BoF scheme might be able to accurately classify emotional pictures with individual differences.

The classification of individually emotional scenes is a crucial task in CV. Accordingly, this study aims to estimate the rating of individually emotional pictures, using the NN classifier based on the BoF scheme. The correct classification rate was evaluated in individually emotional images quantified by rating scores.

II. IMAGE FEATURE EXTRACTION

A. Picture dataset Emotional images (8-bit grayscale bitmaps) were selected

from the International Affective Picture System (IAPS) [5]. The mean and distribution of the luminance of the images (360 × 270 pixels) were adjusted among the images. The emotional images were rated by eight volunteers (age 28.8 ± 4.7 years.). In the rating scores, on a 1–7 scale (1: extremely negative; 7: extremely positive), the paired t-test showed that there was a significant difference among the picture conditions (p < 0.01 in each): 4.0 ± 0.1, 4.8 ± 0.5, and 2.6 ± 0.4 for the neutral, pleasant, and unpleasant images, respectively.

B. Analysis methods Figure 1 shows the scheme of the BoF and the NN learning

algorithm for emotional rating scores. A picture dataset with three emotional categories (neutral, pleasant, and unpleasant; 60 images in each) was set for the assessment of the proposed system. The BoF, based on the idea of the bag of words scheme for the categorization of textual data, was applied for searching emotional images [1]. The following steps were actually performed to identify individually emotional images:

1) The SURF technique was used for the extraction of the descriptor as an image feature. The SURF is a robust feature descriptor indicating the distribution of

Koji Kashihara Institute of Technology and Science

The University of Tokushima 2-1 Minamijyousanjima, Tokushima 770-8506, Japan

[email protected]

978-1-4673-5936-8/13/$31.00 ©2013 IEEE 291

Page 2: [IEEE 2013 1st International Conference on Orange Technologies (ICOT 2013) - Tainan (2013.3.12-2013.3.16)] 2013 1st International Conference on Orange Technologies (ICOT) - Classification

the pixel intensities within a scale-dependent neighborhood of each interest point. To increase robustness and decrease the computation time, the Haar wavelet was adopted as a simple filter [2].

2) Visual words were created by the k-means algorithm to cluster the feature vectors of SURF (Fig. 2). The k- means clustering was selected as the simplest square- error partitioning method. This algorithm was iterated by the assignment of points to their closest cluster centers and the recomputation of the cluster centers [1].

3) The BoF was constructed by counting the number of patches assigned to each cluster. All images were represented by histograms of visual words, which were used as image features of an NN classifier for the judgment of photographic categories.

Visual words

Image features

Histograms

(2) Code book generation(k-means clustering)

Test data

Pleasant

2 3 5 7 95 7 5 4 52 3 8 7 91 3 5 …..

3 2 5 7 15 7 5 4 12 2 8 7 91 3 2 …..

UnpleasantNN

classifier

Training data

Image feature

(1) Image featuredetection and description

using SURF

2 3 5 7 95 7 5 4 52 3 8 7 91 3 5 …..

2 3 5 7 95 7 5 4 52 3 8 7 91 3 5 …..

Unknown

Histogram

Emotional rating?

… … …

(4) Training of learning algorithm (5) Validation(3) Construction

of the BoF

Actual rating scores

Figure 1. Scheme of the BoF and the NN learning algorithm for emotional

rating scores.

Figure 2. Examples of image features extracted with the SURF technique. Red circles show the extracted features.

III. EVALUATION OF NN CLASSIFIER

A. NN classifier The NN classifier was applied to categorize emotional

image scenes. A multilayer feed-forward NN with a hidden layer emulated nonlinear responses, and outputs (YNN) were predicted through the NN. The error signal was propagated back through the network, thereby modifying the weights before the next input presentation. The back-propagation algorithm was performed in the following order: output layer and first hidden layer. All connection weights were adjusted to decrease the error function by the back-propagation learning rule based on the gradient descent method [6].

The error function (E) is expressed as (YM-YNN)2 / 2, where YM is the model response as a supervised signal and YNN is the predicted response by the NN before the update of the connection weights. The error between YM and YNN is propagated back through the network. The connection weight is generally updated by a gradient descent of E as a function of weights: w* = w + KnΔw, where w* is the single weight of each connection after updating, w is the single weight of each connection before updating, Δw is the modified weight, and Kn is the learning rate.

B. General categorization of emotional pictures The validation of the NN classifier was first performed to

judge the three general types of categorized data (neutral, pleasant, and unpleasant) from emotional pictures. The inputs to the NN classifier were the values of a histogram of visual words acquired from the BoF scheme in emotional images.

The NN had 500 input units and 3 output units, which were normalized numbers (0-1). The accuracy was evaluated by the degree of coincidence among the emotional categories of the previous IAPS studies [5] and the NN outputs. The NN output of the largest value was determined to be the emotional category of the evaluated picture. The numbers of the units (n) in the hidden layer were set for n = 10, by trial and error. The initial weights of the NN were given at random, and Kn (showing learning rates) was set at 0.2. Fifteen pictures were used for the learning phase, and another fifteen pictures were applied for the validation of the NN calssifier. The absolute error between YM and YNN from the trained NN (100,000 times) was less than 0.01.

The NN classifier was firstly applied for the judgment of the general emotional categories from the histogram elements of visual words. The overall accuracy of categorizing neutral (80%), pleasant (10%), and unpleasant (60%) pictures was 53.3 %. Compared with the other categories, the classification of the pleasant pictures resulted in the lowest accuracy.

C. Estimation of individual ratings The numbers of input units (m = 500) to a NN component

with a single output and of the units (n = 100) in the hidden layer were set for the estimation of individual ratings. The initial weights of the NN were given at random, and Kn (showing learning rates) was set at 0.2. To express the NN outputs, the rating scores on a 1–7 scale were temporarily normalized between 0 and 1. The average of the absolute error

292

Page 3: [IEEE 2013 1st International Conference on Orange Technologies (ICOT 2013) - Tainan (2013.3.12-2013.3.16)] 2013 1st International Conference on Orange Technologies (ICOT) - Classification

between YM and YNN from the trained NN (100,000 times) was less than 0.01. The normalized score during the NN calculation was returned to the actual scale (1–7). If the output estimated by the NN ranged between ± 1 compared with the individual rating score, it was regarded as a correct answer.

The final accuracy for the individual categorization was 60.8 ± 0.17%. The highest accuracy of all participants was 90%, and the lowest was 40%. Figure 3 shows typical examples of detected similar pictures with high accuracy in the pleasant picture category. With accurate results, most of the pictures with similar ratings had a tendency to contain similar objects and situations.

Figure 3. Typical examples detected as having similar pleasant ratings between the individual score and the one estimated by the NN:

A horse (left) and babies (right).

IV. DISCUSSION Although humans can recognize the correct emotional

situation as quickly and accurately as possible, the NN classifier sometimes resulted in an incorrect response due to the high similarity between image features based on the BoF scheme. This result suggests that the understanding of emotional scenes requires the correct judgment of the semantic image contents. In addition, unpleasant images have a tendency to induce common feelings in most people. In contrast, pleasant images may include individual differences, suggesting the difficulty of image extraction.

The parameters and structures of the NN for this experiment were determined by trial and error and they were fixed during the evaluation of the NN classifier. However, those factors may dominantly reflect the final accuracy of the NN classifier. Therefore, the optimal adjustment of variables such as the learning rate and the hidden layers could improve the results of the NN classifier. The analysis of the text-based information on emotional images may also modify the estimation of individual rating scores, whereas it is difficult to apply it for all images in databases. The support vector machines as well as the NN may correspond to this problem. Further effective methods for the correct categorization or

rating of emotional images will be required as the next step for CV.

V. CONCLUSION The NN classifier with the BoF scheme was applied for the

categorization of individually emotional images, resulting in partially high accuracy. The BoF scheme based on local image features sometimes detected similar objects or humans in images, regardless of different emotional ratings. More effective methods of semantic categorization must be developed in future studies. In addition, searching for similar pictures from an emotional database and their classification might be effective for the construction of a human-computer interface for reflecting individual preferences [7].

ACKNOWLEDGMENT This study was partially funded by a Grant-in-Aid for

Young Scientists (B) from the Ministry of Education, Culture, Sports, Science and Technology of Japan (KAKENHI, 22700466).

REFERENCES [1] G. Csurka, C. R. Dance, L. Fan, J. Williamowski, and C. Bray, “Visual

categorization with bags of keypoints,” Proceedings of the IEEE Workshop on Statistical Learning in Computer Vision (SLCV’04), pp. 1-16, 2004.

[2] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, pp. 346-359, 2008.

[3] N. Cristianini and J. Shawe-Taylor, “Introduction to support vector machines and other kernel-based learning methods,” Cambridge University Press, Cambridge, 2000.

[4] M. Rosenblum, Y. Yacoob, and L. S. Davis, “Human expression recognition from motion using a radial basis function network architecture,” IEEE Transactions on Neural Networks, vol. 7, pp. 1121–1138, 1996.

[5] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International Affective Picture System (IAPS): Affective ratings of pictures and instruction manual (Technical Report A-6),” Gainesville: University of Florida, 2005.

[6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” In: J. L. McClelland, D. E. Rumelhart, and The PDP research group (eds), Parallel Distributed Processing. vol. 1, MIT Press, Cambridge, 1986.

[7] K. Kashihara, M. Ito, and M. Fukumi, “Development of automatic filtering system for individually unpleasant data detected by pupil-size change,” Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 3311-3316, 2011.

293