A MULTI-SUBNET DEEP MODEL FOR PEDESTRIAN …sist.shanghaitech.edu.cn/faculty/luoxl/class/2017Fall_EE251/Glob...video surveillance system serve s for the security defense, ... analysis,

A MULTI-SUBNET DEEP MODEL FOR PEDESTRIAN ATTRIBUTE RECOGNITION

Hui Hao1, Guiguang Ding1, Qiang Liu1, Sheng Tang2 1 Software Institute of Tsinghua University

2 Institute of Computing Technology Chinese Academy of Sciences

ABSTRACT In recent years, the pedestrian attribute recognition plays an important role in more and more applications. Unlike recent methods, we first divide pedestrian attributes into groups according to the relationship of attributes spatial distribution. Then based on convolutional neural networks, all groups of attributes share the low-level features, while the high-level features are trained by groups separately. At last, we consider the potential relationship among attributes, so we add a fully connected layer of attribute auxiliary nodes to use the constraints among attributes. The network structure does not need to train individual model for each attribute. The grouping method can make use of the localization characteristics of attributes. Besides, the attribute auxiliary layer takes advantage of the constraints among attributes. Experiments on two standard pedestrian attribute datasets show that our model can get higher recognition accuracy.

Index Terms— pedestrian attribute recognition, multi-subnet model, deep learning

1. INTRODUCTION With the popularity of camera network in public places, the video surveillance system serves for the security defense, criminal detectives and other fields. Through the artificial way, to monitor the video, not only spend a lot of time, but also prone to flaws. Pedestrian attribute recognition is an indispensable part of surveillance video intelligence analysis, which obtains the attribute characteristics of pedestrian images. For pedestrian attribute recognition, the input is an image of a person and the task is to predict a set of attributes, as shown in Figure 1. Unlike other computer identification problems, attributes are semantic-based binary or multiple results, such as men, wearing shirts or wearing blue pants. Such attribute predictions can bring a lot of auxiliary information for image retrieval [1, 2, 3] object tracking [4], re-identification [5] applications and robotic applications that require semantic information. Therefore, pedestrian attribute recognition has great potential in practical applications. However, in many surveillance scenarios, it is often difficult to obtain a clear face image and pedestrian body area of the details. In addition, due to various

illumination conditions and different human pose, pedestrian attribute recognition faces many challenges. There are several main challenges in pedestrian attribute recognition. First, there exists large inter-class and intra-class variations. Different people may have the same appearance, while the same person may look different due to various illumination condition and camera views. Second, the pedestrian attributes have complex localization characteristics. This means that some attributes are often recognized in some relatively fixed local body areas. For example, wearing jacket is most relevant to the upper body area and the judgment of the shoes style is often focused on the area of the human feet. So it is important to get the relationship between the body parts and the specific attributes. Third, human beings can quickly and efficiently accomplish the attribute recognition task. Because they get information from different body parts and other attributes. For example, age can be inferred by the clothing style, hair color and body shape; someone with long hair is more likely to be a woman, so the length of the hair can help to get the result of sex. In attribute recognition task, we cannot ignore the potential correlation among human attributes. The previous pedestrian attribute recognition algorithms usually extract the hand-craft features of the whole pedestrian image, and use the SVM method to train the attribute classifier [6,7,8,9,10]. But the artificial extraction of pedestrian features cannot solve the above mentioned challenges successfully because hand-craft features have limited representation ability for large inter-class and intra-

Fig 1. For pedestrian attribute recognition task, the input is an image of a person and the task is to predict a set of attributes.

1349978-1-5090-5990-4/17/$31.00 ©2017 IEEE GlobalSIP 2017

class variations, and the independent SVM classifiers cannot get the potential relationship among attributes. Being inspired by the convolutional neural network’s outstanding performance on different traditional computer vision tasks, we propose a multi-subnet deep model (MSDM), which divides attributes into groups according to their spatial distribution to learn high-level features separately. The contribution of our work are concluded as follows, 1) not simply put all the attributes together to train a model, but use grouping method to construct subnet, which make full use of attributes localization characteristics; 2) adding a fully connected layer of attribute auxiliary nodes can use the constraints among attributes to get more accurate results.

2. RELATED WORK 2.1. Pedestrian Attribute Recognition The early pedestrian attribute recognition algorithms [11, 12, 13] always treat all attributes separately, training an independent classifier for each attribute. These methods usually extract the low-level features of images, such as color histogram, texture features, etc. and then train an individual SVM classifier to accomplish the attribute recognition. In this case, these methods cannot make good use of the relationship among attributes. In later works, there are many methods based on deep learning to help solve this problem. Patrick et al. proposed the ACN model [14], which transforms the problem of pedestrian attribute recognition into a multi-label classification problem. The input is an image and we get all the attribute predictions together through a single model. This idea is also used in the multi-subnet deep model in our work. Another method based on convolution neural

network is the DeepMAR model proposed by Li et al. [15], which also outputs all attributes together. In addition, they propose a weighted sigmoid cross entropy loss function to deal with the unbalanced problem in attributes. There is also another very popular method, which is based on the body parts information to accomplish the pedestrian attribute recognition. Bourdev et al. proposed an attribute recognition algorithm [13] based on poselet[16], which is a typical human body parts posture. This algorithm use a kind of multi-layer classifier, which is divided into three layers, namely the poselet level, the person level and the context level. Models based on human parts, such as poselets and DPM [17], have been shown to achieve good results on the attribute recognition task, but they are still limited by the low-level features. In [18], the PANDA model uses body parts information and deep learning method, through the posture normalized convolution neural network to extract the regularized feature representation associated with the human posture. In our work, we do not divide pedestrian images into parts, instead, we divide all attributes into groups and learn the relationship between attributes and body parts automatically. 2.2. Convolutional Neural Network Deep learning has made significant progress in solving the problem of artificial intelligence, and it has proven to be very beneficial in discovering the intricate structure of high dimensional data and therefore applicable to many fields such as science, business and government. Convolutional neural network [19] is the most widely used deep method of learning in the field of computer vision, and is proposed by LeCun et al. They originally applied it to Optical Character Recognition [20] and subsequent generic object recognition tasks [21]. With the advent of more

Fig 2. An illustration of the architecture of our network.

1350

available markup data and the enhancement of hardware computing power, convolution neural networks have now become the most accurate method of universal object classification [22] and pedestrian detection tasks [23]. Although the convolution neural network has a very high accuracy on the large tag data set, the generalization ability of the network is limited in the smaller data set. For the above problem, in [24, 25], the author uses unsupervised learning methods and a large number of unlabeled data to solve this problem. In the beginning, convolution neural network is designed for single-label image classification problem. However, in recent years the method has also been widely used to solve the problem of multi-labels classification, and show good performance.

3. MULTI-SUBNET DEEP MODEL

The multi-subnet deep model proposed in this paper aims to predict the attributes that is distributed in the same body area through the way of knowledge sharing. The basic structure of the model is shown in Figure 2. The algorithm is divided into the following three steps. 3.1. Pedestrian Attribute Division Because most of the pedestrian attributes are always recognized in the relatively fixed body parts areas, for this reason, we divide all attributes into groups according to the physiological structure of the human body. Now we take PETA dataset [26] as an example to introduce the division principle. There are 35 different attributes in the dataset, we divide them into the following 7 groups,

Ø Gender: male Ø Age: 16-30, 31-45, 46-60, above 60

Ø Head and shoulders area: long hair, wearing scarf, wearing sunglasses, wearing hat, no accessories

Ø Upper body area: short sleeve, striped shirt, T-shirt, V-neck, plaid jacket, jacket, other shirt, with logo

Ø Lower body area: trousers, jeans, shorts, skirts, formal wearing, casual wearing

Ø Feet area: sports shoes, leather shoes, sandals, other shoes

Ø Carryings: backpack, messenger bag, plastic bag, other bags, no carryings

This 7 groups of attributes is manually divided according to the body area where each attribute is often present. However, it can be found that the division method divides the sex, age and carryings attributes into three groups separately. Because the sex and age is relatively a global attribute, which cannot simply predict from the body parts, but need to combine its overall characteristics of human body. For the attribute of carryings, it can be found that there are no fixed positions due to the different types of bags and the way in which they are carried, so that they appear in different areas of the body, as shown in the red rectangle in Figure 3. 3.2. Multi-subnet Deep Model After obtaining the grouping results of the attributes, we construct a multi-subnet deep model based on the CaffeNet framework, as shown in Figure 2. In this model, all attributes share the low-level features of the convolution neural network in order to learn the common information, and at the same time learn the high-level features of the network according to the attribute groups. We use this grouping method to train the parameters of the first part of the network (part one of Figure 2, labeled 1) to obtain the first-step results of attribute recognition. The network input is a pedestrian image, which is normalized to 227*227 pixels. Then it connects four convolution layers and all attributes share the parameters of the four convolution layers. After that, several subnetworks are connected, and the structure of each subnet is basically the same. To fine tune the network, we use sigmoid cross entropy loss. 3.3. Attribute Auxiliary Layer In addition, taking into account the potential correlation among the pedestrian attributes, by adding a layer of attribute auxiliary nodes after the output of the part one network, we make use of the constraint relation among attributes to further improve the recognition accuracy, as shown in part two of Figure 2, labeled 2.

Fig 3. Bags appear in different body parts areas.

1351

Table 1. Performance on RAP and PETA dataset.

Methods

RAP PETA mA Acc Pre Rec F1 mA Acc Pre Rec F1

DeepSAR 71.30 62.12 76.01 74.94 75.47 81.32 73.64 83.05 81.36 82.19 DeepMAR 72.49 62.31 75.89 75.56 75.72 82.39 74.07 82.68 82.24 82.46

MSDM 74.06 64.71 75.92 75.86 75.89 83.54 75.07 83.68 83.14 83.41 MSDM -Attr 75.61 64.12 76.35 76.23 76.29 84.62 76.62 82.89 85.01 83.93

Table 2. Performance on RAP dataset based on different models

Methods mA Acc Pre Rec F1

MSDM (CaffeNet) 74.06 64.71 75.92 75.86 75.89

MSDM (VGGNet) 75.32 64.93 75.42 75.91 75.66

MSDM -Attr(CaffeNet) 75.61 64.12 76.35 76.23 76.29

MSDM -Attr(VGGNet) 76.53 65.23 75.94 76.49 76.21

From Figure 2 we can see that the nodes of the two fully connected layers correspond one by one. During the training process, the parameters of the first part of the network model in Figure 2 remain unchanged, and only fine tune the parameters of the last fully connected layer. Using this fully connected layer, driven by the training data, the network can automatically get better results according to the constraints among these attributes. In addition, the fully connected layer added in this section only brings thousands of network parameters, so it will not affect the efficiency of the recognition process.

4. EXPERIMENTS 4.1. Datasets and Evaluation Protocols In this section, we choose two standard pedestrian attribute datasets to conduct several experiments. One is PETA dataset, which contains 19,000 pedestrian images, and each image has 35 binary attributes for evaluation. We divide these images into three parts, 9,500 images serve as training set, 1,900 images serve as validation set, and the last 7,600 images serve for testing. The other dataset is RAP dataset [27], which is a large pedestrian dataset with 41,585 images. Each image is annotated with 72 binary attributes and we divide them into 10 groups. We use 33,268 images for training, and the left 8,317 for testing. In the past study, mean accuracy (mA) has been used to evaluate the attribute recognition algorithm. Because of the unbalanced distribution of attributes in the dataset, for each attribute, mA calculates the accuracy of the positive case and the accuracy of the negative case respectively, and then the average of the two is taken as the accuracy of the attribute recognition. In addition, we also use four common metrics that are accuracy, precision, recall and F1 score. 4.2. Results and Discussion

The results of the experiments on PETA and RAP are shown in Table 1. The DeepSAR [14] method trains a network model for each attribute respectively, and the DeepMAR method treats all attributes as a multi-label vector. The first part of our multi-subnet deep model is called MSDM, which is based on CaffeNet, while the first and second parts are combined as MSDM-Attr. It can be seen from the results that both MSDM and MSDM-Attr get higher attribute recognition accuracy. MSDM can get more discriminative pedestrian attributes features by using grouping method. In addition, based on MSDM and adding an attribute auxiliary layer, MSDM-Attr may get more information from the potential relationship among all attributes. Moreover, we conduct another two experiments based on VGG model. From the result in table 2, we can see that our multi-subnet deep model is suitable for both shallow network and deep network. Besides, it also demonstrates that using deeper base network can get better results.

5. CONCLUSION In this paper, a multi-subnet deep model for pedestrian attribute recognition has been proposed. The model makes use of the attributes localization characteristics to learn high-level features by groups. Moreover, it adds an attribute auxiliary layer in order to focuses on the potential relationship among attributes. Experimental results on two standard attribute datasets, PETA and RAP have demonstrated the efficiency of both the MSDM and the MSDM-Attr.

6. REFERENCES [1]. A. Dantcheva, A. Singh, P. Elia, and J. Dugelay. Search

pruning in video surveillance systems: Efficiency-reliability tradeoff. In Computer Vision Workshops (ICCVWorkshops),

1352

2011 IEEE International Conference on, pages 1356–1363. IEEE, 2011

[2]. E. S. Jaha and M. S. Nixon. Analysing soft clothing biometrics for retrieval. 2014.

[3]. Wu Liu, Tao Mei, Yongdong Zhang, Cherry Che, Jiebo Luo: Multi-task deep visual-semantic embedding for video thumbnail selection. CVPR 2015: 3707-3715

[4]. D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International workshop on performance evaluation of tracking and surveillance. Citeseer, 2007.

[5]. R. Layne, T. M. Hospedales, S. Gong, et al. Person re-identification by attributes. In BMVC, volume 2, page 3, 2012.

[6]. L. An, X. Chen, M. Kafai, S. Yang, and B. Bhanu. Improving person re-identification by soft biometrics based reranking. In Distributed Smart Cameras (ICDSC), 2013 Seventh International Conference on, pages 1–6. IEEE, 2013.

[7]. Y. Deng, P. Luo, C. C. Loy, and X. Tang. Pedestrian attribute recognition at far distance. Proceedings of ACM Multimedia (ACM MM), 2014.

[8]. E. S. Jaha and M. S. Nixon. Soft biometrics for subject identification using clothing attributes. 2014.

[9]. R. Layne, T. M. Hospedales, and S. Gong. Attributes-based reidentification. In Person Re-Identification, pages 93–117. Springer, 2014.

[10]. R. Layne, T. M. Hospedales, S. Gong, et al. Person re-identification by attributes. In BMVC, volume 2, page 3, 2012.

[11]. G. Sharma and F. Jurie. Learning discriminative spatial representation for image classification. In BMVC 2011-British Machine Vision Conference, pages 1–11. BMVA Press, 2011.

[12]. G. Sharma, F. Jurie, and C. Schmid. Expanded parts model for human attribute and action recognition in still images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–659, 2013.

[13]. L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In 2011 International Conference on Computer Vision, pages 1543– 1550. IEEE, 2011.

[14]. P. Sudowe, H. Spitzer, and B. Leibe. Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 87–95, 2015.

[15]. Li D, Chen X, Huang K. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios[C]// Pattern Recognition. IEEE, 2016.

[16]. Bourdev L, Malik J. Poselets: Body part detectors trained using 3D human pose annotations[C]// IEEE, International Conference on Computer Vision. IEEE, 2009:1365-1372.

[17]. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 2

[18]. Zhang N, Paluri M, Ranzato M, et al. PANDA: Pose Aligned Networks for Deep Attribute Modeling [J]. 2014:1637-1644.

[19]. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand-written zip code recognition. In Neural Computation, 1989. 1, 2

[20]. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. 2

[21]. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV, 2009. 2, 3

[22]. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012. 1, 3, 4

[23]. P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, 2013. 1, 3

[24]. M. Ranzato, F. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 3

[25]. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012. 3

[26]. Deng Y, Luo P, Chen C L, et al. Pedestrian Attribute Recognition At Far Distance[C]// the ACM International Conference. ACM, 2014:789-792.

[27]. Li D, Zhang Z, Chen X, et al. A Richly Annotated Dataset for Pedestrian Attribute Recognition [J]. 2016.

1353

Documents

A MULTI-SUBNET DEEP MODEL FOR PEDESTRIAN …sist.shanghaitech.edu.cn/faculty/luoxl/class/2017Fall_EE251/Glob...video surveillance system serve s for the security defense, ... analysis,