Upload
teddy
View
213
Download
0
Embed Size (px)
Citation preview
American Institute of Aeronautics and Astronautics
1
Accelerating CSRN based face recognition on an NVIDIA
GPGPU
Kenneth Rice1
Clemson University, Clemson, SC 29634, USA
Tarek M. Taha2
University of Dayton, Dayton, OH 45469, USA
Ronald Miller3
Air Force Research Laboratory,Human Performance Wing, Wright-Patterson Air Force Base, OH 45433, USA
and
Khan M. Iftekharuddin4, Keith Anderson
5, and Teddy Salan
6
University of Memphis, Memphis, TN 38152, USA
Unmanned aerial vehicles (UAVs) are being equipped with high definition cameras to
survey a wide range of low-contrast and diverse environments. From data captured by
UAVs, image analysts can determine adversarial threats proficiently. However, there is
simply too much data and not enough analysts to do this processing efficiently. Enabling
computing systems to mimic the processes in the human brain to process sych data would be
of significant benefit.
CSRNs (cellular simultaneous recurrent networks) are capable of solving several spatial
processing tasks that are carried out by human. In particular, they have been shown to be
capable of pose invariant face recognition. Given the highly recurrent nature of CSRNs (a
property also seen in the human cortex), the computational demands of these algorithms
grow with input size. Therefore the acceleration of CSRNs would be highly beneficial.
In this paper we examine the acceleration of CSRNs applied to face recognition. We
develop optimized implementations of the algorithm on an Intel Xeon 2.67 GHz processor
and an NVIDIA Tesla C2050 GPGPU (general purpose graphical processing unit). Our
results show that the GPGPU is 22.9 times faster than the CPU implementation.
I. Introduction
NMANNED aerial vehicles (UAVs) are being equipped with high definition cameras to survey a wide range of
low-contrast and diverse environments. From data captured by UAVs, image analysts can determine
adversarial threats proficiently. However, there is simply too much data and not enough analysts to do this
processing efficiently. At present, a major initiative in the research community is investigating new ways of
processing data that capture the efficiency of the human brain in hardware and software. This has resulted in
increased interest and development of bio-inspired computing approaches in software and hardware. One such bio-
inspired approach is cellular simultaneous recurrent networks (CSRNs).
1 Graduate Student, Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA.
2 Associate Professor, Electrical and Computer Engineering, 300 College Park, University of Dayton, Dayton, OH
45469, USA. 3 Principal Research Physicist, Anticipate and Influence Division, 2455 H Street, WPAFB, Ohio 45433, USA.
4 Professor, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.
5 Graduate Student, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.
6 Graduate Student, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.
U
Infotech@Aerospace 201129 - 31 March 2011, St. Louis, Missouri
AIAA 2011-1554
Copyright © 2011 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.
American Institute of Aeronautics and Astronautics
2
CSRNs have been demonstrated to be very useful in solving state transition type problems, such as maze
traversals [1]. These basic problem sets can in turn be utilized for complex image processing problems with time
varying information. CSRNs have been applied to pose invariant face recognition [2], a task where traditional
computer vision methods underperform. Although powerful in image processing capabilities, CSRNs have high
computational demands with increasing input problem size. In order to process UAV image capture data, efficient
processing approaches for implementing CSRNs need to be investigated.
In this paper, we explore various approaches to accelerate pose invariant face recognition using CSRNs. In
particular, we investigate various general purpose graphical processing unit (GPGPU) techniques to take advantage
of CSRN’s inherent parallelism. The results of other studies have demonstrated the effectiveness in GPGPUs for
accelerating neural networks applications [3,4,5]. Strigl et al. [3] achieve a speedup of 2 to 24 over a comparable
CPU implementation using a GPGPU for a convolutional neural network. Dolan and DeSouza [4] develop a cellular
neural network image processing library and investigate its performance across three different platforms: single-
core CPU, multi-core CPU, and GPGPU. Dolan and DeSouza observe a speedup of an order of magnitude greater as
a result of using GPGPUs. Lastly, Nageswaran et al. [5] construct a large scale Izhikevich spiking neural network
simulator on a single GPGPU. In a comparison against an equivalent CPU implementation, Nageswaran et al.
observe a 26 times speedup in favor of the GPGPU implementation.
For our study, we examine the acceleration of the CSRN based face recognition problem used by Ren et al. [2]
on an NVIDIA Tesla C2050 GPGPU coupled with a dual quad core Intel Xeon 2.67 GHz processor. Our
preliminary results show that our GPGPU implementations resulted in an acceleration of 22.9 times for the training
phase when compared to optimized C implementations. We explore the parallelization of the code for
implementation on a GPGPU and evaluate optimizations such as reducing memory access.
II. Background
A. CSRN Characteristics
A cellular neural network (CNN) consists of identical elements arranged in a geometric pattern [6]. Due to
symmetry in the network, elements within a CNN are able to share weights. This sharing of weights decreases the
time required to train the CNN because the total number of weights to train are less. This symmetry also allows
CNNs to be very useful in solving problems that contain a similar type of inherent geometry. Lastly, each element in
a CNN can be as simple as a single neuron or more complex, i. e. a multi-layered perceptron (MLP). Differences
between the cell elements lie mainly in the input they receive.
Simultaneous recurrent networks (SRNs) are a type of neural network, which has been proven to be more
powerful than MLPs [6][7]. In a recurrent network, outputs are fed back as inputs in subsequent iterations. The
recurrent behavior in SRNs is an attempt to emulate similar activity in the brain. The brain has feedback paths along
with feedforward paths [8].
A CSRN is the combination of a CNN and a SRN. The operation of CSRNs mimic the mammalian neocortex,
which is a fairly uniform structure composed of similar elements performing uniform processing [8]. The
architecture of a CSRN is shown in Fig. 1. The geometry in the input pattern is reflected in the geometry of the
CSRN's cellular structure. Each CSRN cell houses one network (shown as a gray box) for each component in the
input pattern. The outputs of each cell are combined to produce an overall network output.
An application where CSRNs have been shown to perform successfully is the generalized 2D maze traversal
problem [1]. In [6], Pang et al. report that MLPs are unable to solve the maze traversal problem, whereas a CSRN
can do so easily.
American Institute of Aeronautics and Astronautics
3
Input Pattern
CSRN Cells
Output Transformation
Fig.1. CSRN structure and composition.
B. CSRN Face Recognition
As Ren et al. discussed in [2], CSRNs were used for pose invariant face recognition. In this problem, the
subject’s face was recorded over a sequence performing a head rotation from right to left. Samples of the face over
that sequence are taken. An example of samples of this head rotation is shown in the Fig. 2.
Fig.2. Example of face performing rotation.
Once these samples have been obtained, a preprocessing step is performed also described in [2] to extract the
face from the samples. The extracted face data is reduced into significantly smaller pattern vectors using principle
component analysis (PCA). These resulting pattern vectors can have their temporal signature, the distance between
successive pattern vectors as discussed in [9], computed. The idea here is that there is a recognizable pattern within
the temporal signature of a face sequence. This recognizable pattern has been observed to be sufficient for
recognizing different individuals as shown in [9]. Because the wide range of poses used during training, the method
has been observed to be tolerant to changes in pose. For learning the recognizable pattern within the temporal
signature of face sequences, Ren et al. employed the use of CSRNs.
C. CSRN Adapted to Face Recognition
For face recognition, Ren et al. created a network of CSRNs used to learn the temporal signature of a face
sequence representing an individual. Since the face sequence had PCA performed on it, a separate CSRN is trained
to learn the temporal signature of a single component in the PCA feature space. Therefore, the number of PCA
components in the feature space is the number of CSRNs required in a group to recognize a single person.
Fig. 3 shows an example of the aforementioned CSRN network. In Fig. 3 to recognize N distinct individuals, N
different groups of CSRNs corresponding to the N people are used. There are N different PCAs transformations used
because each transformation corresponds only to one individual. Also within each group of CSRNs, there are M
CSRNs to process face sequence data of M pattern vector components. When the sample face sequence is processed
by the full network of CSRNs, the group of CSRNs which results in an output temporal signature of closest match to
the learned temporal signature is identified as the person.
American Institute of Aeronautics and Astronautics
4
Person 1 Person 2 Person N
PCA1 PCA2 PCAN
21 M 21 M21 M
Fig. 3. CSRN network for face recognition. In this network, there are N groups of M CSRNs to process the face
sequence. Each CSRN group corresponds to a different person.
In this work, the CSRN cell element structure for the face recognition used the generalized multi-layered
perceptron (GMLP) model showed in Fig. 4. This GMLP model works in two layers. The first layer acts as an input
layer. It is composed of a bias node, two external input nodes, four neighbor input nodes corresponding to up,
down, left, and right cell neighbor outputs, and 10 recurrent nodes. The second layer acts as a hidden layer
consisting of only the recurrent nodes.
The nodes are fully connected between the first and second layer. Those connections are also weighted with the
weights associated with the bias node (gray) denoted as ww and the weights associated with the remaining first layer
nodes (black) denoted as W. In forming the overall network cell element output, the second layer node output values
are also aggregated as input from one node to all of the succeeding nodes. Ultimately, the last second layer node will
receive all preceding second layer node outputs multiplied by a weight from W along with the weighted outputs
from the first layer nodes as input. The output of the last second layer node is multiplied by a weight scaling value
Ws and that is observed as the output of that particular CSRN cell element.
Ws
ww
W
Bias
External
Recurrent
Neighbor
Output
Fig. 4. Two layer GMLP network architecture. One GMLP network is used for each cell in the CSRN network.
Nodes are fully connected between layers.
As mentioned earlier, Ren et al. mapped a single CSRN to a single pattern vector component. The CSRN
processes input as a 2D grid of a pattern vector component’s temporal signature values. Each CSRN cell element
receives a different temporal signature value as input from that pattern vector. Therefore, if a face sequence has been
American Institute of Aeronautics and Astronautics
5
sampled 10 times resulting in 9 temporal signatures, then a 3×3 grid of the temporal signature data corresponding to
one component in the pattern vector is submitted to a CSRN in the network for processing.
In the case of this application, each CSRN cell element uses a 27 node GMLP model. As previously noted in Fig.
4, the GMLP model has two layers where the first layer consisted of 17 nodes (1 bias, 2 external inputs, 4 neighbor
nodes, and 10 recurrent nodes) and the second layer consisted of 10 nodes (10 recurrent nodes). The bias node feeds
a constant value of 1 into the input layer. The first of the two external inputs become the value of the temporal
signature pattern vector. The second external input is set to 1 whenever the pattern vector value equals 0 and is
otherwise set to 0.
D. CSRN Computation
CSRNs can be trained several ways, but the method which Ilin et al. [1] observed to have the best results was
multi-stream Extended Kalman Filter (MSEKF) training technique. MSEKF works mainly using the Equations (1) –
(4).
T
t t t t tC K C R (1)
1T
t t t tG K C (2)
1t t t tw w G (3)
1t t t t t tK K G C K Q (4)
This is where the variables for time iteration t represent the following: Γt is the residual covariance, Ct is the state
observation matrix Jacobian, Kt is the predicted estimate covariance, Rt is the observation noise covariance, Gt is the
optimal Kalman gain, αt is the measurement residual, wt is the predicted state, and Qt is the process noise covariance.
For training, wt represents the shared weights for the CSRN (W and ww from Fig. 3). Also, Ct and αt are computed
based upon wt.
To train a single CSRN using MSEKF for the face recognition problem can broken down into four stages as
outlined in Algorithm 1.
_________________________________________________
Algorithm 1: Pseudocode for CSRN MSEKF training
_________________________________________________
// Over a set number of iterations (or until there
// is no change in weights)
For each iteration{
// Iterate over all samples to compute a collective
// Ct and αt
For each data sample{
CSRN Feedforward Pass (CSRNFF)
CSRN Feedback Pass (CSRNFB)
Calculate Ct and αt (CCA)
}
// Perform CSRN network update
Update wt and Kt (UWK)
}
_________________________________________________
The CSRN feedforward pass (CSRNFF) is for processing a data sample. During CSRNFF, the data sample is
propagated up through the GMLP contained within the CSRN cells. The output of CSRNFF is used as the overall
output as well as to compute αt. The CSRN feedback pass (CSRNFB) is used mainly for helping computing Ct.
During CSRNFB, the outputs of CSRNFF are propagated back down through the GMLP contained within the CSRN
cells. CSRNFF and CSRNFB both iterate over a predefined interval.
American Institute of Aeronautics and Astronautics
6
Once the outputs of both the CSRNFF and CSRNFB stage have been obtained, Ct and αt can be computed (CCA).
After Ct and αt are computed, Kt+1 and wt+1 can be computed (UWK) which consists of performing Equations (1) –
(4). To train a network of CSRNs, this process must be done for all CSRNs within the network. Likewise for testing,
a network of CSRNs only need to perform the CSRNFF stage for their respective data samples.
III. Implementation
A. CSRN Computation Issues
Using CSRNs in any application can quickly become a computationally intensive task as the input size increases.
However, in order to tackle really interesting problems, the input size of CSRNs needs to increase. The high
computational cost comes largely during training due to the matrix inversion of Γt as seen in Equation (2) along with
performing the increasing amount of cellular element computations during the CSRNFF and CSRNFB stages.
Therefore, CSRNs need a method to accelerate the computations.
The ideal platform for CSRN acceleration is something that would take advantage of both the inherent task-level
parallelism (the cell element computations of the CSRNFF and CSRNFB stages) and the data-level parallelism (the
vector and matrix operations of the CCA and UWK stages) available. One such device is a GPGPU.
B. GPGPU as an acceleration device
GPGPUs are quickly emerging as a premier acceleration platform. The reason is because of the low learning
curve for software developers. This leads to a reduced development cycle when compared to other acceleration
platforms such as FPGAs. Fig. 5 shows the general structure of a GPGPU. A GPGPU is composed of a number
scalar processors (SPs) grouped together to form streaming multiprocessors (SMs). These streaming
multiprocessors contain their own shared memory, cache, multithreaded instruction unit (MTI), and special
functional units (SFUs). All streaming multiprocessors have access to the same global memory. Each streaming
multiprocessor is capable of processing several simultaneous processing blocks, and each processing block is
capable of executing up to 512 (or greater for newer generation GPGPUs) threads at a time.
GPGPUs work on the principle of divide and conquer. In order to achieve the best speedup performance
possible, the processing of an application has to be distributed within the GPGPU among thousands of lightweight
threads. Also, GPGPUs are more geared towards applications which have a higher compute to memory access ratio.
Within GPGPU applications, memory access has been shown to be a bottleneck in the designs.
MTI
Cache
SP
SP
SFU
Shared Memory
SFU
SP
SP
SM
MTI
Cache
SP
SP
SFU
Shared Memory
SFU
SP
SP
SM
MTI
Cache
SP
SP
SFU
Shared Memory
SFU
SP
SP
SM
Global Memory
Fig. 5. General structure of GPGPU.
C. Mapping CSRN to GPGPU
To perform the CSRN processing, we want to take advantage of both the task-level and data-level parallelism in
the computations. As previously noted, we took advantage of the task-level parallelism within the CSRNFF and
American Institute of Aeronautics and Astronautics
7
CSRNFB stages of the application and the data-level parallelism in the CCA and UWK stages. For the CSRNFF and
CSRNFB stages, we map the operations of each CSRN cell element (GMLP) to a single processing block.
Additionally, we map the operations of each node within the GMLP to a thread within the processing block. For the
CCA and UWK stage computations, we map the matrix/vector operations to a collection of blocks, such that each
element within the matrix/vector operations would be processed by a thread. The matrix inversion of Γt that occurs
as a part of Equation (2) is performed using a parallel Gauss-Jordan elimination technique on the GPGPU adapted
from [10].
We observed that we can dramatically reduce the number of accesses to global memory in our GPGPU
implementation of Gauss-Jordan elimination. Our observation can be seen from the example of Gauss-Jordan
elimination given in Fig. 6.
Initial Matrix Identity Matrix
0.5 1 2 1 0 0
3 10 8 0 1 0
9 30 25 0 0 1
After 1st Iteration
After 2nd Iteration
After 3rd Iteration
1 2 4 2 0 0
0 4 -4 -6 1 0
0 12 -11 -18 0 1
1 0 6 5 -0.5 0
0 1 -1 -1.5 0.25 0
0 0 1 0 -3 1
1 0 0 5 17.5 -6
0 1 0 -1.5 -2.75 1
0 0 1 0 -3 1
Fig. 6. Gauss-Jordan elimination matrix inversion. Shaded regions represent areas or change. Bold rows represents
current observed row.
Given an initial matrix (matrix to be inverted), it is appended with the Identity matrix. After an iteration of
Gauss-Jordan elimination, only the shaded columns are modified. This is because of the zeros present in the
remaining columns offer no change to the unobserved rows. Using this knowledge, we modified our GPGPU Gauss-
Jordan elimination routine to only access global memory during the times in which there will be modification to the
unobserved rows. By doing this, we were able to decrease the amount of time necessary to compute UWK
computation stage by approximately 57 %.
We also observed another method to further reduce the number of global memory accesses. In Fig. 6 during the
computation of the Gauss-Jordan elimination routine, the observed row is normalized by the diagonal value every
iteration. This normalization actually can be postponed until after all iterations have been completed. We modified
our GPGPU implementation to normalize only after all iterations of the Gauss-Jordan routine have completed. By
doing this, the number of memory global accesses during UWK stage were reduced which resulted in a reduction in
the computation time.
Fig. 7 shows a representative flow chart for the GPGPU operations for training one CSRN within the network.
The flow chart shown in Fig. 7 would be used to train all CSRNs within the network for their respective inputs.
American Institute of Aeronautics and Astronautics
8
UWK
CCA
CSRNFF
CSRNFB
Submit Sample
Start
End
More Samples?
Yes
No
Yes No
Iteration < Max?
Fig. 7. Flow chart for GPGPU CSRN mapping for MSEKF training. The shaded regions representation processing
modules placed on GPGPU.
IV. RESULTS
We implemented two designs: an optimized CPU version using the C programming language and OpenCV
library [11,12] and a GPGPU version using both the C programming language and CUDA [13,14] (a parallel
computing architecture C programming language extension). The CPU implementation was performed on a dual
quad core Intel Xeon 2.67 GHz processor and the GPGPU implementation was performed on a combination of the
dual quad core Intel Xeon 2.67 GHz processor and an NVIDIA Tesla C2050 GPGPU (14 multiprocessors).
For using the CSRN setup designed for face recognition, we chose to only accelerate the CSRN training phase of
the algorithm as this phase is the slowest. In doing so, we repeated the experiments of Ren et al. utilizing face
sequences from the publicly available VidTIMIT database [15]. In these experiments, we utilized a CSRN network
designed to classify 5 people. Each person class consisted of a group of 10 CSRNs using 10 samples from the face
sequence. We also trained the network with 5 different face sequences per person. We performed the training using
MSEKF. The times are shown in Table I.
Table I. Comparison of timing measurements for CSRN MSEKF training using CPU and GPGPU implementations.
Implementation Training Time
(seconds)
CPU 2,237.65
GPGPU 97.48
As is shown in Table I, the GPGPU implementation offers a speedup performance of 22.9 times over the CPU
implementation. We project that we would see more speedup performance by using more samples as input. Also
increasing the number of CSRNs per person class (increasing the size of PCA reduced pattern vectors) should result
in more speedup. Lastly, dividing the application across multiple processors/GPGPUs should result in even greater
speedup performance.
V. Conclusion
Pose Invariant face recognition is a difficult task in computer vision. As discussed in this paper, CSRNs are a
biologically inspired approach that can be applied to this face recognition. The computational requirement for
CSRNs grows as the input image size grows. In this paper we examined the acceleration of CSRN based face
recognition on an Intel Xeon 2.67 GHz processor and an NVIDIA C2050 GPGPU. Our results indicate that
speedups of 22.9 times are possible during training with the GPGPU. This indicates that the GPGPU is able to take
advantage of the parallelism within the CSRN based face recognition.
Acknowledgments
This work was supported by an NSF CAREER Award and grants from the US Air Force.
American Institute of Aeronautics and Astronautics
9
References
[1] R. Ilin, R. Kozma, and P. J.Werbos, “Beyond feedforward models trained by backpropagation: a practical training tool for a
more efficient universal approximator,” in IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 929 - 937, 2008.
[2] Y. Ren, K. Anderson, K. Iftekharuddin, P. Kim, and E. White, “Pose invariant face recognition using cellular simultaneous
recurrent networks”, in Proc. Int. Joint Conf. on Neural Netw., 2009, pp. 2691-2698.
[3] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of GPU-based convolutional neural networks,” in Proc.
of the 18th Euromicro Conf. on Parallel, Distributed and Network-based Processing, 2010, pp. 317-324.
[4] R. Dolan, and G. DeSouza, “GPU-based simulation of cellular neural networks for image processing,” in Proc. of Int. Joint
Conf. on Neural Netw., 2009, pp. 730–735.
[5] J. M. Nageswaran, N. Dutt, J. L. Krichmar, A. Nicolau, and A. Veidenbaum, “Efficient simulation of large-scale spiking
neural networks using CUDA graphics processors,” in Proc. of Int. Joint Conf. on Neural Networks, 2009, pp. 2145–2152.
[6] X. Pang and P. Werbos, “Neural network design for J function approximation in dynamic programming,” arXiv:adap-
org/9806001v1, 1998.
[7] P. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, New
York, NY: Wiley-Interscience, 1994.
[8] D. J. Felleman , D. C. Van Essen, “Distributed hierarchical processing in the primate cerebral cortex,” Cerebral Cortex,
vol. 1, pp. 1–47, 1991.
[9] S. Gongy, A. Psarrouz, I. Katsoulisy, and P. Palavouzisy, “Tracking and recognition of face sequences,” in Proc. of
European Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Production , 1994, pp. 96
– 112.
[10] Thinking in CUDA: Gauss-Jordan elimination. [Online] Available at:
http://cudahacks.com/cms/Blog/tabid/64/EntryId/7/Thinking-in-CUDA-Gauss-Jordan-elimination.aspx.
[11] Open Computer Vision Library. [Online] Available at: http://sourceforge.net/projects/opencvlibrary/.
[12] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library. Cambridge, MA: O’Reilly
Press, 2008.
[13] NVIDIA. NVIDIA Developer Zone. [Online] Available at: http://developer.nvidia.com/object/cuda_3_2_downloads.html.
[14] NVIDIA. NVIDIA CUDA C Programming Guide, version 3.2, 2010. [Online] Available at:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf.
[15] C. Sanderson and K. K. Paliwal,” Polynomial Features for Robust Face Authentication,” in IEEE Int. Conf. on Image
Processing, 2002, pp. 997-1000.