[American Institute of Aeronautics and Astronautics Infotech@Aerospace 2011 - St. Louis, Missouri ()] Infotech@Aerospace 2011 - Accelerating CSRN based face recognition on an NVIDIA

American Institute of Aeronautics and Astronautics

1

Accelerating CSRN based face recognition on an NVIDIA

GPGPU

Kenneth Rice1

Clemson University, Clemson, SC 29634, USA

Tarek M. Taha2

University of Dayton, Dayton, OH 45469, USA

Ronald Miller3

Air Force Research Laboratory,Human Performance Wing, Wright-Patterson Air Force Base, OH 45433, USA

and

Khan M. Iftekharuddin4, Keith Anderson

5, and Teddy Salan

6

University of Memphis, Memphis, TN 38152, USA

Unmanned aerial vehicles (UAVs) are being equipped with high definition cameras to

survey a wide range of low-contrast and diverse environments. From data captured by

UAVs, image analysts can determine adversarial threats proficiently. However, there is

simply too much data and not enough analysts to do this processing efficiently. Enabling

computing systems to mimic the processes in the human brain to process sych data would be

of significant benefit.

CSRNs (cellular simultaneous recurrent networks) are capable of solving several spatial

processing tasks that are carried out by human. In particular, they have been shown to be

capable of pose invariant face recognition. Given the highly recurrent nature of CSRNs (a

property also seen in the human cortex), the computational demands of these algorithms

grow with input size. Therefore the acceleration of CSRNs would be highly beneficial.

In this paper we examine the acceleration of CSRNs applied to face recognition. We

develop optimized implementations of the algorithm on an Intel Xeon 2.67 GHz processor

and an NVIDIA Tesla C2050 GPGPU (general purpose graphical processing unit). Our

results show that the GPGPU is 22.9 times faster than the CPU implementation.

I. Introduction

NMANNED aerial vehicles (UAVs) are being equipped with high definition cameras to survey a wide range of

low-contrast and diverse environments. From data captured by UAVs, image analysts can determine

adversarial threats proficiently. However, there is simply too much data and not enough analysts to do this

processing efficiently. At present, a major initiative in the research community is investigating new ways of

processing data that capture the efficiency of the human brain in hardware and software. This has resulted in

increased interest and development of bio-inspired computing approaches in software and hardware. One such bio-

inspired approach is cellular simultaneous recurrent networks (CSRNs).

1 Graduate Student, Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA.

2 Associate Professor, Electrical and Computer Engineering, 300 College Park, University of Dayton, Dayton, OH

45469, USA. 3 Principal Research Physicist, Anticipate and Influence Division, 2455 H Street, WPAFB, Ohio 45433, USA.

4 Professor, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.

5 Graduate Student, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.

6 Graduate Student, Electrical and Computer Engineering Department, University of Memphis, TN 38152, USA.

U

Infotech@Aerospace 201129 - 31 March 2011, St. Louis, Missouri

AIAA 2011-1554

Copyright © 2011 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.


2

CSRNs have been demonstrated to be very useful in solving state transition type problems, such as maze

traversals [1]. These basic problem sets can in turn be utilized for complex image processing problems with time

varying information. CSRNs have been applied to pose invariant face recognition [2], a task where traditional

computer vision methods underperform. Although powerful in image processing capabilities, CSRNs have high

computational demands with increasing input problem size. In order to process UAV image capture data, efficient

processing approaches for implementing CSRNs need to be investigated.

In this paper, we explore various approaches to accelerate pose invariant face recognition using CSRNs. In

particular, we investigate various general purpose graphical processing unit (GPGPU) techniques to take advantage

of CSRN’s inherent parallelism. The results of other studies have demonstrated the effectiveness in GPGPUs for

accelerating neural networks applications [3,4,5]. Strigl et al. [3] achieve a speedup of 2 to 24 over a comparable

CPU implementation using a GPGPU for a convolutional neural network. Dolan and DeSouza [4] develop a cellular

neural network image processing library and investigate its performance across three different platforms: single-

core CPU, multi-core CPU, and GPGPU. Dolan and DeSouza observe a speedup of an order of magnitude greater as

a result of using GPGPUs. Lastly, Nageswaran et al. [5] construct a large scale Izhikevich spiking neural network

simulator on a single GPGPU. In a comparison against an equivalent CPU implementation, Nageswaran et al.

observe a 26 times speedup in favor of the GPGPU implementation.

For our study, we examine the acceleration of the CSRN based face recognition problem used by Ren et al. [2]

on an NVIDIA Tesla C2050 GPGPU coupled with a dual quad core Intel Xeon 2.67 GHz processor. Our

preliminary results show that our GPGPU implementations resulted in an acceleration of 22.9 times for the training

phase when compared to optimized C implementations. We explore the parallelization of the code for

implementation on a GPGPU and evaluate optimizations such as reducing memory access.

II. Background

A. CSRN Characteristics

A cellular neural network (CNN) consists of identical elements arranged in a geometric pattern [6]. Due to

symmetry in the network, elements within a CNN are able to share weights. This sharing of weights decreases the

time required to train the CNN because the total number of weights to train are less. This symmetry also allows

CNNs to be very useful in solving problems that contain a similar type of inherent geometry. Lastly, each element in

a CNN can be as simple as a single neuron or more complex, i. e. a multi-layered perceptron (MLP). Differences

between the cell elements lie mainly in the input they receive.

Simultaneous recurrent networks (SRNs) are a type of neural network, which has been proven to be more

powerful than MLPs [6][7]. In a recurrent network, outputs are fed back as inputs in subsequent iterations. The

recurrent behavior in SRNs is an attempt to emulate similar activity in the brain. The brain has feedback paths along

with feedforward paths [8].

A CSRN is the combination of a CNN and a SRN. The operation of CSRNs mimic the mammalian neocortex,

which is a fairly uniform structure composed of similar elements performing uniform processing [8]. The

architecture of a CSRN is shown in Fig. 1. The geometry in the input pattern is reflected in the geometry of the

CSRN's cellular structure. Each CSRN cell houses one network (shown as a gray box) for each component in the

input pattern. The outputs of each cell are combined to produce an overall network output.

An application where CSRNs have been shown to perform successfully is the generalized 2D maze traversal

problem [1]. In [6], Pang et al. report that MLPs are unable to solve the maze traversal problem, whereas a CSRN

can do so easily.


3

Input Pattern

CSRN Cells

Output Transformation

Fig.1. CSRN structure and composition.

B. CSRN Face Recognition

As Ren et al. discussed in [2], CSRNs were used for pose invariant face recognition. In this problem, the

subject’s face was recorded over a sequence performing a head rotation from right to left. Samples of the face over

that sequence are taken. An example of samples of this head rotation is shown in the Fig. 2.

Fig.2. Example of face performing rotation.

Once these samples have been obtained, a preprocessing step is performed also described in [2] to extract the

face from the samples. The extracted face data is reduced into significantly smaller pattern vectors using principle

component analysis (PCA). These resulting pattern vectors can have their temporal signature, the distance between

successive pattern vectors as discussed in [9], computed. The idea here is that there is a recognizable pattern within

the temporal signature of a face sequence. This recognizable pattern has been observed to be sufficient for

recognizing different individuals as shown in [9]. Because the wide range of poses used during training, the method

has been observed to be tolerant to changes in pose. For learning the recognizable pattern within the temporal

signature of face sequences, Ren et al. employed the use of CSRNs.

C. CSRN Adapted to Face Recognition

For face recognition, Ren et al. created a network of CSRNs used to learn the temporal signature of a face

sequence representing an individual. Since the face sequence had PCA performed on it, a separate CSRN is trained

to learn the temporal signature of a single component in the PCA feature space. Therefore, the number of PCA

components in the feature space is the number of CSRNs required in a group to recognize a single person.

Fig. 3 shows an example of the aforementioned CSRN network. In Fig. 3 to recognize N distinct individuals, N

different groups of CSRNs corresponding to the N people are used. There are N different PCAs transformations used

because each transformation corresponds only to one individual. Also within each group of CSRNs, there are M

CSRNs to process face sequence data of M pattern vector components. When the sample face sequence is processed

by the full network of CSRNs, the group of CSRNs which results in an output temporal signature of closest match to

the learned temporal signature is identified as the person.


4

Person 1 Person 2 Person N

PCA1 PCA2 PCAN

21 M 21 M21 M

Fig. 3. CSRN network for face recognition. In this network, there are N groups of M CSRNs to process the face

sequence. Each CSRN group corresponds to a different person.

In this work, the CSRN cell element structure for the face recognition used the generalized multi-layered

perceptron (GMLP) model showed in Fig. 4. This GMLP model works in two layers. The first layer acts as an input

layer. It is composed of a bias node, two external input nodes, four neighbor input nodes corresponding to up,

down, left, and right cell neighbor outputs, and 10 recurrent nodes. The second layer acts as a hidden layer

consisting of only the recurrent nodes.

The nodes are fully connected between the first and second layer. Those connections are also weighted with the

weights associated with the bias node (gray) denoted as ww and the weights associated with the remaining first layer

nodes (black) denoted as W. In forming the overall network cell element output, the second layer node output values

are also aggregated as input from one node to all of the succeeding nodes. Ultimately, the last second layer node will

receive all preceding second layer node outputs multiplied by a weight from W along with the weighted outputs

from the first layer nodes as input. The output of the last second layer node is multiplied by a weight scaling value

Ws and that is observed as the output of that particular CSRN cell element.

Ws

ww

W

Bias

External

Recurrent

Neighbor

Output

Fig. 4. Two layer GMLP network architecture. One GMLP network is used for each cell in the CSRN network.

Nodes are fully connected between layers.

As mentioned earlier, Ren et al. mapped a single CSRN to a single pattern vector component. The CSRN

processes input as a 2D grid of a pattern vector component’s temporal signature values. Each CSRN cell element

receives a different temporal signature value as input from that pattern vector. Therefore, if a face sequence has been


5

sampled 10 times resulting in 9 temporal signatures, then a 3×3 grid of the temporal signature data corresponding to

one component in the pattern vector is submitted to a CSRN in the network for processing.

In the case of this application, each CSRN cell element uses a 27 node GMLP model. As previously noted in Fig.

4, the GMLP model has two layers where the first layer consisted of 17 nodes (1 bias, 2 external inputs, 4 neighbor

nodes, and 10 recurrent nodes) and the second layer consisted of 10 nodes (10 recurrent nodes). The bias node feeds

a constant value of 1 into the input layer. The first of the two external inputs become the value of the temporal

signature pattern vector. The second external input is set to 1 whenever the pattern vector value equals 0 and is

otherwise set to 0.

D. CSRN Computation

CSRNs can be trained several ways, but the method which Ilin et al. [1] observed to have the best results was

multi-stream Extended Kalman Filter (MSEKF) training technique. MSEKF works mainly using the Equations (1) –

(4).

T

t t t t tC K C R (1)

1T

t t t tG K C (2)

1t t t tw w G (3)

1t t t t t tK K G C K Q (4)

This is where the variables for time iteration t represent the following: Γt is the residual covariance, Ct is the state

observation matrix Jacobian, Kt is the predicted estimate covariance, Rt is the observation noise covariance, Gt is the

optimal Kalman gain, αt is the measurement residual, wt is the predicted state, and Qt is the process noise covariance.

For training, wt represents the shared weights for the CSRN (W and ww from Fig. 3). Also, Ct and αt are computed

based upon wt.

To train a single CSRN using MSEKF for the face recognition problem can broken down into four stages as

outlined in Algorithm 1.

_________________________________________________

Algorithm 1: Pseudocode for CSRN MSEKF training

_________________________________________________

// Over a set number of iterations (or until there

// is no change in weights)

For each iteration{

// Iterate over all samples to compute a collective

// Ct and αt

For each data sample{

CSRN Feedforward Pass (CSRNFF)

CSRN Feedback Pass (CSRNFB)

Calculate Ct and αt (CCA)

}

// Perform CSRN network update

Update wt and Kt (UWK)

}

_________________________________________________

The CSRN feedforward pass (CSRNFF) is for processing a data sample. During CSRNFF, the data sample is

propagated up through the GMLP contained within the CSRN cells. The output of CSRNFF is used as the overall

output as well as to compute αt. The CSRN feedback pass (CSRNFB) is used mainly for helping computing Ct.

During CSRNFB, the outputs of CSRNFF are propagated back down through the GMLP contained within the CSRN

cells. CSRNFF and CSRNFB both iterate over a predefined interval.


6

Once the outputs of both the CSRNFF and CSRNFB stage have been obtained, Ct and αt can be computed (CCA).

After Ct and αt are computed, Kt+1 and wt+1 can be computed (UWK) which consists of performing Equations (1) –

(4). To train a network of CSRNs, this process must be done for all CSRNs within the network. Likewise for testing,

a network of CSRNs only need to perform the CSRNFF stage for their respective data samples.

III. Implementation

A. CSRN Computation Issues

Using CSRNs in any application can quickly become a computationally intensive task as the input size increases.

However, in order to tackle really interesting problems, the input size of CSRNs needs to increase. The high

computational cost comes largely during training due to the matrix inversion of Γt as seen in Equation (2) along with

performing the increasing amount of cellular element computations during the CSRNFF and CSRNFB stages.

Therefore, CSRNs need a method to accelerate the computations.

The ideal platform for CSRN acceleration is something that would take advantage of both the inherent task-level

parallelism (the cell element computations of the CSRNFF and CSRNFB stages) and the data-level parallelism (the

vector and matrix operations of the CCA and UWK stages) available. One such device is a GPGPU.

B. GPGPU as an acceleration device

GPGPUs are quickly emerging as a premier acceleration platform. The reason is because of the low learning

curve for software developers. This leads to a reduced development cycle when compared to other acceleration

platforms such as FPGAs. Fig. 5 shows the general structure of a GPGPU. A GPGPU is composed of a number

scalar processors (SPs) grouped together to form streaming multiprocessors (SMs). These streaming

multiprocessors contain their own shared memory, cache, multithreaded instruction unit (MTI), and special

functional units (SFUs). All streaming multiprocessors have access to the same global memory. Each streaming

multiprocessor is capable of processing several simultaneous processing blocks, and each processing block is

capable of executing up to 512 (or greater for newer generation GPGPUs) threads at a time.

GPGPUs work on the principle of divide and conquer. In order to achieve the best speedup performance

possible, the processing of an application has to be distributed within the GPGPU among thousands of lightweight

threads. Also, GPGPUs are more geared towards applications which have a higher compute to memory access ratio.

Within GPGPU applications, memory access has been shown to be a bottleneck in the designs.

MTI

Cache

SP

SP

SFU

Shared Memory

SFU

SP

SP

SM

MTI

Cache

SP

SP

SFU

Shared Memory

SFU

SP

SP

SM

MTI

Cache

SP

SP

SFU

Shared Memory

SFU

SP

SP

SM

Global Memory

Fig. 5. General structure of GPGPU.

C. Mapping CSRN to GPGPU

To perform the CSRN processing, we want to take advantage of both the task-level and data-level parallelism in

the computations. As previously noted, we took advantage of the task-level parallelism within the CSRNFF and


7

CSRNFB stages of the application and the data-level parallelism in the CCA and UWK stages. For the CSRNFF and

CSRNFB stages, we map the operations of each CSRN cell element (GMLP) to a single processing block.

Additionally, we map the operations of each node within the GMLP to a thread within the processing block. For the

CCA and UWK stage computations, we map the matrix/vector operations to a collection of blocks, such that each

element within the matrix/vector operations would be processed by a thread. The matrix inversion of Γt that occurs

as a part of Equation (2) is performed using a parallel Gauss-Jordan elimination technique on the GPGPU adapted

from [10].

We observed that we can dramatically reduce the number of accesses to global memory in our GPGPU

implementation of Gauss-Jordan elimination. Our observation can be seen from the example of Gauss-Jordan

elimination given in Fig. 6.

Initial Matrix Identity Matrix

0.5 1 2 1 0 0

3 10 8 0 1 0

9 30 25 0 0 1

After 1st Iteration

After 2nd Iteration

After 3rd Iteration

1 2 4 2 0 0

0 4 -4 -6 1 0

0 12 -11 -18 0 1

1 0 6 5 -0.5 0

0 1 -1 -1.5 0.25 0

0 0 1 0 -3 1

1 0 0 5 17.5 -6

0 1 0 -1.5 -2.75 1

0 0 1 0 -3 1

Fig. 6. Gauss-Jordan elimination matrix inversion. Shaded regions represent areas or change. Bold rows represents

current observed row.

Given an initial matrix (matrix to be inverted), it is appended with the Identity matrix. After an iteration of

Gauss-Jordan elimination, only the shaded columns are modified. This is because of the zeros present in the

remaining columns offer no change to the unobserved rows. Using this knowledge, we modified our GPGPU Gauss-

Jordan elimination routine to only access global memory during the times in which there will be modification to the

unobserved rows. By doing this, we were able to decrease the amount of time necessary to compute UWK

computation stage by approximately 57 %.

We also observed another method to further reduce the number of global memory accesses. In Fig. 6 during the

computation of the Gauss-Jordan elimination routine, the observed row is normalized by the diagonal value every

iteration. This normalization actually can be postponed until after all iterations have been completed. We modified

our GPGPU implementation to normalize only after all iterations of the Gauss-Jordan routine have completed. By

doing this, the number of memory global accesses during UWK stage were reduced which resulted in a reduction in

the computation time.

Fig. 7 shows a representative flow chart for the GPGPU operations for training one CSRN within the network.

The flow chart shown in Fig. 7 would be used to train all CSRNs within the network for their respective inputs.


8

UWK

CCA

CSRNFF

CSRNFB

Submit Sample

Start

End

More Samples?

Yes

No

Yes No

Iteration < Max?

Fig. 7. Flow chart for GPGPU CSRN mapping for MSEKF training. The shaded regions representation processing

modules placed on GPGPU.

IV. RESULTS

We implemented two designs: an optimized CPU version using the C programming language and OpenCV

library [11,12] and a GPGPU version using both the C programming language and CUDA [13,14] (a parallel

computing architecture C programming language extension). The CPU implementation was performed on a dual

quad core Intel Xeon 2.67 GHz processor and the GPGPU implementation was performed on a combination of the

dual quad core Intel Xeon 2.67 GHz processor and an NVIDIA Tesla C2050 GPGPU (14 multiprocessors).

For using the CSRN setup designed for face recognition, we chose to only accelerate the CSRN training phase of

the algorithm as this phase is the slowest. In doing so, we repeated the experiments of Ren et al. utilizing face

sequences from the publicly available VidTIMIT database [15]. In these experiments, we utilized a CSRN network

designed to classify 5 people. Each person class consisted of a group of 10 CSRNs using 10 samples from the face

sequence. We also trained the network with 5 different face sequences per person. We performed the training using

MSEKF. The times are shown in Table I.

Table I. Comparison of timing measurements for CSRN MSEKF training using CPU and GPGPU implementations.

Implementation Training Time

(seconds)

CPU 2,237.65

GPGPU 97.48

As is shown in Table I, the GPGPU implementation offers a speedup performance of 22.9 times over the CPU

implementation. We project that we would see more speedup performance by using more samples as input. Also

increasing the number of CSRNs per person class (increasing the size of PCA reduced pattern vectors) should result

in more speedup. Lastly, dividing the application across multiple processors/GPGPUs should result in even greater

speedup performance.

V. Conclusion

Pose Invariant face recognition is a difficult task in computer vision. As discussed in this paper, CSRNs are a

biologically inspired approach that can be applied to this face recognition. The computational requirement for

CSRNs grows as the input image size grows. In this paper we examined the acceleration of CSRN based face

recognition on an Intel Xeon 2.67 GHz processor and an NVIDIA C2050 GPGPU. Our results indicate that

speedups of 22.9 times are possible during training with the GPGPU. This indicates that the GPGPU is able to take

advantage of the parallelism within the CSRN based face recognition.

Acknowledgments

This work was supported by an NSF CAREER Award and grants from the US Air Force.


9

References

[1] R. Ilin, R. Kozma, and P. J.Werbos, “Beyond feedforward models trained by backpropagation: a practical training tool for a

more efficient universal approximator,” in IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 929 - 937, 2008.

[2] Y. Ren, K. Anderson, K. Iftekharuddin, P. Kim, and E. White, “Pose invariant face recognition using cellular simultaneous

recurrent networks”, in Proc. Int. Joint Conf. on Neural Netw., 2009, pp. 2691-2698.

[3] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of GPU-based convolutional neural networks,” in Proc.

of the 18th Euromicro Conf. on Parallel, Distributed and Network-based Processing, 2010, pp. 317-324.

[4] R. Dolan, and G. DeSouza, “GPU-based simulation of cellular neural networks for image processing,” in Proc. of Int. Joint

Conf. on Neural Netw., 2009, pp. 730–735.

[5] J. M. Nageswaran, N. Dutt, J. L. Krichmar, A. Nicolau, and A. Veidenbaum, “Efficient simulation of large-scale spiking

neural networks using CUDA graphics processors,” in Proc. of Int. Joint Conf. on Neural Networks, 2009, pp. 2145–2152.

[6] X. Pang and P. Werbos, “Neural network design for J function approximation in dynamic programming,” arXiv:adap-

org/9806001v1, 1998.

[7] P. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, New

York, NY: Wiley-Interscience, 1994.

[8] D. J. Felleman , D. C. Van Essen, “Distributed hierarchical processing in the primate cerebral cortex,” Cerebral Cortex,

vol. 1, pp. 1–47, 1991.

[9] S. Gongy, A. Psarrouz, I. Katsoulisy, and P. Palavouzisy, “Tracking and recognition of face sequences,” in Proc. of

European Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Production , 1994, pp. 96

– 112.

[10] Thinking in CUDA: Gauss-Jordan elimination. [Online] Available at:

http://cudahacks.com/cms/Blog/tabid/64/EntryId/7/Thinking-in-CUDA-Gauss-Jordan-elimination.aspx.

[11] Open Computer Vision Library. [Online] Available at: http://sourceforge.net/projects/opencvlibrary/.

[12] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library. Cambridge, MA: O’Reilly

Press, 2008.

[13] NVIDIA. NVIDIA Developer Zone. [Online] Available at: http://developer.nvidia.com/object/cuda_3_2_downloads.html.

[14] NVIDIA. NVIDIA CUDA C Programming Guide, version 3.2, 2010. [Online] Available at:

http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf.

[15] C. Sanderson and K. K. Paliwal,” Polynomial Features for Robust Face Authentication,” in IEEE Int. Conf. on Image

Processing, 2002, pp. 997-1000.

http://cudahacks.com/cms/Blog/tabid/64/EntryId/7/Thinking-in-CUDA-Gauss-Jordan-elimination.aspx

http://sourceforge.net/projects/opencvlibrary/

http://developer.nvidia.com/object/cuda_3_2_downloads.html

http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Documents

[American Institute of Aeronautics and Astronautics Infotech@Aerospace 2011 - St. Louis, Missouri ()] Infotech@Aerospace 2011 - Accelerating CSRN based face recognition on an NVIDIA