Face Alignment Algorithm Based on an Improved Cascaded

Research ArticleFace Alignment Algorithm Based on an Improved CascadedConvolutional Neural Network

Xun Duan Yuanshun Wang and Yun Wu

School of Computer Science and Technology GuiZhou University GuiYang 550025 China

Correspondence should be addressed to Yuanshun Wang yswang837gmailcom

Received 1 November 2020 Revised 12 February 2021 Accepted 30 March 2021 Published 8 April 2021

Academic Editor Deepu Rajan

Copyright copy 2021 Xun Duan et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Aiming at the problem of a large number of parameters and high time complexity caused by the current deep convolutional neuralnetworkmodels an improved face alignment algorithm of a cascaded convolutional neural network (CCNN) is proposed from thenetwork structure random perturbation factor (shake) and data scale e algorithm steps are as follows 3 groups of lightweightCNNs are designed the first group takes facial images with face frame as input trains 3 CNNs in parallel and weighted outputs thefacial images with 5 facial key points (anchor points) en the anchor points and 2 different windows with a shake mechanismare used to crop out 10 partial images of human faces e networks in the second group train 10 CNNs in parallel and every 2networksrsquo weighted average and colocated a key point Based on the second group of networks the third group designed a smallershake mechanism and windows to achieve more fine-tuning When training the network the idea of parallel within groups andserial between groups is adopted Experiments show that on the LFPW face dataset the improved CCNN in this paper is superiorto any other algorithm of the same type in positioning speed algorithm parameter amount and test error

1 Introduction

e task of face alignment [1ndash3] is also called facial landmarklocalization In the case of a given facial image the algorithmlocates the key points of the face including the left eye theright eye the tip of the nose and the left and right corners ofthe mouth e difficulty lies in quickly and accuratelypredicting the coordinate values of 5 key points of a givenfacial image Facial landmark localization [4 5] is a relativelycore algorithm business that plays a crucial role in manyscientific research and application topics It has always beena research issue in the fields of image processing patternrecognition and computer vision

e current facial landmark localization methods aremainly divided into 3 categories AAM [6] (active appear-ance model) based on traditional methods and ASM [7](active shape model) based onmodels andmethods based ondeep learning e AAM method and model-based ASM

extract the semantic features of a given facial image by it-eratively solving an optimization problem [8] under variousconstraints In addition to the high computational com-plexity it also easily falls into a local minimum with the riseof deep learning methods face alignment tasks haveachieved extremely accurate results in terms of accuracyHowever the current high-precision detection algorithmsmostly use deep networks (such as VGG [9] and Resnet-34[10]) which creates a substantial number of parametersresulting in the time and space complexity increasing dra-matically and the chain rule can easily lead to gradientvanishing and gradient explosion in the deep network [11]which can easily cause the network to fail to learn usefulrules For this reason this paper improved the cascadedconvolutional neural network [12 13] from the aspect ofnetwork structure raising the random perturbation factor(shake) and the data scale Under the condition of ensuringaccuracy lightweight [14] convolutional neural networks are

HindawiAdvances in MultimediaVolume 2021 Article ID 6654561 9 pageshttpsdoiorg10115520216654561

designed to reduce the number of parameters that thenetworks need to learn and accelerate the speed at locatingkey points on the face

2 Related Theories

Convolutional neural networks (CNNs) [15] are a new typeof artificial neural network proposed by combining tradi-tional artificial neural networks and deep learning Sincefully connected neural networks have a large number ofparameters when processing image data CNNs optimize thestructure of the traditional fully connected neural networksby introducing weight sharing [16] and local perceptionmethods which greatly reduces the number of learnableparameters of the fully connected neural networks Since thefeature maps output by the CNN pooling layers have theadvantages of rotation invariance and translation invarianceCNNs have natural robustness [17] for processing imagedata In the common CNN model architecture most CNNsuse the convolutional layer and the pooling layer alternatelyto extract the semantic information of the facial image fromlow to high After the channel numbers and size of thefeature maps reach a certain dimension the feature maps arearranged in order and converted into a one-dimensionalfeature vector which is connected with the fully connectedlayer for dimensional transformation e operation processof the convolutional layer can be expressed as follows

X(lk)

f 1113944

nlminus1

p1W

(lkp) otimesX(lminus 1p)

1113872 1113873 + b(lk)⎛⎝ ⎞⎠ (1)

In formula (1) X(lk) represents the k-th feature mapoutput by the l-th hidden layer nl represents the number ofchannels of the l-th layer feature map andW(lkp) representsthe convolution kernel used whenmapping the p-th group offeature maps in the (lminus 1)-th layer to the k-th feature maps inthe l-th layer e algorithm in this paper uses max-poolingAfter the pooling operation the size of the feature map isreduced to one-fold of the original step size Max-poolingcan be expressed as follows

X(l+1k)

(mn) max0ltablts X(lk)

(m middot step + a n middot step + b)1113966 1113967

(2)

In formula (2) X(l+1k) (m n) represents the value of thek-th group of feature map coordinates output from the(l + 1)-th layer at (m n)

When optimizing the CNNmodel the back propagationmethod is used to update all the connection weights betweenneurons to minimize the loss function Since the facealignment task is a regression task [18] the mean squareerror (MSE) is used to define its loss function which can beexpressed as follows

MSE 12N

1113944

N

i1Oi minus Pi

22 (3)

In formula (3) N is the number of nodes in the inputlayer of the neural network O is the artificially labeled valueand P is the predicted value of the neural networks

3 Improved CCNN

To improve the efficiency and accuracy of the algorithmdetection in this paper the CCNN network is improved inthe following 4 aspects

Improvement 1 designing shallower networks deepnetworks create a large number of parameters resulting in asharp increase in time and space complexity Additionallythe chain rule can easily lead to gradient vanishing andgradient explosion in deep networks and it easily causes thenetworks to fail to learn useful rules e CCNN algorithmhas a total of 23 CNNs and each CNN contains a maximumof 8 layers and a minimum of 6 layers e unnecessaryconvolution and pooling layers are removed and thenumber of algorithm parameters is greatly reduced to ensurethe running speed of the algorithm

Improvement 2 a shake mechanism is proposed to solvethe problem that the rules learned by the networks tend tothe geometric center of the windows which improves thegeneralization ability (robustness) of the model e posi-tioning accuracy and speed are increased by designing thesize of windows

Improvement 3 adopting multilevel cascaded recursiveCNNs it is verified by experiments that the 3-stage cascadedstructure of CCNN is selected in this paper which canguarantee a high positioning accuracy within an acceptabletime range

Improvement 4 designing a smaller input image size asmaller input image size can speed up the training of thenetworks and the number of channels of the convolutionkernel can be increased to compensate for the informationloss caused by a smaller input image

31 CCNN Framework Design e framework of the im-proved cascaded convolutional neural network in this paperis divided into 3 stages which realizes the process fromglobal coarse positioning to local precise positioning

e overall design idea of the CCNN algorithm adoptsthe recursive idea and the recursive termination condition isldquothe number of network stages that balance positioningaccuracy and positioning time complexityrdquo Based on this acascaded convolutional neural network with 3 stages isdesigned in addition to guaranteeing a high positioningaccuracy rate it is also necessary to ensure that the posi-tioning results are obtained within an acceptable time frameAs shown in Table 1 in the design of the CCNN algorithmeach stage recursively generates the key points of the face

2 Advances in Multimedia

311 e First Stage As shown in Figure 1 the originalimage is marked with the initial face frame and Figure 1(a) iscropped into 3 areas according to the face distribution of theLFPW dataset (respectively the key point area contained inthe initial face frame left right eye and nose tip key pointsarea and nose tip left and right mouth corner key pointsarea) these 3 areas are used as the input of the 3 CNNs in thefirst stage and the coordinate values of 5 key points areweighted average output to obtain Figure 1(b)

e rough positioning of global key points is realizede weighted average method is shown as follows

(x y)p 1N

1113944

N

i11113954xi 1113954yi( 1113857 (4)

In formula (4) P belongs to the set of left eye right eyenose tip left mouth corner right mouth corner and N is thenumber of repeated positions

312 e Second Stage e 5 key points output in the firststage are used as anchor points and expanded to the topbottom left and right of the anchor points to obtain thecurrent 10 windows e expansion method is based on thefollowing formula

windowsi a1lowasth b1lowastw( 1113857 i 1 2 5

windowsj a2lowasth b2lowastw( 1113857 j 6 7 10

⎧⎨

⎩ (5)

In formula (5) windowsi represents the 5 windows of thefirst group windowj represents the 5 windows of the secondgroup a and b are hyperparameters a1 and b1 take the valueof 016 a2 and b2 take the value of 018 and h and w are theheight and width of the initial face frameen according tothe size of the windows and shake (shake in the paper followsa normal distribution namely shakesimN (0 1eminus 4) as arrow①) 10 partial human face images are to cut out with arandom perturbation factor to obtain Figure 1(c) Shake canbe regarded as a kind of data augmentation operation Itsessence is a type of translation strategy It can shift thewindow randomly up down left and right by shake unitsese 10 partial facial images are taken as the input of the 10CNNs in the second stage and every 2 CNNs use formula (4)to jointly weight and average a key point to obtainFigure 1(d) thereby realizing local key point positioning

313 e ird Stage Taking the 5 key points output in thesecond stage as anchor points and by designing smallerwindows and shake (a1 and b1 take the value of 008 a2 andb2 take the value of 009 shake is the same as the secondstage as arrow②) to cut out the smaller 10 partial images ofthe facial key points we obtain Figure 1(e) ese 10 partialimages are used as the input of the 10 CNNs in the thirdstage to obtain Figure 1(f) and every 2 CNNs are jointlyweighted averaged by formula (4) to locate a key point toobtain the accurate localization of the partial image of theface which is the final output of the algorithm in this paper

32 Network Structure Design Since the algorithm in thispaper is based on the recursive idea now we take the secondstage (level 2) the LE21 CNN and LE22 CNN which locatethe key point of the left eye (LE) as examples to describe thenetwork structure design of the CCNN (F is the full face L isleft R is right E is the eye N is the nose M is the cornermouth and LE21 CNN represents the first CNN that locatesthe left eye in the second stage the same as follows)

321 Level 2 Network Structure Figure 2 shows the level 2network structure is stage contains 10 lightweight CNNsnamely LE21 LE22 RE21 RE22 N21 N22 and LM21

LM22 RM21 and RM22 every 2 CNNs are jointlyweighted average by formula (4) to locate a key point andcombined with the positioning results of the other 8 CNNsand finally the level 2 output is obtained

322 Level2-LE21 CNN Network Structure To improve thepositioning speed and efficiency of the algorithm in thisarticle we design a new cascaded convolutional neuralnetwork in this section which accelerates the training andtesting of the network by reducing the number of layers ofeach CNN and reducing the size of the input imageCompared with reference [13] this paper reduces thenumber of network layers from 12 to 6 layers and the inputsize of the second stage is reduced from 32lowast32lowast 3 to 15lowast15lowast3

Figure 3 shows the network structure of level2-LE21CNN We input 15lowast15lowast3 left eye local area map after 2 setsof convolutional pooling layers and 2 fully connected layerswe output the left eye prediction coordinates value of the

Table 1 Design of the CCNN algorithm

Stage 1

Input X (face images with face frames)Step 1 resize X to (39lowast39lowast3)

Step 2 after 2 sets of convolutional pooling layer and 2 fully connected layers generate a candidate set of key pointsStep 3 same as step 2 generate candidate face key point coordinates

Output y1 (face image with 5 weighted key points)

Stage 2

Input use 2 different windows with shake to crop y1 to obtain 10 partial face imagesStep 1 similar to stage1-step2 generate a candidate set of key points

Step 2 every 2 CNNs colocate a key pointOutput y2 (face image with 5 weighted key points)

Stage 3

Input use 2 smaller different windows with shake to crop y2 to obtain 10 partial face imagesStep 1 similar to stage2-step1 generate a candidate set of key points


Advances in Multimedia 3

Le1

Le1

Rm1

Rm1

LE21

Le2Average

Rm2

Average

Level 2

middotmiddotmiddot middotmiddotmiddot middotmiddotmiddot

LE22

RM21

RM22

Window 2

Shake 2

Window 1

Shake 1

Window 1

Shake 1

Window 2

Shake 2

Input Output

Figure 2 Schematic diagram of the level 2 network structure

15

15

3

4

4

12

12

20

Input

Convolution

2

2

6

6

20

Max pooling

33

4

4

40

2

2

Convolution

2

240

Max pooling

Full connection60

2Full connection

Figure 3 Network structure of level2-le21 CNN

Input

Output

Precise localization

Roughlocalization

Localization

First stage

Second stage

Third stage

(a) (b)

(d) (c)

(e) (f)

2

1

Figure 1 CCNN algorithm framework


LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)

323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model

324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB

33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows

E 1m

1113944

m

j1

12N

1113944

N

i1y

ji minus d

ji

2

2⎛⎝ ⎞⎠ +

12ηW

22 (6)

In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows

Wt+1 Wt minus λ middotzE

zWt

(7)

4 Experiments and Results

e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework

41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle

xprime

yprime

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

cos θ minussin θ 0

sin θ cos θ 0

0 0 1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

x

y

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)

en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows

x directionxprime minus x

11138681113868111386811138681113868111386811138681113868

w

y directionyprime minus y

11138681113868111386811138681113868111386811138681113868

h

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(9)

In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows

42 Experiment Results

Experiment 1 Verifying the necessity of the 3-stage cas-caded structure

As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively


e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs

Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of

window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+

e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high

Experiment 3 Verifying the necessity of the under shakemechanism

Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage

e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high

Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error

Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points

e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green

broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly

Table 2 Network structure table of the first stage

Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)

RE N LM RM Averagetest error

LE

Facial key points

Test error at different stage of each key point

075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103

Third_stageSecond_stageFirst_stage

Figure 4 Test error graph of key points in different stages


LE

Facial key points

Test error of key points under windows or not

20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188

Our windows+Our windowsndash

Figure 5 Test error graph of key points under windows or not


indicates that each network needs separate hyper-parameters to train

In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones

Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved

significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye

Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient

188203 193

217 213

19

235244

277261

146137


LE

Facial key points

Test error of key points under shake or not

14

16

18

20

22

24

26

28

Test

erro

r

Our shake+Our shakendash

Figure 6 Test error graph of key points under shake or not

204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points

Test error of each key point of each algorithm

075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun

Figure 7 Test error graph of key points of each algorithm

Table 3 Parameters and speed of each algorithm

Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159


5 Conclusions

e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame

Data Availability

e data used to support the findings of the study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)

References

[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019

[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020

[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019

[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019

[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018

[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995

[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998

[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019

[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018

[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261

[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385

[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020

[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017

[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020

[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020

[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural

Figure 8 Effect picture of facial landmark localization


networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020

[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019

[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020

[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010

[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020

[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020

[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013


designed to reduce the number of parameters that thenetworks need to learn and accelerate the speed at locatingkey points on the face

2 Related Theories

Convolutional neural networks (CNNs) [15] are a new typeof artificial neural network proposed by combining tradi-tional artificial neural networks and deep learning Sincefully connected neural networks have a large number ofparameters when processing image data CNNs optimize thestructure of the traditional fully connected neural networksby introducing weight sharing [16] and local perceptionmethods which greatly reduces the number of learnableparameters of the fully connected neural networks Since thefeature maps output by the CNN pooling layers have theadvantages of rotation invariance and translation invarianceCNNs have natural robustness [17] for processing imagedata In the common CNN model architecture most CNNsuse the convolutional layer and the pooling layer alternatelyto extract the semantic information of the facial image fromlow to high After the channel numbers and size of thefeature maps reach a certain dimension the feature maps arearranged in order and converted into a one-dimensionalfeature vector which is connected with the fully connectedlayer for dimensional transformation e operation processof the convolutional layer can be expressed as follows

X(lk)

f 1113944

nlminus1

p1W

(lkp) otimesX(lminus 1p)

1113872 1113873 + b(lk)⎛⎝ ⎞⎠ (1)

In formula (1) X(lk) represents the k-th feature mapoutput by the l-th hidden layer nl represents the number ofchannels of the l-th layer feature map andW(lkp) representsthe convolution kernel used whenmapping the p-th group offeature maps in the (lminus 1)-th layer to the k-th feature maps inthe l-th layer e algorithm in this paper uses max-poolingAfter the pooling operation the size of the feature map isreduced to one-fold of the original step size Max-poolingcan be expressed as follows

X(l+1k)

(mn) max0ltablts X(lk)

(m middot step + a n middot step + b)1113966 1113967

(2)

In formula (2) X(l+1k) (m n) represents the value of thek-th group of feature map coordinates output from the(l + 1)-th layer at (m n)

When optimizing the CNNmodel the back propagationmethod is used to update all the connection weights betweenneurons to minimize the loss function Since the facealignment task is a regression task [18] the mean squareerror (MSE) is used to define its loss function which can beexpressed as follows

MSE 12N

1113944

N

i1Oi minus Pi

22 (3)

In formula (3) N is the number of nodes in the inputlayer of the neural network O is the artificially labeled valueand P is the predicted value of the neural networks

3 Improved CCNN

To improve the efficiency and accuracy of the algorithmdetection in this paper the CCNN network is improved inthe following 4 aspects

Improvement 1 designing shallower networks deepnetworks create a large number of parameters resulting in asharp increase in time and space complexity Additionallythe chain rule can easily lead to gradient vanishing andgradient explosion in deep networks and it easily causes thenetworks to fail to learn useful rules e CCNN algorithmhas a total of 23 CNNs and each CNN contains a maximumof 8 layers and a minimum of 6 layers e unnecessaryconvolution and pooling layers are removed and thenumber of algorithm parameters is greatly reduced to ensurethe running speed of the algorithm

Improvement 2 a shake mechanism is proposed to solvethe problem that the rules learned by the networks tend tothe geometric center of the windows which improves thegeneralization ability (robustness) of the model e posi-tioning accuracy and speed are increased by designing thesize of windows

Improvement 3 adopting multilevel cascaded recursiveCNNs it is verified by experiments that the 3-stage cascadedstructure of CCNN is selected in this paper which canguarantee a high positioning accuracy within an acceptabletime range

Improvement 4 designing a smaller input image size asmaller input image size can speed up the training of thenetworks and the number of channels of the convolutionkernel can be increased to compensate for the informationloss caused by a smaller input image

31 CCNN Framework Design e framework of the im-proved cascaded convolutional neural network in this paperis divided into 3 stages which realizes the process fromglobal coarse positioning to local precise positioning

e overall design idea of the CCNN algorithm adoptsthe recursive idea and the recursive termination condition isldquothe number of network stages that balance positioningaccuracy and positioning time complexityrdquo Based on this acascaded convolutional neural network with 3 stages isdesigned in addition to guaranteeing a high positioningaccuracy rate it is also necessary to ensure that the posi-tioning results are obtained within an acceptable time frameAs shown in Table 1 in the design of the CCNN algorithmeach stage recursively generates the key points of the face




(x y)p 1N

1113944

N

i11113954xi 1113954yi( 1113857 (4)





⎧⎨

⎩ (5)









Stage 1




Stage 2



Stage 3




Le1

Le1

Rm1

Rm1

LE21

Le2Average

Rm2

Average

Level 2


LE22

RM21

RM22

Window 2

Shake 2

Window 1

Shake 1

Window 1

Shake 1

Window 2

Shake 2

Input Output


15

15

3

4

4

12

12

20

Input

Convolution

2

2

6

6

20

Max pooling

33

4

4

40

2

2

Convolution

2

240

Max pooling

Full connection60

2Full connection


Input

Output


Roughlocalization

Localization

First stage

Second stage

Third stage

(a) (b)

(d) (c)

(e) (f)

2

1







E 1m

1113944

m

j1

12N

1113944

N

i1y

ji minus d

ji

2

2⎛⎝ ⎞⎠ +

12ηW

22 (6)



zWt

(7)




xprime

yprime

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


sin θ cos θ 0

0 0 1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

x

y

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)



11138681113868111386811138681113868111386811138681113868

w


11138681113868111386811138681113868111386811138681113868

h

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(9)




















LE

Facial key points


075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103




LE

Facial key points


20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188









188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References





























(x y)p 1N

1113944

N

i11113954xi 1113954yi( 1113857 (4)





⎧⎨

⎩ (5)









Stage 1




Stage 2



Stage 3




Le1

Le1

Rm1

Rm1

LE21

Le2Average

Rm2

Average

Level 2


LE22

RM21

RM22

Window 2

Shake 2

Window 1

Shake 1

Window 1

Shake 1

Window 2

Shake 2

Input Output


15

15

3

4

4

12

12

20

Input

Convolution

2

2

6

6

20

Max pooling

33

4

4

40

2

2

Convolution

2

240

Max pooling

Full connection60

2Full connection


Input

Output


Roughlocalization

Localization

First stage

Second stage

Third stage

(a) (b)

(d) (c)

(e) (f)

2

1







E 1m

1113944

m

j1

12N

1113944

N

i1y

ji minus d

ji

2

2⎛⎝ ⎞⎠ +

12ηW

22 (6)



zWt

(7)




xprime

yprime

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


sin θ cos θ 0

0 0 1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

x

y

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)



11138681113868111386811138681113868111386811138681113868

w


11138681113868111386811138681113868111386811138681113868

h

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(9)




















LE

Facial key points


075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103




LE

Facial key points


20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188









188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References



























Le1

Le1

Rm1

Rm1

LE21

Le2Average

Rm2

Average

Level 2


LE22

RM21

RM22

Window 2

Shake 2

Window 1

Shake 1

Window 1

Shake 1

Window 2

Shake 2

Input Output


15

15

3

4

4

12

12

20

Input

Convolution

2

2

6

6

20

Max pooling

33

4

4

40

2

2

Convolution

2

240

Max pooling

Full connection60

2Full connection


Input

Output


Roughlocalization

Localization

First stage

Second stage

Third stage

(a) (b)

(d) (c)

(e) (f)

2

1







E 1m

1113944

m

j1

12N

1113944

N

i1y

ji minus d

ji

2

2⎛⎝ ⎞⎠ +

12ηW

22 (6)



zWt

(7)




xprime

yprime

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


sin θ cos θ 0

0 0 1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

x

y

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)



11138681113868111386811138681113868111386811138681113868

w


11138681113868111386811138681113868111386811138681113868

h

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(9)




















LE

Facial key points


075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103




LE

Facial key points


20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188









188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References































E 1m

1113944

m

j1

12N

1113944

N

i1y

ji minus d

ji

2

2⎛⎝ ⎞⎠ +

12ηW

22 (6)



zWt

(7)




xprime

yprime

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦


sin θ cos θ 0

0 0 1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

x

y

1

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)



11138681113868111386811138681113868111386811138681113868

w


11138681113868111386811138681113868111386811138681113868

h

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(9)




















LE

Facial key points


075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103




LE

Facial key points


20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188









188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References









































LE

Facial key points


075

100

125

150

175

200

225

250

275

Test

erro

r

158

188

254272

231221

137144

133125

198

148

113107

078 084 073103




LE

Facial key points


20

25

30

35

40

Test

erro

r

373

312

394

293319

338

235244

277261

203188









188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References
































188203 193

217 213

19

235244

277261

146137


LE

Facial key points


14

16

18

20

22

24

26

28

Test

erro

r



204194

21318

099 101

078 084

148

126133

073

103

127127

251234

219


LE

Facial key points


075

100

125

150

175

200

225

250

Test

erro

r

OursChen RuiSun





5 Conclusions


Data Availability




Acknowledgments


References



























5 Conclusions


Data Availability




Acknowledgments


References



































Documents

Face Alignment Algorithm Based on an Improved Cascaded