Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Research ArticleFace Alignment Algorithm Based on an Improved CascadedConvolutional Neural Network
Xun Duan Yuanshun Wang and Yun Wu
School of Computer Science and Technology GuiZhou University GuiYang 550025 China
Correspondence should be addressed to Yuanshun Wang yswang837gmailcom
Received 1 November 2020 Revised 12 February 2021 Accepted 30 March 2021 Published 8 April 2021
Academic Editor Deepu Rajan
Copyright copy 2021 Xun Duan et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Aiming at the problem of a large number of parameters and high time complexity caused by the current deep convolutional neuralnetworkmodels an improved face alignment algorithm of a cascaded convolutional neural network (CCNN) is proposed from thenetwork structure random perturbation factor (shake) and data scale e algorithm steps are as follows 3 groups of lightweightCNNs are designed the first group takes facial images with face frame as input trains 3 CNNs in parallel and weighted outputs thefacial images with 5 facial key points (anchor points) en the anchor points and 2 different windows with a shake mechanismare used to crop out 10 partial images of human faces e networks in the second group train 10 CNNs in parallel and every 2networksrsquo weighted average and colocated a key point Based on the second group of networks the third group designed a smallershake mechanism and windows to achieve more fine-tuning When training the network the idea of parallel within groups andserial between groups is adopted Experiments show that on the LFPW face dataset the improved CCNN in this paper is superiorto any other algorithm of the same type in positioning speed algorithm parameter amount and test error
1 Introduction
e task of face alignment [1ndash3] is also called facial landmarklocalization In the case of a given facial image the algorithmlocates the key points of the face including the left eye theright eye the tip of the nose and the left and right corners ofthe mouth e difficulty lies in quickly and accuratelypredicting the coordinate values of 5 key points of a givenfacial image Facial landmark localization [4 5] is a relativelycore algorithm business that plays a crucial role in manyscientific research and application topics It has always beena research issue in the fields of image processing patternrecognition and computer vision
e current facial landmark localization methods aremainly divided into 3 categories AAM [6] (active appear-ance model) based on traditional methods and ASM [7](active shape model) based onmodels andmethods based ondeep learning e AAM method and model-based ASM
extract the semantic features of a given facial image by it-eratively solving an optimization problem [8] under variousconstraints In addition to the high computational com-plexity it also easily falls into a local minimum with the riseof deep learning methods face alignment tasks haveachieved extremely accurate results in terms of accuracyHowever the current high-precision detection algorithmsmostly use deep networks (such as VGG [9] and Resnet-34[10]) which creates a substantial number of parametersresulting in the time and space complexity increasing dra-matically and the chain rule can easily lead to gradientvanishing and gradient explosion in the deep network [11]which can easily cause the network to fail to learn usefulrules For this reason this paper improved the cascadedconvolutional neural network [12 13] from the aspect ofnetwork structure raising the random perturbation factor(shake) and the data scale Under the condition of ensuringaccuracy lightweight [14] convolutional neural networks are
HindawiAdvances in MultimediaVolume 2021 Article ID 6654561 9 pageshttpsdoiorg10115520216654561
designed to reduce the number of parameters that thenetworks need to learn and accelerate the speed at locatingkey points on the face
2 Related Theories
Convolutional neural networks (CNNs) [15] are a new typeof artificial neural network proposed by combining tradi-tional artificial neural networks and deep learning Sincefully connected neural networks have a large number ofparameters when processing image data CNNs optimize thestructure of the traditional fully connected neural networksby introducing weight sharing [16] and local perceptionmethods which greatly reduces the number of learnableparameters of the fully connected neural networks Since thefeature maps output by the CNN pooling layers have theadvantages of rotation invariance and translation invarianceCNNs have natural robustness [17] for processing imagedata In the common CNN model architecture most CNNsuse the convolutional layer and the pooling layer alternatelyto extract the semantic information of the facial image fromlow to high After the channel numbers and size of thefeature maps reach a certain dimension the feature maps arearranged in order and converted into a one-dimensionalfeature vector which is connected with the fully connectedlayer for dimensional transformation e operation processof the convolutional layer can be expressed as follows
X(lk)
f 1113944
nlminus1
p1W
(lkp) otimesX(lminus 1p)
1113872 1113873 + b(lk)⎛⎝ ⎞⎠ (1)
In formula (1) X(lk) represents the k-th feature mapoutput by the l-th hidden layer nl represents the number ofchannels of the l-th layer feature map andW(lkp) representsthe convolution kernel used whenmapping the p-th group offeature maps in the (lminus 1)-th layer to the k-th feature maps inthe l-th layer e algorithm in this paper uses max-poolingAfter the pooling operation the size of the feature map isreduced to one-fold of the original step size Max-poolingcan be expressed as follows
X(l+1k)
(mn) max0ltablts X(lk)
(m middot step + a n middot step + b)1113966 1113967
(2)
In formula (2) X(l+1k) (m n) represents the value of thek-th group of feature map coordinates output from the(l + 1)-th layer at (m n)
When optimizing the CNNmodel the back propagationmethod is used to update all the connection weights betweenneurons to minimize the loss function Since the facealignment task is a regression task [18] the mean squareerror (MSE) is used to define its loss function which can beexpressed as follows
MSE 12N
1113944
N
i1Oi minus Pi
22 (3)
In formula (3) N is the number of nodes in the inputlayer of the neural network O is the artificially labeled valueand P is the predicted value of the neural networks
3 Improved CCNN
To improve the efficiency and accuracy of the algorithmdetection in this paper the CCNN network is improved inthe following 4 aspects
Improvement 1 designing shallower networks deepnetworks create a large number of parameters resulting in asharp increase in time and space complexity Additionallythe chain rule can easily lead to gradient vanishing andgradient explosion in deep networks and it easily causes thenetworks to fail to learn useful rules e CCNN algorithmhas a total of 23 CNNs and each CNN contains a maximumof 8 layers and a minimum of 6 layers e unnecessaryconvolution and pooling layers are removed and thenumber of algorithm parameters is greatly reduced to ensurethe running speed of the algorithm
Improvement 2 a shake mechanism is proposed to solvethe problem that the rules learned by the networks tend tothe geometric center of the windows which improves thegeneralization ability (robustness) of the model e posi-tioning accuracy and speed are increased by designing thesize of windows
Improvement 3 adopting multilevel cascaded recursiveCNNs it is verified by experiments that the 3-stage cascadedstructure of CCNN is selected in this paper which canguarantee a high positioning accuracy within an acceptabletime range
Improvement 4 designing a smaller input image size asmaller input image size can speed up the training of thenetworks and the number of channels of the convolutionkernel can be increased to compensate for the informationloss caused by a smaller input image
31 CCNN Framework Design e framework of the im-proved cascaded convolutional neural network in this paperis divided into 3 stages which realizes the process fromglobal coarse positioning to local precise positioning
e overall design idea of the CCNN algorithm adoptsthe recursive idea and the recursive termination condition isldquothe number of network stages that balance positioningaccuracy and positioning time complexityrdquo Based on this acascaded convolutional neural network with 3 stages isdesigned in addition to guaranteeing a high positioningaccuracy rate it is also necessary to ensure that the posi-tioning results are obtained within an acceptable time frameAs shown in Table 1 in the design of the CCNN algorithmeach stage recursively generates the key points of the face
2 Advances in Multimedia
311 e First Stage As shown in Figure 1 the originalimage is marked with the initial face frame and Figure 1(a) iscropped into 3 areas according to the face distribution of theLFPW dataset (respectively the key point area contained inthe initial face frame left right eye and nose tip key pointsarea and nose tip left and right mouth corner key pointsarea) these 3 areas are used as the input of the 3 CNNs in thefirst stage and the coordinate values of 5 key points areweighted average output to obtain Figure 1(b)
e rough positioning of global key points is realizede weighted average method is shown as follows
(x y)p 1N
1113944
N
i11113954xi 1113954yi( 1113857 (4)
In formula (4) P belongs to the set of left eye right eyenose tip left mouth corner right mouth corner and N is thenumber of repeated positions
312 e Second Stage e 5 key points output in the firststage are used as anchor points and expanded to the topbottom left and right of the anchor points to obtain thecurrent 10 windows e expansion method is based on thefollowing formula
windowsi a1lowasth b1lowastw( 1113857 i 1 2 5
windowsj a2lowasth b2lowastw( 1113857 j 6 7 10
⎧⎨
⎩ (5)
In formula (5) windowsi represents the 5 windows of thefirst group windowj represents the 5 windows of the secondgroup a and b are hyperparameters a1 and b1 take the valueof 016 a2 and b2 take the value of 018 and h and w are theheight and width of the initial face frameen according tothe size of the windows and shake (shake in the paper followsa normal distribution namely shakesimN (0 1eminus 4) as arrow①) 10 partial human face images are to cut out with arandom perturbation factor to obtain Figure 1(c) Shake canbe regarded as a kind of data augmentation operation Itsessence is a type of translation strategy It can shift thewindow randomly up down left and right by shake unitsese 10 partial facial images are taken as the input of the 10CNNs in the second stage and every 2 CNNs use formula (4)to jointly weight and average a key point to obtainFigure 1(d) thereby realizing local key point positioning
313 e ird Stage Taking the 5 key points output in thesecond stage as anchor points and by designing smallerwindows and shake (a1 and b1 take the value of 008 a2 andb2 take the value of 009 shake is the same as the secondstage as arrow②) to cut out the smaller 10 partial images ofthe facial key points we obtain Figure 1(e) ese 10 partialimages are used as the input of the 10 CNNs in the thirdstage to obtain Figure 1(f) and every 2 CNNs are jointlyweighted averaged by formula (4) to locate a key point toobtain the accurate localization of the partial image of theface which is the final output of the algorithm in this paper
32 Network Structure Design Since the algorithm in thispaper is based on the recursive idea now we take the secondstage (level 2) the LE21 CNN and LE22 CNN which locatethe key point of the left eye (LE) as examples to describe thenetwork structure design of the CCNN (F is the full face L isleft R is right E is the eye N is the nose M is the cornermouth and LE21 CNN represents the first CNN that locatesthe left eye in the second stage the same as follows)
321 Level 2 Network Structure Figure 2 shows the level 2network structure is stage contains 10 lightweight CNNsnamely LE21 LE22 RE21 RE22 N21 N22 and LM21
LM22 RM21 and RM22 every 2 CNNs are jointlyweighted average by formula (4) to locate a key point andcombined with the positioning results of the other 8 CNNsand finally the level 2 output is obtained
322 Level2-LE21 CNN Network Structure To improve thepositioning speed and efficiency of the algorithm in thisarticle we design a new cascaded convolutional neuralnetwork in this section which accelerates the training andtesting of the network by reducing the number of layers ofeach CNN and reducing the size of the input imageCompared with reference [13] this paper reduces thenumber of network layers from 12 to 6 layers and the inputsize of the second stage is reduced from 32lowast32lowast 3 to 15lowast15lowast3
Figure 3 shows the network structure of level2-LE21CNN We input 15lowast15lowast3 left eye local area map after 2 setsof convolutional pooling layers and 2 fully connected layerswe output the left eye prediction coordinates value of the
Table 1 Design of the CCNN algorithm
Stage 1
Input X (face images with face frames)Step 1 resize X to (39lowast39lowast3)
Step 2 after 2 sets of convolutional pooling layer and 2 fully connected layers generate a candidate set of key pointsStep 3 same as step 2 generate candidate face key point coordinates
Output y1 (face image with 5 weighted key points)
Stage 2
Input use 2 different windows with shake to crop y1 to obtain 10 partial face imagesStep 1 similar to stage1-step2 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y2 (face image with 5 weighted key points)
Stage 3
Input use 2 smaller different windows with shake to crop y2 to obtain 10 partial face imagesStep 1 similar to stage2-step1 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y3 (face image with 5 weighted key points)
Advances in Multimedia 3
Le1
Le1
Rm1
Rm1
LE21
Le2Average
Rm2
Average
Level 2
middotmiddotmiddot middotmiddotmiddot middotmiddotmiddot
LE22
RM21
RM22
Window 2
Shake 2
Window 1
Shake 1
Window 1
Shake 1
Window 2
Shake 2
Input Output
Figure 2 Schematic diagram of the level 2 network structure
15
15
3
4
4
12
12
20
Input
Convolution
2
2
6
6
20
Max pooling
33
4
4
40
2
2
Convolution
2
240
Max pooling
Full connection60
2Full connection
Figure 3 Network structure of level2-le21 CNN
Input
Output
Precise localization
Roughlocalization
Localization
First stage
Second stage
Third stage
(a) (b)
(d) (c)
(e) (f)
2
1
Figure 1 CCNN algorithm framework
4 Advances in Multimedia
LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)
323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model
324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB
33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows
E 1m
1113944
m
j1
12N
1113944
N
i1y
ji minus d
ji
2
2⎛⎝ ⎞⎠ +
12ηW
22 (6)
In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows
Wt+1 Wt minus λ middotzE
zWt
(7)
4 Experiments and Results
e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework
41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle
xprime
yprime
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
cos θ minussin θ 0
sin θ cos θ 0
0 0 1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
x
y
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)
en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows
x directionxprime minus x
11138681113868111386811138681113868111386811138681113868
w
y directionyprime minus y
11138681113868111386811138681113868111386811138681113868
h
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
(9)
In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows
42 Experiment Results
Experiment 1 Verifying the necessity of the 3-stage cas-caded structure
As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively
Advances in Multimedia 5
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
designed to reduce the number of parameters that thenetworks need to learn and accelerate the speed at locatingkey points on the face
2 Related Theories
Convolutional neural networks (CNNs) [15] are a new typeof artificial neural network proposed by combining tradi-tional artificial neural networks and deep learning Sincefully connected neural networks have a large number ofparameters when processing image data CNNs optimize thestructure of the traditional fully connected neural networksby introducing weight sharing [16] and local perceptionmethods which greatly reduces the number of learnableparameters of the fully connected neural networks Since thefeature maps output by the CNN pooling layers have theadvantages of rotation invariance and translation invarianceCNNs have natural robustness [17] for processing imagedata In the common CNN model architecture most CNNsuse the convolutional layer and the pooling layer alternatelyto extract the semantic information of the facial image fromlow to high After the channel numbers and size of thefeature maps reach a certain dimension the feature maps arearranged in order and converted into a one-dimensionalfeature vector which is connected with the fully connectedlayer for dimensional transformation e operation processof the convolutional layer can be expressed as follows
X(lk)
f 1113944
nlminus1
p1W
(lkp) otimesX(lminus 1p)
1113872 1113873 + b(lk)⎛⎝ ⎞⎠ (1)
In formula (1) X(lk) represents the k-th feature mapoutput by the l-th hidden layer nl represents the number ofchannels of the l-th layer feature map andW(lkp) representsthe convolution kernel used whenmapping the p-th group offeature maps in the (lminus 1)-th layer to the k-th feature maps inthe l-th layer e algorithm in this paper uses max-poolingAfter the pooling operation the size of the feature map isreduced to one-fold of the original step size Max-poolingcan be expressed as follows
X(l+1k)
(mn) max0ltablts X(lk)
(m middot step + a n middot step + b)1113966 1113967
(2)
In formula (2) X(l+1k) (m n) represents the value of thek-th group of feature map coordinates output from the(l + 1)-th layer at (m n)
When optimizing the CNNmodel the back propagationmethod is used to update all the connection weights betweenneurons to minimize the loss function Since the facealignment task is a regression task [18] the mean squareerror (MSE) is used to define its loss function which can beexpressed as follows
MSE 12N
1113944
N
i1Oi minus Pi
22 (3)
In formula (3) N is the number of nodes in the inputlayer of the neural network O is the artificially labeled valueand P is the predicted value of the neural networks
3 Improved CCNN
To improve the efficiency and accuracy of the algorithmdetection in this paper the CCNN network is improved inthe following 4 aspects
Improvement 1 designing shallower networks deepnetworks create a large number of parameters resulting in asharp increase in time and space complexity Additionallythe chain rule can easily lead to gradient vanishing andgradient explosion in deep networks and it easily causes thenetworks to fail to learn useful rules e CCNN algorithmhas a total of 23 CNNs and each CNN contains a maximumof 8 layers and a minimum of 6 layers e unnecessaryconvolution and pooling layers are removed and thenumber of algorithm parameters is greatly reduced to ensurethe running speed of the algorithm
Improvement 2 a shake mechanism is proposed to solvethe problem that the rules learned by the networks tend tothe geometric center of the windows which improves thegeneralization ability (robustness) of the model e posi-tioning accuracy and speed are increased by designing thesize of windows
Improvement 3 adopting multilevel cascaded recursiveCNNs it is verified by experiments that the 3-stage cascadedstructure of CCNN is selected in this paper which canguarantee a high positioning accuracy within an acceptabletime range
Improvement 4 designing a smaller input image size asmaller input image size can speed up the training of thenetworks and the number of channels of the convolutionkernel can be increased to compensate for the informationloss caused by a smaller input image
31 CCNN Framework Design e framework of the im-proved cascaded convolutional neural network in this paperis divided into 3 stages which realizes the process fromglobal coarse positioning to local precise positioning
e overall design idea of the CCNN algorithm adoptsthe recursive idea and the recursive termination condition isldquothe number of network stages that balance positioningaccuracy and positioning time complexityrdquo Based on this acascaded convolutional neural network with 3 stages isdesigned in addition to guaranteeing a high positioningaccuracy rate it is also necessary to ensure that the posi-tioning results are obtained within an acceptable time frameAs shown in Table 1 in the design of the CCNN algorithmeach stage recursively generates the key points of the face
2 Advances in Multimedia
311 e First Stage As shown in Figure 1 the originalimage is marked with the initial face frame and Figure 1(a) iscropped into 3 areas according to the face distribution of theLFPW dataset (respectively the key point area contained inthe initial face frame left right eye and nose tip key pointsarea and nose tip left and right mouth corner key pointsarea) these 3 areas are used as the input of the 3 CNNs in thefirst stage and the coordinate values of 5 key points areweighted average output to obtain Figure 1(b)
e rough positioning of global key points is realizede weighted average method is shown as follows
(x y)p 1N
1113944
N
i11113954xi 1113954yi( 1113857 (4)
In formula (4) P belongs to the set of left eye right eyenose tip left mouth corner right mouth corner and N is thenumber of repeated positions
312 e Second Stage e 5 key points output in the firststage are used as anchor points and expanded to the topbottom left and right of the anchor points to obtain thecurrent 10 windows e expansion method is based on thefollowing formula
windowsi a1lowasth b1lowastw( 1113857 i 1 2 5
windowsj a2lowasth b2lowastw( 1113857 j 6 7 10
⎧⎨
⎩ (5)
In formula (5) windowsi represents the 5 windows of thefirst group windowj represents the 5 windows of the secondgroup a and b are hyperparameters a1 and b1 take the valueof 016 a2 and b2 take the value of 018 and h and w are theheight and width of the initial face frameen according tothe size of the windows and shake (shake in the paper followsa normal distribution namely shakesimN (0 1eminus 4) as arrow①) 10 partial human face images are to cut out with arandom perturbation factor to obtain Figure 1(c) Shake canbe regarded as a kind of data augmentation operation Itsessence is a type of translation strategy It can shift thewindow randomly up down left and right by shake unitsese 10 partial facial images are taken as the input of the 10CNNs in the second stage and every 2 CNNs use formula (4)to jointly weight and average a key point to obtainFigure 1(d) thereby realizing local key point positioning
313 e ird Stage Taking the 5 key points output in thesecond stage as anchor points and by designing smallerwindows and shake (a1 and b1 take the value of 008 a2 andb2 take the value of 009 shake is the same as the secondstage as arrow②) to cut out the smaller 10 partial images ofthe facial key points we obtain Figure 1(e) ese 10 partialimages are used as the input of the 10 CNNs in the thirdstage to obtain Figure 1(f) and every 2 CNNs are jointlyweighted averaged by formula (4) to locate a key point toobtain the accurate localization of the partial image of theface which is the final output of the algorithm in this paper
32 Network Structure Design Since the algorithm in thispaper is based on the recursive idea now we take the secondstage (level 2) the LE21 CNN and LE22 CNN which locatethe key point of the left eye (LE) as examples to describe thenetwork structure design of the CCNN (F is the full face L isleft R is right E is the eye N is the nose M is the cornermouth and LE21 CNN represents the first CNN that locatesthe left eye in the second stage the same as follows)
321 Level 2 Network Structure Figure 2 shows the level 2network structure is stage contains 10 lightweight CNNsnamely LE21 LE22 RE21 RE22 N21 N22 and LM21
LM22 RM21 and RM22 every 2 CNNs are jointlyweighted average by formula (4) to locate a key point andcombined with the positioning results of the other 8 CNNsand finally the level 2 output is obtained
322 Level2-LE21 CNN Network Structure To improve thepositioning speed and efficiency of the algorithm in thisarticle we design a new cascaded convolutional neuralnetwork in this section which accelerates the training andtesting of the network by reducing the number of layers ofeach CNN and reducing the size of the input imageCompared with reference [13] this paper reduces thenumber of network layers from 12 to 6 layers and the inputsize of the second stage is reduced from 32lowast32lowast 3 to 15lowast15lowast3
Figure 3 shows the network structure of level2-LE21CNN We input 15lowast15lowast3 left eye local area map after 2 setsof convolutional pooling layers and 2 fully connected layerswe output the left eye prediction coordinates value of the
Table 1 Design of the CCNN algorithm
Stage 1
Input X (face images with face frames)Step 1 resize X to (39lowast39lowast3)
Step 2 after 2 sets of convolutional pooling layer and 2 fully connected layers generate a candidate set of key pointsStep 3 same as step 2 generate candidate face key point coordinates
Output y1 (face image with 5 weighted key points)
Stage 2
Input use 2 different windows with shake to crop y1 to obtain 10 partial face imagesStep 1 similar to stage1-step2 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y2 (face image with 5 weighted key points)
Stage 3
Input use 2 smaller different windows with shake to crop y2 to obtain 10 partial face imagesStep 1 similar to stage2-step1 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y3 (face image with 5 weighted key points)
Advances in Multimedia 3
Le1
Le1
Rm1
Rm1
LE21
Le2Average
Rm2
Average
Level 2
middotmiddotmiddot middotmiddotmiddot middotmiddotmiddot
LE22
RM21
RM22
Window 2
Shake 2
Window 1
Shake 1
Window 1
Shake 1
Window 2
Shake 2
Input Output
Figure 2 Schematic diagram of the level 2 network structure
15
15
3
4
4
12
12
20
Input
Convolution
2
2
6
6
20
Max pooling
33
4
4
40
2
2
Convolution
2
240
Max pooling
Full connection60
2Full connection
Figure 3 Network structure of level2-le21 CNN
Input
Output
Precise localization
Roughlocalization
Localization
First stage
Second stage
Third stage
(a) (b)
(d) (c)
(e) (f)
2
1
Figure 1 CCNN algorithm framework
4 Advances in Multimedia
LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)
323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model
324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB
33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows
E 1m
1113944
m
j1
12N
1113944
N
i1y
ji minus d
ji
2
2⎛⎝ ⎞⎠ +
12ηW
22 (6)
In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows
Wt+1 Wt minus λ middotzE
zWt
(7)
4 Experiments and Results
e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework
41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle
xprime
yprime
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
cos θ minussin θ 0
sin θ cos θ 0
0 0 1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
x
y
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)
en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows
x directionxprime minus x
11138681113868111386811138681113868111386811138681113868
w
y directionyprime minus y
11138681113868111386811138681113868111386811138681113868
h
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
(9)
In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows
42 Experiment Results
Experiment 1 Verifying the necessity of the 3-stage cas-caded structure
As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively
Advances in Multimedia 5
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
311 e First Stage As shown in Figure 1 the originalimage is marked with the initial face frame and Figure 1(a) iscropped into 3 areas according to the face distribution of theLFPW dataset (respectively the key point area contained inthe initial face frame left right eye and nose tip key pointsarea and nose tip left and right mouth corner key pointsarea) these 3 areas are used as the input of the 3 CNNs in thefirst stage and the coordinate values of 5 key points areweighted average output to obtain Figure 1(b)
e rough positioning of global key points is realizede weighted average method is shown as follows
(x y)p 1N
1113944
N
i11113954xi 1113954yi( 1113857 (4)
In formula (4) P belongs to the set of left eye right eyenose tip left mouth corner right mouth corner and N is thenumber of repeated positions
312 e Second Stage e 5 key points output in the firststage are used as anchor points and expanded to the topbottom left and right of the anchor points to obtain thecurrent 10 windows e expansion method is based on thefollowing formula
windowsi a1lowasth b1lowastw( 1113857 i 1 2 5
windowsj a2lowasth b2lowastw( 1113857 j 6 7 10
⎧⎨
⎩ (5)
In formula (5) windowsi represents the 5 windows of thefirst group windowj represents the 5 windows of the secondgroup a and b are hyperparameters a1 and b1 take the valueof 016 a2 and b2 take the value of 018 and h and w are theheight and width of the initial face frameen according tothe size of the windows and shake (shake in the paper followsa normal distribution namely shakesimN (0 1eminus 4) as arrow①) 10 partial human face images are to cut out with arandom perturbation factor to obtain Figure 1(c) Shake canbe regarded as a kind of data augmentation operation Itsessence is a type of translation strategy It can shift thewindow randomly up down left and right by shake unitsese 10 partial facial images are taken as the input of the 10CNNs in the second stage and every 2 CNNs use formula (4)to jointly weight and average a key point to obtainFigure 1(d) thereby realizing local key point positioning
313 e ird Stage Taking the 5 key points output in thesecond stage as anchor points and by designing smallerwindows and shake (a1 and b1 take the value of 008 a2 andb2 take the value of 009 shake is the same as the secondstage as arrow②) to cut out the smaller 10 partial images ofthe facial key points we obtain Figure 1(e) ese 10 partialimages are used as the input of the 10 CNNs in the thirdstage to obtain Figure 1(f) and every 2 CNNs are jointlyweighted averaged by formula (4) to locate a key point toobtain the accurate localization of the partial image of theface which is the final output of the algorithm in this paper
32 Network Structure Design Since the algorithm in thispaper is based on the recursive idea now we take the secondstage (level 2) the LE21 CNN and LE22 CNN which locatethe key point of the left eye (LE) as examples to describe thenetwork structure design of the CCNN (F is the full face L isleft R is right E is the eye N is the nose M is the cornermouth and LE21 CNN represents the first CNN that locatesthe left eye in the second stage the same as follows)
321 Level 2 Network Structure Figure 2 shows the level 2network structure is stage contains 10 lightweight CNNsnamely LE21 LE22 RE21 RE22 N21 N22 and LM21
LM22 RM21 and RM22 every 2 CNNs are jointlyweighted average by formula (4) to locate a key point andcombined with the positioning results of the other 8 CNNsand finally the level 2 output is obtained
322 Level2-LE21 CNN Network Structure To improve thepositioning speed and efficiency of the algorithm in thisarticle we design a new cascaded convolutional neuralnetwork in this section which accelerates the training andtesting of the network by reducing the number of layers ofeach CNN and reducing the size of the input imageCompared with reference [13] this paper reduces thenumber of network layers from 12 to 6 layers and the inputsize of the second stage is reduced from 32lowast32lowast 3 to 15lowast15lowast3
Figure 3 shows the network structure of level2-LE21CNN We input 15lowast15lowast3 left eye local area map after 2 setsof convolutional pooling layers and 2 fully connected layerswe output the left eye prediction coordinates value of the
Table 1 Design of the CCNN algorithm
Stage 1
Input X (face images with face frames)Step 1 resize X to (39lowast39lowast3)
Step 2 after 2 sets of convolutional pooling layer and 2 fully connected layers generate a candidate set of key pointsStep 3 same as step 2 generate candidate face key point coordinates
Output y1 (face image with 5 weighted key points)
Stage 2
Input use 2 different windows with shake to crop y1 to obtain 10 partial face imagesStep 1 similar to stage1-step2 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y2 (face image with 5 weighted key points)
Stage 3
Input use 2 smaller different windows with shake to crop y2 to obtain 10 partial face imagesStep 1 similar to stage2-step1 generate a candidate set of key points
Step 2 every 2 CNNs colocate a key pointOutput y3 (face image with 5 weighted key points)
Advances in Multimedia 3
Le1
Le1
Rm1
Rm1
LE21
Le2Average
Rm2
Average
Level 2
middotmiddotmiddot middotmiddotmiddot middotmiddotmiddot
LE22
RM21
RM22
Window 2
Shake 2
Window 1
Shake 1
Window 1
Shake 1
Window 2
Shake 2
Input Output
Figure 2 Schematic diagram of the level 2 network structure
15
15
3
4
4
12
12
20
Input
Convolution
2
2
6
6
20
Max pooling
33
4
4
40
2
2
Convolution
2
240
Max pooling
Full connection60
2Full connection
Figure 3 Network structure of level2-le21 CNN
Input
Output
Precise localization
Roughlocalization
Localization
First stage
Second stage
Third stage
(a) (b)
(d) (c)
(e) (f)
2
1
Figure 1 CCNN algorithm framework
4 Advances in Multimedia
LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)
323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model
324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB
33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows
E 1m
1113944
m
j1
12N
1113944
N
i1y
ji minus d
ji
2
2⎛⎝ ⎞⎠ +
12ηW
22 (6)
In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows
Wt+1 Wt minus λ middotzE
zWt
(7)
4 Experiments and Results
e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework
41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle
xprime
yprime
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
cos θ minussin θ 0
sin θ cos θ 0
0 0 1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
x
y
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)
en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows
x directionxprime minus x
11138681113868111386811138681113868111386811138681113868
w
y directionyprime minus y
11138681113868111386811138681113868111386811138681113868
h
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
(9)
In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows
42 Experiment Results
Experiment 1 Verifying the necessity of the 3-stage cas-caded structure
As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively
Advances in Multimedia 5
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
Le1
Le1
Rm1
Rm1
LE21
Le2Average
Rm2
Average
Level 2
middotmiddotmiddot middotmiddotmiddot middotmiddotmiddot
LE22
RM21
RM22
Window 2
Shake 2
Window 1
Shake 1
Window 1
Shake 1
Window 2
Shake 2
Input Output
Figure 2 Schematic diagram of the level 2 network structure
15
15
3
4
4
12
12
20
Input
Convolution
2
2
6
6
20
Max pooling
33
4
4
40
2
2
Convolution
2
240
Max pooling
Full connection60
2Full connection
Figure 3 Network structure of level2-le21 CNN
Input
Output
Precise localization
Roughlocalization
Localization
First stage
Second stage
Third stage
(a) (b)
(d) (c)
(e) (f)
2
1
Figure 1 CCNN algorithm framework
4 Advances in Multimedia
LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)
323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model
324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB
33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows
E 1m
1113944
m
j1
12N
1113944
N
i1y
ji minus d
ji
2
2⎛⎝ ⎞⎠ +
12ηW
22 (6)
In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows
Wt+1 Wt minus λ middotzE
zWt
(7)
4 Experiments and Results
e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework
41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle
xprime
yprime
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
cos θ minussin θ 0
sin θ cos θ 0
0 0 1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
x
y
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)
en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows
x directionxprime minus x
11138681113868111386811138681113868111386811138681113868
w
y directionyprime minus y
11138681113868111386811138681113868111386811138681113868
h
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
(9)
In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows
42 Experiment Results
Experiment 1 Verifying the necessity of the 3-stage cas-caded structure
As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively
Advances in Multimedia 5
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
LE21 CNN combined with the left eye coordinate valueoutput by the LE22 CNN and finally we weighted averagethe left eye coordinate value of level 2 in the manner offormula (4)
323 Proposing the Shake Mechanism and Windowse entire algorithm flow is carried out recursively in stagesAs shown in Figure 1 the original red frame of the inputpicture in the first stage can be regarded as a larger windowand each small green frame of the input picture in the secondstage is a smaller window Since the positioning method ofthe algorithm in this paper uses the relative positioning ofwindows the more accurate the windows the more accuratethe model positioning the smaller the windows the fasterthe model positioning speed To avoid the immutability ofwindows the shake mechanism is proposed to slightly shakewindows that is translation (which has been explained indetail in the second stage of Section 312) it can also beregarded as a method of data augmentation thereby im-proving the generalization ability of the model
324 Model Parameters e network structure parametersof the first stage are shown in Table 2 where the fullyconnected layers of F1 EN1 and NM1 are 10d-fc2 6d-fc2and 6d-fc2 respectively e network structure of the thirdstage is the same as that of the second stage e totalnumber of network parameters is only 643920 which isapproximately 246MB
33 CCNN Model Training e model training method inthis paper adopts the idea of parallel within groups and serialbetween groups For each lightweight CNN a rectified linearunit (ReLU) is used after the convolutional layer to increasethe nonlinear expression ability of the model max-poolinglayers are used to downsample the feature map of theconvolutional layers and the fully connected layers usedropout technology to improve the robustness of the model(the value is 025) e initial learning rate is 5eminus 3 andevery 100 iterations the learning rate decays to 90 of theoriginal (the other hyperparameters are selected accordingto the actual situation such as batch size and iterationtimes) e stochastic gradient descent (SGD) algorithm isused to update the weights of all connections betweenneurons during network training To additionally increasethe robustness of the model L2 regularization is introducedto penalize the learnable parameter W and then the finalexpression of the newly designed loss function is as follows
E 1m
1113944
m
j1
12N
1113944
N
i1y
ji minus d
ji
2
2⎛⎝ ⎞⎠ +
12ηW
22 (6)
In formula (6) m is the batch size and W is the weightmatrix in the network is weight matrix W is updatedduring the process of error back propagation Before thenetwork starts training the Xavier [19] method is used toinitialize the weight matrix W0 e weight matrix Wt + 1updated after t+ 1 iterations can be expressed as follows
Wt+1 Wt minus λ middotzE
zWt
(7)
4 Experiments and Results
e operating system used in the experiment is Centos764 bit a Dell server equipped with 2 RTX2080TI graphicscards and 32GB memory and the code running environ-ment is the PyTorch framework
41Dataset ese experiments use the facial dataset LFPWa total of 13466 facial images Each facial image has 3channels with RGB and has relevant coordinate annotationinformation (including key point coordinates and initial faceframe coordinates) e original data use 10000 images asthe training set and 3466 images as the test set Dataaugmentation [20 21] can be used to increase the number oftraining samples to effectively improve the performance ofthe convolutional neural network To be consistent with theactual facts we set the left 15-degree and right 12-degreecenter rotation (the label coordinates should also be rotated)and the rotation method is given in formula (8) where θ isthe rotation angle
xprime
yprime
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
cos θ minussin θ 0
sin θ cos θ 0
0 0 1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
x
y
1
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (8)
en the training dataset is flipped horizontally (thelabel coordinates should also be flipped accordingly) and60000 training data are finally obtained Each training datarequires data standardization (minus the mean and dividingthe variance) In addition for each training data it is neededto eliminate the impact of the length and width of the faceframe (windows at each stage) e elimination method isshown as follows
x directionxprime minus x
11138681113868111386811138681113868111386811138681113868
w
y directionyprime minus y
11138681113868111386811138681113868111386811138681113868
h
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
(9)
In formula (9) x and y are the horizontal and verticalcoordinates of the upper left corner of the windows xprime and yprimeare the horizontal and vertical coordinates of the key pointspredicted by the current windows and w and h are the widthand height of the windows
42 Experiment Results
Experiment 1 Verifying the necessity of the 3-stage cas-caded structure
As shown in Figure 4 the vertical axis is the test errorthe horizontal axis is the facial key points and the blue redand green broken lines are the test errors of the first stagethe second stage and the third stage respectively
Advances in Multimedia 5
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
e test error of the second stage exceeded less than thatof the first stage e test error of the third stage also de-creased compared with the second stage the average testerror decreased from 221 to 137 and then to 103 Tobalance the positioning accuracy and positioning speedcomprehensively this article does not cascade the fourthstage of CNNs
Experiment 2 Verifying the necessity of under windowsFigure 5 shows the impact of the presence or absence of
window frames on the test error of each key point in the firststage e vertical axis is the test error the horizontal axis isthe facial key points and the blue broken line is windows-which uses absolute positioning the red broken line indi-cates the relative positioning of windows+
e relative positioning method is more stable and thetest error at each key point is lower than that of the absolutepositioning method e average test error decreased from338 to 235 Because the first stage uses global key pointcoarse positioning of the first stage the test errors of the blueand red broken lines are both high
Experiment 3 Verifying the necessity of the under shakemechanism
Figure 6 shows the impact of the presence or absence ofthe shake factor on the test error of each key point in the firststage
e vertical axis is the test error and the horizontal axisis the facial key points e blue broken line is shake- whichmeans no data augmentation operation the red broken lineis shake+ which denotes data augmentation on the partialimage e test error of the red broken line after dataaugmentation is lower than that of the blue broken linewhich shows that the shake mechanism has improved thegeneralization ability of the model e average test errordropped from 235 to 190 Because the first stage usesglobal key point coarse positioning the test errors of the blueand red broken lines are both high
Experiment 4 Comparison with other algorithms of thesame type in LE error RE error N error LM error RMerror and average test error
Figure 7 shows the test error of each key point of eachalgorithm e vertical axis is the test error and the hori-zontal axis is the facial key points
e blue and red broken lines are the algorithms ofSun [22] and Chen Rui [13] respectively and the green
broken line is the algorithm in this paper it can be seenthat the algorithm in this paper is lower than any otheralgorithm of the same type in LE RE N LM and averagetest error and the average test error reaches 103 etest error at the key points of RM is slightly higher thanthat of Chen Ruirsquos [13] algorithm which indirectly
Table 2 Network structure table of the first stage
Name Input Convolution kernelssteps Output Parametersconv1 39lowast39lowast3 4lowast4lowast 201 36lowast3 6lowast 20 960pool1 36lowast36lowast20 2lowast 22 18lowast18lowast20 -conv2 18lowast18lowast20 3lowast3lowast 401 16lowast16lowast40 7200pool2 16lowast16lowast40 2lowast 22 8lowast8lowast40 -conv3 8lowast8lowast40 3lowast3lowast 601 6lowast6lowast60 21600pool3 6lowast6lowast60 2lowast 22 3lowast3lowast60 -120d-fc1 6480010d-fc2 (F1) 6d-fc2 (EN1) 6d-fc2 (NM1)
RE N LM RM Averagetest error
LE
Facial key points
Test error at different stage of each key point
075
100
125
150
175
200
225
250
275
Test
erro
r
158
188
254272
231221
137144
133125
198
148
113107
078 084 073103
Third_stageSecond_stageFirst_stage
Figure 4 Test error graph of key points in different stages
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under windows or not
20
25
30
35
40
Test
erro
r
373
312
394
293319
338
235244
277261
203188
Our windows+Our windowsndash
Figure 5 Test error graph of key points under windows or not
6 Advances in Multimedia
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
indicates that each network needs separate hyper-parameters to train
In addition the learnable parameters in this algorithmare only 643920 approximately 246MB which is approx-imately 58 of the parameters of Chen Rui [13] and 56 ofthe ResNet18 It only requires 147milliseconds to process afacial image on the GPU As shown in Table 3 it greatlyreduces the consumption of hardware resources and can beused as an embedded algorithm for cameras or apps used inmobile devices such as mobile phones
Figure 8 shows the effect picture of the facial land-mark localization of the algorithm in this paper e firstline of facial images is output 1 of the first stage of CNNand so on it can be seen that the output of the first CNNstage (global key points coarse positioning stage) hasflaws and the positioning accuracy of the second CNNstage (local key points positioning stage) has improved
significantly e positioning accuracy of the third CNNstage (accurate positioning of local key points) alsoimproved and was no longer distinguishable by thenaked eye
Under the influence of distorted expressions and diversepostures the algorithm still achieves accurate positioning Ifwe face tasks with very high real-time requirements thenonly the first 2 layers of cascaded convolutional neuralnetworks are sufficient
188203 193
217 213
19
235244
277261
146137
RE N LM RM Averagetest error
LE
Facial key points
Test error of key points under shake or not
14
16
18
20
22
24
26
28
Test
erro
r
Our shake+Our shakendash
Figure 6 Test error graph of key points under shake or not
204194
21318
099 101
078 084
148
126133
073
103
127127
251234
219
RE N LM RM Averagetest error
LE
Facial key points
Test error of each key point of each algorithm
075
100
125
150
175
200
225
250
Test
erro
r
OursChen RuiSun
Figure 7 Test error graph of key points of each algorithm
Table 3 Parameters and speed of each algorithm
Methodsindicators Parameters (MB) Speed (ms)Algorithm of this paper 246 147ResNet18 4459 -ResNet34 8515 -Chen Rui [13] 424 159
Advances in Multimedia 7
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
5 Conclusions
e improved face alignment algorithm based on a cascadedconvolutional neural network in this paper realizes theprocess from coarse positioning to precise positioning bydesigning a 3-stage network structure e improvement ofwindows accelerates the training of the network and im-proved the positioning accuracy rate of the key points eproposed shake factor can be regarded as a data augmen-tation operation which improves the generalization abilityof the model e test on the LFPW face dataset shows thatthe average test error of the algorithm in this paper reaches103 which is approximately 116 lower than that of thesame type of algorithms in addition it only takes 147milliseconds to process a facial image on the GPU which is12 faster than Chen Ruirsquos [13] in milliseconds and a small-scale and efficient facial landmark localization network isrealized e algorithm shows good antiinterference abilitywith gestures and expressions Future work can performfurther research on data augmentation and add detectiontechnology to give a more accurate initial face frame
Data Availability
e data used to support the findings of the study areavailable from the corresponding author upon request
Conflicts of Interest
e authors declare that they have no conflicts of interest
Acknowledgments
is work was supported by the National Natural ScienceFoundation of China ([2018] 61741124) and Guizhou Sci-ence and Technology Plan Project ([2018] 5781)
References
[1] G Q Qi J M Yao H L Hu et al ldquoFace landmark detectionnetwork with multi-scale feature map fusionrdquo ApplicationResearch of Computers vol 1-6 pp 1001ndash3695 2019
[2] J Xu W J Tian and Y Y Fan ldquoSimulation of face key pointrecognition and location method based on deep learningrdquoComputer Simulation vol 37 no 06 pp 434ndash438 2020
[3] J Li e Research and Implement of Face Detection and FaceAlignment in the Wild NanJing University Nanjing China2019
[4] X F Yang Research on Facial Landmark Localization Basedon Deep Learning Xiamen University Xiamen China 2019
[5] C X Jing Research on Face Detection and Face AlignmentBased on Deep Learning China Jiliang University HangzhouChina 2018
[6] T F Cootes C J Taylor D H Cooper et al ldquoActive shapemodels-their training and applicationrdquo Computer Vision andImage Understanding vol 61 no 1 1995
[7] T F Cootes G J Edwards and C J Taylor ldquoActive ap-pearance modelsrdquo in Proceedings of the 5th European Con-ference on Computer Vision-Volume II-Volume II SpringerBerlin Heidelberg June 1998
[8] X Yu C Ma Y Hu et al ldquoNew neural network method forsolving nonsmooth pseudoconvex optimization problemsrdquoComputer Science vol 46 no 11 pp 228ndash234 2019
[9] A Sengupta Y Ye R Wang et al ldquoGoing deeper in spikingneural networks VGG and residual rrchitecturesrdquo Frontiersin Neuroence vol 13 2018
[10] C Szegedy S Ioffe V Vanhoucke et al ldquoInception-v4 in-ception-resNet and the impact of residual connections onlearningrdquo 2016 httpsarxivorgabs160207261
[11] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo 2015 httpsarxivorgabs151203385
[12] J H Xie and Y S Zhang ldquoCaspn cascaded spatial pyramidnetwork for face landmark localizationrdquo Application Researchof Computers vol 1-6 2020
[13] R Chen and D Lin ldquoFace key point location based on cas-caded convolutional neural networkrdquo Journal of SiChuanTechnology University (natural Edition) vol 30 pp 32ndash372017
[14] L H Xu Z Li J J Jiang et al ldquoA high- precision andlightweight facial landmark detection algorithmrdquo Laser ampOptoelectronics Progress vol 1-12 2020
[15] J D Lin X Y Wu Y Chai et al ldquoStructure optimization ofconvolutional neural networksa surveyrdquo Journal of Auto-matica Sinica vol 46 no 01 pp 24ndash37 2020
[16] F Y Hu L Y Li X R Shang et al ldquoA review of objectdetection algorithms based on convolutional neural
Figure 8 Effect picture of facial landmark localization
8 Advances in Multimedia
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9
networksrdquo Journal of SuZhou University of Science andTechnology (Natural Science Edition) vol 37 no 02pp 1ndash10+25 2020
[17] S Wu Research on Occlusion and Pose Robust Facial Land-mark Localization Zhejiang University Hangzhou China2019
[18] Y H Wu W Z Cha J J Ge et al ldquoAdaptive task sortingmodel based on improved linear regression methodrdquo Journalof Huazhong University of Science and Technology (NaturalScience Edition) vol 48 no 01 pp 93ndash97+107 2020
[19] X Glorot and Y Bengio ldquoUnderstanding the difficulty oftraining deep feedforward neural networksrdquo Journal of Ma-chine Learning Research vol 9 pp 249ndash256 2010
[20] S Park S-B Lee and J Park ldquoData augmentation method forimproving the accuracy of human pose estimation withcropped imagesrdquo Pattern Recognition Letters vol 136 2020
[21] Y Z Zhou X Y Cha and J Lan ldquoPower system transientstability prediction based on data augmentation and deepresidual networkrdquo Journal of China Electric Power vol 53no 01 pp 22ndash31 2020
[22] Y Sun X Wang and X Tang ldquoDeep convolutional networkcascade for facial point detectionrdquo Computer Vision andPattern Recognition vol 9 no 4 pp 3476ndash3483 2013
Advances in Multimedia 9