Parameterised Sigmoid and ReLU Hidden Activation Functions ...mi.eng.cam.ac.uk/~cz277/doc/Slides-Interspeech2015-ACT.pdf · The generalised form of Sigmoid, or the logistic function,

Parameterised Sigmoid and ReLU Hidden ActivationFunctions for DNN Acoustic Modelling

Chao Zhang & Phil Woodland

University ofCambridge

29 April 2015

Introduction

• The hidden activation function plays an important role in deeplearning: Pretraining (PT) & Finetuning (FT), ReLU.

•• Recent studies on learning parameterised activation functionsresulted in improved performance.

• We study the parameterised forms of Sigmoid (p−Sigmoid) andReLU (p−ReLU) functions for SI DNN acoustic model training:

2 of 13

Parameterised Sigmoid Function

• The generalised form of Sigmoid, or the logistic function, is

fi (ai ) = ηi ·1

1 + e−γiai+θi

• ηi , γi , and θi have different effects on fi (ai ):◦ ηi defines the boundaries of fi (ai ). It Learns Hidden Unit i ’s (positive,

zero, or negative) Contribution (LHUC).◦ γi controls the steepness of the curve;◦ θ applies a horizontal displacement to fi (ai ).

• No bias term is added to fi (ai ), since it works the same as the biasof the layer.

3 of 13

Parameterised Sigmoid Function

• By varying the parameters, p−Sigmoid(ηi , γi , θi ) can do piecewiseapproximation to other functions, e.g., step, ReLU, and Soft ReLU.

• Can also present tanh, if the bias of the layer is taken into account.

-5 -4 -3 -2 -1 0 1 2 3 4 5

-1

1

2

3

p-Sigmoid(1, 1, 0)

p-Sigmoid(1, 30, 0)

p-Sigmoid(4, 1, 2)

p-Sigmoid(3, -2, 3)

p-Sigmoid(2, 2, 0) -1

f(a)

a

Figure: Piecewise approximation by p−Sigmoid functions.

4 of 13

Parameterised ReLU Function

• Associate a scaling factor to either part of the function, to enablethe 2 ends of the “hinge” rotate separately around the “pin”.

fi (ai ) =

{αi · ai if ai > 0βi · ai if ai 6 0

.

-2 -1 0 1 2

-2

-1

1

2f(a)

a

Figure: Illustration of the hinge-like shape p−ReLU function.5 of 13

EBP for Parameterised Activation Functions

• Assume◦ F is the objective function◦ i , j are the output and input node numbers of a layer◦ ai , fi (·) are the activation value and activation function of node i◦ ϑi is a parameter of fi (·), wji is a (extended) weight

• According to the chain rule, there is

∂F∂ϑi

=∂fi (ai )

∂ϑi

∑j

∂F∂aj

wji .

• Therefore, we need to compute◦ ∂fi (ai )/∂ai for training weights & biases;◦ ∂fi (ai )/∂ϑi for activation function parameter ϑi .

6 of 13

Experiments

• CE DNN-HMMs were trained on 72 hours Mandarin CTS data.

• Three testing sets were used, dev04, eval03, and eval97.◦ 42d CMLLR(HLDA(PLP 0 D A T Z)+Pitch D A) features.◦ Context shift set, c = [−4,−3,−2,−1, 0,+1,+2,+3,+4].

• 63k word dictionary and trigram LM trained using 1 billion words.

• DNN structure 378× 10005 × 6005.

• Imporved NewBob learning rate scheduler◦ Sigmoid & p−Sigmoid: τ0 = 2.0× 10−3, Nmin = 12◦ ReLU & p−ReLU: τ0 = 5.0× 10−4, Nmin = 8

• ηi , γi , θi , αi , and βi are intialised as 1.0, 1.0, 0.0, 1.0, and 0.25.

• All GMM-HMM and DNN-HMM acoustic model training anddecoding used HTK.

7 of 13

Experiments with p−Sigmoid

• Learning ηi , γi , and θi in PT & FT, and FT only.• Using combinations did not outperform using separate parameters.

ID Activation Function dev04

S0 Sigmoid 27.9S1+ p-Sigmoid(ηi , 1, 0) 27.6S2+ p-Sigmoid(1, γi , 0) 27.7S3+ p-Sigmoid(1, 1, θi ) 27.7S1 p-Sigmoid(ηi , 1, 0) 27.1S2 p-Sigmoid(1, γi , 0) 27.5S3 p-Sigmoid(1, 1, θi ) 27.4S6 p-Sigmoid(ηi , γi , θi ) 27.3

Table: dev04 %WER for the p-Sigmoid systems. + means the activationfunction parameters were trained in both PT and FT.

8 of 13

Experiments with p−ReLU

• For ReLU DNNs, it is not useful to avoid the impact on the otherparameters at the begining.

• αi has more impact on training than βi .

ID Activation Function dev04

R0 ReLU 27.6R1 p-ReLU(αi , 0) 26.8R2 p-ReLU(1, βi ) 27.0R3 p-ReLU(αi , βi ) 27.1

R1− p-ReLU(αi , 0) 27.4R2− p-ReLU(1, βi ) 27.0

Table: dev04 %WER for the p-ReLU systems. − indicates the activationfunction parameters were frozen in the 1st epoch.

9 of 13

Results on All Testing Sets

• S1 and R1 had 3.4% and 2.0% lower WER than S0 and R0, byincreasing only 0.06% parameters.

• p-Sigmoid contains gains by making Sigmoid similar to ReLU.• Weighting the contribution of each hidden unit individually is quite

useful.

ID Activation Function eval97 eval03 dev04

S0 Sigmoid 34.1 29.7 27.9S1 p-Sigmoid(ηi , 1, 0) 32.9 28.6 27.1R0 ReLU 33.3 29.1 27.6R1 p-ReLU(αi , 0) 32.7 28.7 26.8

Table: %WER on all test sets.

10 of 13

Summary

• Different types of parameters and their combinations ofp−Sigmoid and p−ReLU were analysed and compared.

• A scaling factor with no constraint imposed is the most useful.With the linear scaling factors,◦ p−Sigmoid(ηi , 1, 0) resulted 3.4% relative WER reduction compared

with Sigmoid.◦ p−ReLU(αi , 0) reduced the WER by 2.0% relative over the ReLU

baseline.

• Learning different types of parameters simultaneously is difficult.

11 of 13

Appendix: Parameterised Sigmoid Function

• To compute the exact derivatives, it is necessary to store ai .

∂fi (ai )

∂ai=

{0 if ηi = 0

γi fi (ai )(1− η−1

i fi (ai ))

if ηi 6= 0,

∂fi (ai )

∂ηi=

{(1 + e−γiai+θi )−1 if ηi = 0

η−1i fi (ai ) if ηi 6= 0

,

∂fi (ai )

∂γi=

{0 if ηi = 0

ai fi (ai )(1− η−1

i fi (ai ))

if ηi 6= 0,

∂fi (ai )

∂θi=

{0 if ηi = 0

−fi (ai )(1− η−1

i fi (ai ))

if ηi 6= 0.

• For p−Sigmoid(ηi , 1, 0), let ∂fi (ai )/∂ηi (ηi = 0), ai is notnecessary to be stored.

12 of 13

Appendix: Parameterised ReLU Function

• Since αi , βi can be any real number, it is not possible to tell thesign of ai from fi (ai ).

∂fi (ai )

∂ai=

{αi if ai > 0βi if ai 6 0

,

∂fi (ai )

∂αi=

{ai if ai > 00 if ai 6 0

,

∂fi (ai )

∂βi=

{0 if ai > 0ai if ai 6 0

.

• For p−ReLU(αi , 0), let ∂fi (ai )/∂αi (αi = 0), it is not necessary tostore ai .

13 of 13

Documents

Parameterised Sigmoid and ReLU Hidden Activation Functions ...mi.eng.cam.ac.uk/~cz277/doc/Slides-Interspeech2015-ACT.pdf · The generalised form of Sigmoid, or the logistic function,