Computer Vision - Neural Networks · Neural Networks Ing. Ivan Gruber Ph.D. Department of Cybernetics Faculty of Applied Sciences University of West Bohemia ESF projekt Z apado cesk

Computer VisionNeural Networks

Ing. Ivan Gruber Ph.D.

Department of CyberneticsFaculty of Applied SciencesUniversity of West Bohemia

ESF projekt Zapadoceske univerzity v Plznireg. c. CZ.02.2.69/0.0/0.0/16 015/0002287

Computer Vision

Content

I Neural Networks

– General Knowledge– Artificial neuron

I Neural Networks proprieties

– Activation functions– Layers– Training– Parameters

I Important architectures

Computer Vision 1 / 30

General Knowledge Activation functions Layers Training Important architectures Conclusion

Deep neural networks

I The most popular machine learning technique nowadays

I Models are inspired by biological brain

I By using more layers with activation functions is achieved non-linearity

I Many different neural network architectures

I Feedforward vs Recurrent

I Supervised vs Unsupervised learning

I Weight update via back-propagation algorithm

I Advantages: End-to-end training (including feature extraction), state-of-the-artresults

I Disadvantages: Need huge amount of the training data, choice of correctarchitecture (little bit an alchemy)



Artificial neuron

I Biological neuron is composed of:– Soma - body of the neuron– Axon - output, each neuron has only one axon– Dendrites - input, each neuron can have up to several thousands dendrites– Synapses - links between Axons and Dendrites, one-way gates, different synaptic

strength– Inputs (electrical impulses) are summed and send into its axon if above certain

thresholdI Artificial neuron:

– The strength of the axioms is modeled by weights W– The threshold is ensured by activation function f



Activation function

I Defines the output of the neuron based on the input(s) and some fixedmathematical operation

I Neurons don’t have to have activation function

I Many different activation functions



Sigmoid Function

f (ξ) =1

1 + eξ,where ξ =

n∑i=1

(wTi xi + b), (1)

I Frequently used historicallyI Two major drawbacks:

– Function saturation (-5,5) - causes problems during back-propagation (vanishing)– Not zero-centered - gradient during back-propagation will be always positive or

negative



Tanh activation

f (ξ) =2eξ

1 + eξ(2ξ)− 1, (2)

I Zero centered

I Saturation problem again



ReLU (Rectified Linear Unit)

f (ξ) = max(0, ξ), (3)

I Most popular activation function

I Computational simplicity

I Danger of dead neurons creation

I Modifications: Leaky ReLU, PReLU



Softmax

f (ξ)j =eξj∑N eξN

, (4)

I Used in the classification layer

I Converts a raw value into a posterior probability



Layers

I NN is formad by (acyclic) connecting of artificial neurons together

I The final purpose and function of the ANN are to determine by these connections(architecture of the network), by weights, and by types of neurons (activationfunctions)

I Neurons are organized unto distinct layersI The most common ones:

– Fully-connected layer– Convolution layer– Pooling layer– Regularization layer



Fully-Connected Layer

I Each neuron is connected to all neurons in previous layerI The most common layerI (Optionally) last few layers in CNNsI Prone to over-fitting → usage together with dropoutI Hyperparameters: Number of neurons



Convolution Layer

I Neurons are connected only to a local region of the previous layerI Size of the region is hyperparameter called kernel size (or receptive field)I The size of convolutional step is called stride (usually = 1)I Can be imagined as a set of filtersI Number of filters is called depthI All the neurons within the same filter are sharing weightsI Hyperparameters: Kernel size, Stride, Number of filters



Kernel’s properties

I Kernels in the first layers = low-level features (edge detectors for example)

I Kernels in the mid layers = higher-level features

I Kernels at the end = class specific features



Pooling Layer

I No trainable weights. no activation functionI Performs specific mathematical operation over the related regionI Hyperparameters: Kernel size, StrideI Typical math operations: Maximum, Average



Dropout

I Regularization technique

I Prevents over-fitting

I During the training, each neuron output has a probability p to be ignored

I Hyperparameters: probability p



Loss layer

I Last layer of NNI in the case of classification or regression we try to:

I such values of ωi that will minimize a chosen criterionI criteriun usually incorporates information from teacher: t

Classification criteria:I Binary cross entropy :

Ek = −∑i

ti log oi + (1− ti ) log(1− oi ) (5)

I Categorical cross entropy:

Ek = −∑i

tk,i log ok,i (6)

(with soft-max layer)

Regression criteria: Mean squared/absolute error

Ek =∑i

(ti − oi )2 (7)



Back-propagation

I The most common training method of NNs

I Used in conjunction with an optimization methodI Algorithm steps:

1. The forward pass - NN predicts an output2. Error calculation based on loss function3. The backward pass - By recursive application of chain rule, gradient for individual

parameters (W, b) is calculated = loss is back-propagated to the individual neurons4. Using the gradients, the parameters update is performed based on the optimization

method



Loss function

I Cross-entropy loss (classification tasks), always used with Softmax:

Lce = −N∑i

pi log pi , (8)

I Mean-square error (regression tasks):

L2 = ||f − y ||22, (9)

I Others:

– Contrastive loss– Triplet loss– Angular Softmax loss– Arc loss



Weight initialization

I Before the training process, it is necessary to initialize parameters

I Non-trivial task

I Popular subject of researchI Common initializers:

– Zeros– Normal– Gaussian Random Variables (µ = 0 a σ = 0.01 . . . 10−5)– Xavier (Glorot)– LeCun– etc

I Parameter update is performed by an optimizer



Stochastic Gradient Descent (SGD)

I First-order optimization algorithm

I Training data are divided into batches (due to memory limitations)

I Gradient descent is computed over those batches

ωt+1 = ωt − γtn∑

i=1

∇Li (ωt), (10)

I Advantages: Low computational time, best results with the right learning ratepolicy

I Disadvantages: Necessity of finding right learning rate policy



Learning rate

I Effect of the learning rate

I Learning rate decay:– Learning rate change during the training– Step decay– Exponential decay– Etc.



Momentum

I Improves results in the most cases

I Weighted average between newly computed gradient and the past gradients

ωt+1 = ωt + ∆ωt = ωt − γt∇L(ωt) + α∆ωt−1, (11)



Adaptive optimizers

I Changes learning rate adaptively

I Adagrad

I RMSprop

I Adam

I Etc.



AlexNet (2012)

I Winner of ImageNet 2012

I The first time NN approach overcome other approaches

I 5 convolutional layers + 3 fully-connected

I Innovations: ReLU nonlinearity, Dropout



VGGNet (2014)

I Usage of small kernels (3×3)

I Constant computational complexity across all convolutional layers

I State-of-the-art results



InceptionNet (2014)

I Large kernels are preferred for more global information, while smaller one arepreferred for local information

I Size of important things can be very variableI Application of different operations in the same depth = Inception moduleI Usage of 1×1 convolutions (’max pooling for the channel dimension’)



Fully-Convolutional Networks

I No fully-connected layersI Fully-connected layer - huge number of parameters, prone to overfittingI Global average pooling instead of the last fully-connected layerI Advantages:

1. Correspondence between feature maps and categories is enforced2. Overfitting avoidance3. Global average pooling is more robust to spatial transformations4. Fully-convolutional networks have a great ability to encode localization without any

further information



ResNet (2016)

I Problems with vanishing and exploding gradientI The ease of learning is not same for all transformationsI Inclusion of shortcut connectionsI Winners of ImageNet 2015

y = F (x, {Wi}) + x, (12)



Autoencoders

I Bow-tie structure:

– Encoder - Fully-convolutional network– Decoder - Deconvolutional network

I Semantic segmentation tasks



Challenges, codes and examples

I Kaggle

I ImageNet

I Papers with code

I CS231n: Convolutional Neural Networks for Visual Recognition

I CS231n: Youtube

I Andrew Ng Coursera courses

I Siraj Raval - Youtube (Fraud and thief, but still very informative)

I 3Blue1Brown

I Two minute papers

I Deep learning news


https://www.kaggle.com/

http://www.image-net.org/challenges/LSVRC/

https://paperswithcode.com/

http://cs231n.github.io/

https://www.youtube.com/watch?v=NfnWJUyUJYU&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC

https://www.coursera.org/courses?query=machine%20learning%20andrew%20ng

https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A

https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw

https://www.youtube.com/user/keeroyz

https://www.deeplearning.ai/thebatch/

Thank you for your attention!

Questions?


Documents

Computer Vision - Neural Networks · Neural Networks Ing. Ivan Gruber Ph.D. Department of Cybernetics Faculty of Applied Sciences University of West Bohemia ESF projekt Z apado cesk